Commit e9ab5fff authored by Morten Brun's avatar Morten Brun
Browse files

adding clustering

parent bbba91a8
%% Cell type:markdown id: tags:
# Explotation of transcriptomics data
The aim of this exercise is to visually divide samples by exposure. The main task here is to find discriminating features. This is what machine learning is about.
%% Cell type:code id: tags:
``` python
# First we load some libraries that we always need
# If you do not know what they are good for I suggest
# three options:
# 1. Do not worry about this for now
# 2. Use google to find information
# 3. Ask somebody who might know
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
First we locate the data files we are going to work with. This is a first major obstacle. If we need only one file it is easy, but for different reasons the data we wish to analyze can be spread around several files. This is the case here, so we need to do some work already at this stage. Unfortunately, this happens more often than not.
%% Cell type:code id: tags:
``` python
# Import tool used to walk throug the files
from os import walk
# In order to make the code cleaner it is often usefull
# to specify values in variables. Here we specify the file
# path, that is, the folder to fetch files from
path = '../rawdata/In vivo II transcriptomics/'
# Now we walk through all filenames in the specified folder
(_, _, filenames) = next(walk(path))
# We only keep the file ending with xlsx, that is, the excel files
filenames = [path + name for name in filenames if name.endswith('.csv')]
```
%% Cell type:markdown id: tags:
Let us list the files and read in the first one of them
%% Cell type:code id: tags:
``` python
filenames
```
%%%% Output: execute_result
['../rawdata/In vivo II transcriptomics/ctr2_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/ctr1_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/wyhigh_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/wylow_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/ttalow_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/gwlow_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/gwhigh_lib_norm.csv',
'../rawdata/In vivo II transcriptomics/ttahigh_lib_norm.csv']
%% Cell type:code id: tags:
``` python
df = pd.read_csv(filenames[0], index_col=0)
```
%% Cell type:markdown id: tags:
To get an idea about the data we look at the first few rows
%% Cell type:code id: tags:
``` python
filenames[0].split()[-1].split('/')[-1][:4]
```
%%%% Output: execute_result
'ctr2'
%% Cell type:code id: tags:
``` python
df.head()
```
%%%% Output: execute_result
sample_532 sample_533 sample_536 sample_538 \
ENSGMOG00000000001 5.895429 7.580259 5.988031 3.197445
ENSGMOG00000000002 5.243531 3.838597 5.961058 2.671838
ENSGMOG00000000003 0.141717 0.000000 0.000000 0.000000
ENSGMOG00000000004 10.912212 17.855290 14.916132 18.089656
ENSGMOG00000000005 0.000000 0.038774 0.000000 0.000000
sample_540 sample_545 sample_546 sample_547
ENSGMOG00000000001 3.664780 6.194208 5.696718 5.078485
ENSGMOG00000000002 4.716152 4.942187 4.970291 3.742042
ENSGMOG00000000003 0.000000 0.032948 0.000000 0.000000
ENSGMOG00000000004 10.543753 13.673385 17.166620 12.128225
ENSGMOG00000000005 0.000000 0.000000 0.000000 0.000000
%% Cell type:markdown id: tags:
Already here we have reason to be suspecious. Lots of zeros in a row is a bad sign. For example, judging by row 4, the sample 533 is very different from the rest of the samples. This is probably an artifact of technical limitations, so we may come back to this if we are unable to find clear patterns in the data.
%% Cell type:markdown id: tags:
The csv files are relatively clean, so we can just read them in and merge them together to one big table. There is one catch, though. The doses are specified in the filenames, so we need to collect that information too. We define a helper function to do this for us.
%% Cell type:code id: tags:
``` python
# Helper method for reading in multiple csv files and
# formatting them for use in the analysis
from itertools import chain
def read_csv_files(filepaths, max_null_columns=None):
'''
Read in excel files.
'''
# Force the filepaths to consist of a list of filepaths also
# when a single filename is given as input
if type(filepaths) == str:
filepaths = [filepaths]
# Read in the content of all csv files
datas = [pd.read_csv(filepath, index_col=0) for filepath
in filepaths]
doses = [[filepath.split()[-1].split('/')[-1][:4]] * df.shape[1]
for filepath, df in zip(filepaths, datas)]
# Format the individual spreadsheets
'''
for idx, df in enumerate(datas):
# Discard empty columns
df = df.dropna(how='all', axis=1)
# Discard columns with no name
df = df[df.index.notnull()]
# Discard duplicate columns keeping biggest values.
df_dup = df.loc[df.index.duplicated(keep=False)]
df = df.loc[~df.index.duplicated(keep=False)]
for name in np.unique(df_dup.index):
df.loc[name] = df_dup.loc[name].max(axis=0).values
datas[idx] = df
'''
# Merge all spreadsheets into one spreadsheet
df = pd.concat(datas, axis=1, sort=False)
# Discard duplicate columns
# df = df.loc[:,~df.columns.duplicated()]
# Discard all rows with more than `max_null_columns` empty entries
df = df.replace(0, pd.np.nan)
if max_null_columns is None:
max_null_columns = df.shape[1]
df = df.loc[df.isnull().sum(axis=1) <= max_null_columns]
# Fill in empty entries with columnwise minima
mins = df.min()
df = df.fillna(value=mins)
# Return then resulting spreadsheet. We will call it a DataFrame from now on.
return df, list(chain.from_iterable(doses))
```
%% Cell type:markdown id: tags:
Now we read in the csv files.
%% Cell type:code id: tags:
``` python
df, doses = read_csv_files(filenames, max_null_columns=0)
```
%% Cell type:markdown id: tags:
If you want to, you can inspect the content of the resulting DataFrame `df`. Since it clutters your screen I have commented the following cell out. You can activate it by seclecting it and changing its type to code. One way to do this is through the "Cell" option on the top of this page.
%% Cell type:raw id: tags:
df
%% Cell type:markdown id: tags:
We are going to work with the doses, so we need to extract them from the dataframe. This has been made a little difficult by the way the columns are named. This is unfortunately quitet realistic.
%% Cell type:markdown id: tags:
Next we flip (transpose) the dataframe (spreadsheet) and, for each gene id, calculate the mean values of the gene expressions with respect to dose.
%% Cell type:raw id: tags:
from sklearn.preprocessing import StandardScaler
%% Cell type:raw id: tags:
scaler = StandardScaler()
scaler.fit(df.values)
df[df.columns] = scaler.transform(df.values)
%% Cell type:code id: tags:
``` python
df[df.columns] = np.arcsinh(df[df.columns].astype(float) / 5)
```
%% Cell type:code id: tags:
``` python
# Construct auxillary dataframes holding the flipped and averaged expressions
df2 = df.transpose()
df2['doses'] = doses
df2groups = df2.groupby('doses')
# Extract the averaged expressions in a numpy array
fit_values = df2groups.mean().values
# Clean up by deleting the auxillary dataframes
del df2
del df2groups
```
%% Cell type:markdown id: tags:
### Magic data manipulation.
We first compute the first 5 principal directions of the gene ids with respect to the samples grouped into dosage. Then we use these principal directions to project the individual samples to 5 dimensional space. We store the result in the variable `X`.
%% Cell type:code id: tags:
``` python
# Import PCA module
from sklearn.decomposition import PCA
# Fit and transform PCA
pca = PCA(n_components=5).fit(fit_values)
print('explained_variance_ratio', np.cumsum(pca.explained_variance_ratio_))
X = pca.transform(df.values.transpose())
```
%%%% Output: stream
explained_variance_ratio [0.43189944 0.63387887 0.769785 0.84857145 0.91671413]
%% Cell type:markdown id: tags:
### Visualization
%% Cell type:markdown id: tags:
We define a helper function for plotting. It takes several inputs.
* X is two-dimensional data that will be plotted by placing abbreviated dose names in the appropritat position on the plane
* y_text holds the dose abbreviations
* y_color specifies the colors. To begin with we want text and color to correspond.
%% Cell type:code id: tags:
``` python
from itertools import cycle, islice
def plot_embedding(X, y_text, y_color=None, title=None, colors=None):
if y_color is None:
y_color = y_text
if colors is None:
palette = np.vstack((plt.cm.tab10(np.arange(10)), plt.cm.Set3(np.arange(12))))
palette = np.array(list(islice(cycle(palette), len(np.unique(y_color)))))
colors = {}
for idx, value in enumerate(np.unique(y_color)):
colors[value] = palette[idx]
x_min, x_max = np.min(X, axis=0), np.max(X, axis=0)
dimensions = x_max - x_min
x_max = x_max + 0.1*dimensions
x_min = x_min - 0.05*dimensions
#plt.figure()
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(y_text[i]),
color=colors[y_color[i]],
fontdict={'weight': 'bold', 'size': 9})
plt.xlim([x_min[0], x_max[0]])
plt.ylim([x_min[1], x_max[1]])
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
```
%% Cell type:markdown id: tags:
We make a list of dosage abbreviations
%% Cell type:code id: tags:
``` python
y_dose = np.empty(len(doses), dtype=int)
for idx, dose in enumerate(np.unique(doses)):
y_dose[np.array(doses) == dose] = idx
```
%% Cell type:markdown id: tags:
Compute UMAP embedding intended to visualize the data
%% Cell type:markdown id: tags:
### Plot the data
%% Cell type:markdown id: tags:
We plot the first two principal components
%% Cell type:code id: tags:
``` python
plot_embedding(X[:, :2],
doses,
y_color=y_dose, colors=None)
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
Next we compute a fancy embedding of the all principal components.
%% Cell type:code id: tags:
``` python
# Import UMAP module
from umap import UMAP
# Suppress annoying warnings
import warnings
warnings.filterwarnings('ignore')
Y = UMAP(n_components=2).fit_transform(X)
```
%% Cell type:markdown id: tags:
And plot it
%% Cell type:code id: tags:
``` python
plot_embedding(Y,
doses,
y_color=y_dose, colors=None)
```
%%%% Output: display_data
![]()
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment