The aim of this exercise is to visually divide samples by exposure. The main task here is to find discriminating features. This is what machine learning is about.
%% Cell type:code id: tags:
``` python
# First we load some libraries that we always need
# If you do not know what they are good for I suggest
# three options:
# 1. Do not worry about this for now
# 2. Use google to find information
# 3. Ask somebody who might know
importnumpyasnp
importpandasaspd
importmatplotlib.pyplotasplt
```
%% Cell type:markdown id: tags:
First we locate the data files we are going to work with. This is a first major obstacle. If we need only one file it is easy, but for different reasons the data we wish to analyze can be spread around several files. This is the case here, so we need to do some work already at this stage. Unfortunately, this happens more often than not.
%% Cell type:code id: tags:
``` python
# Import tool used to walk throug the files
fromosimportwalk
# In order to make the code cleaner it is often usefull
# to specify values in variables. Here we specify the file
# path, that is, the folder to fetch files from
path='../rawdata/In vivo II transcriptomics/'
# Now we walk through all filenames in the specified folder
(_,_,filenames)=next(walk(path))
# We only keep the file ending with xlsx, that is, the excel files
Already here we have reason to be suspecious. Lots of zeros in a row is a bad sign. For example, judging by row 4, the sample 533 is very different from the rest of the samples. This is probably an artifact of technical limitations, so we may come back to this if we are unable to find clear patterns in the data.
%% Cell type:markdown id: tags:
The csv files are relatively clean, so we can just read them in and merge them together to one big table. There is one catch, though. The doses are specified in the filenames, so we need to collect that information too. We define a helper function to do this for us.
%% Cell type:code id: tags:
``` python
# Helper method for reading in multiple csv files and
If you want to, you can inspect the content of the resulting DataFrame `df`. Since it clutters your screen I have commented the following cell out. You can activate it by seclecting it and changing its type to code. One way to do this is through the "Cell" option on the top of this page.
%% Cell type:raw id: tags:
df
%% Cell type:markdown id: tags:
We are going to work with the doses, so we need to extract them from the dataframe. This has been made a little difficult by the way the columns are named. This is unfortunately quitet realistic.
%% Cell type:markdown id: tags:
Next we flip (transpose) the dataframe (spreadsheet) and, for each gene id, calculate the mean values of the gene expressions with respect to dose.
# Construct auxillary dataframes holding the flipped and averaged expressions
df2=df.transpose()
df2['doses']=doses
df2groups=df2.groupby('doses')
# Extract the averaged expressions in a numpy array
fit_values=df2groups.mean().values
# Clean up by deleting the auxillary dataframes
deldf2
deldf2groups
```
%% Cell type:markdown id: tags:
### Magic data manipulation.
We first compute the first 5 principal directions of the gene ids with respect to the samples grouped into dosage. Then we use these principal directions to project the individual samples to 5 dimensional space. We store the result in the variable `X`.