COS 323 - Computing for the Physical and Social Sciences

Fall 2013

Assignment 2: Regression on Embryonic Stem Cell Data

Due Tuesday, Oct. 22

In this assignment, you will implement nonlinear least squares fitting using Gauss-Newton iteration. You will be applying your implementation to fitting murine embryonic stem cell data to a nonlinear model.

1. Data and Motivation

Embryonic stem cells (ESC) hold great promise in many areas of biological research because they have the properties of self-renewal and pluripotency. Pluripotency is the state in which the cell has the potential to differentiate into different cell types. For this reason stem cells are important to fundamental research in cell differentiation and regenerative medicine.

For this assignment we will be using a small subset of the data and model proposed by Park and Nakai in the following paper:
Sung-Joon Park and Kenta Nakai. A regression analysis of gene expression in ES cells reveals two gene classes that are significantly different in epigenetic patterns. BMC Bioinformatics, 12 (Suppl 1):S50
The entire data set for the paper is available here: http://www.hgc.jp/~park/research/data/apbc2011/index.html
The paper was done entirely in silico. That means that the authors didn't do any wetlab work, but relied on publicly available gene expression data in the GEO database (GSE11431).

Gene expression is the process by which DNA is used to produce RNA. RNAs are then formed into proteins and functional RNAs and these determine the function of cells. Measuring the RNA in cells as well as specifying the type of RNA in cells can indicate their function. For this reason, characterizing and predicting gene expression is fundamental to understanding how cells function properly, when cells may not function properly, and determining what processes may be affecting the function of cells. The experiments that generated this data examined how transcription factors affect the gene expression of embryonic stem cells. Transcription factors (TF) are proteins that affect the expression of genes by binding to sequences of DNA and other proteins. The following 11 transcription factors were used in the model of the paper: n-Myc, Zfx, c-Myc, Klf4, Tcfcp2l1, Esrrb, Nanog, Oct4, Sox2, Stat3, and Smad1. In addition the authors used two additional different data sets in order to add the following epigenetic effects to their model: histone modification, methylation, and presence in a CpG island. You do not have to worry about understanding the biology for this assignment. You only need to understand that these transcription factors and epigenetic states are represented by variables in the model we will discuss in the next section to describe gene expression. The transcription factors and epigenetic states will be the explanatory variables in the model.

We will only be using a small subset of the data set contained in the files chen_Eland_Score_Epi.dat and epigenetic_effects.dat. The file chen_Eland_Score_Epi.dat contains the following 16 columns:
Gene expression n-Myc Zfx c-Myc Klf4 Tcfcp2l1 Esrrb Nanog Oct4 Sox2 Stat3 Smad1 HistM Methy CpGI
The first is the gene name, followed by the gene expression value, followed by the transcription factors and epigenetic effects. The transcription factor scores were based on the density profile of transcription factor bindings. Each epigenetic effect was encoded with a discrete value included in the epigenetic_effects.dat for the possible epigenetic state. The transcription factor scores and epigenetic effects discrete values have already been log transformed, but the expression value has not.

The file epigenetic_effects.dat contains the following columns:
Gene, RPKM(ESC), RPKM(EB), CpGI_Existence, Histone_Mark_Pattern, DNA_Methylation (n=none, U=unmethylation, M=methylation)
These are the the gene name, absolute expression for ESC, absolute expression for embryoid body, whether a CpG Island existed or note, type of histone mark, and whether there was methylation or not, respectively.

Write Matlab code in a file named ESC_data.m to do the following:

Read in the file chen_Eland_Score_Epi.dat. Create a vector of strings to store the gene names. Create a vector to store the gene expression value. Create a matrix of the remaining 14 columns in the file.
What is the range of the expression data? Explain why log transformation is or is not justified in your write up.
Optional: If you are interested in looking at the epigenetic effects data in the file epigenetic_effects.dat, you may want to see if high gene expression occurred with methylation or other patterns of epigenetic effects.

2. Model and Derivation

The authors proposed the following log-log transformed model to describe the gene expression of ESCs in terms of the transcription factors and epigenetic effects:
$$\log y_i =\sum_{j=1}^{11} a_j \log S_{ij}+ c \log C_i + d \log H_i + e \log M_i $$ where $y_i$ is the i-th observed expression, $S_{ij}$ is the score of the j-th transcription factor in observation i, $C_i$ the observed CpG state, $H_i$ the histone mark, and $M_i$, the methylation state, all for observation $i$. All values are log based 10. The authors then use linear regression to solve for the regression coefficients $a_j$,$c$,$d$, and $e$. All of the TF score and epigenetic state variables are explanatory variables and will be referred to as $x_{ij}$ in the following discussion where $i$ is the observation and $j$ is the number of the variable or column in the explanatory variable matrix. We will refer to the coefficients generically as $a,b,c,\ldots$ for the rest of the discussion.

However, there has been criticism of using log-log transformed data and linear regression rather than power laws and nonlinear regression. What is a problem mentioned in lecture about this? Instead we would like you to consider the power law version of this model. What is the model now?
To do: Write a Matlab function in a file named model.m that given a vector of coefficients and matrix of explanatory variables will return the vector for the dependent variable, $y$, for this model.

Fitting the model to data consists of estimating the coefficents using least squares, i.e, finding the coefficients that minimize the least squares. You will be using Gauss-Newton iteration to perform the fit - see the motivation behind this in slides 8-18 of Lecture 9. Here we will refer to the coefficients as $a,b,c \ldots$. Briefly, you will compute successively better estimates of the parameters via the iteration $$ \begin{pmatrix} a \\ b \\ c \\ \vdots \end{pmatrix}^{\!\!(k+1)} = \begin{pmatrix} a \\ b \\ c \\ \vdots \end{pmatrix}^{\!\!(k)} + \delta^{(k)}, \quad \mathrm{where} \quad J^\mathrm{T} J \, \delta = J^\mathrm{T} r. $$

The matrix $J$ is a Jacobian: the columns correspond to partial derivatives of our model with respect to $a, b, c, \ldots$, and each row corresponds to a different data point. The vector $r$ is the residual: each row is $y_i - f(x_{i},a,b,c \cdots)$ for a different data point where $f$ is the function for our model and $x_{i}$ is a vector of all the explanatory variables at observation $i$. (The parenthesized superscripts $(k)$ and $(k+1)$ are the values at different iterations.)

To do: Derive the expression for each row in the Jacobian.

3. Implementation

Implement a Matlab function in a file named gaussnewton.m that performs the following:

Takes as input a matrix in which each row contains $y_i$ and some number of $x_{ij}$ and a vector of an initial guess of coefficients, $\beta$.
Carries out a Gauss-Newton iteration using the derivations you did above.
Terminates the iteration at a point of your choosing. Justify your termination condition in the writeup.
Returns the best-fit parameters for the regression coefficients as well as the condition number of $J^\mathrm{T} J$ from the last iteration.

4. Evaluation

Test your implementation on the ESC data in the file chen_Eland_Score_Epi.dat. Remember to be careful with the data to make sure you need to have it log-transformed or not at each step below. To do so we would like you to compare your implementation for nonlinear regression with linear regression on the data. Write Matlab code in a file named regresseval.m to do the following:

Using pre-existing matlab function(s) run the simple linear regression on the data for the above log-log transformed model. Save the coefficients as $\beta_0$ and you will use them in the following steps. Include $\beta_0$ in your writeup file.
Using $\beta_0$ as your initial guess for the regression coefficients, run the following trials for your implementation on the following subsets of the data:
- The entire data set
- The subset of 1000 genes with the highest expression
- The subset of the 1000 genes with the lowest expression.
- 5 random subsets of 1000 genes
For each trial save the regression coefficients into a table for that trial. At the end you will have a table of 8 columns and 14 rows containing all of the regression coefficients for each trial. Label the columns with the trial and rows with the coefficient from the model. What can you say about the stability with which each parameter is recovered?
For the trial of your choice, create a scatterplot of the observed vs. predicted gene expression values for your method and compare it to a similar plot for linear regression. Recall that the predicted gene expression values are just the values obtained by using the computed regression coefficients and the data in the model. To do so, select a subset of the data, S, and a coefficients for a trial, $b$. Then compute the predicted gene expression using the data for S and $b$ and plot it against the observed expression for S. Then do the same using the log-log model on the data set S using the coefficients in $\beta_0$. Explain your choice of trial coefficients, $b$, and data subset, S. Compare the models based on what you can discern from your scatterplots.

Submitting

This assignment is due Tuesday, October 22, 2012 at 11:59 PM. Please see the course policies regarding assignment submission, late assignments, and collaboration.

Please submit:

Your code well-commented .m files including the files ESC_data.m, model.m, gaussnewton.m, regresseval.m and any auxiliary files.
Submit an assignment writeup in a plain-text or PDF file, containing:

A description of your code (minimally: which files contain solutions and a brief description of how they work).
Your responses to bolded questions in all sections.
Your derivation of the Jacobian for Section 2.
Justification of your termination condition for Section 3.
Discussion of stability for Section 4.
Your regression coefficient table and scatterplots for Section 4.

Any additional notes or comments you feel are necessary for us to understand your assignment and grade it fairly and easily.

The Dropbox link to submit your assignment is here.

Last update 17-Oct-2013 00:13:47

smr at princeton edu