COS 323 - Computing for the Physical and Social Sciences |
Fall 2013 |
Course home | Outline and lecture notes | Assignments |
In this assignment, you will implement nonlinear least squares fitting using Gauss-Newton iteration. You will be applying your implementation to fitting murine embryonic stem cell data to a nonlinear model.
Embryonic stem cells (ESC) hold great promise in many areas of biological research because they have the properties of self-renewal and pluripotency. Pluripotency is the state in which the cell has the potential to differentiate into different cell types. For this reason stem cells are important to fundamental research in cell differentiation and regenerative medicine.
For this assignment we will be using a small subset of the data and model proposed by Park and Nakai in the following paper:
Sung-Joon Park and Kenta Nakai.
A regression analysis of gene expression in ES cells reveals two gene classes that are significantly different in epigenetic patterns. BMC Bioinformatics, 12 (Suppl 1):S50
The entire data set for the paper is available here: http://www.hgc.jp/~park/research/data/apbc2011/index.html
The paper was done entirely in silico. That means that the authors didn't do any wetlab work, but relied on publicly available gene expression data in the GEO database (GSE11431).
Gene expression is the process by which DNA is used to produce RNA. RNAs are then formed into proteins and functional RNAs and these determine the function of cells. Measuring the RNA in cells as well as specifying the type of RNA in cells can indicate their function. For this reason, characterizing and predicting gene expression is fundamental to understanding how cells function properly, when cells may not function properly, and determining what processes may be affecting the function of cells. The experiments that generated this data examined how transcription factors affect the gene expression of embryonic stem cells. Transcription factors (TF) are proteins that affect the expression of genes by binding to sequences of DNA and other proteins. The following 11 transcription factors were used in the model of the paper: n-Myc, Zfx, c-Myc, Klf4, Tcfcp2l1, Esrrb, Nanog, Oct4, Sox2, Stat3, and Smad1. In addition the authors used two additional different data sets in order to add the following epigenetic effects to their model: histone modification, methylation, and presence in a CpG island. You do not have to worry about understanding the biology for this assignment. You only need to understand that these transcription factors and epigenetic states are represented by variables in the model we will discuss in the next section to describe gene expression. The transcription factors and epigenetic states will be the explanatory variables in the model.
We will only be using a small subset of the data set contained in the files chen_Eland_Score_Epi.dat and epigenetic_effects.dat. The file
chen_Eland_Score_Epi.dat contains the following 16 columns:
Gene expression n-Myc Zfx c-Myc Klf4 Tcfcp2l1 Esrrb Nanog Oct4 Sox2 Stat3 Smad1 HistM Methy CpGI
The first is the gene name, followed by the gene expression value, followed by the transcription factors and epigenetic effects. The transcription factor scores were based on the density profile of transcription factor bindings. Each epigenetic effect was encoded with a discrete value included in the epigenetic_effects.dat for the possible epigenetic state. The transcription factor scores and epigenetic effects discrete values have already been log transformed, but the expression value has not.
The file epigenetic_effects.dat contains the following columns:
Gene, RPKM(ESC), RPKM(EB), CpGI_Existence, Histone_Mark_Pattern, DNA_Methylation (n=none, U=unmethylation, M=methylation)
These are the the gene name, absolute expression for ESC, absolute expression for embryoid body, whether a CpG Island existed or note, type of histone mark, and whether there was methylation or not, respectively.
Write Matlab code in a file named ESC_data.m to do the following:
The authors proposed the following log-log transformed model to describe the gene expression of ESCs in terms of the transcription factors and epigenetic effects:
$$\log y_i =\sum_{j=1}^{11} a_j \log S_{ij}+ c \log C_i + d \log H_i + e \log M_i $$
where $y_i$ is the i-th observed expression, $S_{ij}$ is the score of the j-th transcription factor in observation i, $C_i$ the observed CpG state, $H_i$ the histone mark,
and $M_i$, the methylation state, all for observation $i$. All values are log based 10. The authors then use linear regression to solve for the regression coefficients $a_j$,$c$,$d$, and $e$. All of the TF score and epigenetic state variables are explanatory variables and will be referred to as $x_{ij}$ in the following discussion where $i$ is the observation and $j$ is the number
of the variable or column in the explanatory variable matrix. We will refer
to the coefficients generically as $a,b,c,\ldots$ for the rest of the discussion.
However, there has been criticism of using log-log transformed data and linear regression rather than power laws and nonlinear regression. What is a problem mentioned in lecture about this? Instead we would like you to consider the power law version of this model. What is the model now?
To do: Write a Matlab function in a file named model.m that given a vector of coefficients
and matrix of explanatory variables will return the vector for the
dependent variable, $y$, for this model.
Fitting the model to data consists of estimating the coefficents using least squares, i.e, finding the coefficients that minimize the least squares. You will be using Gauss-Newton iteration to perform the fit - see the motivation behind this in slides 8-18 of Lecture 9. Here we will refer to the coefficients as $a,b,c \ldots$. Briefly, you will compute successively better estimates of the parameters via the iteration $$ \begin{pmatrix} a \\ b \\ c \\ \vdots \end{pmatrix}^{\!\!(k+1)} = \begin{pmatrix} a \\ b \\ c \\ \vdots \end{pmatrix}^{\!\!(k)} + \delta^{(k)}, \quad \mathrm{where} \quad J^\mathrm{T} J \, \delta = J^\mathrm{T} r. $$
The matrix $J$ is a Jacobian: the columns correspond to partial derivatives of our model with respect to $a, b, c, \ldots$, and each row corresponds to a different data point. The vector $r$ is the residual: each row is $y_i - f(x_{i},a,b,c \cdots)$ for a different data point where $f$ is the function for our model and $x_{i}$ is a vector of all the explanatory variables at observation $i$. (The parenthesized superscripts $(k)$ and $(k+1)$ are the values at different iterations.)
To do: Derive the expression for each row in the Jacobian.
Implement a Matlab function in a file named gaussnewton.m that performs the following:
Test your implementation on the ESC data in the file chen_Eland_Score_Epi.dat. Remember to be careful with the data to make sure you need to have it log-transformed or not at each step below. To do so we would like you to compare your implementation for nonlinear regression with linear regression on the data. Write Matlab code in a file named regresseval.m to do the following:
This assignment is due Tuesday, October 22, 2012 at 11:59 PM. Please see the course policies regarding assignment submission, late assignments, and collaboration.
Please submit:
The Dropbox link to submit your assignment is here.