Instructor | Ellen Zhong |
Time | Thursdays 3:00-5:00p, Friend Center 007 |
"Precept" / student-only discussion | Tuesdays 4:30-5:30p, CS 401 |
Office hours | Wednesdays 4:00-5:00p, CS 314, or by appointment | Slack | Link |
Syllabus | Link |
Recent breakthroughs in machine learning algorithms have transformed the study of the 3D structure of proteins and other biomolecules. This seminar class will survey recent papers on ML applied to tasks in protein structure prediction, structure determination, computational protein design, physics-based modeling, and more. We will take a holistic approach when discussing papers, including discussing their historical context, algorithmic contributions, and potential impact on scientific discovery and applications such as drug discovery.
For more information on the discussion format, expectations, and grading, see the course syllabus.
A non-exhaustive list of topics we will cover include:
Selected papers will cover a broad range of algorithmic concepts and machine learning techniques including:
In addition to the assigned papers, optional primers or reviews on relevant topics will be made available for background reading.
Monday, September 25th, 4:30pm ET
John Jumper (DeepMind)
Title: Highly accurate protein structure prediction with deep learning
Abstract:
Our work on deep learning for biology, specifically the AlphaFold system, has demonstrated that neural
networks are capable of highly accurate modeling of both protein structure and protein-protein
interactions. In particular, the system shows a remarkable ability to extract chemical and evolutionary
principles from experimental structural data. This computational tool has repeatedly shown the ability to
not only predict accurate structures for novel sequences and novel folds but also to do unexpected tasks
such as selecting stable protein designs or detecting protein disorder. In this lecture, I will discuss
the context of this breakthrough in the machine learning principles, the diverse data and rigorous
evaluation environment that enabled it to occur, and the many innovative ways in which the community is
using these tools to do new types of science. I will also reflect on some surprising limitations --
insensitivity to mutations and the lack of context about the chemical environment of the proteins -- and
how this may be traced back to the essential features of the training process. Finally, I will conclude
with a discussion of some ideas on the future of machine learning in structure biology and how the
experimental and computational communities can think about organizing their research and data to enable
many more such breakthroughs in the future.
Bio: John Jumper received his PhD in Chemistry from the University of Chicago, where he developed
machine learning methods to simulate protein dynamics. Prior to that, he worked at D.E. Shaw Research on
molecular dynamics simulations of protein dynamics and supercooled liquids. He also holds an MPhil in
Physics from the University of Cambridge and a B.S. in Physics and Mathematics from Vanderbilt University.
At DeepMind, John is leading the development of new methods to apply machine learning to protein biology.
Thursday, November 16th, 3:00pm ET
Jason Yim (MIT)
Title: Diffusion models for protein structure and de novo design.
Abstract: Generative machine learning is revolutionizing protein design. In this talk, I will
discuss recent advances in using diffusion models to generate protein structures and perform conditional
generation towards protein design desiderata. First, I will go over FrameDiff, including an overview of
the mathematical foundation of SE(3) diffusion and a practical algorithm for training a frame-based
generative model over protein backbones. Next, I will overview how SE(3) diffusion is used in RFDiffusion,
a state-of-the-art protein design method, which is pre-trained on protein structure prediction. We show
that a single method, RFdiffusion, enables binder design, motif-scaffolding, and symmetric protein
generation. Finally, I discuss current limitations and the technical challenges on the horizon for de novo
protein design.
Bio: Jason Yim is a PhD candidate at the Massachusetts Institute of Technology (MIT) Computer
Science and Artificial Intelligence Laboratory advised by Tommi Jaakkola and Regina Barzilay. His research
focuses on developing generative models for scientific applications as well as experimental design in
biological experiments. He has previously worked as a research engineer at DeepMind and interned at
Microsoft AI4science.
Thursday, December 7th, 3:00pm ET
Stephan Eismann (AtomicAI)
Title: Enabling structure-based drug discovery for RNA using AI
Abstract:
RNA molecules adopt three-dimensional structures that are critical to their function and of interest in
drug discovery. Few RNA structures are known, however, and predicting them computationally has proven
challenging. I will talk about ARES, a machine learning approach that enables identification of accurate
structural models without assumptions about their defining characteristics, despite being trained with
only 18 known RNA structures. ARES outperforms previous methods and consistently produced the best results
in community-wide blind RNA structure prediction challenges. In addition to ARES, I will talk about recent
advancements in tertiary RNA structure prediction at Atomic AI.
Bio: Stephan leads the ML team at Atomic AI. Originally from Germany, he did his PhD at Stanford
University where his research focused on the development of novel ML algorithms for problems in structural
biology
Please fill out this form and contact Ellen if you are interested in signing up for this class. See last year's course website for a sample of topics and papers we will cover.
Post-lecture feedback: Please fill out this form if you are assigned to give feedback on a lecture.
Week | Date | Topic | Readings | Presenters | Questions and Feedback |
---|---|---|---|---|---|
1 | September 7 | Course overview; Introduction to machine learning in structural biology |
Additional Resources:
1. Dill et al. The Protein Folding Problem. Annual Review of Biophysics 2008. |
Ellen Zhong [Slides] | N/A |
2 | September 14 | Protein structure prediction; CASP; Supervised learning; Protein-specific metrics |
1. Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using
potentials from deep learning. Nature 2020.
2. Ingraham, J. et al. Learning Protein Structure with a Differentiable Simulator. ICLR 2019 Oral. [Talk] Additional Resources: 3. AlphaFold1 CASP13 slides 4. https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp13-what-just-happened/ 5. trRosetta: Yang et al. Improved protein structure prediction using predicted interresidue orientations. PNAS 2020. |
Ellen Zhong, David Shustin [Slides-1] [Slides-2] | Pre-lecture questions Feedback: Yihao Liang, Ambri Ma |
3 | September 21 | Breakthroughs in protein structure prediction |
1. Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure
prediction with Alphafold. Nature 2021.
2. Tunyasuvunakool, K., Adler, J., Wu, Z. et al. Highly accurate protein structure prediction for the human proteome. Nature 2021. Additional Resources: 3. AlphaFold2 slides. [CASP14 talk] [Michael Figurnov slides] 4. https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/. 5. Baek et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021. [paper] 6. Primer on transformers: [1] [2] |
Viola Chen, Xiaxin Shen, Ellen Zhong [Slides-1] [Slides-2] | Pre-lecture questions Feedback: Andy Zhang, Brendan Wang |
4 | September 28 | Protein structure determination I: Cryo-EM reconstruction |
1. Zhong et al. Reconstructing continuous distributions of
protein structure from cryo-EM images. ICLR 2020 Spotlight.
2. Zhong et al. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 2021. [pdf] Additional Resources: 3. Computer vision related works:
i. Mildenhall, Srinivasan, Tancik et al. NeRF:
Representing
Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020 Oral. [project page]
ii. Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020 Spotlight.
iii. Xie et al. Neural Fields in Visual Computing and
Beyond. Computer Graphics Forum 2022.
5. Cryo-EM background:
Singer & Sigworth. Computational Methods for
Single-Particle Cryo-EM. Annual Review of Biomedical Data Science, 2020.
6. Primer on Variational Autoencoders:
[1]
[2]
[3]
[4]
|
Ellen Zhong [Slides] | Pre-lecture questions |
5 | October 5 | Protein language modeling |
Sample of: 1. Rives et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS 2020. 2. Hie et al. Learning Mutational Semantics. NeurIPS 2020. 3. Hie et al. Learning the language of viral evolution and escape. Science 2021. 4. ESM-2/ESMAtlas: Lin et al. Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv 2022. 5. ESM-MSA-1b: Rao et al. MSA Transformer. ICML 2021. 6. Riesselman et al. Deep generative models of genetic variation capture the effects of mutations. Nature Methods 2018. 7. Bepler & Berger. Learning protein sequence embeddings using information from structure. ICLR 2019. 8. Nijkamp et al. ProGen2: Exploring the Boundaries of Protein Language Models. arXiv 2022. 9. Ferruz et al. ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications 2022. 10. Chen et al. xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Proteins. bioRxiv 2023. 11. Rao, Bhattacharya, Thomas et al. Evaluating Protein Transfer Learning with TAPE. NeurIPS 2019 Spotlight. 12. Zheng et al. Structure-informed Language Models Are Protein Designers. ICML 2023 Oral. |
Paper discussion + Short presentations |
Flash talk info and sign up spreadsheet Written summary due before class on Canvas. Presentation upload form: here |
6 | October 12 | Protein design I: Inverse folding |
1. Ingraham et al. Generative
models for graph-based protein design. NeurIPS 2019.
2. Dauparas et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022. 3. ESM-IF1: Hsu et al. Learning inverse folding from millions of predicted structures. ICML 2022. |
Brendan Wang, Yukang Yang, Justin Wang [Slides-1] [Slides-2] [Slides-3] | Pre-lecture questions Feedback: Kaiqu Liang, Minkyu Jeon, Xiaxin Shen |
7 | October 19 | No class -- Fall Recess | Final Project Part 1 Due (Project proposal) | ||
8 | October 26 | Structural bioinformatics |
1. Mackenzie et al. Tertiary alphabet for
the observable protein structural universe. PNAS 2016.
2. van Kempen, Kim et al. Fast and accurate protein structure search with Foldseek. Nature Biotechnology 2023. |
Eugene Choi, Snigdha Sushil Mishra [Slides-1] [Slides-2] | Pre-lecture questions Feedback: Jiahao Qiu, Viola Chen |
9 | November 2 | Physics-based modeling |
1. Lindorff-Larsen et al. How
fast-folding proteins fold. Science 2011. [Perspective]
2. Noe et al. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 2019. [talk] Optional further reading: 3. Shaw et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010. 4. CVPR 2021 tutorial on normalizing flows. 5. Grathwohl, Chen, et al. FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. ICLR 2019 Oral. |
Ellen Zhong, Yihao Liang, Jiahao Qiu [Slides-1] [Slides-2] | Pre-lecture questions Feedback: Justin Wang, Eugene Choi |
10 | November 9 | Protein structure determination II |
1. Punjani and Fleet. 3DFlex: determining
structure and motion of flexible proteins from cryo-EM. Nature Methods 2023.
2. Jamali et al. Automated model building and protein identification in cryo-EM maps. biorXiv 2023. |
Minkyu Jeon, Ambri Ma [Slides] | Pre-lecture questions Feedback: Alkin Kaz, Victor Chu |
11 | November 16 | Protein Design II |
1. Yim, Trippe, Bortoli, Mathieu et al. SE(3) diffusion
model with application to protein backbone generation. ICML 2023.
2. Watson, Juergens, Bennett et al. De novo design of protein structure and function with RFdiffusion. Nature 2023. 3. Ingraham et al. Illuminating protein space with a programmable generative model. bioxRiv 2022. |
Jason Yim (guest speaker), Alkin Kaz [Slides-1] [Slides-2] | Pre-lecture questions Feedback: David Shustin, Howard Yen |
12 | November 23 | No class -- Thanksgiving | |||
13 | November 30 | Small molecule drug discovery |
1. Corso, Stark, Jing et al. DiffDock: Diffusion Steps,
Twists, and Turns for Molecular Docking. ICLR 2023.
2. Krishna, Wang, Ahern et al. Generalized Biomolecular Modeling and Design with RoseTTAFold All-Atom. biorXiv 2023. |
Victor Chu, Howard Yen [Slides] | Pre-lecture questions Feedback: Yukang Yang, Snigdha Sushil Mishra |
14 | December 7 | RNA structure prediction |
1. Townshend, Eismann, Watkins et al. Geometric deep learning of RNA
structure. Science 2021.
Additional Resources: 2. Zhang et al. Advances and opportunities in RNA structure experimental determination and computational modeling. Nature Methods 2022. |
Stephan Eismann (guest speaker) | |
15 | Tuesday, December 12, 3:00-5:00pm | Final project presentations |