Digging for Meaning in the Big Data of Human Biology | Computer Science Department at Princeton University

April 28, 2015

Since the Human Genome Project drafted the human body’s genetic blueprint more than a decade ago, researchers around the world have generated a deluge of information related to genes and the role they play in diseases like hypertension, diabetes, and various cancers.

Although thousands of studies have made discoveries that promise a healthier future, crucial questions remain. An especially vexing challenge has been to identify the function of genes in specific cells, tissues, and organs. Because tissues cannot be studied by direct experimentation (in living people), and many disease-relevant cell types cannot be isolated for analysis, the data have emerged in bits and pieces through studies that produced mountains of disparate signals.

A multi-year effort by researchers from Princeton and other universities and medical schools has taken a big step toward extracting knowledge from these big data collections and opening the door to new understanding of human illnesses. Their paper, published online by the prestigious biology journal Nature Genetics, demonstrates how computer science and statistical methods can comb broad expanses of diverse data to identify how genetic circuits function and change in different tissues relevant to disease.

Led by Olga Troyanskaya, professor in the Department of Computer Science and the Lewis-Sigler Institute of Integrative Genomics and deputy director for genomics at the Simons Center for Data Analysis in New York, the team used integrative computational analysis to dig out interconnections and relationships buried in the data pile. The study collected and integrated about 38,000 genome-wide experiments from an estimated 14,000 publications. Their findings produced molecular-level functional maps for 144 different human tissues and cell types, including many that are difficult or impossible to uncover experimentally.

“A key challenge in human biology is that genetic circuits in human tissues and cell types are very difficult to study experimentally,” Troyanskaya said. “For example, the podocyte cells in the kidneys, which are the cells that perform the filtering that the kidneys are responsible for, cannot be isolated and studied experimentally. Yet we must understand how proteins interact in these cells if we want to understand and treat chronic kidney disease. Our approach mines big data collections to build a map of how genetic circuits function in the podocyte cells, as well as in many other disease-relevant tissues and cell types.”

These networks allow biomedical researchers to understand the function and interactions of genes in specific cellular contexts and can illuminate the molecular basis of many complex human diseases. The researchers developed an algorithm, which they call a network-guided association study, or NetWAS, that combines these tissue-specific functional maps with standard genome-wide association studies (GWAS) in order to identify genes that are causal drivers of human disease. Because the technique is completely data-driven, NetWAS avoids biases toward well-studied genes and diseases — enabling discovery of completely new disease-associated genes, processes, and pathways.

To put NetWAS and the tissue-specific networks in the hands of biomedical researchers around the world, the team created an interactive server called GIANT (for Genome-scale Integrated Analysis of Networks in Tissues). GIANT allows users to explore these networks, compare how genetic circuits change across tissues, and analyze data from genetic studies to find genes that cause disease.

Aaron K. Wong, a data scientist at the Simons Center for Data Analysis and formerly a graduate student in the computer science department at Princeton, played the lead role in creating GIANT. “Our goal was to develop a resource that was accessible to biomedical researchers,” he said. “For example, with GIANT, researchers studying Parkinson’s disease can search the substantia nigra network, which represents the brain region affected by Parkinson’s, to identify new genes and pathways involved in the disease.” Wong is one of three co-first authors of the paper.

The paper’s other two co-first authors are Arjun Krishnan, a postdoctoral fellow at the Lewis-Sigler Institute; and Casey Greene, an assistant professor of genetics at Dartmouth College, who was a postdoctoral fellow at Lewis-Sigler from 2009 to 2012. The team also included Ran Zhang, a graduate student in Princeton’s Department of Molecular Biology, and Kara Dolinski, assistant director of the Lewis-Sigler Institute.

Looking to the future, Troyanskaya sees practical therapeutic uses for the group’s findings about the interrelatedness of genetic actions. “Biomedical researchers can use these networks and the pathways that they uncover to understand drug action and side effects, and to repurpose drugs,” she said. “They can also be useful for understanding how various therapies work and how to develop new ones.”

Other contributors to the study were Emanuela Ricciotti, Garret A. FitzGerald, and Tilo Grosser of the Department of Pharmacology and the Institute for Translational Medicine and Therapeutics at the Perelman School of Medicine, University of Pennsylvania; Rene A. Zelaya, of Dartmouth; Daniel S. Himmelstein, of the University of California, San Francisco; Boris M. Hartmann, Elena Zaslavsky, and Stuart C. Sealfon, of the Department of Neurology at the Icahn School of Medicine at Mount Sinai, in New York; and Daniel I. Chasman, of Brigham and Women's Hospital and Harvard Medical School in Boston.

The Simons Center for Data Analysis was formed in 2013 by the Simons Foundation, a private organization dedicated to advancing research in mathematics and the basic sciences.