Using Epigenomic Data to Accurately Predict and Distinguish Cell Types
April 28, 2020
If you are a clinician trying to treat cancer, one of the things you need to know is the type of cells present in the tumor under investigation so you can most effectively treat the cancer. Alternatively, if you are culturing stem cells for regenerative medicine purposes, you need to have an accurate understanding of the resulting cell types so that you can reprogram the precursor cells accordingly.
However, significant gaps exist in our understanding of cell fate determination because it depends on how gene-environment interactions or epigenetics govern cell identity.
An exciting recent publication in the journal Science Advances presents a data-driven approach drawing from machine-learning techniques to predict cell type from high-throughput transcriptomic and chromatin-mapping datasets with high accuracy and efficiency.
The new method described in this paper is an advancement over existing approaches that are quite restricted with respect to distinguishing multiple cell types based on epigenotype or epigenetic profiles. The complexity of intracellular networks that involve 104 to 106 genes and their products make it difficult to find patterns that reliably predict cell phenotype, but this new study can potentially help to overcome these challenges and contribute to cell behavior alteration, cellular reprogramming, and development of regenerative therapies.
The authors were able to apply their new approach to both gene expression and Hi-C datasets to distinguish cell types better than existing methods, even for large sets of cell types containing multiple different human normal and cancer tissues.
Furthermore, their approach could help in cancer diagnostics by establishing biomarkers that reliably identify cancer subtypes. The global scope of this research contributes to advancing the field of network medicine, where large bioinformatic datasets are integrated to guide research towards disease treatment.
What Makes Different Cell Types Different From Each Other?
Generally speaking, all cells in an organism are genetically identical – a neuron has the same DNA sequence as a liver cell, for example. Despite all cells having the same genomic sequences, many organisms have multiple different cell types that have very different properties and functions.
We know that cellular behavior is largely regulated by epigenetic modifications, which are heritable changes to DNA and proteins that lead to distinctive gene expression patterns.
Epigenetic mechanisms include DNA methylation, chromatin remodeling, and histone post-translational modifications. Methods have been developed to investigate epigenetic changes (epigenotypes), but existing approaches to make predictions of cell type based on epigenetic patterns are limited due to the complexity and scale of intracellular networks.
How Can Epigenomic Data Be Used to Identify Cell Types?
In this study, the authors used publicly available gene expression and chromosome conformation data to generate models that allowed them to make reliable predictions about cell types and cell behaviors. They used human gene expression microarray data from GEO (Gene Expression Omnibus), publicly available Hi-C data from SRA (Sequence Read Archive), and RNA-Seq data from the Genome-Tissue Expression database, referring to these datasets as GeneExp (microarray), Hi-C, and GTEx (RNA-Seq).
The authors used mathematical modeling to translate biomolecular data into cell type predictions. Their approach involved performing comparisons between cell types that the authors called “test” and “query” groups. The expectation from the analysis was that certain genetic pathways will be active in the test cell type where stronger correlations between constituent genes will be observed.
Based on correlations that define cellular state between genes and loci, the authors adopted an encoding process where epigenomic measurements of two different color-coded cells yielded different epigenotypes. The data was processed with correlations and matrices to determine a condition-specific effective network that indicates relationships which are enforced or possible under the specified conditions. Using statistical analysis, real biological signals are distinguished from observation error, and cell type homogeneity is assessed from probability distributions.
Further, the authors predicted cell type probabilities using a machine learning algorithm. Their results fared better than predictions made using other methods, and their cell identity predictions were robust and not affected by small perturbations.
The authors further refined their machine learning algorithms to improve their predictive power. They demonstrated that there were few overlaps between cancer cell type groups and other groups, validating the performance of their algorithm. Moreover, the researchers observed substantial overlap among epithelial cell tissues, which reflects their functional similarity, and were not able to distinguish between different neural precursor groups, because of their reprogramming potential towards induced pluripotency. Hence, the authors’ approach preserves aspects of gene expression space that are relevant to cell behavior.
Their refined algorithm was shown to classify monocytes, lymphocytes, leukemias, liver tissue, kidney tissue, and renal cancer without errors, which reflect their uniqueness compared to other cell type groups. The prediction algorithm also maintained accuracy for the Hi-C data.
Summary: There’s Gold in Them Thar Epigenomic Data
Overall, this study contributes to efforts to predict how biology works by developing an improved method of processing “omics” datasets. The algorithm developed by the authors predicted different cell types across different datasets, e.g. DNA microarray, Hi-C, and RNA-Seq, and several variables, with over 60% accuracy, which is a significant achievement.
These findings are exciting because protein expression is the primary determinant of cell fate, and the method described in this paper predicted cell identity not from protein levels, but instead from mRNA fluctuations (RNA-Seq) which result in <45% of the variance of protein abundance, and chromatin structure (Hi-C), whose degree of separation from protein expression is even greater.
Their method also highlighted predictive sensitivities between different data types. The predictive powers for analysis of the GTEx dataset (RNA-Seq) were higher than the GeneExp dataset (DNA microarray). Hence, the authors’ method can make predictions with impressively high accuracy and efficiency. Their results were even better than expected considering various constraints including biological, computational, and long-range conformational complexities.
The method is clinically advantageous for establishing biomarkers to identify cancer subtypes because it distinguishes closely related cell types using subtle alterations, and distant cell types using broader variations. By predicting cellular data, we can analyze normal versus diseased cells and develop therapies that are currently unavailable due to a lack of information.
This approach can also help develop precision medicine based-therapies using epigenomic profiles from patients to predict disease susceptibility and cellular reprogramming potential for cell therapy/gene therapy. The information can further help advance personalized medicine-based approaches against infectious diseases like COVID-19 and immune therapies against similar epidemiological diseases.
In addition to accurate predictions for cell behavior across versatile datasets, future applications can also include comparisons between sensitivities of various assays. The method can be applied to noncoding RNAs to understand their functional role in shaping cell types, gene expression, and cancer. The method can also be used to interpret phenotype from large databases of existing cell type patterns.
About the author
Rwik Sen, Ph.D.
Rwik is from Kolkata in eastern India, a city of history, multiple cultures and food. Kolkata is in the state of West Bengal which hosts 2 UNESCO World Heritages, and has part of the Himalayan mountains to its north and the Bay of Bengal to its south. Love of the natural world and mystery novels made Rwik passionate about scientific discovery. Hence, after his undergraduate in biotechnology, Rwik went to Southern Illinois University for a Ph.D., followed by postdoctoral training at the University of Colorado. Interacting with people, traveling, promoting STEM outreach and inclusion, are some of the things Rwik enjoys in addition to the ocean and dancing.
Contact Rwik on LinkedIn with any questions
We’d love to hear from you! Please contact us at email@example.com or on Twitter (@activemotif) to share your thoughts and feedback! We’re also looking for science writers to contribute to MOTIFvations, so if you’re an established science communicator or just want to get started, please reach out – there might be a story we can collaborate on!