<< Back to MOTIFvations Blog Home Page

Using Epigenomic Data to Accurately Predict and Distinguish Cell Types

April 28, 2020

If you are a clinician trying to treat cancer, one of the things you need to know is the type of cells present in the tumor under investigation so you can most effectively treat the cancer. Alternatively, if you are culturing stem cells for regenerative medicine purposes, you need to have an accurate understanding of the resulting cell types so that you can reprogram the precursor cells accordingly.

However, significant gaps exist in our understanding of cell fate determination because it depends on how gene-environment interactions or epigenetics govern cell identity.

An exciting recent publication in the journal Science Advances presents a data-driven approach drawing from machine-learning techniques to predict cell type from high-throughput transcriptomic and chromatin-mapping datasets with high accuracy and efficiency.

The new method described in this paper is an advancement over existing approaches that are quite restricted with respect to distinguishing multiple cell types based on epigenotype or epigenetic profiles. The complexity of intracellular networks that involve 10⁴ to 10⁶ genes and their products make it difficult to find patterns that reliably predict cell phenotype, but this new study can potentially help to overcome these challenges and contribute to cell behavior alteration, cellular reprogramming, and development of regenerative therapies.

The authors were able to apply their new approach to both gene expression and Hi-C datasets to distinguish cell types better than existing methods, even for large sets of cell types containing multiple different human normal and cancer tissues.

Furthermore, their approach could help in cancer diagnostics by establishing biomarkers that reliably identify cancer subtypes. The global scope of this research contributes to advancing the field of network medicine, where large bioinformatic datasets are integrated to guide research towards disease treatment.

What Makes Different Cell Types Different From Each Other?

Generally speaking, all cells in an organism are genetically identical – a neuron has the same DNA sequence as a liver cell, for example. Despite all cells having the same genomic sequences, many organisms have multiple different cell types that have very different properties and functions.

We know that cellular behavior is largely regulated by epigenetic modifications, which are heritable changes to DNA and proteins that lead to distinctive gene expression patterns.

Epigenetic mechanisms include DNA methylation, chromatin remodeling, and histone post-translational modifications. Methods have been developed to investigate epigenetic changes (epigenotypes), but existing approaches to make predictions of cell type based on epigenetic patterns are limited due to the complexity and scale of intracellular networks.

How Can Epigenomic Data Be Used to Identify Cell Types?

In this study, the authors used publicly available gene expression and chromosome conformation data to generate models that allowed them to make reliable predictions about cell types and cell behaviors. They used human gene expression microarray data from GEO (Gene Expression Omnibus), publicly available Hi-C data from SRA (Sequence Read Archive), and RNA-Seq data from the Genome-Tissue Expression database, referring to these datasets as GeneExp (microarray), Hi-C, and GTEx (RNA-Seq).

The authors used mathematical modeling to translate biomolecular data into cell type predictions. Their approach involved performing comparisons between cell types that the authors called “test” and “query” groups. The expectation from the analysis was that certain genetic pathways will be active in the test cell type where stronger correlations between constituent genes will be observed.

Based on correlations that define cellular state between genes and loci, the authors adopted an encoding process where epigenomic measurements of two different color-coded cells yielded different epigenotypes. The data was processed with correlations and matrices to determine a condition-specific effective network that indicates relationships which are enforced or possible under the specified conditions. Using statistical analysis, real biological signals are distinguished from observation error, and cell type homogeneity is assessed from probability distributions.

Further, the authors predicted cell type probabilities using a machine learning algorithm. Their results fared better than predictions made using other methods, and their cell identity predictions were robust and not affected by small perturbations.

The authors further refined their machine learning algorithms to improve their predictive power. They demonstrated that there were few overlaps between cancer cell type groups and other groups, validating the performance of their algorithm. Moreover, the researchers observed substantial overlap among epithelial cell tissues, which reflects their functional similarity, and were not able to distinguish between different neural precursor groups, because of their reprogramming potential towards induced pluripotency. Hence, the authors’ approach preserves aspects of gene expression space that are relevant to cell behavior.

Their refined algorithm was shown to classify monocytes, lymphocytes, leukemias, liver tissue, kidney tissue, and renal cancer without errors, which reflect their uniqueness compared to other cell type groups. The prediction algorithm also maintained accuracy for the Hi-C data.

Summary: There’s Gold in Them Thar Epigenomic Data

Overall, this study contributes to efforts to predict how biology works by developing an improved method of processing “omics” datasets. The algorithm developed by the authors predicted different cell types across different datasets, e.g. DNA microarray, Hi-C, and RNA-Seq, and several variables, with over 60% accuracy, which is a significant achievement.

These findings are exciting because protein expression is the primary determinant of cell fate, and the method described in this paper predicted cell identity not from protein levels, but instead from mRNA fluctuations (RNA-Seq) which result in <45% of the variance of protein abundance, and chromatin structure (Hi-C), whose degree of separation from protein expression is even greater.

Their method also highlighted predictive sensitivities between different data types. The predictive powers for analysis of the GTEx dataset (RNA-Seq) were higher than the GeneExp dataset (DNA microarray). Hence, the authors’ method can make predictions with impressively high accuracy and efficiency. Their results were even better than expected considering various constraints including biological, computational, and long-range conformational complexities.

The method is clinically advantageous for establishing biomarkers to identify cancer subtypes because it distinguishes closely related cell types using subtle alterations, and distant cell types using broader variations. By predicting cellular data, we can analyze normal versus diseased cells and develop therapies that are currently unavailable due to a lack of information.

This approach can also help develop precision medicine based-therapies using epigenomic profiles from patients to predict disease susceptibility and cellular reprogramming potential for cell therapy/gene therapy. The information can further help advance personalized medicine-based approaches against infectious diseases like COVID-19 and immune therapies against similar epidemiological diseases.

In addition to accurate predictions for cell behavior across versatile datasets, future applications can also include comparisons between sensitivities of various assays. The method can be applied to noncoding RNAs to understand their functional role in shaping cell types, gene expression, and cancer. The method can also be used to interpret phenotype from large databases of existing cell type patterns.

About the author

Rwik Sen, Ph.D.

Rwik is from Kolkata in eastern India, a city of history, multiple cultures and food. Kolkata is in the state of West Bengal which hosts 2 UNESCO World Heritages, and has part of the Himalayan mountains to its north and the Bay of Bengal to its south. Love of the natural world and mystery novels made Rwik passionate about scientific discovery. Hence, after his undergraduate in biotechnology, Rwik went to Southern Illinois University for a Ph.D., followed by postdoctoral training at the University of Colorado. Interacting with people, traveling, promoting STEM outreach and inclusion, are some of the things Rwik enjoys in addition to the ocean and dancing.

Contact Rwik on LinkedIn with any questions

<< Back to MOTIFvations Blog Home Page

Name	Provider	Purpose	Expires
pint-checkbox-non-necessary	.activemotif.com	Remembers your selected cookie consent preference	3 months
pint-cookies-accepted	.activemotif.com	Remembers that you have made a cookie preference	1 Year

Name	Provider	Purpose	Expires
intercom-device-id-*	.activemotif.com	Used by Intercom Messenger to store identifier for each unique device that interacts with the Messenger. Intercom uses this cookie to determine the unique devices interacting with the Intercom Messenger to prevent abuse.	9 months
intercom-id-*	.activemotif.com	Used by Intercom Messenger to store anonymous visitor identifier cookie.	9 months
intercom-session-*	.activemotif.com	Used by Intercom Messenger to store identifier for each unique browser session and is used to keep track of sessions.	7 days
intercom.intercom-state-*	.activemotif.com	Used by Intercom live chat function to recognise a visitor, in order to optimise the live chat functionality.	Persistent
__utma	.activemotif.com	This is a persistent cookie which expires in 2 years by default and distinguishes between users and sessions. It is used to track first visit, last visit, current visit, and number of visits to calculate new and returning visitor statistics. The cookie is updated every time data is sent to Google Analytics. The lifespan of the cookie can be customised by website owners.	a year
__utmb	.activemotif.com	Used to determines new sessions and visits and expires after 30 minutes. The cookie contains the timestamp of the exact moment in time when a visitor enters the website and is updated every time data is sent to Google Analytics. Any activity by a user within the 30 minute life span will count as a single visit, even if the user leaves and then returns to the site. A return after 30 minutes will count as a new visit, but a returning visitor.	30 minutes
__utmc	.activemotif.com	Contains a timestamp of the exact moment in time when a visitor leaves the website. This works with _utmb to calculate when you close your browser to calculate how long a visit takes.	Session
__utmt	.activemotif.com	Used to throttle the request rate for the service (limit the collection of data on high traffic sites)	10 minutes
__utmz	.activemotif.com	This cookie keeps track of entry point into your website storing traffic source, medium, campaign, and search term used to land on your website - so Google Analytics can tell site owners where visitors came from when arriving on the site. The cookie has a life span of 6 months and is updated every time data is sent to Google Analytics.	6 months
_ga	.activemotif.com	Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.	a year
_ga_*	.activemotif.com	Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.	a year
_gcl_au	.activemotif.com	Used by Google AdSense to understand user interaction with the website by generating analytical data.	3 months
IDE	.doubleclick.net	Used by Google’s DoubleClick to serve targeted advertisements that are relevant to users across the web. Targeted advertisements may be displayed to users based on previous visits to a website. These cookies measure the conversion rate of ads presented to the user.	a year
test_cookie	.doubleclick.net	A session cookie used to check if the user’s browser supports cookies.	15 minutes
pardot	pi.pardot.com		Session
lpv*	info.activemotif.com		30 minutes
visitor_id*	.pardot.com, .activemotif.com		a year
visitor_id*-hash	.pardot.com, .activemotif.com		a year

Enabling Epigenetics Research

Enabling Epigenetics Research

Using Epigenomic Data to Accurately Predict and Distinguish Cell Types

What Makes Different Cell Types Different From Each Other?

How Can Epigenomic Data Be Used to Identify Cell Types?

Summary: There’s Gold in Them Thar Epigenomic Data

About the author

Rwik Sen, Ph.D.

Featured Articles

Product Guides

Epigenetic News

Technical Downloads

沪ICP备15012530号

Using Epigenomic Data to Accurately Predict and Distinguish Cell Types

What Makes Different Cell Types Different From Each Other?

How Can Epigenomic Data Be Used to Identify Cell Types?

Summary: There’s Gold in Them Thar Epigenomic Data

About the author

Rwik Sen, Ph.D.

Featured Articles

Product Guides

Epigenetic News

Technical Downloads

沪ICP备15012530号

Cookie Settings