A Novel Computational Genetics Framework for Identifying New Candidate Circulating Biomarkers of Heart Failure
Palo Alto Senior High School - Palo Alto, Ca
My sincere and heartfelt thanks to Dr. Euan Ashley for giving me the opportunity to perform research at the Falk Cardiovascular Center in Stanford University. I thank him for providing insight and direction to my research interests.
I am grateful to Daryl Waggott, my primary mentor, who has been a constant pillar of support throughout this incredible experience. Daryl was always eager and readily available to provide answers to the numerous questions that I had with a smile. Thank you so much for putting up with me.
Additional thanks to Terra Coakley for her patience and support. I couldn’t have done this without her help and support.
I am honored and immensely thankful that the Ashley Lab accommodated a high school junior amidst their overwhelming busy schedules. I am grateful to have been given the chance to perform research in such world class institution amongst some the most well renowned and pioneering scientists in their fields.
Thanks to my computer science teacher, Mr. Christopher Kuszmaul, for encouraging my passion to take wings. He has always supported my strong belief that computer science has the power to provide simple solutions to complex global problems.
Last but not least, thanks to my wonderful parents and brother for their support and unconditional love.
The purpose of this research is to identify potential new biomarkers for the diagnosis of heart failure. Heart failure, a chronic condition in which the heart is unable to supply sufficient oxygen for the needs of the tissues, is one of the leading causes of mortality in the United States. Finding an informative biomarker is imperative. Though a small number of biomarkers for the diagnosis of heart failure exist, these lack discriminatory power and are not appropriate for all patient populations. Biomarkers are considered superior to other methods of diagnosis because of their objectivity and ability to enable non-invasive diagnosis. This study created a generalizable computational genetics framework that combined publicly available RNA-Seq and microarray datasets pertaining to heart failure. The framework includes pre-processing, statistical analysis steps and features a prioritization algorithm. The R-Programming environment was used. Extensive quality control and validity checks were performed as well as a meta-analysis on the data. The study yielded thirty-eight potential gene candidates, the protein products of which we hypothesize could serve as new biomarkers of heart failure. Of the thirty-eight gene candidates, FHOD3 and DSG2 are most notable due to their secreted nature, known ties to heart conditions and their high expression values in the plasma of heart failure patients. The findings of this study have several implications as they impact the diagnosis of heart failure for patients with genetic susceptibility to the disease. Better biomarkers, especially those that can help predict treatment options, are the future of this field. These candidates lay the foundation for moving this research forward.
Heart Failure (HF) is chronic condition in which the heart is unable to supply sufficient oxygen for the normal functions of body tissues. Its commonality and high mortality rate make it a major health problem not only in the United States, but also worldwide. As of late, an estimated 23 million people worldwide are living with HF (Gaggin & Januzzi, 2013). Finding non-invasive means for the accurate diagnosis, prognosis and guided therapy of HF has become imperative. Biomarkers are non-invasive objective tools that are indicative of a biological state or condition. Ideal biomarkers are safe, easy to measure and are consistent across sex, ethnic groups, and other stratifying factors. In the case of HF, biomarkers have impacted the way that the illness is diagnosed and monitored (Gaggin & Januzzi, 2013). Since natriuretic peptides (NPs) were discovered in 1985, they have been established as the best biomarkers of HF (Loncar et al., 2014). Their prognostic abilities and therapeutic values are well established. Although NPs are considered to be the “gold standard” biomarkers of HF, they face several limitations and setbacks. Factors known to influence the clinical interpretation of NPs include age, obesity, atrial arrhythmias, renal failure, and structural heart disease beyond the clinical diagnosis of HF (Loncar et al., 2014). In 2007, The National Academy of Clinical Biochemistry (NACB) set comparable goals in a consensus document that states that a biomarker in HF ideally enables clinicians to: “identify possible underlying (and potentially reversible) causes of HF; confirm the presence or absence of the HF syndrome; and estimate the severity of HF and the risk of disease progression” (Tang et al., 2007). Remarkably, NPs failed to meet the standards set by the NACB. Additionally, no other available biomarkers for HF have satisfied these guidelines. The lack of a sufficient and capable biomarker for HF, furthers the need for the discovery of new biomarker that have discriminatory powers under a variety of demographic and clinical settings.
We hypothesize that it is possible to identify candidate biomarkers of heart failure by combining multiple datasets in an analytically structured framework.
Materials and Methods
This study combines three microarray datasets of studies pertaining to HF with one RNA-Seq HF dataset through the use of an analytically structured framework. It jointly pre-processes and analyzes the data to produce a list of possible biomarkers for heart failure. A joint analysis of datasets from RNA-Seq and Microarray data has proved to be difficult in the past due to the vastly different technical modalities employed by the platforms. This study was able to overcome the barrier of combining RNA-Seq datasets with Microarray datasets, thus maximizing the potential of all the candidate biomarkers produced. The R-Programming language (R Development Core Team, 2008) and several Bioconductor packages were used to perform all statistical analysis steps in the framework created by the study. The R-Programming environment was chosen due to its ease in enabling statistical computation and graphics. It is useful for a large number of statistical procedures. Figure 1 gives a breakdown of the study design and framework created by the study.
Figure 1: Study Design
Figure 1 gives an apt description of the analytical structured framework created by the study. It outlines the methodology of the study.
Data used in this research was gathered from the National Center for Biotechnology Information (NCBI). Four microarray datasets and one RNA-Seq dataset of studies pertaining to HF were gathered from the site. The datasets were chosen based on large sample sizes and high patient variation (race, gender etc.). Four microarray datasets pertaining to heart failure: GSE42955 (Molina-Navarro et al., 2013 ), GSE57338 (Liu et al., 2015), GSE1145 (“Changes in cardiac transcription profiles brought about by heart failure”) and one dataset, GSE59867 (Maciejak et al., 2015) containing the plasma expression values of heart failure patients were chosen. In all of the datasets above sans, GSE59867, the data were taken from the heart tissue of both non-HF and HF candidates. The data from the GSE59867 dataset were taken from the blood of non-HF and HF patients. One RNA-Seq dataset, GSE55296 (Tarazón et al., 2014) which used the RNA-Seq chip, AB 5500xl Genetic Analyzer (Homo sapiens), was used. The data for this dataset were taken from the heart tissue of HF and non-HF patients. Data for the microarray datasets used were gathered using various microarray chips. The data for the GSE42955 and GSE59867 was taken using the Affymetrix Human Gene 1.0 ST Array chip. The data for the GSE1145 dataset was gathered using the Affymetrix Human Genome U133 Plus 2.0 Array and Affymetrix Human Genome U95 Version 2 Array chips. The GSE57338 dataset was configured using the Affymetrix Human Gene 1.1 ST Array chip. More information about the datasets can be found in Table 1.
Table 1: Description of cohort
*Only 13 samples were used from this dataset. All 13 of the samples were heart failure patients. 13 out of the 79 patients were used for analysis due to heterogeneity in the clinical definition of heart failure
Table 1 describes the cohort used in the study. The table contains five columns that describe each of the used datasets. The columns are as follows: study name, GEO ID, sample size, chip info, platform, and dataset info (number of controls and patients).
The pre-processing steps of the study included normalization, batch correction and merging. All analysis was performed using the R-Programming environment. This step found a novel way to jointly analyze and pre-process microarray and RNA-Seq datasets which have vastly different technical modalities. There were compatibility issues based on the differing natures of the platforms that generated the data. Some data were generated using various microarray chips and others using RNA-Seq chips. The datasets were preprocessed jointly to avoid systematic artifacts that could influence differential analysis. This was the most challenging and vital portion of the analytical framework created by this study. Additionally, joint pre-processing was performed to increase sample size and subsequent power of the analysis. Each dataset was normalized using the Universal exPression Code (UPC) method. The UPC normalization method was chosen because of it’s ability to allow for cross platform analysis after independent normalization (joint analysis of RNA-Seq and Microarray data). A description of the UPC approach is as follows, “[The] Universal exPression Code (UPC) approach, which corrects for platform-specific background noise using models that account for the genomic base composition and length of target regions; this approach also uses a mixture model to estimate whether a gene is active in a particular profiling sample” (Piccolo et al, 2013). The UPC expression values differ from other normalization methods in that the expression values range from 0 to 1 (not expressed to expressed). Before normalizing the RNA-Seq datasets, the raw-data files were converted to read-count values using the Rsubread package (Liao et al., 2013). After normalization the four datasets pertaining to HF, including three microarray and one RNA-Seq (GSE42955, GSE57338, GSE1145, GSE55296), were then merged by the InSilicoDB merge method (Taminau et al., 2012). The merge method combines expression sets from different datasets while applying techniques to remove biases produced by merging the datasets. Batch effects were removed using the ComBat method (Johnson et al., 2007). This yielded a single expression dataset that contained data from both the merge and batch correction steps. The GSE59867 dataset was processed independently from the others because it was not part of the joint analysis. It represents a different tissue (plasma) and was used for verification of results. To confirm the pre-processing steps, a meta-analysis was performed using the Meta package (“CRAN meta”). Forest plots were created to ensure that the known biomarkers for heart failure were up-regulated. Figure 1 is one of these plots.
Figure 2: Forest Plot of NPPA gene
Figure 2 is a forest plot of a commonly known biomarker of heart failure: Natriuretic Peptide A (NPPA). This forest plot is a product of the meta-analysis that was performed to check the validity of the normalization methods of the study.
Statistical Analysis: Differential Expression Analysis
The Limma package (Ritchie et al., 2015) was used to perform a differential analysis of the combined HF versus normal heart tissue dataset. The differential analysis was performed to determine which genes were up-regulated or down-regulated in patients with HF and non-HF in the merged expression set. Up-regulated indicates that the gene in question was more highly expressed relative to a reference point, whereas down-regulated indicates the opposite phenomenon. Limma’s LmFit method was used for the primary analysis. The topTable method was used to generate the log2 fold-changes, p-values, q-values as well as other values critical for further analysis. Figure 3 is a volcano plot showing the relationship between the log2 fold-change values and p-values generated by the the statistical analysis step.
Figure 3: Volcano plot of Log2 Fold Change vs P-Value
Figure 3 is a volcano plot of the Log2 Fold-Change values vs P-Values produced by the differential analysis of the combined dataset with HF vs non-HF heart tissue values. The region in black indicates all genes which have LogFC values greater than 1.
After the pre-processing and statistical analysis was completed, the data in the merged expression set was filtered to produce the relevant biomarkers for heart failure. The reasoning for the filters is as follows. Firstly, a filter was put in place to ensure that the biomarkers found were specific only to the heart under pathological conditions. Secondly, the biomarkers were selected based on evidence of being secreted and extracellular. Thirdly, the biomarkers were sorted to only include genes that were not in the blood under normal conditions. This being said, the filtering step included four main filters. The first filter, log2 fold-changes > 1 and Q.value < .1, ensured that the filtered genes were expressed highly in the heart tissue of HF patients while allowing for a type 1 error of 10%. For the second filter, a list of secreted proteins from the Human Protein Atlas Site (“The Human Protein Atlas”) was gathered, and the genes yielded from the first filter were filtered to only include secreted proteins. The third filter used was a list of genes from the GTEx database (“GTEx Portal”) that were classified to be expressed high in healthy cardiac tissue (Heart tissue Reads Per Kilobase per Million mapped reads (RPKM) > 1). The final filter used a list of genes from the GTEx database that were considered to be expressed low in the blood under normal conditions (Blood RPKM < .1). Table 2 gives a detailed breakdown of the number of genes that passed each filter. The four filters yielded 38 genes, the protein products of which we hypothesize could be potential biomarkers for HF. Table 3 gives a detailed description of each of the 10 of the top gene candidates based on heart RPKM values.
Table 2: Breakdown of number of genes that passed each filter. Table 2 gives a detailed account of the number of genes that passed each of the four filters (LogFC and Q-value, secreted protein, heart RPKM, and blood RPKM). The table includes two rows: independent and serial. The values in the independent row reflect the number of the genes that passed each filter separately. The consecutive filter reflects the number of genes that passed the filter given the genes that passed the previous filter.
Table 3:The 10 top genes (based on heart RPKM value) and known biomarkers for HF. Table 3 gives a detailed description of the top 10 gene candidates based on heart RPKM values and a description of known biomarkers of HF. For each gene, plasma and heart log2 fold-change values, p-values, blood RPKM values and heart RPKM values are shown. Each of the values shown are pertinent to the analysis and the filtering that was performed in the study.
Results of the Prioritization Algorithm
Filter 1: LogFC > 1 and Q.val < .1
The first filter of the study, LogFC > 1 and Q.val <.1, yielded 3870 genes. The LogFC > 1 and Q.val <.1 parameters were chosen based on their indication of high expression values of a gene. The parameters indicated that the genes that passed were up-regulated in heart tissue while allowing for a type 1 error of 10%.
Filter 2: Secreted – Human Protein Atlas Site
The second filter checked for biomarkers that were secreted. A list of secreted proteins generated from the Human Protein Atlas Site (“The Human Protein Atlas”) was used as the filter. Having such a filter would increase the likelihood that the biomarker candidates of HF would be secreted into the bloodstream, thus making them easier to measure. Notably, genes that were considered to be secreted could be found in the extracellular matrix, receptors on the plasma membrane or be exocrine in nature. Out of the 3870 genes that passed filter, 535 genes passed filter 2.
Filter 3: High in Heart – GTEx
The third filter used genes with high heart expression values (heart RPKM values) from the GTEx database that were classified as high in the heart (heart RPKM > 1). The genes that passed filter 2, were filtered against the list of high heart genes and yielded 524 genes. This filter was considered to be sensitive (less stringent).
Filter 4: Low in Blood – GTEx
The fourth and final filter aimed to ensure that final candidates had low expression values in the blood under normal conditions. Thus, a list of genes was generated from the GTEx database containing genes with blood expression RPKM values < .1. The genes that passed filter 3 were filtered against the list produced by the GTEx database and yielded 38 genes. This filter was aimed to be specific (more stringent).
Validation with GSE59867 dataset
In order to verify the results and methodology of the study, we checked to see if the proposed biomarker candidates for HF were in fact expressed highly (logFC > 1) in the blood of HF patients relative to healthy controls. In order to do this, the GSE59867 dataset was used. This dataset contained the plasma expression values of both HF and non-HF patients. Out of the 38 candidate genes, 30 were expressed highly in the heart. This high success rate indicates that it is more likely that the candidate genes are accurate circulating biomarkers for HF. Table 3 shows the plasma log2 fold-change values of the gene candidates.
True Positive – Known Biomarkers
In order to asses the validity of the methodology and results of the study, a check for true positives for known biomarkers was performed. The check aimed to validate the normalization method. We classified the HF biomarkers as “true positives” if they passed the first filter. Of the known biomarkers, the natriuretic biomarkers NPPB and NPPA, in addition to TNNI3 and MYH6, all passed the first filter. Table 3 shows the LogFC values and gives a description of the key values of known HF biomarkers.
Top Biomarkers Description
Of the candidate genes of HF proposed by this study, FHOD3, DSG2 and PLA2G5 show the most potential. The following paragraphs gives a brief description of each of the four genes.
Formin Homology 2 Domain Containing 3 (FHOD3) is a protein that plays a role in the regulation of the actin cytoskeleton (“FHOD3 Gene(Protein Coding)”). This gene has been known to be associated with hypertrophic cardiomyopathy (“FHOD3 formin homology 2 domain containing 3”). In this study its log2 fold-change values in the plasma dataset was 1.394, thus indicating that it is in fact highly expressed in the plasma of HF candidates. This makes it a lucrative option as a potential new biomarker of HF.
Desmoglein 2 (DSG2) is a type of desmosome: a cell-cell junction between epithelial, myocardial, and certain other cell types. It has been known to be associated with cardiomyopathy, dilated, 1bb and arrhythmogenic right ventricular dysplasia 10 (“DSG2 Gene - GeneCards”). In this study DSG2 was highly expressed in the plasma of HF patients (LogFC of 1.403). It had a relatively high LogFC value (1.453) in the heart tissue of HF patients. All factors combined, DSG2 has potential to be a biomarker of heart failure.
Phospholipase A2, Group V (PLA2G5) is a protein coding gene. It is associated with the secretory phospholipase A2 family. It is located on the secretory phospholipase A2 genes on chromosome 1 (“PLA2G5 Gene”). Though it has not been linked to HF in the past, this study shows that PLA2G5 had high expression values in the plasma dataset (1.943) while also having a high heart RPKM value (1.746). Thus, it is also a strong candidate for a biomarker of HF.
The filters used in this study were meant to increase the chances of finding a circulating biomarker that high expression in the heart and blood when HF was present and low expression in the blood in patients without the condition. In order to achieve this, the thresholds of the filters were chosen accordingly. Interestingly, choosing the thresholds of the filters can be done in two ways. The first is to adjust the thresholds such that the known biomarkers of HF pass through a majority of the chosen filters. The second is to adjust the thresholds by experimentally validating the results. The latter was used in this study.
Generalizability of Framework
One of the main achievements of the analytical framework that this study produced is its generalizability. The framework created can be applied to several other health conditions ranging from liver failure, kidney failure as well as various cancers. The generalizability of the frameworks furthers this study’s importance in the field of computational genetics.
The issue of data integration within genomics and big data sciences is a major challenge. This study created a framework that featured both the horizontal and vertical integration of heterogeneous data sources. Horizontal refers to the multiple technical RNA measurement technologies (relative expression measured by hybridization and fluorescent intensity versus direct measurement through sequencing RNA molecules). In other words, jointly analyzing RNA-Seq platform and Microarray platform datasets. Each of these platforms has dramatically different error profiles, distributions and technical modalities. This study was able to overcome the extremely difficult technological challenge of combining and jointly analyzing RNA-Seq and Microarray datasets. Extensive effort and research was put into place to find a way to combine these technical modalities that are so vastly different. The methodology of combining the two technologies has vast implications on the landscape of bioinformatics. Using the framework described in the study, other bioinformaticians will be able to combine RNA-Seq and Microarray datasets in order to create more impactful research. The vertical integration we undertook was also novel and innovative due to integration of multiple external large datasets: GTEx Database - thousands of tissue samples and the Human Protein Atlas Site database.
One question pertaining to the methodology of this research is why the study mainly focused on the tissue, as oppose to the blood, of HF patients. There are two main reasons why the study made the conscious decision to focus a majority of the research in the tissue. Firstly, the availability of data. Much of the publicly available data for HF pertains to primary tissue (same for other conditions). There is very little publicly available data that contain the plasma values of pathological conditions. Thus, if the study was to rely on plasma datasets alone, the sample size of the study would be relatively small. This in turn would lead to an analysis with less statistical power. Secondly, the use of primary tissue data gives the study a strong mechanistic argument for the origin of the biomarker coming from the diseased tissue. Additionally, it increases the chances that the biomarkers are directly related to HF as oppose to a secondary event occurring elsewhere in the body. It is commonly accepted that the secondary events are more heterogeneous in targeting patient response and are also less reliable.
Moving forward, this study aims to add more public and internal data to the study. Additionally, we will be performing a technical verification in primary patients with and without heart failure. We will test for the presence of the proposed proteins using low throughput antibody staining (western blot and ImmunoHistoChemistry) using a combination of banked and prospectively collected blood from patients with heart failure at defined stages. The defined stages are as follows: early HF, late HF, post transplant and post VAD’s. Thus we will evaluate the candidate markers for different traits: early diagnosis, prognosis of outcome and prediction of treatment strategies. Patient recruitment for this next step is currently in progress. Lastly, future research can be conducted to perform a primary discovery on blood using high throughput proteomics
The implications of this study are far reaching. Firstly, accuracy. As an example, some biomarkers are very sensitive for screening and some are very specific for ruling out false positives. Secondly, the most significant breakthrough is in precision medicine which is uplifting and revolutionizing the future of medical treatments. Precision medicine believes that different biomarkers can be significant to different demographics and ethnicities. This is the new frontier of medicine. More biomarkers, especially those that can help predict treatment options, are the future. Precision drug delivery and treatment options will be the most important and far reaching implications of this research. The candidates proposed by the study have the potential to greatly reduce treatment costs and also save time by avoiding unneeded tests. These candidates lay the foundation for moving this research forward.
Changes in cardiac transcription profiles brought about by heart failure. (n.d.). Retrieved from
CRAN - Package meta. (n.d.). Retrieved from
DSG2 Gene - GeneCards | DSG2 Protein | DSG2 Antibody. (n.d.). Retrieved from http://www.genecards.org/cgi-bin/carddisp.pl?gene=DSG2
FHOD3 formin homology 2 domain containing 3 [Homo sapiens (human)] - Gene - NCBI. (n.d.). Retrieved from http://www.ncbi.nlm.nih.gov/gene/80206
FHOD3 gene(Protein Coding). (n.d.). Retrieved from http://www.genecards.org/cgi-bin/carddisp.pl?gene=FHOD3
Gaggin, H. K., & Januzzi, J. L. (2013). Biomarkers and diagnostics in heart failure. Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, 1832(12), 2442-2450. doi:10.1016/j.bbadis.2012.12.014
GTEx Portal. (n.d.). Retrieved from http://www.gtexportal.org/home/
Johnson, WE, Rabinovic, A, and Li, C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1):118-127.
Liao Y, Smyth GK and Shi W (2013). “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” Nucleic Acids Research, 41, pp. e108.
Liu, Y., Morley, M., Brandimarto, J., Hannenhalli, S., Hu, Y., Ashley, E. A., … Li, M. (2015). RNA-Seq identifies novel myocardial gene expression signatures of heart failure. Genomics, 105(2), 83-89. doi:10.1016/j.ygeno.2014.12.002
Loncar, G., Omersa, D., Cvetinovic, N., Arandjelovic, A., & Lainscak, M. (2014). Emerging
Biomarkers in Heart Failure and Cardiac Cachexia. IJMS, 15(12), 23878-23896. doi:10.3390/ijms151223878
Maciejak, A., Kiliszek, M., Michalak, M., Tulacz, D., Opolski, G., Matlak, K., … Burzynska, B. (2015). Gene expression profiling reveals potential prognostic biomarkers associated with the progression of heart failure. Genome Medicine, 7(1). doi:10.1186/s13073-015-0149-z
Molina-Navarro, M. M., Roselló-Lletí, E., Ortega, A., Tarazón, E., Otero, M., Martínez-Dolz, L., … Rivera, M. (2013). Differential Gene Expression of Cardiac Ion Channels in Human Dilated Cardiomyopathy. PLoS ONE, 8(12), e79792. doi:10.1371/journal.pone.0079792
PLA2G5 Gene. (n.d.). Retrieved from http://www.genecards.org/cgi-bin/carddisp.pl?gene=PLA2G5&keywords=PLA2G5
Piccolo, S. R., Withers, M. R., Francis, O. E., Bild, A. H., & Johnson, W. E. (2013). Multiplatform single-sample estimates of transcriptional activation. Proceedings of the National Academy of Sciences, 110(44), 17778-17783. doi:10.1073/pnas.1305823110
R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W and Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), pp. e47.
Taminau J, Meganck S, Lazar C, Steenhoff D, Coletta A, Molter C, Duque R, Schaetzen Vd, Solis DYW, Bersini H and Nowe A (2012). “ Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages.” BMC Bioinformatics, 13, pp. 355.
Tarazón, E., Roselló-Lletí, E., Rivera, M., Ortega, A., Molina-Navarro, M. M., Triviño, J. C., …
Portolés, M. (2014). RNA Sequencing Analysis and Atrial Natriuretic Peptide Production in Patients with Dilated and Ischemic Cardiomyopathy. PLoS ONE, 9(3), e90157. doi:10.1371/journal.pone.0090157
The Human Protein Atlas. (n.d.). Retrieved from http://www.proteinatlas.org
Wilson Tang, W., Francis, G. S., Morrow, D. A., Newby, L. K., Cannon, C. P., Jesse, R. L., … Wu, A. H. (2007). National Academy of Clinical Biochemistry Laboratory Medicine Practice Guidelines: Clinical Utilization of Cardiac Biomarker Testing in Heart Failure. Circulation, 116(5), e99-e109. doi:10.1161/circulationaha.107.185267