Two-Way Gene Interaction From Microarray Data Based on Correlation Methods
Background: Gene networks have generated a massive explosion in the development of high-throughput techniques for monitoring various aspects of gene activity. Networks offer a natural way to model interactions between genes, and extracting gene network information from high-throughput genomic data is an important and difficult task.
Objectives: The purpose of this study is to construct a two-way gene network based on parametric and nonparametric correlation coefficients. The first step in constructing a Gene Co-expression Network is to score all pairs of gene vectors. The second step is to select a score threshold and connect all gene pairs whose scores exceed this value.
Materials and Methods: In the foundation-application study, we constructed two-way gene networks using nonparametric methods, such as Spearman’s rank correlation coefficient and Blomqvist’s measure, and compared them with Pearson’s correlation coefficient. We surveyed six genes of venous thrombosis disease, made a matrix entry representing the score for the corresponding gene pair, and obtained two-way interactions using Pearson’s correlation, Spearman’s rank correlation, and Blomqvist’s coefficient. Finally, these methods were compared with Cytoscape, based on BIND, and Gene Ontology, based on molecular function visual methods; R software version 3.2 and Bioconductor were used to perform these methods.
Results: Based on the Pearson and Spearman correlations, the results were the same and were confirmed by Cytoscape and GO visual methods; however, Blomqvist’s coefficient was not confirmed by visual methods.
Conclusions: Some results of the correlation coefficients are not the same with visualization. The reason may be due to the small number of data.
Keywords: Gene Expression; Gene Regulatory Networks; Gene Ontology; Molecular Structure; Nonparametric
In recent years, there has been a great explosion in the development of high-throughput techniques for globally monitoring various aspects of gene activity (1). High-throughput genomic data is a rich resource to explain how genes are joined (2-5). Until now, the study of the properties, activities, and roles of genes and proteins; the discovery of molecular processes within cells; and the tissues and molecular biological aspects of illnesses were assessed at one or several genes or proteins. Microarray technology has emerged as one way of simultaneously expressing the levels of thousands of genes, with the general approaches for the data being gene sets and cluster analyses (1). Likewise, several tools have been developed for the visualization and analysis of biological networks, such as Cytoscape (5), VisAnt (6), and tYNA (7). Clustering is the classification of a heterogeneous population into a number of homogeneous subsets, which are then referred to as clusters. This method attempts to find groups that are significantly different from each other, as members of these groups are extremely similar (8). Gene set enrichment analysis (GSE) is designed to find differences in gene expression between phenotypes by incorporating uses, biological knowledge, and statistical analysis (9). Clustering techniques cannot recognize molecular networks, nor can clustering methods show direct or indirect connections between the genes inside the clusters. Furthermore, clustering methods assign a gene to one cluster, while the tumor protein p53 can cooperate in several physiological pathways. Thus, we need to represent gene interaction methods based on different algorithms (8). Information about interactions improves our understanding of the disease and could provide a basis for new treatment methods (10, 11). There are several gene network constructions, such as Boolean network (12, 13), mutual information, and Bayesian network (14), to discover the more complex interactions and to detect interaction networks within the gene expression data. One disadvantage of these methods is the large samples of expression data. Networks offer a natural way to model interactions between genes, with nodes representing genes and with edges representing various interactions inferred from different data sources (15).
This study’s purpose is to construct gene networks using nonparametric Spearman’s correlation and Blomqvist’s coefficient and then to compare them with Pearson’s correlation. The first step in creating a gene co-expression network (GCN) is to score all pairs of gene vectors. The second step is to select a score threshold and connect all gene pairs whose scores exceed this value. Finally, the results were compared with Cytoscape, based on BIND, and GO visualization, based on molecular function methods.
3. Materials and Methods
Venous thrombosis is defined as a blood clot that forms in a vein, and it is a common reason for morbidity and mortality. A classical venous thrombosis is deep vein thrombosis (DVT), which can break off and cause a life-threatening pulmonary embolism (16). The VTE microarray dataset includes 70 adults with one or more prior VTE on warfarin and 63 healthy controls (17). Blood was gathered in PAX gene tubes, RNA was separated, and gene expression profiles were achieved using the Affymetrix human genome U133A 2.0 array. In the data, a set of six genes, such as CYP2A6, NAT2, CYP1A2, CYP2A13, XDH, and NAT1, was selected. The KEGG pathway was used for performing gene set analysis (18). In the foundation-application study, we used correlation algorithms to construct gene networks. The first stage in constructing a GCN is to score all pairs of gene vectors using correlation coefficients. The second stage is to select a score threshold and to connect all gene pairs whose scores exceed this value, focusing on undirected networks, which indicate pairwise relationships of co-expression without necessarily representing causality. There are several methods to survey the expression profiles of gene pairs.
The Pearson’s correlation coefficient, r, is a measure of the degree of linear relationship between two gene vectors, X and Y, and it is calculated as (19):
The Spearman’s rank correlation is like the Pearson’s correlation coefficient except that it acts on the ranks of the data rather than the normal raw data (20). The Spearman’s rank correlation coefficient, rs, between two gene vectors, X = (X1, ...., XN) and Y = (Y1, ….., YN) with the respective ranks (R1, …., RN) and (S1, ...., SN), is calculated as:
Blomqvist’s coefficient is a nonparametric correlation method between two random variables. The coefficient is asymmetric and focuses on the difference of observed values among the first ranks in the orderings induced by the variables. Let (x1, y1),...,(xn,yn) denote a sample from a continuous bivariate population, and let X~, ỹ denote sample medians. It is separated into the (x, y)-plane by four quadrants with the lines x = x~; and y = ỹ. Then Blomqvist’s B is defined as (21, 22):
The next step is to choose a score threshold and to create a GCN linking all gene pairs with scores exceeding this threshold. Let Z1, Z2, ….., ZP be p genes for the pair (Zi, Zj), i ≠ j, i, j = 1, 2, ….., p the P value associated to the index K ϵ [r, rs, B] and to each fixed pair correlation (Z*a, Z*b) is defined by
where I(A) denotes the indicator function of the set A. The P value was created for all pair genes in Pearson’s correlation, Spearman’s rank correlation, and Blomqvist’s coefficient method. If P value is more than 0.95, it means that the genes are linked together to construct a network.
Cytoscape is an open source software platform to imagine interaction gene networks and to combine these interactions with gene expression and functional genomics data. Cytoscape is constructed of a gene network graph, with genes displayed as nodes and with interactions between nodes displayed as edges. The Cytoscape program is written in Java and has been released under an LGPL Open Source license; graph structures and some layout algorithms (hierarchical and circular) are implemented using the yFiles Graph Library (23).
The GO ontology is structured as a directed acyclic graph, and each term has defined relationships to one or more other terms in the same domain, and sometimes to other domains. GO is improved based on a cooperative project and includes three structured control words (Ontologies) that depict gene products in terms of their cellular parts, biological procedures, and molecular functions in a species-independent way (24).
In the study, the VTE dataset is a microarray dataset including 70 adults with one or more prior VTE on warfarin and 63 healthy controls. The descriptive statistics of variables are given in Table 1.
Descriptive Statistics for Six Genes Related to VTE Dataset
The scatter plots drawn in Figure 1 show the positive or negative linear relationships between two genes. A Kolmorov-Smirnov test was also conducted to examine the normality distribution of the genes. The results showed that all of them were insignificant (P > 0.05).
Scatter Plots for Pairs of Genes Related to VTE Dataset
The dataset’s Pearson’s correlation, Spearman’s rank, Blomqvist’s coefficient, and calculated P value are presented in Table 2. We compared these relationships with GO and Cytoscape.
Spearman's Rank Coefficient, Pearson's Correlation, Blomqvist’s Coefficient, and Associated P Value for 15 Pairs of Genes CYP2A6, NAT2, CYP1A2, CYP2A13, XDH, and NAT1
As shown, XDH-CYP2A6 has a strong relationship with Pearson and Spearman correlations; NAT2-CYP1A2 has a relationship with Blomqvist’s coefficient; and the other pair genes have weak relationships at the 0.05 significance level. The Cytoscape visualization method, based on BIND, has been drawn for the genes in Figure 2.
Cytoscape Visualization Method, Based on BIND, for Six Genes
As shown, the XDH gene has a relationship with CYP1A2, and this gene is related to CYP2A6 through OXY. Pearson and Spearman correlations confirmed the relationship; however, Blomqvist’s beta does not show the relation.
Table 3 shows the GO method, based on molecular function, for six genes. The XDH gene has a relationship with CYP2A6, which confirms our algorithm. There are other relations in the GO method, as well.
GO Method Base on Molecular Function
The results showed that Pearson and Spearman correlation coefficients revealed better conclusions than the Blomqvist’s coefficient. The reason may be due to the small number of data.
In 2006, Kim et al. presented a new distance measure that is applied for both linear trends and fold-changes of expression in a mouse (25). They compared performances of different distance measures on seven experiments that consisted of 288 mouse oligonucleotide microarrays. They showed that the proposed distance measurement for comparing expression profiles recognizes genes with several numbers of common regulatory components since it considered the inherent regulatory knowledge better than previous distance measures. In the present study, we surveyed three correlation coefficients and two visualization methods that confirmed the relations. Although Blomqvist’s coefficient does not have similar results, the findings of the Pearson and Spearman correlation coefficients are the same.
In 2002, Kue et al. surveyed mRNA measurement comparisons between matched measurements and calculated concordance between clusters from two DNA microarray technologies, Stanford type cDNA microarrays and Affymetrix oligonucleotide microarrays (26). They compared Pearson correlation and Spearman’s rank correlation coefficient for genes, cell lines, and across all 162, 120 matched pairs of measurements. They hypothesized that the data had normal distribution and used Student’s t-distribution. Hierarchical clustering was done using Euclidean distance, as the measure of similarity, and average linkage clustering using Matlab software. There were poor correlations between the two platforms. In the study, we presented several methods of correlation coefficients using R-3.1.1 and a visualization method using the Cytoscape and GO methods. Pearson and Spearman correlation coefficients showed the same results.
In 2000, Butte and Kohane used three methods not categories to cluster RNA expression data (27). The simple criterion for clusters was based on a fold-difference greater than a given threshold. They applied the Euclidean method for connecting all genes computing the extensive pair-wise mutual information, removed the connections under the threshold, isolated clusters of genes or related networks, and then detected related clusters biologically. Each gene was thus completely connected to every other gene with the calculated mutual information. In our study, we displayed the relationship between genes by correlation coefficients and compared them with visualization methods. Using Pearson and Spearman correlation coefficients, the results were the same.
In 2012, Bergen et al. expressed that the metabolic enzyme included in nicotine and cotinine metabolism is CYP2A6 (28). Other variables in the study were age, gender, BMI, smoking situation, and hormonal status. They carried out a hierarchical linear model for DMET SNPs and adjusted NMR, and then continued by adjusting for related tests (PACT) within genes with > 1 common SNP with ≥ 1 SNP with nominal P < 0.05. They recognized SNPs at 13 genes with PACT < 0.05 in ≥ 2 transmission models in a large twin dataset. In their article, they investigated the importance of CYP2A6 in tobacco smoking. However, we considered CYP2A6 and five other genes in order to draw a gene network based on the correlation method and on comparing them with Cytoscape and GO in venous thromboembolism. By comparing the two studies, it was concluded that CYP2A6 and five other genes are effective in venous thromboembolism disease, and they inferred that CYP2A6 is the predominant metabolic enzyme involved in nicotine and cotinine metabolism.
In 2012, Neal et al. surveyed the Cytochrome p450 (CYP) family of 60 genes in the metabolism and combination of different chemicals and lipid cellular molecules involving vitamin D (29). In genotyped NHANES III participators, they researched genetic deviation in CYP (33 SNPs in 9 genes), vitamin D receptor genes (2 SNPs), and additional variables connected to sufficiency in previous studies, such as body mass index (BMI), season of sample collection (SSC), sex, supplementation habit, income, and age for associations with vitamin D sufficiency. They applied chi square tests and multiple logistic regression to determine relations with Vitamin D sufficiency. There were important relationships between vitamin D sufficiency and SSC, BMI, sex, and age across RE level. Several CYP SNPs were associated with vitamin D sufficiency in general models. CYP2A6 (rs1801272) was meaningfully related to vitamin D sufficiency in several groups in adjusted and crude models. The article is the first report of CYP2A6’s connection with vitamin D sufficiency, and there is also biological plausibility because of its wide range of potential metabolic targets. In their article, they surveyed the relation between a gene and vitamin D. Our study surveyed relationships between CYP2A6 and five other genes in venous thromboembolism, and it drew a gene network based on correlation methods in the data and on comparisons with Cytoscape and GO. Comparing the two studies, we concluded that CYP2A6 can cause vitamin D deficiency and skeletal, cardiovascular, autoimmune, and metabolic disease, as well as venous thromboembolism. Garcia-Closas et al. surveyed the association between NAT2 slow acetylation and GSTM1null genotype in the risk of bladder cancer (30). They studied polymorphisms in NAT2, GSTM1, NAT1, GSTT1, GSTM3, and GSTP1, and there were 1,150 patients with transitional cell carcinoma of the urinary bladder and 1,149 members of the control group in Spain. They also performed meta-analyses of GSTM1, NAT2, and bladder cancer that involved more than twice that of other studies. In bladder cancer, they compared the odds ratios for persons with an absence of one or two copies of the GSTM1gene with NAT2 rapid or intermediate acetylators. NAT2 slow acetylators had an increased overall risk of bladder cancer that was stronger in cigarette smokers than in nonsmokers. They concluded that the GSTM1null genotype increases the risk of bladder cancer, and the NAT2 slow acetylator genotype enhances the risk among cigarette smokers. In the current study, we investigated NAT2 and five other genes in venous thromboembolism. We also drew a gene network with a correlation-based algorithm and compared it with Cytoscape and GO visualization methods. Comparing the two studies, we concluded that NAT2 can cause bladder cancer and venous thromboembolism.
In 2000, Bartsch et al. studied several genes, such as CYP1A1, 1A2, 1B1, 2A6, 2D6, 2E1, 2C9, 2C19, 17, and 19, singularly or as a mixture with detoxifying enzymes as adjusters for the risk for tobacco-interconnected cancers (31). They expressed the important actions by which the compounds are metabolized and caused DNA adducts in the bladder epithelium, including N-hydroxylation (CYP1A2) and N-acetylation (NAT1 and NAT2). These aromatic amines are the main components of smoke and seem to be an important reason for urinary bladder cancer in smokers. They also stated that deleting the CYP2A6 region leads to an inactive enzyme or lack of protein synthesis, differences in the polyadenylation signal of NAT1 that affects transcript half-life, the quantity of the enzyme, and interactions of the CYP1A2 gene and its enzyme catalysis products. In this study, we surveyed NAT1, CYP2A6, CYP2A2, and three other genes in venous thromboembolism. We also constructed a gene network using a correlation-based algorithm and compared it with GO and Cytoscape methods. Comparing the two studies, it was concluded that NAT2, CYP2A6, and CYP2A2 are effective in tobacco-related cancers and venous thromboembolism.
We would like to express our sincere thanks to the referees for carefully reading our manuscript and for giving such constructive comments, which substantially helped in improving the paper’s quality.
- 1. Dehmer M, Emmert-Streib F, Graber A, Salvador A. Applied statistics for network biology: methods in systems biology. John Wiley and Sons; 2011. 478 pp. [DOI]
- 2. Lander ES. Array of hope. Nat Genet. 1999;21(1 Suppl):3-4. [DOI] [PubMed]
- 3. Quackenbush J. Genomics. Microarrays--guilt by association. Science. 2003;302(5643):240-1. [DOI] [PubMed]
- 4. Zhang MQ. Extracting functional information from microarrays: a challenge for functional genomics. Proc Natl Acad Sci U S A. 2002;99(20):12509-11. [DOI] [PubMed]
- 5. Killcoyne S, Carter GW, Smith J, Boyle J. Cytoscape: a community-based framework for network modeling. Methods Mol Biol. 2009;563:219-39. [DOI] [PubMed]
- 6. Hu Z, Snitkin ES, DeLisi C. VisANT: an integrative framework for networks in systems biology. Brief Bioinform. 2008;9(4):317-25. [DOI] [PubMed]
- 7. Yip KY, Yu H, Kim PM, Schultz M, Gerstein M. The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. Bioinformatics. 2006;22(23):2968-70. [DOI] [PubMed]
- 8. Wu X, Ye Y, Subramanian KR, Zhang L. Interactive Analysis of Gene Interactions Using Graphical gaussian model. Biol Knowl Discov Data Min. 2003;3:63-9.
- 9. Lists of Software for Bioinformatics: Pathway Analysis Tool. Available from: http://bioinformatics.ai.sri.c.../
- 10. Mendes P, editor(s). Advanced visualization of metabolic pathways in PathDB. Proceedings of the 8th Conference on Plant and Animal Genome; 2000; San Diego.
- 11. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiol Rev. 1995;59(1):94-123. [PubMed]
- 12. Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput. 1998;18-29. [PubMed]
- 13. Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002;18(2):261-74. [PubMed]
- 14. Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. J Comput Biol. 2000;7(3-4):601-20. [DOI] [PubMed]
- 15. Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004;5(2):101-13. [DOI] [PubMed]
- 16. Saha P, Humphries J, Modarai B, Mattock K, Waltham M, Evans CE, et al. Leukocytes and the natural history of deep vein thrombosis: current concepts and future directions. Arterioscler Thromb Vasc Biol. 2011;31(3):506-12. [DOI] [PubMed]
- 17. Lewis DA, Stashenko GJ, Akay OM, Price LI, Owzar K, Ginsburg GS, et al. Whole blood gene expression analyses in patients with single versus recurrent venous thromboembolism. Thromb Res. 2011;128(6):536-40. [DOI] [PubMed]
- 18. Alavi-Majd H, Khodakarim S, Zayeri F, Rezaei-Tavirani M, Tabatabaei SM, Heydarpour-Meymeh M. Assessment of gene set analysis methods based on microarray data. Gene. 2014;534(2):383-9. [DOI] [PubMed]
- 19. Taylor R. Interpretation of the Correlation Coefficient: A Basic Review. J Diagn Med Sonogr. 1990;6(1):35-9. [DOI]
- 20. Gauthier T. Detecting Trends Using Spearman's Rank Correlation Coefficient. Environ Forensics. 2001;2(4):359-62. [DOI]
- 21. Blomqvist N. On a Measure of Dependence Between two Random Variables. Ann Math Stat. 1950;21(4):593-600. [DOI]
- 22. Genest C, Plante JF. On blest's measure of rank correlation. Can J Stat. 2003;31(1):35-52. [DOI]
- 23. yWorks - The Diagramming Company . Files Graph Library. Available from: www.yworks
- 24. Slimani T. Description and Evaluation of Semantic Similarity Measures Approaches. Int J Comput Appl. 2013;80(10):25-33. [DOI]
- 25. Kim RS, Ji H, Wong WH. An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse. BMC Bioinformatics. 2006;7:44. [DOI] [PubMed]
- 26. Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 2002;18(3):405-12. [PubMed]
- 27. Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput. 2000;418-29. [PubMed]
- 28. Bergen A, Javitz H, Michel M, Krasnow R, Nishita D, Lessov-Schlaggar C, et al, editor(s). Drug Metabolizing Enzyme Genes and Nicotine and Cotinine Metabolism. American Society of Human Genetics 62nd Annual Meeting ; 2012; San Francisco, California.
- 29. Neal C, Jackson J, Crider K, editor(s). Genetic variation and vitamin D sufficiency in the U.S. population (NHANES III). American Society of Human Genetics 62nd Annual Meeting; 2012; San Francisco, California.
- 30. Garcia-Closas M, Malats N, Silverman D, Dosemeci M, Kogevinas M, Hein DW, et al. NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet. 2005;366(9486):649-59. [DOI] [PubMed]
- 31. Bartsch H, Nair U, Risch A, Rojas M, Wikman H, Alexandrov K. Genetic polymorphism of CYP genes, alone or in combination, as a risk modifier of tobacco-related cancers. Cancer Epidemiol Biomarkers Prev. 2000;9(1):3-28. [PubMed]