NIEHS
Comparative
Mouse Genomics Centers Consortium (CMGCC)

Aronow's Lab


Bruce J. Aronow, Ph.D.

Bioinformatics Projects

The Bioinformatics Core for the University of Cincinnati Comparative Mouse Genome Centers Consortium (CMGCC) is approaching several aims in order to assist researchers: (i) in designing and analyzing gene and protein expression experiments, and (ii) in evaluating their experimental data in the context of other relevant expression profile and other genetic data (e.g., genotype and sequence data), and to build three databases that can support efforts to create and understand mouse models generated by the entire consortium.

The three principle systems are:

GeneServer: Information Server for Genes of High Interest to CMGCC Investigators

The goal of the geneserver database is to act as a repository and lookup source for to integrate information about genes, transcripts, proteins, and pathways. This is being approached in successive steps:

GeneServer: Gene Server will collate available data that pertains to the human DNA damage repair and cell cycle control genes, with particular focus on the pathways and processes that the genes participate in, and the known polymorphisms in each gene and gene product. The website for the database is http://genome.chmcc.org/geneserver.

The informational content of the GeneServer will be developed in close collaboration with investigators within the CMGCC. The GeneInfo database application will consist of an underlying Oracle database with a series of Java servlets developed to perform different functions. The database will consist of linked tables that that store descriptions of the gene features that correspond to properties of each biomolecule. The Java Servlets will be designed to run under TomCat, a Java Server, and will be pointed to an Oracle database to allow for database management, visualization of functional pathways, protein complexes, individual genes, transcripts and proteins, as well as the species homologues and human polymorphic alleles. The impact of human gene polymorphisms will be shown in reference to (a) known biochemical domains, (b) phylogenetically conserved sequence motifs, and (c) known or potential protein structure. The GeneServer will house approximately 100 different types of information within 20-30 tables that pertain to about 200 different genes.

Comparative Genomics analysis tool to find highly conserved genomic regions that may contain functional domains and regions. This has been approached mainly for the goal of identifying critical promoter and enhancer elements using the comparative genomics of cis-elements within conserved sequences approach.

TRAFAC

A comparative genomics cis-regulatory region discovery and analysis system. We have completed programming a web accessible database system for the detection of cis-regulatory regions in genes whose regulation is of interest to any investigator in the CMGCC. The website for this system is http://trafac.chmcc.org. This is a novel, powerful, and highly expandable system that should fuel progress into prediction of gene regulatory regions. A follow-up development for the system is intended to identify potential regulatory region polymorphisms that occur in human individuals. Currently there are more than 200 genes that have been curated into the database. The system has been designed to allow for rapid curation of any gene or gene group of interest to members of the CMGCC and has many of the genes for cell-cycle control and DNA damage repair already entered and available for exploration. We are now ready to accept additional genes for entry into the system. A key point for knock-in construction may be to avoid the insertion of flox/loxP sites into potential regulatory regions. An additional direction is to examine the probable regulatory regions for the occurrence of sequence polymorphisms.

Progress: We have annotated and set up the analysis of several dozen genes of the DNA damage and cell cycle control pathway and these are available for detailed analysis at the website: http://trafac.chmcc.org. In addition, the server at the current level of implementation has been described in detail in a publication in Genome Research:

Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ. Detection and Visualization of Compositionally Similar cis-Regulatory Element Clusters in Orthologous and Coordinately Controlled Genes. Genome Res. 2002 Sep;12(9):1408-17.

Problems: The genes need a high degree of curation and this is an intensive activity that needs to be done in such a way that an audit trail and revised features table accompanies the gene.

Proposed solution: We are approaching this issue in the context of a more general implementation of the TRAFAC tool using the ENSEMBL mouse and human genome assembly. We are making excellent progress in this goal which is to have a GeneServer that integrates this information and allows for multiple users to improve the annotation depth of genes, proteins and pathways.. We are working with Robert Weiss of the Utah geneSNPs group and Debbie Nickerson of the University of Washing geneSNPs group to bring the solution to a general level.

GENET

Through Genet, the Gene Expression Data Server, we have begun to make available our published and some unpublished gene expression data for web view, searching, re-analysis, and download. The system is available at http://genet.chmcc.org. We have now placed onto this webserver data from the effects of Rb delta CK (PSMRB) that is to accompany a submitted manuscript (Markey et al. ) and this data is now available for login as username CMGCC with password CMGCC. We are planning to add additional data pertinent to DNA damage and cancer models. The important overall goal of these efforts is to identify gene responses that can predict the potential for likely harmful consequences of human polymorphism substitutions into the mouse genome.

Identifying differentially expressed genes in multifactorial experiments: We have developed a complete strategy for analyzing microarray data generated by the Genomics Core. The strategy incorporates curvilinear within and between array normalization approaches that effectively remove systematic biases in the data. Linear models based statistical analysis of processed data is then applied which allows us to optimally utilize information from the whole experiment with the goal to identify genes whose expression is affected by various combination of treatments. To effectively perform such analysis that involve fitting thousands of ANOVA models, we developed appropriate SAS programs that utilize the unmatched linear models capabilities of this statistical package. We also developed appropriate Perl routines for pre-processing of the raw data generated by the Genomics Core to a format that can be directly accessed by SAS. Complete processing of a 15 array 3 factors experiment, including the data pre-processing, comprehensive normalization and fitting of all applicable ANOVA models, calculation of various measures of statistical significance (e.g. individual p-values, False Discovery Rate based significance measures, etc.), generation of comprehensive model and data quality diagnostics, and merging of gene annotations, takes about 1 hour of computing time on a high-end PC workstation. Outputs of such analysis are then, depending on the preferences of individual investigators, uploaded to GeneSpring, transferred to Microsoft Excel and/or other electronic formats.

The linear model (i.e. Analysis of Variance) approach allows us to reduce the cost of multifactorial experiments by reducing the number of combinations of experimental factors that need to be directly compared on a microarray. The appropriate experimental design is crucial for ones ability to identify statistically significant changes. In order to choose an optimal experimental design, investigators whose microarray experiments are subsidized by CMGCC are required to consult a CMGCC member with experience in designing such experiments prior to conducting the experiment.

We have begun to make available our published and some unpublished gene expression data for web view, searching, re-analysis, and download. The system is available at http://genet.chmcc.org/. We have now placed onto this webserver data from the effects of Rb delta CK (PSMRB) that is to accompany a submitted manuscript (Markey et al. from the Knudsen laboratory) and this data is now available for login as username CMGCC with password CMGCC. We are planning to add additional data pertinent to DNA damage and cancer models. The important overall goal of these efforts is to identify gene responses that can predict the potential for likely harmful consequences of human polymorphism substitutions into the mouse genome. We have also completed the incorporation of a very large dataset of 350 microarrays from children with a variety of leukemia types with different oncogenes active. We are exploring mechanisms of mining this data for gene functions in cell cycle control and DNA damage that are related to oncogenic pathways.

PathMaker

Objects:

  • To annotate all the elements in a pathway
  • To save the pathway as a persistent object
  • To modify existing pathways

Components:

  • Selector: Populates the Gene Objects
    • Select genes based on the Gene Ontology assignments
    • Selected genes are placed in the gene object tab of a palette
    • Selected gene used as a navigation point to gene related objects. For e.g.: adding a protein or a transcript or a polymorphism
  • Assembler: Operates within Pathway Biological Object Model
    • Work from the PathMaker palette
    • Specify interactions between the genes
  • Viewer: Displays a graphical image of the current Pathway

Bioinformatics Projects

back to top

CCMGC Home

11/29/2002