Large-scale microarray analyses reveal that transcriptional co-regulation patterns can be remarkably helpful in predicting the function of novel mouse genes.
Every eukaryotic genome-sequencing project to date has revealed the presence of thousands of novel predicted genes. Researchers interested in functional genomics now face some formidable challenges: defining how many unknown genes are yet to be discovered and working out what they do. Now, in Journal of Biology , Timothy Hughes and colleagues show that techniques that were first applied to yeast can be used to predict gene function in mice (see 'The bottom line' box for a summary of the work).
Hughes became something of a microarray aficionado during his postdoc at Rosetta Inpharmatics, LLC in Seattle, USA. He and his colleagues there demonstrated that a careful combination of genome-wide microarray analysis of gene expression patterns and sophisticated statistical methods could be used to predict gene function. Specifically, they showed that patterns of transcriptional co-regulation could effectively predict the biological function of novel genes . But those impressive studies were performed in a unicellular yeast, which has around 6,000 genes in total. It wasn't clear how well the approach would fare with larger mammalian genomes and the complexity of multicellular organisms. When Hughes moved to the University of Toronto, Canada, he was eager to give it a try. Mark Gerstein of Yale University says that the Hughes study has tackled an important problem in functional genomics: "That is, translating ideas that were found applicable in simple unicellular organisms to more complicated mammalian systems."
A mountain of microarray data
Hughes' first concern was which genes to spot onto his microarray slides (see the 'Background' box). Researchers are still undecided about how many genes make up a mouse. "There is no 'gold standard' cDNA database for mouse genes," explains Hughes. His team chose to start with a single source, the XM sequences from NCBI (see Table 1 for a list of the resources mentioned in this article). "We downloaded the XM collection from the NCBI. It's almost certainly not perfect, as it's all done using draft genome sequence, but it seems to contain a large majority of the known genes and a bunch of predicted genes, many of which were detectable on the arrays," says Hughes. "The collection contains about 75% of the current RefSeq sequences, it contains the majority of Ensembl genes, but it's missing a lot of the RIKEN clones." The team then made a single 60-residue oligonucleotide for each of the potential genes.
Table 1. The online genome-annotation and gene-listing resources described in this article
The Hughes team next got hold of as many different sources of mouse mRNA as they could and hybridized them to the microarrays carrying over 40,000 spots. They found that 21,622 transcripts were expressed in at least one of the 55 tissues examined. "We didn't really expect everything to be expressed," comments Hughes (see the 'Behind the scenes' box for more of the rationale for the work). "We mostly looked at adult tissues and we tried not to look at stress responses." He notes, however, that the latest estimates for the number of mouse genes are somewhere around the 20,000 to 25,000 mark.
Mining the resulting data mountain required a sophisticated bioinformatic approach. "You have to know what you are looking for and be able to formulate questions mathematically and execute them on a computer," notes Hughes. Hughes teamed up with computational colleagues in Brendan Frey's team and applied some fancy statistical tricks, such as 'variance stabilizing normalization', to allow comparison across the tissues, and implemented a learning algorithm called a support vector machine (SVM) . "If you have a bunch of points in two- or three-dimensional space, an SVM looks for ways to distinguish between the ones that have a given feature and the ones that don't. No one had used SVMs before on this scale. If we have 55 tissues, then we are looking at 21,000 objects in a 55-dimensional space and trying to separate the ones that have a function from those that don't."
The statistical analysis revealed that quantitative co-expression could identify groups of genes with related functions; the functions were determined as similar because annotation designated the genes as belonging to the same functional category within the Gene Ontology (see Figure 1). In fact, the SVM method was so effective that it could be used to predict functions for hundreds of genes of unknown function; indeed, the SVM was a much better predictor of gene function than was the simple tissue-specific gene-expression patterns.
Figure 1. Correspondence between gene expression patterns and GO annotations. Significance values resulting from applying a statistical test to each correlation of a Gene Ontology functional category with expression in the indicated tissues shown with colors. See  for further details.
The Canadian group is not the first to carry out such large-scale analyses of mammalian gene expression [4-6]. "But what I like about this paper is that it's really rock solid," says Stuart Kim of the Stanford University Medical Center, USA. "This is really believable stuff. It is really well grounded in the statistics, avoiding simplistic non-mathematical concepts like 'on and off' or 'two-fold up and two-fold down'. They did fairly sophisticated statistical analyses to make sure that the trends they were seeing were really valid. It's important to get better and better datasets published." John Hogenesch of Novartis Research Foundation Genomics Institute in San Diego, California, notes that " [Hughes'] application of SVMs and Gene Ontology to provide preliminary functional annotation for thousands of genes of unknown function is a major advance." The Hogenesch group is also creating an atlas of mammalian genes . "This approach had been used in yeast and worms, but it hadn't yet been applied to mammalian gene expression. Hughes' paper now provides testable hypotheses for the roles of thousands of genes in the genome."
An open resource at the click of a mouse
Hughes' analysis revealed that the results from the extensive mouse tissue-specific dataset correlates very well with the results of studies from other laboratories. One notable feature of the Hughes dataset is that it has been made openly accessible to the research community [1,7]. The additional data with the published article, and the Hughes lab website, provide information about the microarray oligonucleotide sequences, the SVM predictions, gene annotation, and so on, all of which can be downloaded without restriction and free of charge.
Kim points out that this is really important. "I think that every person that works on mice should now go to this study and type in the name of their favorite gene(s) and see where it is expressed in 55 tissues. It will cost nothing and then you will know where it is expressed strongly. You can make sure there are no hidden surprises [in your experiments] or find out what the hidden surprises are." Hogenesch concurs: "Most users will use the database to see where their gene of interest is expressed and what pathway it might participate in. Others will use the dataset itself to ask questions using other methodologies (tissue-specific gene expression, regulatory-element analysis, functional classification, and so on). The types of things you can do with a dataset like this are numerous, which is why it's important that the data are available."
Kim's group is building large genetic networks based on microarray datasets . "We use more than just tissue specificity to build our networks – we use everything that we can grab. So, we will go and grab these data and fold them into ours. Our next paper will include 1,700 mouse microarrays folded into the human-yeast-fly-worm networks. In worms, many labs have used our resource and published some pretty awesome papers based on the genetic network." Kim thinks that the networks will be even more powerful in accelerating the pace of research in mammalian systems, where classical experimental approaches are slow and expensive. Mark Gerstein agrees: "This is an important advance in helping to unravel the functions of the tens of thousands of human genes using functional genomics approaches."
Hughes has enjoyed the transition from studying yeast to working on mice, and is eager to collaborate with mouse geneticists to test some of the predictions that come out of the current study. And he wants to understand more about the correlation between co-regulation patterns and gene function. "As a yeast researcher the thing that blows my mind is how many things animal cells do. I learned a lot just looking at all the functional categories and Gene Ontology," admits Hughes. "The correlation between transcriptional co-regulation and function is very strong. It's much, much higher than you would get if genes were just expressed at random. But it's not absolute either. So, annotating function is a hard problem to crack and that gives us plenty to work on."
Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirnglibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR: The functional landscape of mouse gene expression.
Schadt EE, Edwards SW, GuhaThakurta D, Holder D, Ying L, Svetnik V, Leonardson A, Hart KW, Russell A, Li G, et al.: A comprehensive transcript index of the human genome generated using microarrays and computational approaches.