Identification of a set of genes showing regionally enriched expression in the mouse brain

Background The Pleiades Promoter Project aims to improve gene therapy by designing human mini-promoters (< 4 kb) that drive gene expression in specific brain regions or cell-types of therapeutic interest. Our goal was to first identify genes displaying regionally enriched expression in the mouse brain so that promoters designed from orthologous human genes can then be tested to drive reporter expression in a similar pattern in the mouse brain. Results We have utilized LongSAGE to identify regionally enriched transcripts in the adult mouse brain. As supplemental strategies, we also performed a meta-analysis of published literature and inspected the Allen Brain Atlas in situ hybridization data. From a set of approximately 30,000 mouse genes, 237 were identified as showing specific or enriched expression in 30 target regions of the mouse brain. GO term over-representation among these genes revealed co-involvement in various aspects of central nervous system development and physiology. Conclusion Using a multi-faceted expression validation approach, we have identified mouse genes whose human orthologs are good candidates for design of mini-promoters. These mouse genes represent molecular markers in several discrete brain regions/cell-types, which could potentially provide a mechanistic explanation of unique functions performed by each region. This set of markers may also serve as a resource for further studies of gene regulatory elements influencing brain expression.


Background
The Pleiades Promoter Project (please see Availability & requirements for more information) addresses two major challenges identified in gene therapy -first, the delivery of DNA to specific cell types to reduce side effects from treating healthy cells and second, controlled delivery of DNA to a specific locus in the genome to avoid insertional mutagenesis. The goal for the project is the generation of human DNA promoters less than 4 kb in length (minipromoters) that drive gene expression in brain regions important in neurological conditions. To achieve this goal, we have first identified genes with enriched expression in different regions of the adult mouse brain. Regional expression patterns within the brain tend to be conserved between orthologous human and mouse genes [1]. Additionally, as regulatory sequences in tissue-specific genes tend to be highly conserved [2], human mini-promoters are expected to drive regional gene expression in transgenic mice based on earlier studies [3]. Therefore, promoter regions from orthologous human genes will be assessed in the mouse brain for the ability to drive regional expression. Selection of the most optimal genes for promoter design necessitates detailed assessment of gene expression patterns. An invaluable resource to identify genes expressed in the mammalian brain is the serial analysis of gene expression (SAGE) technique [4,5]. A modern improvement of tag-based expression analysis is LongSAGE, which produces longer transcript tags (21-bp) better suited to unique mapping onto cDNA and genome sequences [6]. As part of the Mouse Atlas of Gene Expression project [7], LongSAGE was used to profile transcriptomes of 72 tissues of mouse strain C57BL/6J at various stages of development [8]. For the Pleiades Promoter Project [9], a scion of the Mouse Atlas project, we have generated new Long-SAGE data on gene expression in the adult mouse central nervous system to identify genes that display enriched expression in key brain regions.
While LongSAGE provides a rich perspective on gene expression patterns, we extended our data mining efforts to include other large information sources. The PubMed database [10] provides an unparalleled compendium of text from the scientific literature. In order to facilitate extraction of key information from Medline abstracts or full-text articles in PubMed, natural language processing tools are routinely employed to semi-automate the process of literature mining [11,12]. In this study we investigated an approach to specifically and automatically identify associations between genes and brain regions from the literature. We further analysed expression data from the Allen Brain Atlas (ABA; [13]), a high-throughput in situ hybridization platform that has assayed expression for ~20,000 genes in the adult mouse brain [14,15]. Here, we report the successful utilization of a combination of gene-finding tools, including SAGE analysis, text mining and ABA expression data, to identify genes displaying regionally enriched expression in surrogate regions of therapeutic interest within the mouse brain.

Identification of brain region-enriched gene expression by LongSAGE
To identify regionally enriched gene expression within the brain of the adult mouse strain C57BL/6J, we used the precision of Laser Capture Microdissection (LCM; Figure 1) [16] to isolate component tissues and construct SAGE libraries from 17 brain regions as well as the whole adult mouse brain for comparison (Methods). As shown in Table 1, these libraries have been sampled to a depth of > 100,000 tags each, a level shown to be adequate for the discovery of medium-to-high level transcripts [8]. Bioinformatics analysis of differential gene expression was performed as described in Methods. Since the majority of transcripts were detected in multiple libraries, we employed a heuristic approach to identify and rank expression patterns (outlined in Table 2). For each brain region, we ranked genes from 1-91 based on the level and pattern of expression in descending order. Expression specificity of a ranked list of 1999 SAGE-identified genes was then confirmed by examining related literature information and Allen Brain Atlas in situ hybridization data. Based on this collective information, region-specific or regionenriched genes were further considered.
Of the 237 genes identified as displaying regionally enriched expression in this study, 132 genes [see Additional file 1] displayed expression patterns listed in Table  2. Only 22 genes were found in a single library and five of these (A930006D11Rik, Chrna6, Gdf10, Hcrt, and Hes3) were determined to be tissue-specific at a statistically significant level (tag counts > 5, P < 0.05).

Complexity of the adult mouse brain transcriptome and SAGE-based analysis of transcriptome similarity of brain regions
As an indication of complexity of the adult mouse brain transcriptome, within the 18 Pleiades libraries (including whole adult brain library) expression was observed for 11,836 genes of the total 17,098 genes detectable within the Mouse Atlas (total number of tags mapped to the Mouse Atlas libraries was approximately 8.8 million including singletons). In contrast, the Allen Brain Atlas (ABA) contains expression patterns of approximately 16,000 genes across the entire adult C57BL/6J mouse brain (Susan Sunkin, ABA, personal communication); of these genes, roughly 65.5% (10,479/16,000) were detectable in the 18 Pleiades libraries. Furthermore, the Pleiades libraries provided about 8% (1,357/17,357) additional genes to the total number of genes detectable by ABA.
We also analyzed SAGE data to measure transcriptome similarity between selected tissues. The premise was that tissues would cluster together or diverge based on the degree to which their genes are differentially expressed.
Hierarchical clustering was done based on unweighted average distance between formed clusters (see description in Methods), the results of which are displayed in the form of a dendrogram ( Figure 2). A pattern of divergent tissue clusters consistently emerges: a cluster of neuronal tissues and several discrete single tissue clusters including Ependymal Layers, Cerebellum White Matter and Cerebel-  lum Purkinje Cell Layer. Among neuronal tissues, the Ventral and Medial Thalamus consistently clustered tightly together and had the lowest expression divergence between any two pairs of tissues. Additionally, Visual Cortex, Primary Motor Cortex, Amygdala (basolateral), Amygdala (central), and Dorsal Striatum also clustered together. Segregation of the Ependymal tissue into a separate single cluster makes sense given its non-neuronal nature [17], and the Cerebellar White Matter is composed of myelinated axonal processes. Clustering is usually sensitive to the specific expression divergence measure used. However, we tried several empirical measures, as well as different P values for selecting differentially expressed genes, and observed that the main pattern of clustering outlined above remains unchanged.

Literature mining strategy to rapidly identify genes associated with brain regions of interest
We included in the present analysis several additional brain regions and cell-types, for example, Blood-Brain Barrier, Barrington's Nucleus, Astroglia etc., for which SAGE libraries had not been constructed. Therefore, to expand our set of genes with regionally enriched expression for all brain regions, we then scrutinized literature from PubMed. We obtained a list of Medline records using Boolean logic with search term combinations indicated in Table 3. To facilitate retrieval of publications from a large literature database such as PubMed, we also developed a semi-automated literature mining strategy (see Methods and Figure 3) based on natural language processing. In this approach we looked for the appearance of a gene name or synonym and a brain region in a sentence. Of the 99.7 million sentences searched, 314,515 occurrences of a brain region term were found; 4,395 mouse genes names, or the names of their human orthologs, were found to appear within the same sentence as a brain region (not shown).
The candidature of literature-mined genes was verified by assessing available expression data (reporter gene expres-Use of Laser Capture Microdissection to isolate the hippocampus dentate gyrus from an adult mouse Figure 1 Use of Laser Capture Microdissection to isolate the hippocampus dentate gyrus from an adult mouse. A) Intact coronal brain section at ~Bregma -1.35 stained with cresyl violet. B & C) dentate gyrus (DG) has been microdissected with laser. D) dentate gyrus has been isolated and captured for total RNA extraction and construction of SAGE libraries. Images were captured using a Sony DXC-390P 3-CCD color video camera attached to a Nikon Eclipse TE2000-S microscope (10× magnification). Scale bar = 100 μm. D: dorsal; V: ventral. sion, microarray expression profile, radioactive/non-radioactive in situ hybridization) in publications, and confirmed with in situ hybridization data from the Allen Brain Atlas (see below). In addition to promoter-reporter fusion data from the literature, reporter expression data for BAC (Bacterial Artificial Chromosome) transgenic mice, when available from the GENSAT database [18], was also considered as complementary evidence of expression [see Additional file 2].

Data mining genes showing regionally enriched expression from Allen Brain Atlas
The entire Allen Brain Atlas (ABA) data set can be searched via a web-based application [13,14]. We used this feature to examine expression patterns of genes identified as regionally enriched by SAGE and/or the literature. This verification was particularly apt for SAGE because ABA in situ hybridization patterns were also derived from the same mouse strain C57BL/6J. We also employed the ABA Anatomic Search tool to identify additional genes whose expression patterns cluster within brain regions of interest. While this approach short-listed genes for major regions (Thalamus, Cerebral Cortex etc.) of the mouse brain listed under Anatomic Search, we also searched within these regions to identify expression in sub-regions of interest, e.g. within Pons for genes expressed in Locus Coeruleus. Recent introduction of the alternative ABA search tool, NeuroBlast, also proved to be useful. We used NeuroBlast to retrieve genes co-expressed with a seeded (query) gene in a region of interest. Identification of regionally enriched co-expressed genes in this manner is indispensable in subsequent identification of shared regulatory elements for efficient mini-promoter design.
Thus, SAGE analysis of the adult mouse brain transcriptome combined with meta-analysis using data mining resources described above identified 237 genes as showing regionally enriched expression (

Identification of over-represented GO terms among genes with region-enriched expression
The Gene Ontology (GO) resource [19] is a powerful tool to identify common functions shared by genes identified by high-throughput gene expression methods such as SAGE. We searched for over-representation of GO terms Gene AND brain AND in situ [qualifiers: Mouse/Human] Gene AND brain region AND in situ Gene AND regulation Gene AND promoter Gene AND promoter AND brain Gene AND promoter AND brain region Gene AND promoter AND transgenic mice Gene AND promoter AND reporter (qualifiers: CAT/Luciferase/Gfp) Transcriptome similarity among 17 brain tissues based on expression divergence at P value = 0.01 Figure 2 Transcriptome similarity among 17 brain tissues based on expression divergence at P value = 0.01. Tissues being compared are indicated on the Y-axis, and expression divergence (ED P ) of clusters of tissues is plotted on the X-axis. At each node in the dendrogram, the number of genes shared between libraries in the tissue cluster is indicated. A threshold of 50% of maximum ED P was chosen for coloring of branch lines in the dendrogram. among our set of genes from each of three ontology classes: Biological Process, Molecular Function and Cellular Component (Methods). Of 237 genes in our selection, we found annotations for 216 genes in the whole mouse genome set of 18535 annotated genes (as of March 18, 2008). From this list, we determined the top 12 statistically over-represented GO terms [see Additional file 3]. Annotations for the test selection of genes were compared with GO annotations of the whole mouse genome. Significant biological processes involved nervous system development, transmission of nerve impulse, cell-cell signaling, neurogenesis, behavior etc. Significant molecular functions involved neuropeptide hormone activity, sequence-specific DNA binding, neurotransmitter receptor activity, steroid hormone receptor activity, neurotransmitter transporter activity etc. Products of some of these genes also tended to be localized in the extracellular region, plasma membrane, synapse, or within transcription factor complexes. Thus, it appears that many of the genes we identified have established neurological functions, which accounts for their regionally enriched expression. It is noteworthy that we found 28 transcription factor encoding genes representing 16 of 30 regions/celltypes of interest (Table 5). This information combined with identification of regulatory sequences within promoters of selected genes will aid the design of mini-promoters specific for each brain region. Because our selection of the 237 genes was biased towards those with known functions, we also carried out GO analysis on genes expressed in each of 18 SAGE libraries [see Additional file 4]. Specific neurological functions were less apparent among over-represented GO terms for these larger sets than for the 237 genes presented in this study.

Discussion
Targeting gene therapy to specific regions of the brain requires the application of well-defined promoters that can drive expression in a region-specific manner. In this study our goal was to identify regionally enriched transcripts in sub-structures/cell-types of the mouse brain with a particular focus on those brain regions associated with diseases. We were encouraged by findings from the ABA project that above background level expression was found for ~80% of genes assayed -and approximately 70% of genes have been localized to fewer than 20% of all brain cells -suggesting that gene expression is clustered in small brain regions [14]. For a variety of reasons we believe that human orthologs of regionally enriched mouse genes would be good candidates to design promoters from. First, at the genomic level, approximately 99% of mouse genes have an ortholog in the human genome [20]. Second, it has been shown that 84% of humanmouse orthologous gene pairs show significantly lower expression divergence than that of random gene pairs [21]. In another comparable study within the milieu of neurogenomics, it was demonstrated that there are significant constraints on the evolution of gene expression and nucleotide sequence of region-specific genes in the brains Text mining data flow Figure 3 Text mining data flow. This shows the steps by which the medical sentence parser retrieves Medline records that contain expression information for a gene in a specific region of the brain. of humans and mice [1]. In general, transcripts that are regionally enriched in mice also appear to be regionally enriched in humans -further emphasizing conservation of mammalian brain gene expression. Nonetheless, we are exercising caution in assuming global conservation of expression across species as divergent as mouse and human, and will be testing multiple candidate genes for each region.

&RPSHWHQFH
Our study profiles region-enriched gene expression within 17 key areas of the adult mouse brain by LongSAGE analysis. For the small number of brain regions for which we had no SAGE data we interrogated the literature and the ABA directly. We used several expression indicators including SAGE tag abundance and specificity, in situ hybridization, promoter-reporter fusion data etc. to assess candidacy of genes. Our data mining strategy was to start with SAGE-identified genes ranked on the basis of specif- *also listed as a cortex-specific gene ‡ also listed as a striatum-specific gene icity and expression level, confirmed with supporting evidence from the literature, ABA or GENSAT. Although we prioritized finding genes displaying absolute regional specificity (no detectable background expression), for our data mining strategy to be practicable we did not limit ourselves to this level of stringency -especially for the brain nuclei e.g. Basal Nucleus of Meynert, Barrington's Nucleus etc. Therefore, we also selected genes that displayed the highest level of regional enrichment with the idea that promoters of such genes can be manipulated to produce desired specificity of expression, as reported by Machon et al. for the mouse Dach1 gene [22]. Compared to ubiquitous expression of the native Dach1 gene, a transgene with 5.8 kb of Dach1 regulatory sequence restricts βgalactosidase reporter expression within the mouse brain to the neocortex. Deletion analysis of this 5.8 kb fragment further delimited cortex-specific activity to a minimal 2.5 kb promoter region. From a total of about 30,000 mouse genes [20], we have identified a set of 237 genes displaying regional enrichment of expression.
Analysis of SAGE data to delineate transcriptome similarity among 17 selected brain tissues revealed segregation of a large cluster of neuronal tissues from discrete single clus-ters of non-neuronal tissues (Ependymal tissue and the highly myelinated Cerebellar White Matter tissue) and the neuronal outlier Cerebellar Purkinje Cell Layer. This pattern of tissue clustering appears to be borne out by unique tissue composition at the very least. Among neuronal tissues, tight clustering of the Ventral and Medial Thalamus regions is possibly a reflection of common diencephalic origin, although from a functional standpoint the two tissues can be considered to be different. The expression signature of a tissue may either independently confer tissue uniqueness, or itself depend on unique tissue composition, the surrounding cellular environment, or a combination of factors.
Other studies have also demonstrated the utility of gene expression patterns in assessing cytoarchitectural distinctness of rodent brain regions. During review of this manuscript another study was published that employed SAGE gene expression profiling to identify region expression in 11 regions of the adult mouse brain [23]. Interestingly, regional enrichment of some transcripts was found to be conserved in the human brain. Microarray analysis of gene expression patterns in 24 neural tissues in the mouse central nervous system has mapped discrete brain Paired-like homeodomain transcription factor 2 Subthalamic Nucleus Lef1 Lymphoid enhancer binding factor 1 Thalamus Tcf7l2 Transcription factor 7-like 2 (T-cell specific, HMG-box) Thalamus Gcm1 Glial cells missing homolog 1 White Matter -Glia, Astrocytes Gcm2 Glial cells missing homolog 2 White Matter -Glia, Astrocytes Olig1 Oligodendrocyte transcription factor 1 White Matter -Glia, Oligodendroglia Olig2 Oligodendrocyte transcription factor 2 White Matter -Glia, Oligodendroglia Sox10 SRY (sex determining region Y)-box 10 White Matter -Glia, Oligodendroglia domains based on such expression patterns [24]. Importantly, it was revealed that embryological imprinting is still evident in the adult brain. Microarray analysis has similarly identified molecular markers for neuronal subtypes in the adult mouse forebrain [25], in brain regions in each of eight strains of inbred mice [26], as well as in the adult rat CNS [27,28]. Fang et al. have shown that the most regionally discriminative genes are associated with one of four specific factors: regional myelin/oligodendrocyte levels, resident neuron types, neurotransmitter innervation profiles, and Ca +2 -dependent signaling and second messenger systems [28].
By assessing over-representation of GO terms within our set of regionally expressed genes, we identified commonalities in molecular functions, cellular locations and involvement in key biological processes. This offers the promise of a unique set of molecular markers for each region/cell-type, and could potentially provide a mechanistic explanation of unique functions performed by discrete brain regions. Because of the disease application of our work, we were assured by the over-representation of genes involved in neurotransmitter synthesis, reception and degradation. Importantly, we have also identified many regionally expressed transcription factor-encoding genes. This is consistent with previous findings of Suzuki et al. who have identified region-specific transcription factors in 11 mouse brain regions by using medium-scale real-time RT-PCR [29]. They reported that 90% of known transcription factors display significant expression in at least one brain region. Additionally, it was found that 349 of over 1000 transcription factor and co-regulator genes, mapped by in situ hybridization in the brains of developing mice, show restricted expression patterns adequate to describe the anatomical organization of the mouse brain [30].
The identification of brain region-specific transcription factors is a prelude to explaining expression patterns of similarly enriched genes regulated by these factors. Armed with this knowledge, we can now search for evidence of transcription factor co-regulation of genes by availing of existing repositories of regulatory sequence collections [31][32][33]. In particular, the PAZAR system [33] has been employed to integrate transcription factor data and annotated regulatory sequences from the Pleiades Promoter Project. Additionally, given that much is already known about pathways that activate transcription factors, it would now be possible to identify pathways with which genes regulated by these transcription factors are associated. Indeed, a regulatory network comprising 15 important basic helix-loop-helix transcription factors and 153 target genes within the mouse brain has now been constructed [34]. From the perspective of the Pleiades Promoter Project, the identification of DNA-binding elements, transcription factors and pathways influencing their interaction will stand in good stead for efficient mini-promoter design.
We encountered challenges during in this study that are deserving of mention. In literature mining, curation was obfuscated by the existence of numerous synonyms for either mouse or human genes, references to a single protein rather than two distinct isoforms, or different genes with the same synonym. Furthermore, where genes were not represented on either ABA or GENSAT it was not possible to confirm expression, but nonetheless such genes were retained based on level and specificity of expression indicated by the literature or SAGE. Additionally, for a good number of genes there was low correlation between expression detected by SAGE and in situ hybridization. Despite the depth of sampling, expression of many genes was not detected by our SAGE procedure; for e.g Pde1b1, which has been shown to be strongly expressed in the striatum by in situ hybridization on ABA and in the literature [35]. Also, Hcrt appeared to be Hypothalamus-specific by SAGE but ABA indicated enrichment in the Hypothalamus with low level, widespread background expression. Although our SAGE procedure and ABA in situ hybridization profiled gene expression from the same mouse strain C57BL/6J, lack of correlation between the two could be due to inherent differences in the way RNA is processed and/or detected in these procedures. Nonetheless, Hcrt was retained in our study after considering significance of expression in SAGE analysis (P value = 0) and the description of minimal promoters in the literature [36,37].

Conclusion
We have successfully identified genes displaying regionenriched expression in the mouse brain by the application of SAGE and data mining from a variety of publicly available sources. These genes represent useful molecular markers that could potentially aid in unraveling the functions of representative brain regions/cell-types. Importantly, for the Pleiades Promoter Project, identification of these genes has brought us closer to our goal of designing well-defined human promoters for gene therapy. Indeed, we have further identified promoters of human orthologs of a subset of these mouse genes, and are now gearing up to test expression of reporter genes in transgenic mice (unpublished data). Ultimately, it will be of great interest to determine for how many of these promoters the mouse pattern of regional enrichment is recapitulated within the human brain, and which of these successfully remediate the disorders they may be designed for.

Mice
Mice used in our experiments were all adult male C57BL/ 6J mice (12-week old post-natal). All procedures used in these experiments were in accordance with the Canada Council on Animal Care and approved by the University of British Columbia Animal Care Committee (A05-1748). All experiments were conducted in accordance with Canadian and International standards for animal care. All efforts were made to minimize the number and suffering of any animals used in these experiments.

Whole brain manual dissection and RNA extraction
Whole brains were manually dissected at room temperature from the intact bodies of mice. To minimize the effects of stress on gene expression, the mother, and the entire litter remained in the family cage until harvest. Mice were removed, one at a time and killed in a separate room, by cervical dislocation. Tissue was immediately flash frozen in liquid nitrogen and stored at -80°C until further processing.

SAGE library preparation
The LongSAGE-Lite method was used to construct the libraries as previously described [5]. In brief, first strand cDNA was synthesized with Powerscript Reverse Transcriptase (Clontech, BD Biosciences, Mississauga, Canada) and LITE1/LITE TS primer mix (Invitrogen, Carlsbad, CA) using 15-120 ng of DNase-treated total RNA, and amplified by a 20-cycle PCR according to the SAGE-Lite method [38]. SAGE-Lite biochemistry for the generation of full-length cDNA libraries is based upon the SMART (Switching Mechanism At the 5' end of RNA Transcripts) cDNA synthesis strategy (Clontech, BD Biosciences, Mississauga, Canada). Following amplification, the cDNA were processed according to an adaptation of the standard LongSAGE protocol using the I-SAGE Long kit (Invitrogen, Carlsbad, CA). The SAGE protocol includes steps of anchoring by NlaIII, tagging by MmeI, and generating 131 bp ditags by T4 DNA ligase. The 131 bp ditags were amplified using the scale-up PCR varying from 23-27 cycles depending on the optimal scale up condition as described in the protocol, and were digested with NlaIII to remove adapter sequences. Purified 36-bp ditags were ligated to form concatemers that were cloned into SphI-digested pZErO-1 vector (Invitrogen, Carlsbad, CA), and transformations were done using One Shot DH10B T1 electrocompetent E. coli (Invitrogen, Carlsbad, CA).
After transformants had been screened by colony PCR, the fraction containing concatemers of sizes ranging from 900 bp-1300 bp was chosen for sequencing. Colonies were picked using a Q-Pix robot (Genetix, Beaverton, OR) and inoculated into 2xYT media with Zeocin (50 μg/ml) and glycerol (7.5%). After overnight culture, glycerol stocks were used to inoculate larger volume cultures for plasmid preparation, carried out using a standard alkaline-lysis procedure adapted for high-throughput processing with microtiter plates. DNA sequencing was performed with BigDye v3.1 dye terminator cycle sequencing reactions run on Tetrad thermal cyclers (MJ Research, Waltham, MA). Products from the sequencing reaction were purified by ethanol precipitation and then run on capillary DNA sequencers (Model 3730xl, Applied Biosystems, Foster City, CA).
Following inspection of data quality from a first 384-well sequencing plate, each library was sequenced to a depth of > 100,000 raw tags. The resulting sequence data were collected automatically and processed by both trimming the reads for sequence quality and removing sequences from non-recombinant clones, vector DNA and linker-derived tags. Processed data can be found on the Mouse Atlas website (please see Availability & requirements for more information)

SAGE data analysis
To obtain high quality SAGE tags for this study, all raw SAGE tags underwent a three-step cluster modification process developed by Siddiqui et al. [8]. In the first step, we calculated for each tag a P value based on the Phred quality score [39] to identify single nucleotide variants likely to originate from sequencing error. In the second step, we used tag sequence clustering to group such variants to combine tags likely to originate from a common transcript. Thus, some singletons were clustered and counted as a more abundant tag. The third step was to filter out low quality tags and compare each P value to a meta-library P value calculated from all SAGE libraries. Tag-to-gene-mapping was then carried out using Discov-erySpace 4.0 application [40]. All cluster-modified tags were then mapped to transcripts in the NCBI Reference Sequence Collection [41]. The remaining unmapped tags were mapped to transcripts in the Mammalian Gene Collection [42], followed by the Ensembl database [43]. Only sense transcripts and unique mappings were considered, and tags that mapped to more than one transcript in any of the three transcript databases were discarded. The three mapping results were subsequently merged based on gene symbol.
For each gene, a P value was assigned to each target (TL; brain region of interest) and off-target (OTL; background region) library pair using the P value option in Discov-erySpace. The P value was computed based on Audic-Claverie algorithms [44] to assess confidence level of differential expression between two transcript libraries. A ranking system was implemented to facilitate selection of candidate genes with specific or enriched expression in each target library (Table 2). Region-specific transcripts were obtained by selecting transcripts detected with 5 tags or more only in one target library. To identify regionenriched transcripts, those detected in one target library and one off-target library (P TL-OTL value < = 0.05) were selected. Transcripts detected in multiple libraries were ranked based on pre-defined P value limits of differential expression (P TL-TL , P TL-OTL ), as well as additional criteria such as target and off-target library counts. Transcripts whose expression patterns did not fit these criteria were not ranked.
To analyze transcriptome similarity of tissues, a dendrogram was generated using MATLAB 7 (The MathWorks, Natick, MA) based on hierarchical clustering using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA). The input data is a list of objects (tissue SAGE libraries) with their pair-wise distances (expression divergence ED; see below), and the output is a dendrogram. Initially, each object is in its own cluster; then, at each step of the hierarchical clustering the nearest two clusters are combined into a higher-level cluster. The distance between any two clusters A and B is taken to be the average of all distances between pairs of objects in A and B. Thus, we defined pair-wise distance or expression divergence (ED) between any two tissues as the fraction of differentially expressed genes in their corresponding SAGE libraries, using the formula: ED (p) = N diff(p) /N (N diff(p) = number of differentially expressed genes for a given P value, N = number of shared genes between two corresponding libraries).

Semi-automated Literature mining
All synonyms for 28,000 mouse genes were obtained from Entrez (RefSeq release 14) combined with Ensembl (build 34) of the mouse genome. Synonyms for the human orthologs were obtained using Compara (Ensembl build 34) to identify similarities between human and mouse together with Homologene (version 47) for homolog detection. In each case, Ensembl and Entrez were used as cross-references for gene identifiers. From these search strings, all names found in the English dictionary were subtracted to remove obfuscating gene terms such as "Ice". Abstracts were parsed from Medline (extraction performed September 7, 2006) and the complete text of articles were parsed from PubMed Central [45], and converted into individual sentences using the medical sentence parser [46]. Each sentence was searched for the co-occurrence of gene names with brain regions of interest. For each brain region, expanded search terms were applied referring to finer structures appropriate to the region as defined by the ontology available from the Allen