Journal of Cell Science partnership with Dryad

Journal of Cell Science makes data accessibility easy with Dryad

Navigating the genome
Cristina de Guzman Strong, Julia A. Segre


The complete sequencing of the DNA of multiple species has opened new avenues of research, enabling genomic studies into the complexity of biological function. Web-based genome browsers, such as those available at the NCBI (, UCSC ( (Kent et al., 2002) and Ensembl ( (Hubbard et al., 2007), serve as public portals for access to these DNA sequences, including annotations useful for generating reagents and testing hypotheses that assess gene function. One can search for a gene by its name, Unigene ID or RefSeq ID in any of these genome browsers, which all utilize the same NCBI sequence database but can differ slightly in their annotations. Each also provides an online tutorial: UCSC (; Ensembl (; and NCBI ( Here, we highlight some common genomics applications by using examples that demonstrate their effectiveness for cell-biological hypotheses.

Viewing a gene and its genomic region

The UCSC Genome Bioinformatics website provides a view of the gene within a chromosomal context. For example, to view the gene encoding the mouse transcription factor GATA3, first select Genome Browser from the UCSC web page. Then, in the Genome Browser Gateway, select `Mouse' from the genome drop-down list (the default is `Human') and type `gata3' (no hyphen) as the `position or search term'. On the next page, select the gata3 link under `RefSeq Genes' (a curated NCBI entry for this gene). The resulting view of Gata3 shows the chromosomal position and size of the gene and displays the gene as a `track'. The arrows indicate the direction of transcription. Information specific to Gata3 can be obtained by selecting any `GATA3' label in the image. Additional genomic features of the gene can be viewed by using the display option controls below or by clicking on the grey and blue tabs to the left of the tracks (see below for specific examples). Clicking on `default tracks' will always return you to the default view of the tracks.

The Ensembl website provides a more traditional, gene-centric view. To view the mouse Gata3 gene, select Mus musculus as the species, type `gata-3' or `gata3' as the search term. Then, on the returned Ensembl Mouse SearchView page, select the `Ensembl protein_coding Gene' for Gata-3. The gene report page displays the gene's chromosomal position. An arrow (within the `Features' image) indicates the transcript orientation. Additional features include alignments with other species, orthologue and paralogue predictions, and links to other genome-based web resources.

Viewing a transcript sequence

To view the transcript sequence from the UCSC Genome Browser, click on the link `RefSeq Genes' (listed on the left) and then click again on the new label `Gata3'. Within the Gata3 track and in the RefSeq page, click on the line under `mRNA/Genomic Alignments'. The returned page shows the cDNA sequence: the untranslated region (UTR) is in red, the protein-coding sequences are in dark blue, and splice sites are shown in light blue.

To view the transcript sequence for Gata-3 from the Ensembl Gene Report, click on `Transcript information' (listed on the left) on the Ensembl Gene Report page. One can then opt to view the exons and codons by choosing this option from the drop-down list `Show the following features' at the bottom of the Ensembl Transcript Report page. Exons are distinguished by the alternating blue and black text and UTRs are displayed against a dark yellow background. To view the complete intronic sequences select `Exon information' from the menu on the lefthand side and tick the `show full intronic sequence' option on the Ensembl Exon Report page.

Designing primers and ensuring specificity

When identifying primers (e.g. for RT-PCR), one typically wants to avoid repetitive DNA sequences and regions encoding conserved protein elements. Selection of DNA primers in such regions can lead to false positives as a result of amplification of a homologous region. In Ensembl, select the `Peptide info' option (above the track) from the Ensembl Gene Report page to generate the `Ensembl Protein Report'. This view (Protein Feature image) demonstrates Znf (zinc-finger)_GATA protein similarity spanning exons 4 to exon 6 in multiple databases; exons are displayed in alternating lavender/purple shading, in this case the first exon of Gata3 is non-coding. In this example, one would select DNA sequences spanning exon 2 to exon 3, because primers that cross intron boundaries circumvent issues arising from genomic DNA contamination in the cDNA preparation. The selected DNA sequence can be pasted into a primer selection program such as Primer3 ( (Rozen and Skaletsky, 2000).

To ensure sequence specificity, the BLAT feature of the UCSC site will tell you whether your primers align with your gene of interest and/or elsewhere in the genome (Kent, 2002). Select BLAT from any of the UCSC web pages and paste and submit your primers. In a returned BLAT search, the presence of multiple matches indicates a lack of specificity. Also displayed are the genomic coordinates of the primers, which can be very useful to organize your reagents. However, a `not found' BLAT Search result does not necessarily mean that your primer does not match because BLAT has limited specificity for sequences ⩽40 bp. Another way of double-checking the specificity and correct orientation of primers is by using UCSC's virtual, In-Silico PCR tool. Select `PCR' from any of the UCSC web pages, and paste in the designated primers (forward and reverse) and define the maximum product size to generate the coordinates and amplicon genomic sequence.

Predicting the transcription start site

One can use the UCSC browser to predict the transcription initiation site by scrolling down to the `mRNA and EST tracks' controls and selecting `full' from the drop-down menu given below `Spliced ESTs'. Alternatively, click on the grey tab to the left of the `spliced ESTs' track and similarly change the display mode to `full'. For this analysis, only spliced ESTs are queried to circumvent possible contamination of genomic DNA in cDNA libraries. Transcription initiation at the most-5′ exon must be of course verified by RT-PCR and primer extension in the cells or tissue of interest because there are many examples of alternative transcriptional start sites.

Finding conserved noncoding regulatory sequences

Sequences that play a role in transcriptional regulation include enhancers, silencers and barrier elements. Identification of evolutionarily conserved non-coding sequences across multiple species can help locate such cis-regulatory elements (Hardison et al., 1997). To view regions of high sequence conservation at the UCSC site, scroll down and select `full' from the drop-down menu below `Conservation' in the Comparative Genomics section. The `Conservation' track shows the Mulitz alignment of each species to the reference mouse genome (Blanchette et al., 2004) and PhastCons-peaks of evolutionary conservation across all species (top `wiggle' portion of the track) (Siepel et al., 2005). Quantitated blocks of highly conserved sequences by PhastCons can also be viewed by selecting `pack' from the `Most Conserved' drop-down menu. Various web-based tools, such as MultiPipMaker ( (Schwartz et al., 2003) and VISTA ( (Frazer et al., 2004), can also be used to identify conserved sequences.

Conserved sequences can be further interrogated by computational programs such as TRANSFAC ( (Wingender et al., 2000) and rVISTA ( (Loots et al., 2002) to identify specific transcription factor-binding-sites. MEME is a complementary approach that identifies common motifs in a set of queried sequences ( (Bailey and Elkan, 1994). Ultimately, these genomic approaches may narrow the regions to test but only provide hypothetical regulatory sequences for further laboratory analysis.

Annotating a genomic region(s)

The UCSC Genome browser allows one to annotate genomic regions by temporarily adding `custom tracks'. This tool is extremely useful for organizing reagents and communicating with others in the `genomic' context. To create custom tracks from the above examples, first create a tab-delimited file (e.g. an Excel file) with the sequence information. Then, from the Genome Browser Gateway page, click on `add custom tracks' and either upload the file or paste in the coordinates. On the `Manage Custom Tracks' page, then click on `go to genome browser' to view the annotated file. In the poster we show the custom tracks for the RT-PCR primer set and the first exon of Gata3.

Concluding remarks

These sites have many additional applications. For example, if you identify an unknown DNA sequence in an experiment, Blat is the quickest way to find out the genomic environment. Other features of these websites, such as `DNA' at the top of the UCSC Genome Browser page, can display a (repeat masked) DNA sequence of the viewed region that is useful for primer design. UCSC also annotates tracks generating miRNAs and snoRNAs (under the genes and gene prediction feature). Conversely, potential targets of miRNA can be viewed in the UCSC browser as T-ScanS miRNA (Lewis et al., 2005) or PicTar miRNA (Krek et al., 2005) tracks in the Expression and Regulation section.

The public genome initiative ENCODE (for ENCyclopedia Of DNA Elements) seeks to identify the function of every sequence in the genome. Using multiple, diverse experiments to ascertain the function of a select 1% of the genome, the ENCODE pilot phase identified more pervasive transcription than previously predicted, a specific relationship between transcription start sites, chromatin structure and histone modification, correlation of early and late replication with regions of gene activation and repression, respectively, as well as new insights into the human genomic landscape based on comparisons between inter- and intraspecies genome analysis (ENCODE Consortium, 2007).

As cell biologists, we test hypotheses to decipher cellular functions and mechanisms, many of which originate with specific gene expression. Cell-type-specific features, such as transcription factor occupancy and histone modifications to predict active promoters, enhancers and silencers are now being surveyed on a genome-wide scale with next-generation sequencing technology and integrated onto genome web browsers. Mapping the genomic landscapes of specific cell types should provide further predictive insights into cellular functions and mechanisms.


This work was supported by the NHGRI Intramural Research Program. We thank Darryl Leja and Julia Fekecs for graphic design expertise and Ashley Owen, Andre Pilon, Tyra Wolfsberg and Donna Karolchik for the critical reading of the manuscript.