Ji Zhang

department of Pathology and Microbiology, University of Nebraska Medical Center; 2Center for Human Molecular Genetics, Munroe-Meyer Institute, University of Nebraska Medical Center, Omaha, Nebraska, USA


Dramatic advances in genome research in recent years will facilitate the precise determination of molecular mechanisms underlying human health and disease, and thus offer great potential for promoting health, lowering mortality and morbidity, and preventing disease. Nutritional science can benefit greatly from understanding the molecular mechanisms that cause heterogeneous responses to nutrient intake observed in healthy adults. Therefore, it is of considerable value for nutrition scientists to gain knowledge of essential technologies and resources of genome research. Originally, genomics refer to the scientific discipline of mapping, sequencing and analysis of an organism's genome, the entire set of genes and chromosomes. Now, the emphasis of genomics has undergone a transition from structural analysis of the genome (structural genomics) to functional analysis of the genome (functional genomics). Structural genomics aims to construct high-resolution genetic, physical and transcriptional maps of an organism and ultimately to determine its entire DNA sequence. Functional genomics, however, represents a new phase of genome research, referring to the development ofinnovative technologies based on the vast amount of structural genomics information. The first section of this chapter will focus on tools and reagents utilized in structural genomics; the second section will focus on DNA microarray technology, a representative of today's functional genomics.

©CAB International 2003. Molecular Nutrition (eds J. Zempleni and H. Daniel)

Structural Genomics

Genome, genetic mapping and physical mapping

The human genome is composed of approximately 3 billion nucleotide base pairs, carrying genetic codes for 30,000-100,000 genes. The DNA of the diploid genome is organized into 22 pairs of autosomes and two sex chromosomes. Each chromosome is thought to contain one linear DNA molecule featuring three functional elements required for the successful duplication of the chromosome upon cell division: (i) autonomous replication sequences; (ii) the centromere, to which the mitotic or meiotic spindle attaches; and (iii) the telomere, which ensures the complete replication of the chromatid at its ends. A map of the genome defines the relative position of different loci (genes, regulatory sequences, polymorphic marker sequences, etc.) on the DNA molecules. Two distinct approaches are used to map the loci of the genome: genetic mapping and physical mapping.

When the distance between two loci is measured by the meiotic recombination frequency, a genetic map is constructed. The closer two loci are on the DNA molecule, the more likely it is that they are inherited together. The term synteny refers to loci that reside on the same chromosome; these loci are not necessarily linked. Loci, located on different chromosomes or far apart on the same chromosome, segregate independently with the recombination frequency of 50%. Genetic distances are expressed in percentage recombination or centiMorgans (cM). One cM equals 1% recombination and corresponds to approximately 0.8 X 106 base pairs (bp). The genetic map of the human genome spans roughly 3700 cM.

Genetic maps are constructed by using linkage analysis: the analysis of the segregation of polymorphic markers in pedigrees. The first genetic markers used were phenotypic traits (e.g. colour blindness) and protein polymorphisms. However, polymorphic markers of these types are rare. Detailed genetic maps were made possible by the discovery of highly polymorphic sequences in the genome, typically represented by microsatellite sequences; the latter contain variable numbers of small tandem repeats of di-, tri- or tetranucleotides (Litt and Luty, 1989; Stallings etal, 1991). Linkage analysis is a statistical method and its resolution also depends on the number of informative pedigrees.

Unlike genetic mapping, physical mapping directly measures the distance between two loci on the linear DNA molecule by nucleotide base pairs. Therefore, the complete sequence of the genome is considered as the ultimate physical map. Both genetic and physical maps result in an identical order of genes, but the relative distance between genes can vary widely due to local variations in recombination frequency. Molecular biology provides the instruments needed to construct the physical map. At the most detailed level, the nucle-otide sequence ofa cloned gene can be determined. At a lower level of resolution, the distance between restriction sites is quantified in units of base pairs. Restriction sites are short sequences (typically 4-8 bp) recognized and cleaved specifically by class II restriction enzymes. The analysis of the presence ofrestriction sites yields a restriction map that typically comprises 10-100 kilobases (kb). Genomic fragments of this size can be cloned and manipulated using plamids, phages, cosmids, BACs (bacterial artificial chromosomes) (Shizuya et al., 1992), PACs (P1 artificial chromosomes) (Ioannou et al., 1994) or YACs (yeast artificial chromosomes) as vectors (Burke et al., 1987).

Tools and reagents utilized in physical mapping

Mapping using human-rodent hybrid panels

Somatic cell hybrids are tools utilized to map human genes physically on to chromosomes.

Fusion of human and rodent cells and the subsequent culture selection results in stable hybrids, which usually contain a complete set of rodent chromosomes and a few human chromosomes. A panel of such hybrids allows for localization of a human gene or gene product on a specific human chromosome (Zhang et al., 1990). This approach can be used to develop synteny maps. To localize a gene in subchromosome regions, however, requires special hybrids containing only part of a human chromosome. This has been achieved by various deletion mapping approaches, including microcell-mediated subchromosome transfer or fusion using donor cells with specific translocations or interstitial deletions (Zhang et al., 1989a). The widely used radiation hybrid approach is also based on this deletion mapping concept (Walter etal., 1994). Here, a single human chromosome is hybridized with rodent chromosomes and cleaved into smaller fragments by using X-rays. Fusion of this radiated cell population with a rodent cell and the subsequent selection of human chromosome materials generates a panel of deletion hybrids containing different fragments of the human chromosome. Collections of radiation-reduced hybrids generated from each of 24 human chromosome-specific rodent hybrids produce a complete human radiation hybrid mapping panel. This facilitates the rapid mapping of genes, ESTs (expressed sequence tags), polymorphic DNA markers and STSs (sequence-tagged sites) (Olson et al., 1989; Weber and May, 1989) to subchromosome regions using polymerase chain reaction (PCR) (Deloukas et al., 1998).

Mapping using chromosome in situ hybridization

Genes can be assigned directly to chromosome regions by in situ hybridization, in which a DNA probe containing the sequence of interest is labelled, denatured and hybridized to its complementary chromosomal DNA from denatured metaphase spreads on a glass slide. Traditionally, the probe is labelled by the incorporation of [3H]nucleotides, and the post-hybridization detection is conducted by autoradiography, in which the chromosome spreads are overlaid with a liquid emulsion, exposed and developed. Silver signals appear near the radioactive probe and therefore highlight the chromosome location of the DNA

sequence of interest (Marynen et al., 1989; Zhang et al., 1989b). The major drawback of this radioactive in situ hybridization is the time needed for one experiment (typically 2-3 weeks). Also, the detection usually yields a relatively high background, and the assignment of a chromosome locus occurs on a statistical basis.

Fluorescence in situ hybridization (FISH) circumvents this problem. The probe is labelled by incorporation of nucleotides labelled with antigen (e.g. biotin) and the detection is performed using fluorescence-conjugated antibody (e.g. avidin). Background noise is low, and most metaphase chromosomes show four specific fluorescent signals (one for each chromatid on the two homologous chromosomes). The sensitivity of this procedure is usually low, and thus it requires the use of large genomic probes (ideally >10 kb). Genomic phage (15 kb), cosmid (40 kb), PACs (80-135 kb), BACs (l30 kb) and even YACs (200 kb to 2 Mb) can be utilized as probes for FISH mapping (Hardas et al, 1994).

Genome mapping and disease gene isolation

One of the most prominent applications of genetic mapping is reverse genetics, i.e. mapping of a genetic disease without knowing the underlying biochemical basis. On the basis of genetic and physical mapping, it is possible to map any Mendelian phenotypic trait. Linkage of a disease locus to a mapped genetic marker allows for the diagnosis of disease and identification of carriers if informative pedigrees are available. Once a disease locus has been mapped genetically, physical mapping can be applied to identify the disease gene and its primary defect (Collins, 1995). The identification of the primary defect is also essential for the design of a specific treatment for the disease and may lead to somatic gene therapy in the future. A similar approach can be applied to investigate multiple gene diseases. This further extends the application of gene mapping.

Comparative gene mapping, the mapping of homologous genes in genomes of different species, can also be helpful in the identification of disease genes. Linked gene groups have been conserved to some extent during evolution, and homologous chromosome regions containing different genes have been conserved among species, e.g. human, mouse and rat. The existence of many mouse or rat strains with well-defined, mapped genetic diseases or syndromes can then be used to map similar genetic loci in humans (Zhang et al., 1989b).

Alternatively, association of the disease with cytogenetic lesions such as small chromosome interstitial deletions or translocations allows the direct localization of the disease locus on the physical map and the identification of the locus by means of the same technology.

Mapping the malignant genome in cancer

Cancer, characterized by neoplastic transformations of cells, invasion of tissues and finally metastasis, is a 'disease of genes', caused by alterations of specific genes, which are partially known. Oncogenes play a role in the neoplastic evolution when activated by either mutation or gene amplification. Tumour suppressor genes control growth of cells. Loss or inactivation of these genes also contributes to tumorigenesis. Cytologically, specific cancers often display characteristic chromosome abnormalities, including translocations, deletions, inversions and DNA amplifications. Accordingly, efforts to use molecular tools in order to analyse these recurrent chromosomal abnormalities have led to the identification of numerous genes related to tumour initiation and progression.

Conventional chromosome banding techniques have provided the major basis for karyo-typical analysis of malignant cells in tumors. As the interpretation ofthe chromosome banding pattern is a pure experience-dependent procedure, errors and ambiguous data often are difficult to prevent, especially with respect to detecting minor structural changes when analysing complex karyotypes. FISH has contributed significantly to the characterization of chromosome abnormalities in tumours. Rearrangements involving specific chromosomes or their derivatives in malignant cells can be visualized directly by hybridizing the chromosome-specific painting probes, chromosome-specific repetitive sequence probes or chromosomal region-specific DNA probes to the tumour cells. However, using this approach to characterize those frequently observed unknown marker chromosomes or unknown origin genomic segments would require probes for all 24 human chromosomes. The development of spectrum karyotyping (SKY)

(Schrock et al., 1996) has attempted to circumvent this problem by visualizing 24 human chromosomes individually in different arbitrary colours. The resolution of this technique is limited to the chromosome level. In addition, small marker chromosomes may escape detection.

Comparative genome hybridization (CGH) provides an overview of unbalanced genetic alterations, which is based on a competitive in situ hybridization of differentially labelled tumour DNA and normal DNA to a normal human metaphase spread (Kallioniemi et al., 1992). Regions of gain or loss of DNA sequences are seen as an increased or decreased colour ratio of two fluorochromes used to detect the labelled DNAs. The genomic information obtained from this technique, however, is restricted to the area where only gain or loss occurs in malignant genomes. In addition, it does not lead to the generation ofDNA from the detected chromosome regions.

Chromosome microdissection provides an approach to isolate DNA directly from any cyto-logically recognizable regions. The isolated DNA can then be used: (i) with region-specific painting probes for the detection of specific chromosomal disease (Zhang et al., 1993a); (ii) in gene amplification studies (Zhang et al., 1993b); and (iii) with region-specific DNA markers for position cloning (Zhang et al., 1995). Technically, the molecular cytogenetic tools described above are complementary to each other, facilitating rapid scanning of malignant genomes and targeting chromosome abnormalities. This may lead to the identification of genes involved in tumorigenesis.

Genome mapping integration

Genome mapping has entered the final stage of integration, i.e. the integration of genetic and physical resources into more complete and comprehensive maps. One of the milestones in this respect was the construction of a human linkage map containing 5264 genetic markers (Dib et al., 1996). This high-resolution genetic map reaches the 1 cM resolution limit of genetic mapping, with marker spacing being <1 X 106 bp of physical distance. This provides one of the most comprehensive instruments in genetic disease studies. Soon after, a high-density physical map with >30,000 gene markers, average spacing about 100 kb, was constructed (Deloukas et al., 1998). Based on these comprehensive maps, the assembly of isolated intact genomic fragments in PAC and BAC vectors into clone maps or comprehensive contigs has been facilitated. Representative PAC or BAC clones were then used as the DNA sources for a shotgun plasmid library, in which the average insert size is about 1 kb. Finally, these shotgun clones were used as DNA templates for high-throughput DNA sequencing using dideoxy-termination biochemistry and automated gel electrophoresis with laser fluorescent detection (Venter et al., 1998). Mapping of the human genome is still incomplete. This is due to the presence of a large number of various interspersed and tandem repeated sequences in chromosomes. In particular, the assembly of repeats in centromeres and paracentric heterochromatin regions is not possible with current technologies. Repeats may contain as little as a few base pairs or as many as 200 bp; these regions may tandem repeat hundreds to thousands of times in a single chromosome region. Although additional efforts are required to complete the human genome sequencing, the vast amount of structural information available to date has facilitated precise determination of molecular mechanisms in human cells (Deloukas et al., 1998; Lander et al., 2001; Venter et al, 2001).

Functional Genomics

The use of microarrays in functional genomics

Functional genomics represents a new phase of genome research: to assess genes functionally on the genome-wide scale. This is represented by the emergence of DNA microarray technology (Schena et al., 1995; DeRisi et al., 1997). In silico microarray methodology is where inserts from tens of thousands of cDNA clones (i.e. probes) are arrayed robotically on to a glass slide and subsequently probed with two differentially labelled pools of RNA (i.e. target). Typically, the RNA sample is labelled with a nucleotide conjugated to a fluorescent dye such as Cy3-dUTP or Cy5-dUTP. RNA (target) from at least two treatment groups is compared in order to identify differences in mRNA levels, e.g. normal cells versus diseased cells; wild-type versus a transgenic animal; or general control versus a series of study samples. After hybridization, the slide is excited by appropriate wavelength laser beams to generate two 16-bit TIF images. The pixel number of each spot in each wavelength channel is proportional to the number of fluorescent molecules and hence permits the quantification of the number of target molecules that have hybridized to the cDNA clones (probes). The difference in signal intensities at each wavelength parallels the number of molecules from the two different target sources that have hybridized to the same cDNA probe. A general process of DNA microarray is illustrated in Fig. 1.1. Experimental procedures for this technology have been well established. Thus, this chapter focuses on data analysis.

Microarray data analysis

Figure 1.2A illustrates typical DNA microarray images. The amount of data generated by each microarray experiment is substantial, potentially equivalent to that obtained through tens of thousands of individual nucleotide hybridization experiments done in the manner of traditional molecular biology (e.g. Northern blot). It is extremely challenging to convert such a massive amount of data into meaningful biological networks. Therefore, it is important for life scientists to understand working principles of data mining tools utilized in this field.

Data pre-processing

Various laser-based data acquisition scanners are commercially available now. For data analysis, it usually is necessary first to build up a spreadsheetlike matrix, in which rows represent genes, columns represent RNA samples, and each cell contains a ratio (e.g. pixel number of Cy5 versus pixel number of Cy3) featuring the transcriptional level of the particular gene in the particular sample. This matrix can be studied in two ways: comparing rows in the matrix and comparing columns in the matrix. By looking for similarities in expression patterns of genes in rows, functionally related genes that are co-regulated can be identified. By comparing expression profiles in samples, biologically correlated samples or differentially expressed genes can be determined. Usually, the matrix needs to be filtered to remove genes with missing or erroneous values. Then, numerical values in the matrix are scaled by logarithm with base 2 to normalize data distribution and reduce potential data bias by extreme values. When a series of test samples (e.g. clinical samples) is compared with an unpaired control (reference) sample, the logarithm scaled ratios need to be processed further by mean or median centring to allow for data analysis in test samples that is independent of the gene expression level in the unpaired control sample.

Similarity measurements

Current efforts in understanding microarray data are focused primarily on clustering and visualization. Clustering is intended to catalogue genes or RNA samples into meaningful groups based on their similar behaviours; visualization is intended to depict clustering results in a readily accessible format. For comparisons of similarities, the concept of Euclidean distance and calculation of correlation coefficients are usually utilized to set up the similarity measurement. Euclidean distance is the distance between two «-dimensional points, e.g. X and Y. Corresponding values for X are X1, X2, . . ., XN, and corresponding values for Y are Y1, Y2, . . ., Yn. The Euclidean distance between X and Y is d(x y) (x,- y,)2

where n is the number of the RNA samples for gene comparison, or the number of genes from sample comparison. For example, comparing any two genes (e.g. X and Y) in a three-dimensional (i.e. three samples) space, the Euclidean distance between X and Y is d( r y) =V(x- j1)2 +(x2- y 2) 2 +(x3 - y 3) 2

The closer the distance between two points, the more similar they are.

The correlation coefficient between any two n-dimensional points is defined as

where n is the number of the RNA samples for gene comparison, or the number of genes for sample comparison, X is the average of values in

16-bit TIF image 16-bit TIF image of the Cy3 channel of the Cy5 channel

Fig. 1.1. Ideogram depicting the general procedure of in silico DNA microarrays. The image on the left illustrates the printing process of microarray fabrication, in which cDNA inserts from individual clones are prepared by PCR and printed on to polylysine-coated glass slides through a GMS 417 arrayer (Affymetrix). After an overnight hybridization with differentially labelled test and control probes, the slide is scanned using a GenePix 4000 scanner (Axon Instruments) to generate two TIF images: green channel and red channel. The pixel ratio between red and green for each spot is used as the numerical value for further data analysis.

point X, and 8X is the standard deviation of values in point X.

For example, if point X and Y are plotted as curves based on their values in all samples or genes, r will tell how similar the shapes of the two curves are. The correlation coefficient is always between — 1 and 1. When r equals 1, the two shapes are identical. When r equals 0, the two shapes are

Fig. 1.2. Microarray data visualizations using different algorithms (see colour version in Frontispiece). (A) Colour images illustrating similarities and dissimilarities between two brain development stages. The upper left is the full image generated by the Cy3-labelled day 11.5 post-coitum (p.c.) mouse brain cDNA pool versus a Cy5-labelled mouse embryonic liver cDNA pool, and the lower left is the full image generated from the Cy3-labelled day 12.5 p.c. mouse brain cDNA versus the same control. The right panel depicts partial images of the left panels to illustrate details. (B) Graphic distributions showing representative clusters obtained by K-means clustering. The horizontal scale represents RNA samples obtained from ten different time points of mouse embryonic brain development. The vertical scale weighs changes in expression, from high expression (red) to low expression (green) with units in log ratio, subtracted by median. (C) A partial tree view obtained through hierarchical clustering of 4608 mouse genes over ten embryonic development samples. Red represents up-regulation and green represents down-regulation. (D) A bar graphic display of SOM illustrating gene clustering and expression patterns of regulated genes during the yeast sporulation process. All genes were organized into 324 (18 x 18) hexagonal map units. Each bar in a given unit illustrates the average expression of genes mapped to that unit. (E) U-matrix and component plane presentations. The colour coding in U-matrix stands for Euclidean distance. The darker the colour, the smaller the distance. The large dark blue area that occupies the majority of the display represents unregulated genes, which form some noise clustering. The component plane presentations (t0—111) illustrate differential displays of regulated genes during sporulation of yeast on the genome-wide scale. The colour coding index stands for the expression values of genes. The brighter the colour, the higher the value. All these differential displays are linked by position: in each display, the hexagon in a certain position corresponds to the same map unit. It is straightforward to compare expression patterns in the same positions of different displays. The last label display shows the positions of each unit on the map.

completely independent. When requals —1, the two shapes are negatively correlated. Both Euclidean distance and correlation coefficient are used to measure similarities in clustering. There is no clear justification to favour one procedure over the other.

Clustering algorithms

Commonly applied algorithms for gene clustering include hierarchical clustering, K-mean clustering and self-organizing map (SOM). Hierarchical clustering is based primarily on the similarity measure between individuals (genes or samples) using a pairwise average linkage clustering, usually the correlation coefficient (Eisen et al., 1998; White et al., 1999). Through the pairwise comparison, this algorithm eventually clusters individuals into a tree view. The length of the branches of the tree depicts the relationship between individuals, where the shorter the branch is the more similarity there is between individuals (Fig. 1.2C). This algorithm has been used frequently in microarray data analysis, and has proven to be a valuable tool. A major drawback of hierarchical clustering is the polygenetic tree structure of the algorithm, which may be best suited to situations of true hierarchical descent, such as in the evolution of species (Tamayo et al., 1999), rather than situations of multiple distinct pathways in living cells. This may lead to incorrect clustering of genes, especially with large and complex data sets.

K-means clustering allows the partition of individuals into a given number of (K) separated and homogeneous groups based on repeated cycles of computation of the mean vector for all individuals in each cluster and reassignment of individuals to the cluster whose centre is closest to the individual (Fig. 1.2B). Euclidean distance is used commonly as the similarity measurement. A limitation of K-mean clustering is that the arbitrarily determined number of gene clusters may not reflect true situations in living cells. In addition, the relationship between clusters is not defined.

The SOM (Kohonen, 1995; Kohonen et al, 1996), an artificial intelligence algorithm based on unsupervised learning, appears to be particularly promising for microarray data analysis. It is, therefore, of considerable interests to discuss this application in further detail.

Self-organizing map algorithm

This algorithm has properties ofboth vector quantification and vector projection, and consequently configures output prototype vectors into a topo-logical presentation of original multidimensional input numerical data. SOM consists of a given number of neurons on a usually two-dimensional grid. Each of these neurons is represented by a multidimensional prototype vector. The number of dimensions of prototype vector is equal to that of dimensions (i.e. the number of samples) of input vectors. The number of input vectors is equal to the number of inputs, i.e. the number of genes in the matrix. The neurons are connected to adjacent neurons by a neighbourhood relationship, which dictates the topology, or structure of the map. The prototype vectors are initiated with random numerical values and trained iteratively. Each actual input vector is compared with each prototype vector on the mapping grid based on:

\\x —mc|| = miniX — m;|}, where x stands for input i 1

vector and mc for output vector. The best-matching unit (BMU) is defined when the prototype vector of a neurone gives the smallest Euclidean distance to the input vector. Simultaneously, the topological neighbours around the BMU are stretched towards the training input vector so that they are updated as denoted by: w;(t + 1) =m;(t) + a(t)[x(t) —m;(t)]. The SOM training is usually processed in two phases, a first rough training step and then the fine tuning. After iterative trainings, SOM eventually is formed in the format that individuals with similar properties are mapped to the same map unit or nearby neighbouring units, creating a smooth transition of related individuals over the entire map (Kohonen et al., 1996). More importantly, this ordered map provides a convenient platform for various inspections of the numerical data set. Although this algorithm has been utilized in several microarray-based investigations (Tamayo et al., 1999; Toronen et al., 1999; Chen et al, 2001), the full potential of SOM (particularly for visual inspections) has not yet been fully utilized in microarray data analyses. Recently, we have introduced component plane presentations, a more in-depth visualization tool of SOM, for the illustration of microarray data, in order to depict transcriptional changes for genes. By integrating features of this component plane presentation with SOM, microarray analyses go beyond gene clustering to include, for instance, differential displays of regulated genes on a genome-wide scale.

Simultaneous illustrations of gene clusters and genome-wide differential displays using component plane integrated SOM

To demonstrate the advantages of this approach over other analytical methods, we selected a previously analysed yeast sporulation data set with 6400 genes and seven RNA samples over seven time points (Chu et al., 1998). Sporulation in yeast is the process in which diploid cells undergo meio-sis to produce haploid germ cells, involving two overlapping steps: meiosis and spore formation; the process can be divided into meiosis I, meiosis II and spore formation. The process of sporulation can be induced using a nitrogen-deficient medium.

For SOM algorithm and its visualizations, we have utilized a SOM toolbox programmed by Vesanto et al. (2000). This toolbox, built in the Matlab 5 computation environment, has capacities to pre-process data, train SOM using a range of different kinds of topologies and to visualize SOM in various ways. To maximize the number of neighbourhood contacts topologically, we utilized hexagonal prototype vectors instead ofrectangular ones for the SOM training. The algorithm was then conducted using 324 prototype vectors on a two-dimensional lattice (18 X 18 grid). For the visualization, we first utilized a bar graphical display (Fig. 1.2D), similar to previously published displays, to gain a global view of gene clustering and expression patterns of expressed genes. The number ofgenes mapped to individual map units varied between seven and 62, and the bar chart displayed in each hexagonal unit represented the average expression pattern of genes mapped in the unit. It can be seen that the map has been organized in such a way that related patterns are placed in nearby neighbouring map units, producing a smooth transition of expression patterns over the entire map. Therefore, gene clustering can also be recognized by surrounding neighbouring map units in addition to its core unit.

To illustrate features other than clustering of regulated genes during the sporulation process, we integrated SOM analyses with the powerful visualization tool of component plane presentations.

Component plane presentations provide an in-depth approach to visualize variables that contribute to SOM. Each component plane presentation is considered as a sliced version of SOM, illustrating values of a single vector component in all map units. For example, the first component plane (t0) in Fig. 1.2E shows the SOM slice at time point 0 h and the last component plane (t11) shows the SOM at time 11 h during the sporulation process (Chu et al., 1998). The colours of map units are selected so that the brighter the colour, the greater is the average expression value of the genes mapped to the corresponding unit. Each of these SOM slices can also be considered as a genome-wide differential display of regulated genes, in which all up-regulated units (hexagons in red), down-regulated units (hexagons in blue) and moderately transcribed units (hexagons in green and yellow) are well delineated. By comparing these genome-wide differential displays, we can learn many additional features of regulated genes in cells. For instance, these displays are correlated sequentially with each other, depicting the process of sporulation at the transcriptional level. The sequential inactivation of genes mapped to the two upper corners suggests that the functional group represented by genes on the right is more sensitive to the nitrogen-deficient medium induction than the one on the left, although both of them are suppressed toward the end ofthe sporulation process. The sequential activation of genes mapped to the two bottom corners gives us a more vivid picture of the process leading to spore formation. Genes in the bottom left corner and left edge are activated at an early stage of the process, indicating that these genes are associated specifically with meiosis I. In contrast, the progressively increased expression of genes in the right corner suggests that these genes are associated with meiosis II and spore formation. This is consistent with the observation that known genes ofmeiosis II and spore formation have been mapped to these corner units.

The SOM algorithm has great potential, in particular with regard to data visualization. To date, most of the procedures used to visualize microarray data are limited to gene clustering, typically represented by bar graphical displays as depicted in Fig. 1.2D. In contrast, U-matrix (unified distance matrix) as displayed in Fig. 1.2E is a distance matrix method that visualizes the pairwise distance between prototype vectors of neighbouring map units and helps to define the cluster structure of SOM. We have utilized this display successfully to define some core clusters of developmentally related genes expressed during brain development. However, the interpretation of data can be difficult when noise interruption is high. This concern is supported further by the presence ofa large number ofunregulated genes in the sporulation data set. These genes form clusters in a random manner, producing a visible clustering area in the centre of the SOM (Fig. 1.2D).

Component plane presentations provide an in-depth approach to visualize component variables that contribute to SOM. Thus, SOM can be sliced into multiple sample-specific, genome-wide differential displays. Each of these displays details transcriptional changes of a specific sample on the genome-wide scale. These genome-wide differential displays greatly help to identify the biological meanings of microarray data. As illustrated in this section, we were able to determine directly the functional significance of genes differentially expressed during the process of sporulation at the genome-wide scale. To reach similar conclusions by alternative methods would require a much greater effort (Chu et al., 1998). Component plane presentations are also applicable to microarray data from other organisms. For example, we have applied this approach to microarray data from mouse brain samples using ten time points during early brain development stages. In these studies, we have identified a large number of genes that are related to brain development. These genome-wide differential displays can be used to identify the functional significance of regulated genes. Also, the displays can be used to correlate data from various samples, based on patterns in identical positions of the displays; this is particularly promising for samples from clinical studies. The potential impact of this approach on microarray data analysis can be substantial.


With advances in genome research, the concept of genomics extends beyond structural analyses of genomes to include functional analysis of the genome. Structural genomics focuses on genetic mapping and physical mapping of the genome by using various tools of molecular biology. A genetic map is based on linkage analysis of the segregation of polymorphic markers in pedigrees. A physical map measures the distance between loci in nucleo-tide base pairs. The ultimate physical map of a genome is the determination of its complete DNA sequences. One of the most prominent applications of genome mapping is disease gene studies, typically represented by reverse genetics. Cancer genetics is also an important aspect of disease gene studies. Although the completion of the human genome sequencing is approaching, the understanding of tools and reagents involving genome mapping may still be helpful for our current research. This chapter emphasizes functional genomics, represented by DNA microarray technology. This technology allows the measure of tens of thousands of genes in parallel, providing the most comprehensive approach to understanding molecular mechanisms involved in living cells. The most challenging part of DNA microarray analysis is to convert the massive amount of data into biologically meaningful networks. Compared with other data mining tools, we believe that SOMs, in particular if integrated by component plane presentations, is the most powerful tool in this respect. This integrated approach not only allows genes to be clustered but also permits regulated genes to be displayed differentially on the genome-wide scale. This application is particularly appealing for clinical case studies, in which detailed comparison between transcriptional profiles of individual patients often is required. With the great abundance of genomic information and the rapid development of technology, the determination of molecular mechanisms that underlie living human cells has come within reach.


The author is grateful to Li Xiao and Yue Teng for their excellent assistance with data calculation and graphics.


Burke, D.T., Carle, G.F. and Olson, M.V. (1987) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236, 806-812.

Chen, J.J., Peck, K., Hong, T.M., Yang, S.C., Sher, Y.P., Shih, J.Y., Wu, R., Cheng, J.L., Roffler, S.R.,

Wu, C.W. and Yang, P.C. (2001) Global analysis of gene expression in invasion by a lung cancer model. Cancer Research 61, 5223—5230.

Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O. and Herskowitz, I. (1998) The transcriptional program of sporulation in budding yeast. Science 282, 699—705.

Collins, F.S. (1995) Positional cloning moves from perditional to traditional. Nature Genetics 9, 347—350.

Deloukas, P., Schuler, G.D., Gyapay, G., Beasley, E.M., Soderlund, C., Rodriguez-Tome, P., Hui, L., Matise, T.C., McKusick, K.B., Beckmann,J.S. etal. (1998) A physical map of 30,000 human genes. Science 282, 744-746.

DeRisi, J.L., Iyer, V.R. and Brown, P.O. (1997) Exploring the metabolic and genetic control of gene expressiononagenomicscale. Science 278, 680-686.

Dib, C., Faure, S., Fizames, C., Samson, D., Drouot, N., Vignal, A., Millasseau, P., Marc, S., Hazan, J., Seboun, E., Lathrop, M., Gyapay, G., Morissette, J. and Weissenbach, J.A. (1996) Comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380, 152-154.

Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA 95, 14863-14868.

Hardas, B.D., Zhang,J., Trent,J.M. and Elder,J. (1994). Direct evidence for homologous sequences on the paracentric regions of human chromosome 1. Genomics 21, 359-363.

Ioannou, P.A., Amemiya, C.T., Garnes, J., Kroisel, P.M., Shizuya, H., Chen, C., Batzer, M.A. and de Jong, PJ. (1994) A new bacteriophage P1-derived vector for the propagation of large human DNA fragments. Nature Genetics 6, 84-89.

Kallioniemi, A., Kallioniemi, O.P., Sudar, D., Rutovitz, D., Gray, J.W., Waldman, F. and Pinkel, D. (1992) Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258, 818-821.

Kohonen, T. (1995) Self-organizing Maps. Springer Series in Information Sciences, Vol. 30, Springer, Berlin.

Kohonen, T., Oja, E., Simula, O.,Visa, A. andKangas,J. (1996) Engineering applications of the self-organizing map. Proceedings of the IEFE 84, 1358-1384.

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921.

Litt, M. and Luty, J.A. (1989) A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. American Journal of Human Genetics 44, 397-401.

Marynen, P., Zhang, J., Cassiman,J.J., Van den Berghe, H. and David, G. (1989) Partial primary structure of the 48- and 90-kilodalton core proteins of cell surface-associated heparan sulfate proteoglycans of lung fibroblasts. Journal of Biological Chemistry 264, 7017-7024.

Olson, M., Hood, L., Cantor, C. and Botstein, D. (1989) A common language for physical mapping of the human genome. Science 245, 1434-1435.

Schena, M., Shalon, D., Davis, R.W. and Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470.

Schrock, E., du Manoir, S., Veldman, T., Schoell, B., Wienberg, J., Ferguson-Smith, M.A., Ning, Y., Ledbetter, D.H., Bar-Am, I., Soenksen, D., Garini, Y. and Ried, T. (1996) Multicolor spectral karyotyping of human chromosomes. Science 273, 494-497.

Shizuya, H., Birren, B., Kim, UJ., Mancino, V., Slepak, T., Tachiiri, Y. and Simon, M. (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proceedings of the National Academy of Sciences USA 89, 8794-8797.

Stallings, R.L., Ford, A.F., Nelson, D., Torney, D.C., Hildebrand, C.E. and Moyzis, R.K. (1991) Evolution and distribution of (GT)n repetitive sequences in mammalian genomes. Genomics 10, 807-815.

Tamayo, P., Slonim, D., Mesiror, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S. and Gowb, T.R. (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences USA 96, 2907-2912.

Toronen, P., Kolehmainen, M., Wong, G. and Castren, E. (1999) Analysis of gene expression data using self-organizing maps. FEBS Letters 451, 142-146.

Venter, J.C., Adams, M.D., Sutton, G.G., Kerlavage, A.R., Smith, H.O. and Hunkapiller, M. (1998) Shotgun sequencing of the human genome. Science

280, 1540-1542.

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. etal. (2001) The sequence of the human genome. Science 291, 1304-1351.

Vesanto, J. (2000) Neural network tool for data mining: SOM toolbox. In: Proceedings of Symposium on Tool Environments and Development Methods for Intelligent Systems, TOOLMET2000. Oulun yliopistopaino, Oulu, Finland, pp. 184-196.

Walter, M.A., Spillett, D.J., Thomas, P., Weissenbach, J. and Goodfellow, P.N. (1994) A method for constructing radiation hybrid maps ofwhole genomes. Nature Genetics 7, 22-28.

Weber, J.L. and May, P.E. (1989) Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. American Journal of Human Genetics 44, 388—396.

White, K.P., Rifkin, S.A., Hurban, P. and Hogness, D.S. (1999) Microarray analysis of Drosophila development during metamorphosis. Science 286, 2179-2184.

Zhang, J., Marynen, P., Devriendt, K., Fryns, J.P., Van den Berghe, H. and Cassiman, JJ. (1989a) Molecular analysis of the isochromosome 12P in the Pallister-Killian syndrome. Construction of a mouse-human hybrid cell line containing an i(12p) as the sole human chromosome. Human Genetics 83, 359-363.

Zhang, J., Hemschoote, K., Peeters, B., De Clercq, N., Rombauts, W. andCassiman,JJ. (1989b) Localization of the PRR1 gene coding for rat prostatic proline-rich polypeptides to chromosome 10 by in situ hybridization. Cytogenetics and Cell Genetics 52, 197-198.

Zhang, J., Devriendt, K., Marynen, P., Van den Berghe, H. and Cassiman, JJ. (1990) Chromosome mapping using polymerase chain reaction on somatic cell hybrids. Cancer Genetics and Cytogenetics

Zhang, J., Meltzer, P., Jenkins, R., Guan, X.Y. and Trent, J. (1993a) Application of chromosome microdissectionprobesfor elucidationofBCR-ABL fusion and variant Philadelphia chromosome translocations in chronic myelogenous leukemia. Blood 81, 3365-3371.

Zhang, J., Trent, J.M. and Meltzer, P.S. (1993b) Rapid isolation and characterization of amplified DNA by chromosome microdissection: identification of IGF1R amplification in malignant melanoma. Oncogene 8, 2827-2831.

Zhang, J., Cui, P., Glatfelter, A.A., Cummings, L.M., Meltzer, P.S. and Trent, J.M. (1995) Microdissection based cloning of a translocation breakpoint in a human malignantmelanoma. CancerResearch 55, 4640-4645.

0 0

Post a comment