Genes coding for intermediate filament proteins: common features and unexpected differences in the genomes of humans and the teleost fish Fugu rubripes

We screened the genomic sequences of the teleost fish Fugu rubripes for genes that encode cytoplasmic intermediate filament (IF) proteins. Here, we compare the number of genes per subfamily (I to IV) as well as the gene mapping in the human and fish genomes. There are several unexpected differences. F. rubripes has a sizeable excess of keratin type I genes over keratin type II genes. Four of the six keratin type II genes map close to four keratin type I genes. Thus, a single keratin II gene cluster (as in mammals) seems excluded. Although a continuous genome sequence is not yet available for F. rubripes, it is difficult to see how all 19 keratin type I genes can be collected as in the human genome into a single cluster without the presence of type II genes and various unrelated genes. F. rubripes has more type III and type IV genes than humans. Some of the type IV genes acquired additional novel intron positions. One gene even harbors (in addition to the two type IV introns) three novel introns and three introns usually present only in mammalian and F. rubripes type I-III genes. This mixture of type IV and type I-III intron positions poses a problem for the traditional view that the first type IV gene arose in evolution by a mRNA-mediated translocation event. In the 42 F. rubripes genes analysed here, there are several differences in intron patterns compared with mammalian genes. Most correspond to additional introns in the fish genes. A search for genes encoding nuclear lamins reveals the four established fish lamins (A, B1, B2 and LIII) as well as an unexpected second lamin A.


Introduction
In humans, the family of genes encoding the structural proteins of the cytoplasmic intermediate filaments (IFs) has more than 60 members and is one of the 100 largest multigene families (Hesse et al., 2001). Sequence identity levels of IF proteins, the organization of the corresponding genes and their expression patterns define several IF subtypes (Fuchs and Weber, 1994;Herrmann and Aebi, 2000;Coulombe et al., 2001). Type I and type II keratins are the largest subfamilies. They give rise to the epithelial keratin filaments that are based on obligate heteromeric double-stranded coiled coils formed by a type I and a type II keratin. Type III includes four proteins that can form homopolymeric IFs. The genes for the seven type IV proteins show an entirely different intron pattern than do type I-III genes. They have only two to three introns related to the central rod domain of the proteins and these introns occur in positions not seen in type I-III genes. The nuclear lamins form the type V, whereas the eye lens proteins filensin and phakinin constitute a separate group (BF, for beaded filaments).
A survey of the draft sequence of the human genome (International Human Genome Sequencing Consortium, 2001) shows that genes coding for non-keratin IF proteins are not clustered (Hesse et al., 2001). By contrast, all type I keratin genes except for K18 form a dense cluster on chromosome 17q21, whereas all type II keratin genes and K18 form a similar cluster on chromosome 12q13 (Waseem et al., 1990).
Point mutations in a still-growing number of IF genes are connected with human diseases. Mutations in at least 14 epidermal keratin genes cause fragility syndromes of the skin (Irvine and McLean, 1999) and similar mutations in the type III desmin gene connect to myopathies of heart and skeletal muscle (Goldfarb et al., 1998), whereas mutations in the GFAP gene are found in Alexander's disease (Brenner et al., 2001;Li et al., 2002). Finally, in Caenorhabditis elegans, at least four of the 11 IF genes are essential for nematode development (Karabinos et al., 2001). Type I-III genes are not restricted to vertebrates but have also been documented in the early chordates, which, however, seem to lack type IV genes (reviewed in Karabinos et al., 2002;Wang et al., 2002).
Some genes for type I-IV cytoplasmic IF proteins from fish have previously been documented by cDNA cloning in particular in the goldfish and the rainbow trout (Markl and Schechter, 1998;Schaffeld et al., 2002a,b), and nuclear lamins have been analysed in the goldfish (Yamaguchi et al., 2001) We screened the genomic sequences of the teleost fish Fugu rubripes for genes that encode cytoplasmic intermediate filament (IF) proteins. Here, we compare the number of genes per subfamily (I to IV) as well as the gene mapping in the human and fish genomes. There are several unexpected differences. F. rubripes has a sizeable excess of keratin type I genes over keratin type II genes. Four of the six keratin type II genes map close to four keratin type I genes. Thus, a single keratin II gene cluster (as in mammals) seems excluded. Although a continuous genome sequence is not yet available for F. rubripes, it is difficult to see how all 19 keratin type I genes can be collected as in the human genome into a single cluster without the presence of type II genes and various unrelated genes. F. rubripes has more type III and type IV genes than humans. Some of the type IV genes acquired additional novel intron positions. One gene even harbors (in addition to the two type IV introns) three novel introns and three introns usually present only in mammalian and F. rubripes type I-III genes. This mixture of type IV and type I-III intron positions poses a problem for the traditional view that the first type IV gene arose in evolution by a mRNA-mediated translocation event. In the 42 F. rubripes genes analysed here, there are several differences in intron patterns compared with mammalian genes. Most correspond to additional introns in the fish genes. A search for genes encoding nuclear lamins reveals the four established fish lamins (A, B1, B2 and LIII) as well as an unexpected second lamin A. and the zebrafish (Hofemeister et al., 2002). However, only the emerging genome of the teleost fish Fugu rubripes (Aparicio et al., 2002) allows a detailed comparison of IF gene organization and complexity in man and a lower vertebrate. Here, we report on some unexpected differences between IF genes in F. rubripes and mammals.

Results and Discussion
The F. rubripes genome is currently provided by 12,381 nonoverlapped scaffolds accounting for a total of 332.5 Mb. Joining of the scaffolds is expected soon (Aparicio et al., 2002). We searched the F. rubripes database between October and December 2002. Table 1 summarizes the scaffolds that contain cytoplasmic IF genes. The area number on the scaffold containing the IF gene and its direction are also given. Some genes are incomplete either because of a gap in the sequence or because the gene is located at one end of the scaffold (Table  1). A summary of this search is given in Table 2, which compares the number of F. rubripes genes for the different subfamilies with those deduced from a survey made on the draft sequence of the human genome in spring of 2001 (Hesse et al., 2001). The values provided are minimum values and might still increase slightly once the two genomes are completed. In the human genome, there are at least 62 cytoplasmic IF genes. If we subtract the 15 genes encoding hair keratins as a mammalian specialization, this number is reduced to 47. F. rubripes has at least 42 genes and thus displays a complexity that is similar to mammals. However, it shows a distinct distribution of the number of genes per subfamilies I-IV (Table 2).
The F. rubripes database (Aparicio et al., 2002) is very reliable and well organized. In a few cases we needed to introduce a frame shift to keep the obvious reading frame or to Journal of Cell Science 116 (11)  (Aparicio et al. 2002), release 6.1.1. † Indicates necessary frame shifts (f) and changes from the proposed gene structure (c). ‡ Indicates whether or not the gene is at a scaffold end. § Sequences lacking 5′ or 3′ ends are marked minus-5′ and minus-3′, respectively. explore a major change from the proposed gene structure (Table 1). Two expected difficulties arose: the occasional presence of sequence gaps in some genes situated in the interior of a scaffold and the location of a gene at the end of a scaffold ( Table 1). The first problem can be solved directly by PCR amplification bridging the gaps between the known neighboring sequences. The second set of problems, which relates to eight cytoplasmic IF genes, requires overlaps for the more than 12,000 scaffolds, which should be supplied in the future by the F. rubripes genome sequencing consortium (Aparicio et al., 2002).
Striking excess of type I over type II keratin genes in F. rubripes Tables 1 and 2 show the presence of 13 complete and six incomplete F. rubripes type I genes. The total of 19 genes surpasses the 16 human type I genes (not including the nine type I hair keratin genes thought to be a mammalian specialization). An entirely different situation is given by the type II keratin genes because we located only four complete and two nearly complete genes. Thus, there are about three times as many type I than type II genes in F. rubripes, whereas, in humans, the numbers of type I and II genes is similar (Tables 1, 2). The large excess of type I over type II genes could indicate that functional differences between the obligatory heteropolymeric keratin filaments of different cell types depend primarily on the type II genes that are expressed, whereas the type I genes provide additional variability. Most F. rubripes keratin genes show the intron patterns previously described for mammalian type I and type II genes. The two complete keratin I genes on scaffold 2605 have an additional intron between the traditional introns 5 and 6. The keratin II gene on scaffold 3830 has another novel intron position situated between introns 6 and 7. A striking case of an unusual intron pattern is observed in the keratin I gene on scaffold 7354. It has the normal type I intron pattern but, in addition, has an intron that occurs in all mammalian type II and in all F. rubripes type II genes (intron 1 of type II genes). The keratin I gene on scaffold 135 shows an unusual doubling of exon 6 (protein sequence identity 95%), which encodes the C-terminal end of the rod domain. Possibly, these exons are alternatives. The type I gene situated at the end of scaffold 8680 ( Fig. 1; Table 1) is incomplete and lacks the 5′ end. Interestingly, it is the only F. rubripes IF gene that shows several gaps in alignments of the predicted protein sequence and might be a rare pseudogene. Finally the single keratin I genes on the two small scaffolds 7320 and 8762 (Table 1) share 96.8% sequence identity on the nucleotide level including the six introns. This observation is clearly unrelated to the often suggested partial tetraploidity of fish genomes (Aparicio et al., 2002) and instead signals a very recent gene duplication event.
Lack of the keratin II gene cluster in F. rubripes The human genome contains all 25 type I keratin genes except for the keratin 18 gene in a cluster on chromosome 17q21, where they are arranged in the same orientation. Similarly, all 24 type II keratin genes are in a similar cluster on chromosome 12q13, which also harbors the keratin 18 gene next to the keratin 8 gene (Hesse et al., 2001). Although the more than 12,000 scaffolds are not yet arranged as continuous F. rubripes genome sequence, the current mapping results (Table 1, Fig.  1) suggest that keratin genes are differently organized in the F. rubripes and the mammalian genomes. Fig. 1 shows that four of the six type II genes locate as either paired (scaffold 214) or single (scaffolds 2158 and 3159) genes next to either one or two type I genes. The other two type II genes map to two separate scaffolds (285 and 3830). Thus, the presence of a single keratin II gene cluster as in humans is excluded. Interestingly, when type I and II genes map together, they show different orientations.
There is already clear evidence for some clustering of keratin I genes in F. rubripes. Fig. 1 shows four groups of two and three directly neighboring keratin I genes, and a pair of keratin I genes separated only by a hypothetical gene. Because another four keratin I genes lie on rather small scaffolds, one could envision a single cluster of many type I genes in F. rubripes. However, it seems not to be possible to build a cluster that collects, as in humans, all type I genes without the simultaneous incorporation of four type II genes and various unrelated genes. Neighboring type I genes can have either the same or opposite orientation.
The emerging differences of keratin gene locations in mammalian and fish genomes argue that the keratin gene clustering as documented for mammals was acquired after the fish lineage separated from the lineage leading to higher vertebrates. Forthcoming genomic data on the amphibian Xenopus tropicalis and the chicken (as a representative of the birds) will shed light on the question when keratin gene clustering was acquired during vertebrate evolution. We also note that a recent phylogenetic analysis of keratin I and II proteins indicates that fish epidermal keratins diversified independently from the mammalian epidermal keratin radiation, keratins 8 and 18 of interior epithelia are true orthologs in fish and mammals (Schaffeld et al., 2002a).
Mammalian keratins 8 and 18 are typical of internal epithelia and represent the earliest keratin expression pair in embryogenesis. Interestingly the gene for keratin 18, a type I keratin, is adjacent to the keratin 8 gene in the type II gene cluster on human chromosome 12q13 (Waseem et al., 1990;Hesse et al., 2001). The close proximity of keratin 8 and 18 genes also seems to hold for F. rubripes. The type II gene on scaffold 3159 codes for a keratin 8 (92% and 78% sequence identity with the rod domain of keratins 8 from rainbow trout  (Hesse et al., 2001). † The number of hair keratins is indicated (h); this is a mammalian specialization. ‡ Beaded eye lens filament. and human, respectively). The keratin 8 gene is separated by a very small hypothetical gene from a type I keratin gene whose sequence is still not complete (Fig. 1). BLAST analysis registers the predicted protein as keratin 18 (81% and 64% identity of the available rod sequence with keratins 18 from rainbow trout and human, respectively). Fig. 2 gives some examples of the sequence similarity of F. rubripes and human IF proteins.
Duplication of the type III desmin gene Previous cDNA cloning studies established in fish the homologs of the four mammalian type III genes encoding vimentin, desmin, GFAP and peripherin [fish peripherin is often referred to as plasticin in the literature] (Markl and Schechter, 1998). The four complete genes are also present in F. rubripes, which shows two additional type III genes ( amplification on F. rubripes DNA and sequence analysis, we have verified the proposed arrangement of the two scaffolds covering the desmin 2 gene. The sixth gene (scaffold 117) has some unexpected features. It is intronless and lies in the large tenth intron (3.2 kb) of a gene encoding the enzyme isoleucine tRNA synthase. The open reading frame predicts a second vimentin that shares 43% sequence identity with vimentin 1. The canonical sequence motif LNDR in coil 1a is changed to LNAK in vimentin 2. Currently, we do not know whether vimentin 2 is an active gene. The lack of a polyA tract argues against its being a processed pseudogene. The presence of two desmin and vimentin genes in F. rubripes and the previous finding of at least two vimentin genes in Xenopus laevis (Herrmann et al., 1989), which is a tetraploid species, open the possibility that only higher vertebrates have single vimentin and desmin genes.  (Table 1). Identical amino acid residues in each pair are marked by bold print.
The beaded filaments of the mammalian and avian eye lens contain the two special cytoplasmic IF proteins, phakinin (CP49) (Hess et al., 1996) and filensin (Masaki and Quinlan, 1997), which together form the BF subfamily. The corresponding genes in F. rubripes locate to contig 29247 and scaffold 91, respectively (Table 1). The F. rubripes phakinin gene lacks the 5′ end and the consensus sequence at the end of the rod domain of the phakinin protein is, as in other phakinins, changed from YRKLLEGE to YHGILDGE (Sandilands et al., 1998). The F. rubripes filensin gene has still a small sequence gap.

Surprisingly many type IV genes
The seven mammalian type IV genes (Lewis and Cowan, 1986) show an entirely different organization than do type I-III genes (Tyner et al., 1985) (Fig. 3). They have only two introns (three for NF-H), which occupy unique positions and occur late in the region encoding the rod domain. To account for this different placement of introns, it was proposed that the first type IV gene arose by an mRNA-mediated transposition event and that subsequent events led to the acquisition of the few new introns (Lewis and Cowan, 1986).
Using the presence of the two intron positions conserved in all mammalian type IV genes, a total of nine type IV genes can be identified in F. rubripes (Tables 1, 2). Several of these genes pose problems in annotation compared with the mammalian genes and so some assignments are tentative. The gene on scaffold 1912 predicts a protein with 84% sequence identity to goldfish gefiltin, the fish homolog for mammalian α-internexin (Markl and Schechter, 1998). Indeed, the F. rubripes gefiltininternexin protein shares 60% identity with human internexin (Fig. 2). The gene on scaffold 2208 predicts a gefiltin-like protein (gefiltin-like 1) that shares 74% sequence identity with gefiltin but has a divergent tail domain (identity with human internexin 50%). A further gefiltin like protein, gefiltin-like 2, is coded by the type IV gene on scaffold 1885. Although gefiltin-like 2 shows nearly the same similarity with gefiltin and vimentin over the rod domain, the intron pattern identifies the gene as a type IV gene.
The two genes present on scaffolds 137 and 2296 predict proteins that are related to gefiltin but have unique tail domains. The second halves of the tail domains are highly acidic owing to the presence of many glutamic acid residues, which often form polyglutamic acid strings. Because this is a distinctive feature of mammalian (Lewis and Cowan, 1986) and Xenopus (Charnas et al., 1992) neurofilament triplet NF-L proteins, we tentatively name these F. rubripes type IV genes NF-L1 and NF-L2, respectively (Table 1). The F. rubripes neurofilament triplet NF-M gene located on scaffold 6593 still has two sequence gaps that obscure the intron pattern. Because of its convincing relation to the corresponding goldfish gene (Glasgow et al., 1994), we used this latter gene in the comparison below. The type IV gene located on scaffold 1245 is tentatively called NF-H because it has the additional intron position of mammalian NF-H genes (Lees et al., 1988) and the predicted protein has a tail domain containing many short degenerate repeats. Depending on the choice between two possible gene structures, there are 19 or 30 degenerate repeats. Whereas the 21 degenerate repeats of mammalian NF-H involve essentially the 14 residue motif KSPEKAKSPVKEEA with two serine phosphorylation sites (Lees et al., 1988), the F. rubripes repeat is based on the 10 residue motif ETKPAAKEEP with one threonine phosphorylation site.
Gene Y on scaffold 2477 predicts the only F. rubripes type IV protein with a low sequence similarity with gefiltin. The predicted protein has a very small head and a very long tail domain. Although this is a structural feature of mammalian nestin (Lendahl et al., 1990) and synemin (Titeux et al., 2001), no convincing homology was detected. Finally, gene X located on scaffold 120 predicts again a protein of 43% similarity with gefiltin but its astonishing intron pattern (see below) makes an annotation very difficult.
Journal of Cell Science 116 (11) Although the F. rubripes collection of type IV genes already includes nine genes, it lacks obvious homologs encoding the large proteins nestin (Lendahl et al., 1990) and synemin (Titeux et al., 2001), the protein syncoilin, which is a constituent IF member of the muscle dystrobrevin complex (Newey et al., 2001). Genes coding for non-keratin IF proteins are not clustered in the human genome (Hesse et al., 2001). Similarly, in F. rubripes, there is no scaffold that harbors more than one type III or one type IV gene.
Surprising intron additions and the problem of the origin of type IV genes Although, in general, fish and mammalian genes have the same intron pattern (Aparicio et al., 2002), some F. rubripes type IV genes do not (Fig. 3). The genes for gefiltin, gefiltin-like 2, NF-L1, NF-M and protein Y have only the conserved two intron positions of mammalian type IV proteins. An additional intron is found in the same position in mammalian and F. rubripes NF-H genes. However, the genes encoding NF-L2 and gefiltinlike 1 have one or two additional introns situated at novel positions. Even more complex is the situation in gene X, which has eight introns: the two conserved intron positions of type IV genes, three novel intron positions and a further three positions that are characteristically found only in mammalian and F. rubripes type I-III genes. These include the first intron position of type II genes, the third intron position of type II genes (which is also present in type III genes) and the intron position corresponding to the end of the coil 1b domain, which is found in all type I-III genes. The documentation of a fish IF gene that combines type I-III intron positions with type IV intron positions (Fig. 3) is at first difficult to accommodate in a model assuming that the first type IV gene arose by translocation of an intronless mRNA into the genome (Lewis and Cowan, 1986). One possibility for the origin of this gene X that stays within this model is the speculation that it arose as a chimera of a keratin II and a type IV gene (Fig. 3).
Genes NF-L2, gefiltin-like 1 and X together provide a total of six new intron positions of IF genes that have no counterpart in human IF genes. The number of novel fish IF intron positions is increased to ten by the novel intron positions in two type I keratin genes, one type II gene and the desmin 2 gene (see above). If we consider vimentin 2 as a special case, there are ten intron gains in the F. rubripes IF genes analysed, but no unusual intron loss (except for vimentin 2).

F. rubripes has two A lamins
Previous studies have shown that fish have four nuclear lamins. Whereas lamins A, B1 and B2 are found in all classes of vertebrates, the additional lamin LIII is only detected in amphibia and fish (Döring and Stick, 1990;Yamaguchi et al., 2001;Hofemeister et al., 2002). Table 3 shows that the genomic F. rubripes sequences cover the complete genes for lamins A, B1 and LIII (with its two alternative last exons, which produce the isoforms LIIIa and LIIIb). The intron pattern of these three genes is perfectly conserved between fish and human. The lamin B2 gene bridges scaffolds 6482 and 7682. Exon 1 is located to scaffold 6482, where it is followed by a long intron sequence that overlaps extensively with the end of scaffold 7678. This scaffold carries also the middle part of the gene, but the 3′ end is probably obscured by the following large sequence gap. Interestingly, the F. rubripes B2 lamin gene has an additional intron inserted in the region encoding the coil 2a domain. Unexpectedly, a second lamin A is also indicated (Table 3). Lamin A2 starts in scaffold 2719 (exon 1 plus intron) and continues with scaffold 6631 (exon 2 till end). Using the zebrafish lamin A as reference (Hofemeister et al., 2002), we find sequence identity of 70% for both F. rubripes A lamins, which share only 63% identity. This is the first report of the presence of two lamin A genes in a vertebrate genome. It raises the question of whether one of them belongs to those genes that contribute to the partial tetraploidy of F. rubripes (Aparicio et al., 2002). The comparatively low degree of sequence similarity would indicate an ancient duplication event.
We thank M. Osborn (Goettingen) and M. Hesse (Bonn) for helpful discussions.