EMBO reports (2001): Selenoprotein gene prediction in the Fly

EMBO reports (2001): Selenoprotein gene prediction in the Fly

in silico identification of novel selenoproteins in the Drosophila melanogaster genome

S. Castellano, N. Morozova, M. Morey, M.J. Berry, F. Serras, M. Corominas and R. Guigó*.

EMBO reports, 2(8):697-702 (2001) [Abstract] [Full Text]

*To whom correspondence should be adressed.

In this site we describe all the programs and data used used to predict selenoproteins in the Drosophila melanogaster genome. An independent but similar work was reported by Hatfield and Gladyshev's labs (Martin-Romero et al., JBC, 276(32):29798-804, 2001).

Selenoproteins overview border=0

Major points on selenoproteins are:

  1. They incorporate the aa selenocysteine (U or Sec) which is the 21st aa. It has its own tRNA which carries the anticodon for UGA (which we were taught was only a STOP codon !).
  2. So, why not all UGA codons code for Sec? because the alternative decoding of UGA is conferred by an mRNA secondary structure, termed SECIS. This structure, by means of one or more proteins, directs the ribosome to incorporate Sec.
  3. They are everywhere: Eukarya, Bacteria and Archaea. But the SECIS element is located in the 3' UTR in eukaryotes and archaeas while in the coding region in bacterias (just after the UGA). Eukarya, Bacteria and Archaea SECIS elements differ substantially.
  4. Try standard gene prediction and, as much, you will get truncated selenoprotein genes. Why not accepting that UGA codes for Sec as long as there is a potential SECIS around? This is the work presented here.
Genome Sequence border=0

We used the first released 19 large genome scaffolds on March 24, 2000 (described in Adams et al. [2000, Science, 287, 2185-2195]). Try Celera or NCBI to get them.

SECIS border=0

The SECIS structure, located in the 3' UTR, is the secondary/tertiary RNA structure which directs the UGA codon recoding. The PatScan SECIS pattern used in this work is the following:

r1={at,ta,gc,cg,tg,gt} p1=5...15 p2=1...7 a tga n p3=9...12 p4=0...2 aa p5=6...17 r1~p3[2,1,1] n gan p6=3...9 r1~p1[2,1,1]

The search along the D.melanogaster genome yield 35876 potential SECIS, which were then assessed thermodynamically using the RNAfold program (Viena RNA package).

Get the 35876 potential SECIS here (one file for each scaffold). 1220 SECIS were considered stable enough and input to geneid. Download them here (one file for each scaffold and formated into gff style).

Check the PatScan program for a detailed explanation of the pattern.

geneid border=0

Please, for a general introduction browse the geneid page . The modified geneid version able to predict selenoproteins can be found just below (source code in ansi C and parameter file):




The parameters file is an external flat file read by geneid at running time. Take a look at it ! . It carries the statistical information, for a given organism, used to predict genes and the gene model (which states the relationships of the exons predicted along a sequence). Please, read the geneid handbook for details.


Coding Potential border=0

Coding potential can be understood as a measure that, given a nucleic sequence, tells the likelihood that the sequence is coding for a protein. Many coding statistics have been proposed and the hexamer frequency is the one used by geneid, in a rather more complicated way, though.

Datasets for the coding potential estimation and test in human mRNAs are given below. Each sequence was split into three regions (each of them of at least 30 nt): 1) Coding region; 2) TGA-STOP region or TAA/TAG-STOP; and 3) STOP-STOP region.


Novel Selenoproteins border=0

Three real selenoproteins were found: dSPS2, dSelK (previously named, dSelG) and dSelH (previously named, dSelM). The dSPS2 protein was recently reported in Hirosawa-Takamori et al. (2000, EMBO Rep., 1, 441-446).

The Coding Sequence (CDS) gene annotation is given in respect to the original scaffolds and in the original CELERA style. Note that the incorrect CELERA annotation for the same gene is also shown. Protein seq. in fasta are also given below. Note that U stands for selenocysteine (Sec). The SECIS sequence is given divided into structural units.


dSPS2: gene, protein and SECIS
dSelK (dSelG): gene, protein and SECIS

dSelH (dSelM): gene, protein and SECIS