Assignment I : Gene Structure Prediction Tool using Neural Network

 

Dataset : dataset from UCSC

 3 datasets:

Only single exon genes:
          single_exon_GB.dat.gz
          single_exon_GB.sets
Multiple exon genes:
       multi_exon_GB.dat.gz
          multi_exon_GB.sets
Combined -- single and multiple exon genes:
          combined_GB.dat.gz
          combined_exon_GB.sets

 

Each dataset(*.dat.gz) consists of GenBank flatfile format entries.

Genbank entries separated by "//"
Accompanying each data set is a ".sets" file listing 7 test/train subsets.  These subsets can be used for cross-validation.

 

 * e.g. An entry from multi_exon_GB.dat

//    #entry separator

LOCUS       HSAPOA2      3360 bp    DNA             PRI       16-FEB-1995

DEFINITION  Human gene for apolipoprotein AII.

ACCESSION   X04898

KEYWORDS    apolipoprotein; apolipoprotein A-II; signal peptide.

SOURCE      human.

ORGANISM  Homo sapiens

            Eukaryotae; mitochondrial eukaryotes; Metazoa/Eumycota group;

            Metazoa; Eumetazoa; Bilateria; Coelomata; Deuterostomia; Chordata;

            Vertebrata; Gnathostomata; Osteichthyes; Sarcopterygii; Choanata;

            Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Archonta; Primates;

            Catarrhini; Hominidae; Homo.

REFERENCE   1  (bases 1 to 3360)

AUTHORS   Shelley,C.S., Sharpe,C.R., Baralle,F.E. and Shoulders,C.C.

TITLE     Comparison of the human apolipoprotein genes. Apo AII presents a

            unique functional intron-exon junction

JOURNAL   J. Mol. Biol. 186 (1), 43-51 (1985)

MEDLINE   86089113

REFERENCE   2  (bases 1 to 3360)

AUTHORS   Knott,T.J., Wallis,S.C., Robertson,M.E., Priestley,L.M., Urdea,M., Rall,L.B. and Scott,J.

TITLE     The human apolipoprotein AII gene: structural organization and sites of expression

JOURNAL   Nucleic Acids Res. 13 (17), 6387-6398 (1985)

MEDLINE   86016095

COMMENT     An Alu repetitive element is located around the polymorphic MspI site at pos. 3033.

               NCBI gi: 28743

FEATURES             Location/Qualifiers

     source          1..3360

                     /organism="Homo sapiens"

     misc_feature    complement(59..64)

                     /note="seq. pot. involved in steroid hormone/receptor

                     binding"

     misc_feature    complement(632..637)

                     /note="seq. pot. involved in steroid hormone/receptor

                     binding"

     CAAT_signal     1100..1108   # promoter signal

     TATA_signal     1148..1153   # promoter signal

     prim_transcript 1174..2507

     exon            1174..1210  #exon 1. sequence number

                     /number=1

     mRNA            join(1174..1210,1380..1455,1749..1881,2277..2507) #mature mRNA sequence number

                     /gene="apo AII"

     misc_feature    1182

                     /note="5' end of cDNA"

     intron          1211..1379  #intron 1. sequence number

                     /number=1

     exon            1380..1455

                     /number=2

     CDS             join(1404..1455,1749..1881,2277..2394) #protein coding sequence number

                     /gene="apo AII"

                     /note="NCBI gi: 671882"

                     /codon_start=1

                     /product="apolipoprotein"

                     /translation="MKLLAATVLLLTICSLEGALVRRQAKEPCVESLVSQYFQTVTDY

                     GKDLMEKVKSPELQAEAKSYFEKSKEQLTPLIKKAGTELVNFLSYFVELGTHPATQ"

     sig_peptide     1404..1455

                     /gene="apo AII"

     intron          1456..1748

                     /number=2

     repeat_region   1711..1742

                     /note="(GT) 16, pot. z-DNA sequence"

     misc_feature    1711..1748

                     /note="functional acceptor site variant sequence"

     exon            1749..1881

                     /gene="apo AII"

                     /number=3

     misc_feature    1749..1765

                     /gene="apo AII"

                     /note="pro-peptide"

     intron          1882..2276

                     /number=3

     exon            2277..2507

                     /number=4

     polyA_signal    2487..2492  #poly adenylation signal 

     polyA_site      2507

BASE COUNT      904 a    849 c    832 g    775 t

ORIGIN     #DNA sequence

        1 cccgggaggt ggaggttgca gtgagccgag atcatgccat tacgctccag cctgagcaac

       61 aagagcaaaa ctctgtctca ggaaaacaaa caaaaaaacc tgcacatata cttctgaatt

      121 taaaacaaaa gttaaaaaac aaagatttct tggtctctgg tcactacctc cctcatcagc

      181 tttgcgcctc cactgtcacc ctcaggaatg ttccacatac tcagcgagta tgcttggggg

                                       ~

     3121 aaaaaaaaaa aaaaaagaaa gtaaagaaaa aaagaaaatg agggtacccc tcataatttc

     3181 ctgttagtca ttctatgaag aaaagaaagc ttcccaaggt gtcacccgtg gccctccttt

     3241 cccttctgag ccaggggaac actgtgtttc cccctttccc acaataaaag acttgagttt

   3301 gctcctctcc ctagaagtgc tctaatttct ccatttaaaa cctcttatct agaccaggca
//
 
 
 
 
 
 
 

Dataset : Genesafe

3 datasets :

l      chromosome 1(AGSR_20-06-99.Chr_1.tar.gz)

l      chromosome 13(AGSR_20-06-99.Chr_13.tar.gz)

l      chromosome X (AGSR_20-06-99.Chr_X.tar.gz)

Files :

l      Lists of characterized sequences (as '*.list' text files)

l      The raw DNA sequences (as fasta *.seq files)

l      'Gene Feature Finding' (GFF) format files (as *.gff files) containing 'true gene' (and possibly 'contig' level annotations).

l      Some extra sequence details such as type of clone, adjacent clones, etc. (in an *.ace file)

l      gff format help page

e.g. 1062E2.gff

##gff-version 2

##date 1999-6-20

##sequence-region 1026E2 1 100418

 

seqname      source    feature    start  end score strand frame   attribute comments

1026E2          Pseudogene    exon   -1873  100     .         -         .         Sequence  "35C21.erv1"

1026E2          Pseudogene    CDS    -1873  100     .         -         0         Sequence  "35C21.erv1"

1026E2          Pseudogene    sequence       -1873  100     .         -         .         Sequence  "35C21.erv1"

1026E2          GD_mRNA      sequence       40911  93451  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   40911  42921  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    41808  42921  .         -         1         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    44548  44770  .         -         2         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   44548  44770  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    46899  47135  .         -         2         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   46899  47135  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   49740  49845  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    49740  49845  .         -         0         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   65949  66157  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    65949  66157  .         -         2         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    67224  67414  .         -         1         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   67224  67414  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      exon   93188  93451  .         -         .         Sequence  "1026E2.1"

1026E2          GD_mRNA      CDS    93188  93450  .         -         0         Sequence  "1026E2.1"

 

<seqname> name of sequence

<feature> exon, CDS(protein cording sequence)

<strand> transcription direction (- : opposite direction – complementary sequence)

<frame> translational frame

 

 

 

 

 

Dataset : SpliceDB  database of mammalian splice site

Set of splice junction pairs

Sequences of donor and acceptor

ID @@ ACCES @@ INTRON @@ DON @@ ACC @@ SEQ_DON @@ SEQ_ACC @@ EST @@ EST_ACCES @@ CORR

ID (Database Identifier): This field has always only one word, that is an unique and specific identifier provided to every pair, it is formed by Infogene entry name, assigned intron number, donor position in original sequence and acceptor corresponding position, all joined usign "##" symbol (i.e. HG_0000731##114##122615##122965)

ACCES (Accession number): This field has always only one word, that is the original accession GenBank number entry (i.e. AB011399)

INTRON (Assigned intron number): This field has always only one word, that is the intron number assigned to every intron pair in Infogene database (i.e. 114)

DON (Donor number): This field has always only one word, that is the donor position in original Infogene entry (i.e. 122615)

ACC (Acceptor number): This field has always only one word, that is the acceptor position in original Infogene entry (i.e. 122965)

SEQ_DON (Nucleotide sequence around donor): This field has always only one word, that is the nucleotide sequence centered in donor characteristic dinucleotides, with 40 bp in every side, forming a total sequence of 82 bp (i.e. aacatctgtctctactggaaacctctgcactgaggagcagattgattgataagcaaaaggcttctactgcatttccatcctt)

SEQ_ACC (Nucleotide sequence around acceptor): This field has always only one word, that is the nucleotide sequence centered in acceptor characteristic dinucleotides, with 40 bp in every side, forming a total sequence of 82 bp (i.e. aaaaagctcactttttttgttcttcacattttacaggagcagacgcctccgcctagacctgaagcctaccccatccccactc)

EST (EST classification): This field has always only one word, that is the obtained EST classification (see Material and Methods in original paper for details) (i.e. B20) EST_ACCES (EST accession number supporting classification): This field has always only one word, that is the accession number of the EST used to support our classification (i.e. gb|N35650|N35650)

CORR (Possible corrections): This field is optional and is specified in free text. All possible corrections after EST support are annotated in this field, based in ESTs or in HTG: automatic EST correction in positions pos1 pos2 using ESTaccession: There is annotated which positions present ambiguities with respect to annotated and supported junctions (pos1 and pos2), and EST accession number that supports alternative junction (ESTaccession) HTG text: There is annotated information about HTG comparison with respect to this entry. (for more details see Results in original paper)

 

 

 

 

 

 

 

 

 

 

 

 

Required 

1.             Distinguish exons and introns from unannotated sequence.

2.             Predict protein coding region (not out of frame)

3.             Examine both strands

4.             Use UCSC dataset (for extra points, use Genesafe and SpliceDB too)

5.             Evaluate tools performance

                 (Test sequence set will be given.)

 

For extra points

1.  Remove similar sequences from dataset to reduce redundancy or bias

2.  Use more dataset from Genbank or other databases

3.  Predict from multi gene or partial gene containing sequence

4.  Make automated annotation of gene result by similarity searching like BLAST

5.  Apply tools to other organisms (plant, mouse, yeast)

 

 

 

 

Gene prediction tools list page: http://www.ncgr.org/mew/gene_prediction.html

                                         http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-geneid.html

Gene recognition reference list page: http://linkage.rockefeller.edu/wli/gene/

 

Ref: Prediction tool using neural network (GRAIL)

 EC Uberbacher and RJ Mural, Locating Protein-Coding Regions in Human DNA Sequences by a Multiple Sensor-Neural Network Approach.(1991) PNAS 88 11261-11265