Package Bio :: Package UniGene
[show private | hide private]
[frames | no frames]

Package Bio.UniGene

Parse Unigene flat file format files such as the Hs.data file.

Here is an overview of the flat file format that this parser deals with:
   Line types/qualifiers:

       ID           UniGene cluster ID
       TITLE        Title for the cluster
       GENE         Gene symbol
       CYTOBAND     Cytological band
       EXPRESS      Tissues of origin for ESTs in cluster
       RESTR_EXPR   Single tissue or development stage contributes 
                    more than half the total EST frequency for this gene.
       GNM_TERMINUS genomic confirmation of presence of a 3' terminus; 
                    T if a non-templated polyA tail is found among 
                      a cluster's sequences; else
                    I if templated As are found in genomic sequence or
                    S if a canonical polyA signal is found on 
                      the genomic sequence
       GENE_ID      Entrez gene identifier associated with at least one sequence in this cluster; 
                    to be used instead of LocusLink.  
       LOCUSLINK    LocusLink identifier associated with at least one sequence in this cluster;  
                    deprecated in favor of GENE_ID
       CHROMOSOME   Chromosome.  For plants, CHROMOSOME refers to mapping on the arabidopsis genome.
       STS          STS
            NAME=        Name of STS
            ACC=         GenBank/EMBL/DDBJ accession number of STS [optional field]
            DSEG=        GDB Dsegment number [optional field]
            UNISTS=      identifier in NCBI's UNISTS database
       TXMAP        Transcript map interval
            MARKER=      Marker found on at least one sequence in this cluster
            RHPANEL=     Radiation Hybrid panel used to place marker
       PROTSIM      Protein Similarity data for the sequence with highest-scoring protein similarity in this cluster
            ORG=         Organism
            PROTGI=      Sequence GI of protein
            PROTID=      Sequence ID of protein
            PCT=         Percent alignment
            ALN=         length of aligned region (aa)
       SCOUNT       Number of sequences in the cluster
       SEQUENCE     Sequence
            ACC=         GenBank/EMBL/DDBJ accession number of sequence
            NID=         Unique nucleotide sequence identifier (gi)
            PID=         Unique protein sequence identifier (used for non-ESTs)
            CLONE=       Clone identifier (used for ESTs only)
            END=         End (5'/3') of clone insert read (used for ESTs only) 
            LID=         Library ID; see Hs.lib.info for library name and tissue        
            MGC=         5' CDS-completeness indicator; if present, 
                         the clone associated with this sequence  
                         is believed CDS-complete. A value greater than 511
                         is the gi of the CDS-complete mRNA matched by the EST,
                         otherwise the value is an indicator of the reliability
                         of the test indicating CDS comleteness;
                         higher values indicate more reliable CDS-completeness predictions. 
           SEQTYPE=      Description of the nucleotide sequence. Possible values are
                         mRNA, EST and HTC.
           TRACE=        The Trace ID of the EST sequence, as provided by NCBI Trace Archive
           PERIPHERAL=   Indicator that the sequence is a suboptimal 
                         representative of the gene represented by this cluster.
                         Peripheral sequences are those that are in a cluster
                         which represents a spliced gene without sharing a
                         splice junction with any other sequence.  In many
                         cases, they are unspliced transcripts originating
                         from the gene.

       //           End of record

Classes
Iterator  
RecordParser  
UnigeneProtsimRecord Store the information for one PROTSIM line from a Unigene file
UnigeneRecord Store a Unigene record Here is what is stored: self.ID = '' # ID line self.species = '' # Hs, Bt, etc.
UnigeneSequenceRecord Store the information for one SEQUENCE line from a Unigene file Initialize with the text part of the SEQUENCE line, or nothing.
UnigeneSTSRecord Store the information for one STS line from a Unigene file
_RecordConsumer  
_Scanner Scans a Unigene Flat File Format file

Generated by Epydoc 2.1 on Mon Aug 27 16:13:10 2007 http://epydoc.sf.net