Package Bio.UniGene
Parse Unigene flat file format files such as the Hs.data file.
Here is an overview of the flat file format that this parser deals with:
Line types/qualifiers:
ID UniGene cluster ID
TITLE Title for the cluster
GENE Gene symbol
CYTOBAND Cytological band
EXPRESS Tissues of origin for ESTs in cluster
RESTR_EXPR Single tissue or development stage contributes
more than half the total EST frequency for this gene.
GNM_TERMINUS genomic confirmation of presence of a 3' terminus;
T if a non-templated polyA tail is found among
a cluster's sequences; else
I if templated As are found in genomic sequence or
S if a canonical polyA signal is found on
the genomic sequence
GENE_ID Entrez gene identifier associated with at least one sequence in this cluster;
to be used instead of LocusLink.
LOCUSLINK LocusLink identifier associated with at least one sequence in this cluster;
deprecated in favor of GENE_ID
CHROMOSOME Chromosome. For plants, CHROMOSOME refers to mapping on the arabidopsis genome.
STS STS
NAME= Name of STS
ACC= GenBank/EMBL/DDBJ accession number of STS [optional field]
DSEG= GDB Dsegment number [optional field]
UNISTS= identifier in NCBI's UNISTS database
TXMAP Transcript map interval
MARKER= Marker found on at least one sequence in this cluster
RHPANEL= Radiation Hybrid panel used to place marker
PROTSIM Protein Similarity data for the sequence with highest-scoring protein similarity in this cluster
ORG= Organism
PROTGI= Sequence GI of protein
PROTID= Sequence ID of protein
PCT= Percent alignment
ALN= length of aligned region (aa)
SCOUNT Number of sequences in the cluster
SEQUENCE Sequence
ACC= GenBank/EMBL/DDBJ accession number of sequence
NID= Unique nucleotide sequence identifier (gi)
PID= Unique protein sequence identifier (used for non-ESTs)
CLONE= Clone identifier (used for ESTs only)
END= End (5'/3') of clone insert read (used for ESTs only)
LID= Library ID; see Hs.lib.info for library name and tissue
MGC= 5' CDS-completeness indicator; if present,
the clone associated with this sequence
is believed CDS-complete. A value greater than 511
is the gi of the CDS-complete mRNA matched by the EST,
otherwise the value is an indicator of the reliability
of the test indicating CDS comleteness;
higher values indicate more reliable CDS-completeness predictions.
SEQTYPE= Description of the nucleotide sequence. Possible values are
mRNA, EST and HTC.
TRACE= The Trace ID of the EST sequence, as provided by NCBI Trace Archive
PERIPHERAL= Indicator that the sequence is a suboptimal
representative of the gene represented by this cluster.
Peripheral sequences are those that are in a cluster
which represents a spliced gene without sharing a
splice junction with any other sequence. In many
cases, they are unspliced transcripts originating
from the gene.
// End of record