CleanEx June 2008
Current release is based on Unigene database available on : June 11 2008
CONTENTS
-
INTRODUCTION
-
SELECTION
-
FORMAT CONVENTIONS
-
The title line
-
CleanEx entries
-
The ID line
-
The DE line
-
The ON line
-
The RNA line
-
The DR lines
-
The EXP line
-
The // line
-
CleanEx_trg entries
-
The ID line
-
The OA line
-
The GN line
-
The GC line
-
The QU line
-
The SR line
-
The FM line
-
The FN line
-
The UG line
-
The F1-F16 lines
-
The DR line
-
APPENDIX A SURVEY OF LAST RELEASE
-
EXPRESSION DISPLAY
1 INTRODUCTION
CleanEx is an expression reference database. Its goal is to link different information found in known databases and public expression data.
Entries are cross-linked with pages allowing users to view expression data locally as well as in the original published format.
2 ENTRY SELECTION
So far, CleanEx contains only human genes for which the symbol is approved by the HUGO nomenclature committee.
There is one entry per gene name.
3 FORMAT CONVENTIONS
3.1 The title line
The title line of CleanEx is shown below:
TI CleanEx EXPRESSION DATABASE
3.2 CleanEx entries
A CleanEx entry contains the following information :
-
Gene name description and localisation
-
Corresponding RNA sequences found in EMBL
-
Cross-references to other databases
-
Cross-references to available expression data
CleanEx entries are presented in a similar format as EMBL and SWISS-PROT sequence entries. Each line starts with a line code identifying the type of information presented. The current line types and line codes and the order in which they appear in an entry, are shown below:
ID - IDentification.
DE - DEscription.
ON - Old gene Name.
RNA - RNA sequence in EMBL.
DR - Databases crosslinks.
EXP - EXPression cross-references.
// - Termination line.
Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72.Below is an example of an entry:
ID HS_FN1 2q34.
XX
DE fibronectin 1.
ON none.
XX
RNA EMBL; AF130095.1; AF130095.
RNA EMBL; AF312399.1; AF312399.
RNA EMBL; AJ276395.1; HSA276395.
RNA EMBL; AJ320525.1; HSA320525.
RNA EMBL; AJ320526.1; HSA320526.
RNA EMBL; AJ320527.1; HSA320527.
RNA EMBL; BC005858.1; BC005858.
RNA EMBL; M10905.1; HSFNC.
RNA EMBL; M27589.1; HSFNPFH1.
RNA EMBL; M27590.1; HSFNPFHL1.
RNA EMBL; U41724.1; U41724.
RNA EMBL; U41850.1; U41850.
RNA EMBL; U42404.1; U42404.
RNA EMBL; U42455.1; U42455.
RNA EMBL; U42456.1; U42456.
RNA EMBL; U42457.1; U42457.
RNA EMBL; U42458.1; U42458.
RNA EMBL; U42592.1; U42592.
RNA EMBL; U42593.1; U42593.
RNA EMBL; U42594.1; U42594.
RNA EMBL; U60067.1; U60067.
RNA EMBL; U60068.1; U60068.
RNA EMBL; X02761.1; HSFIB1.
XX
DR Entrez GeneID; 2335.
DR Unigene; Hs.339722.
DR MIM; 135600.
DR Genew; HGNC:3778; FN1.
DR RefSeq; NM_002026.
DR RefSeq; NM_054034.
DR SWISSPROT; P02751; FINC_HUMAN.
DR EPD; EP16038; HS_FINC.
XX
EXP HSEST; HSEST_FN1; NM_002026.
EXP LYMPHOMA1; L0001_15953; IMAGE_139009.
EXP LYMPHOMA1; L0001_16112; IMAGE_139009.
EXP LYMPHOMA1; L0001_17791; IMAGE_139009.
EXP NCI60; NCI60_136798; IMAGE_136798.
EXP NCI60; NCI60_151144; IMAGE_151144.
EXP NCI60; NCI60_512275; IMAGE_512275.
EXP NCI60; NCI60_512287; IMAGE_512287.
EXP PEROU1; P0001_139009; IMAGE_139009.
EXP PEROU1; P0001_268091; IMAGE_268091.
EXP PEROU1; P0001_269203; IMAGE_269203.
EXP PEROU1; P0001_296556; IMAGE_296556.
EXP PEROU1; P0001_60846; IMAGE_60846.
EXP ROSETTA; R0001_20907; RNA_X02761.
EXP SERUM1; S0001_136798; IMAGE_136798.
EXP SERUM1; S0001_151144; IMAGE_151144.
EXP SERUM1; S0001_512275; IMAGE_512275.
EXP SERUM1; S0001_512287; IMAGE_512287.
//
A detailed description of each line type is given below.
3.2.1 The ID line
The identification line is always the first line of an entry. The general form of the ID line is:
ID GENE_NAME genetic_locus.
-
GENE_NAME is the species code followed by the gene identifier which obeys the Human Gene Nomenclature rules
-
The genetic_locus field is the cytogenetic location of the gene. It is cross-linked with the NCBI's genome map viewer
The ID line is terminated by a period.
3.2.2 The DE line
DE fibronectin 1.
The description lines contain general descriptive information about the gene. It is extracted from the Unigene corresponding entry. The description is given in ordinary English and is free-format. In some cases, more than one DE line is required; in this case, the text is divided only between words. The last DE line is terminated by a period.
3.2.3 The ON line
ON STGD1, ABCR, RP19, STGD.
The ON line describes the history of the gene nomenclature. It lists all the previous gene symbols which have been attributed to the specific gene.
3.2.4 The RNA line
RNA EMBL; M27590.1; HSFNPFHL1.
It contains cross-references to the mRNA entries for this gene. These mRNAs are found in the EMBL database. The RNA lines can refer to partial mRNAs.
The format of this line is given below :
RNA EMBL; EMBL_SV; EMBL_ID.
-
EMBL_SV is the EMBL sequence version number.
-
EMBL_ID is a secondary identifier or name for the EMBL entry.
The line is terminated by a period.
3.2.5 The DR lines
The DR lines contain cross-references to entries from other databases.
So far, we have incorporated links to SWISS-PROT, Entrez GeneID, RefSeq, Unigene, GeneCards and EPD. The precise format of these lines depends on the target database.
The format of the DR line is shown by the following examples :
DR Entrez GeneID; 2335.
DR Unigene; Hs.339722.
DR MIM; 135600.
DR Genew; HGNC:3778; FN1.
DR RefSeq; NM_002026.
DR RefSeq; NM_054034.
DR SWISSPROT; P02751; FINC_HUMAN.
DR EPD; EP16038; HS_FN1.
-
The first item on the DR line is the abbreviated name of the data collection
to which reference is made. The currently defined data bank identifiers
are the following:
| Entrez Gene |
A single query interface to curated sequence and descriptive information about genetic loci. |
| Unigene |
The gene cluster database from NCBI. |
| MIM |
The Mendelian Inheritance in Man Database, a catalog of human genes and genetic disorders. |
| Genew |
The Human Gene Nomenclature Database providing data for all human genes which have approved symbols. |
| RefSeq |
The NCBI Reference Sequence project. |
| SWISSPROT |
Protein sequence database. |
| EPD |
The eukaryotic promoter database. |
-
The second item is the primary accession number (or an equivalent unique
identifier of another data bank) of the entry to which reference is made.
-
The third item (if it exists) is a secondary identifier or name for the
cross-referenced database entry. For Genew, this number is the HGNC (Hugo Gene Nomenclature Committee) identifier.
3.2.6 The EXP line
The EXP line contains cross-references to the human public available data on gene expression. An exhaustive list of datasets already integrated in CleanEx is available HERE.
Currently, the different data types considered for integration in CleanEx are :
-
Stanford cDNA arrays.
-
Nylon membrane arrays.
-
Affymetrix oligoarrays.
-
Other oligoarrays (Incyte, Resgen, Rosetta).
-
EST counts per tissue category and per gene.
-
SAGE experiments.
-
MPSS experiments.
The format of the EXP line is shown by the following examples.
EXP HSEST; HSEST_FN1; NM_002026.
EXP LYMPHOMA1; L0001_15953; IMAGE_139009.
EXP NCI60; NCI60_136798; IMAGE_136798.
EXP PEROU1; P0001_139009; IMAGE_139009.
EXP ROSETTA; R0001_20907; RNA_X02761.
EXP SERUM1; S0001_136798; IMAGE_136798.
EXP AFFY001; AFFY001_1575_at; AFFY_HC-G110_1575_at.
-
The first field of the EXP line is the abbreviated name of the data collection to which reference is made. The currently defined data bank identifiers are the following:
- PEROU1 for the data from Perou et al
- SERUM1 for the data from Iyer et al
- NCI60 for the data from Ross et al
- LYMPHOMA1
- ROSETTA
- AFFY001
- HSEST for the data from EST counts.
-
The second field is the local identifier of the corresponding expression entry. It is built with the local code for the dataset (HSEST_ for ESTs, P0001_ for Perou and S0001_ for SERUM1...) followed by the clone number used for the experiment (mostly Image clone numbers). For the EST dataset, we used the HUGO official name
-
The third field is the reference to the CleanEx_trg corresponding entry.
Users can visualize available expression data about the sequence given in the EXP line in two ways :
- First by using (if it exists) the link to the web site provided by the original authors of the expression dataset.
-
Second by using a local visualiser. A few explanations about this display is given in Appendix B
3.2.7 The // line
The // (terminator) line contains no data or comments. It designates the end of an entry.
3.3 CleanEx_trg
Each CleanEx_trg entry corresponds to one "target" (or "expression feature") used in an expression measurement experiment. Identifiers are composed of a code which describes the target type followed by an underscore and the target accession number. Types could be, for example, IMAGE clone (IMAGE), Affymetrix probeset (AFFY), SAGE tags (SAGE), or EMBL RNA or DNA sequences (RNA,DNA).
The format of CleanEx_trg resembles that of CleanEx. Each CleanEx_trg entry contains the following information :
-
ID CleanEx_trg ID
-
OS Organism Species
-
GC Gene Count
-
GN Official HUGO Gene Symbol
-
OA Original Annotation (if existing)
-
QU QUality
-
SR Sequence Reference
-
FN Feature Number
-
UG UniGene release
-
F1 Feature
-
DR CleanEx_ref ID
-
//
Below is an example of an entry for an Affymetrix probe set :
ID AFFY_HC-G110_1575_at Type=Affy_Tag
OA M14758; HUMMDR1 Human P-glycoprotein (MDR1) mRNA; complete cds.
OS Homo sapiens (human).
GN ABCB1
GC 1
QU High
SR Unigene=Hs.21330;
FM Tag;
FN 16
UG UniGene Build #160
F1 TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F2 AAAGCGCCAGTGAACTCTGACTGTA:284-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F3 GCGCCAGTGAACTCTGACTGTATGA:285-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F4 CCAGTGAACTCTGACTGTATGAGAT:286-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F5 TTAACATTTCCTCAGTCAAGTTCAG:287-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F6 ACATTTCCTCAGTCAAGTTCAGAGT:288-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F7 TTTCCTCAGTCAAGTTCAGAGTCTT:289-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F8 CCTCAGTCAAGTTCAGAGTCTTCAG:290-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F9 AGACATCATCAAGTGGAGAGAAATC:291-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F10 ATTTTCCCATTTGGACTGTAACTGA:292-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F11 TTCCCATTTGGACTGTAACTGACTG:293-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F12 CCATTTGGACTGTAACTGACTGCCT:294-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F13 TTTGGACTGTAACTGACTGCCTTGC:295-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F14 TAACTGACTGCCTTGCTAAAAGATT:296-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F15 CTGACTGCCTTGCTAAAAGATTATA:297-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F16 ACTGCCTTGCTAAAAGATTATAGAA:298-105; Refseq=NM_000927(+); Unigene=Hs.21330;
DR AFFY001_1575_at;
//
Description of the line formats :
3.3.1 The ID line
The identification line is always the first line of an entry. The general form of the ID line is:
ID TRG_ID Type
-
TRG_ID is the internal identifier for the entry. The first part of the ID is a target type identifier. The second part is built with the original target name (image clone identifier, Affymetrix chip and probeset name,...)
-
The Type field is a description of the target's provenance. Type could be for example "Seq_Ref" (for a sequence in EMBL or in RefSeq), "cDNA_clone", "Affy_Tag", "SAGE_Tag"...
3.3.2 The OA line
OA X60188; Human ERK1 mRNA for protein serine/threonine kinase
This line contains either the target's Original Annotation found in the corresponding description files, for example the Affymetrix chips annotation, or the description of the sequence given in the corresponding EMBL entry. It exists only for CleanEx_trg entries corresponding to Affymetrix tags.
3.3.3 The GN line
GN TIE
The GN line lists the official gene symbols which correspond to that entry. If more than four genes match the target, only the four first ones are listed.
3.3.4 The GC line
GC 1
The GC line gives the total count of genes having an approved symbol which match that target entry
3.3.5 The QU line
QU High
The QU line is a quality tag based on the precision of the mapping of the target.
This tag can take different values, according to the corresponding entry type or to the mapping protocol. For AFFY tags and IMAGE clones, the meaning of the four tags is :
-
High : All the features of the target correspond to a maximum of two gene clusters.
-
Medium : All the features of the target correspond to a maximum of four gene clusters. Three mismatches are allowed.
-
Low : Criteria are below the ones of the "Medium" tag.
-
Undefined : The target does not yet belong to a Unigene cluster.
For INCYTE clones, quality tags are a bit more stringent and correspond to the following criteria :
-
1 : Both 3' and 5' ends of the clone are available and match the same Unigene cluster.
-
2 : Either 3' or 5' ends of the clone are availableand both give a statistically significant result.
-
3 : Both 3' and 5' ends of the clone are available, but only one is statistically significant.
-
4 : No statistically significant result have been found.
-
5 : Both ends of the clone match different genes.
-
6 : The sequence is not yet available.
3.3.6 The SR line
SR Unigene=Hs.21330;
The SR line stands for Sequence Reference and gives the associated Unigene Cluster for the whole target.
3.3.7 The FM line
FM Tag;
This line describes the format of the features for the target.
3.3.8 The FN line
FN 16
The FN line gives the number of features belonging to that target. For cDNA clones, this number is typically one. For Affymetrix probesets, it can vary between eleven to twenty-five.
3.3.9 The UG line
UG UniGene Build #160
The UG line shows the Unigene Release which has been used to map the target sequences to its corresponding cluster.
3.3.10 The F1-F25 lines
F1 TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;
These lines show the individual mapping for all the features of the corresponding target. Fields are separated by a ";". The first field is the name of the feature. The second fields contains the RefSeq accession number of the sequences which map the feature. The sign in parenthesis indicates if the tag mapped on the positive or on the negative strand of the RefSeq sequence. Last field shows the Unigene clusters to which the RefSeq sequences are associated.
3.3.4 The DR line
DR AFFY001_1575_at;
DR lines in CleanEx_trg are crosslinks to the expression data found in CleanEx under the line type "EXP". Link is done via the expression data local identifier.
APPENDIX A : SURVEY OF CleanEx LAST RELEASE
References from UniGene Build #213 Homo sapiens
| Number of entries |
21820 |
| Number of RNA cross-references |
124038 |
| Number of Entrez GeneID cross-references |
21699 |
| Number of Unigene cross-references |
19138 |
| Number of Genew cross-references |
21820 |
| Number of RefSeq cross-references |
24981 |
| Number of EPD cross-references |
1400 |
| Number of SWISS-PROT cross-references |
17755 |
| Number of cross-references to EST count |
17556 |
| Number of cross-references to dual channel experiments |
1886076 |
| Number of cross-references to Affymetrix experiments |
6423998 |
| Number of cross-references to SAGE experiments |
644525 |
References from UniGene Build #172 Mus musculus
| Number of entries |
36044 |
| Number of RNA cross-references |
48710 |
| Number of Entrez GeneID cross-references |
28468 |
| Number of Unigene cross-references |
18545 |
| Number of MGD cross-references |
36043 |
| Number of RefSeq cross-references |
26310 |
| Number of EPD cross-references |
115 |
| Number of SWISS-PROT cross-references |
14880 |
| Number of cross-references to EST count |
15715 |
| Number of cross-references to Affymetrix experiments |
1964526 |
| Number of cross-references to SAGE experiments |
421824 |
APPENDIX B : EXPRESSION DISPLAY
The local display gives different representations of the gene's expression.
-
The EST dataset :
For the EST dataset, the ratio is the number of ESTs per gene for one category divided by the total number of ESTs for this category. This ratio is represented by a scale from white (underexpressed) to black (overexpressed).
-
Other datasets
-
In the first column, the color represents the log2 of the ratio between the two channels (green and red). The color display goes from light green (underexpressed) to light red (overexpressed).
-
The second column displays the superposition of both scanned images
In any cases, the ratio color range goes from 1 to 256 and the value is scaled according to the following formula :
((logratio-logmin)/(logmax-logmin/256))
where logratio is the log
2 of the ratio, logmin is the minimum log ratio found in the dataset and logmax is its maximum.
For the display showing the sum of both channels, the color is obtained by superposing the intensity of both channels. Color value for each channel is scaled the same way than the color ratio value :
((logchannel-logminchannel)/(logmaxchannel-logminchannel/256))
where logchannel is the log
2 of the channel intensity, logminchannel is the minimum value of the log
2 channel intensity, and logmaxchannel is its maximum.