Written by:
Philipp Bucher, Rouaida Cavin Perier and Viviane Praz
Swiss Institute of Bioinformatics
and Swiss Institute for Experimental Cancer Research
Ch. des Boveresses 155
CH-1066 Epalinges s/Lausanne
This manual and the database it accompanies may be copied and redistributed freely, without advance permission, provided that this statement is reproduced with each copy.

CleanEx June 2008

Current release is based on Unigene database available on : June 11 2008


    1. The title line
    2. CleanEx entries
      1. The ID line
      2. The DE line
      3. The ON line
      4. The RNA line
      5. The DR lines
      6. The EXP line
      7. The // line
    3. CleanEx_trg entries
      1. The ID line
      2. The OA line
      3. The GN line
      4. The GC line
      5. The QU line
      6. The SR line
      7. The FM line
      8. The FN line
      9. The UG line
      10. The F1-F16 lines
      11. The DR line


CleanEx is an expression reference database. Its goal is to link different information found in known databases and public expression data. Entries are cross-linked with pages allowing users to view expression data locally as well as in the original published format.


So far, CleanEx contains only human genes for which the symbol is approved by the HUGO nomenclature committee.
There is one entry per gene name.


3.1 The title line

The title line of CleanEx is shown below:


3.2 CleanEx entries

A CleanEx entry contains the following information : CleanEx entries are presented in a similar format as EMBL and SWISS-PROT sequence entries. Each line starts with a line code identifying the type of information presented. The current line types and line codes and the order in which they appear in an entry, are shown below:
	ID  - IDentification.
	DE  - DEscription.
	ON  - Old gene Name.
	RNA - RNA sequence in EMBL.
	DR  - Databases crosslinks.
	EXP - EXPression cross-references.	
	//  - Termination line.

Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72.Below is an example of an entry:
ID    HS_FN1     2q34.
DE    fibronectin 1.
ON    none.
RNA   EMBL; AF130095.1; AF130095.
RNA   EMBL; AF312399.1; AF312399.
RNA   EMBL; AJ276395.1; HSA276395.
RNA   EMBL; AJ320525.1; HSA320525.
RNA   EMBL; AJ320526.1; HSA320526.
RNA   EMBL; AJ320527.1; HSA320527.
RNA   EMBL; BC005858.1; BC005858.
RNA   EMBL; M10905.1; HSFNC.
RNA   EMBL; M27589.1; HSFNPFH1.
RNA   EMBL; M27590.1; HSFNPFHL1.
RNA   EMBL; U41724.1; U41724.
RNA   EMBL; U41850.1; U41850.
RNA   EMBL; U42404.1; U42404.
RNA   EMBL; U42455.1; U42455.
RNA   EMBL; U42456.1; U42456.
RNA   EMBL; U42457.1; U42457.
RNA   EMBL; U42458.1; U42458.
RNA   EMBL; U42592.1; U42592.
RNA   EMBL; U42593.1; U42593.
RNA   EMBL; U42594.1; U42594.
RNA   EMBL; U60067.1; U60067.
RNA   EMBL; U60068.1; U60068.
RNA   EMBL; X02761.1; HSFIB1.
DR    Entrez GeneID; 2335.
DR    Unigene; Hs.339722.
DR    MIM; 135600.
DR    Genew; HGNC:3778; FN1.
DR    RefSeq; NM_002026.
DR    RefSeq; NM_054034.
DR    EPD; EP16038; HS_FINC.
EXP   HSEST; HSEST_FN1; NM_002026.
EXP   LYMPHOMA1; L0001_15953; IMAGE_139009.
EXP   LYMPHOMA1; L0001_16112; IMAGE_139009.
EXP   LYMPHOMA1; L0001_17791; IMAGE_139009.
EXP   NCI60; NCI60_136798; IMAGE_136798.
EXP   NCI60; NCI60_151144; IMAGE_151144.
EXP   NCI60; NCI60_512275; IMAGE_512275.
EXP   NCI60; NCI60_512287; IMAGE_512287.
EXP   PEROU1; P0001_139009; IMAGE_139009.
EXP   PEROU1; P0001_268091; IMAGE_268091.
EXP   PEROU1; P0001_269203; IMAGE_269203.
EXP   PEROU1; P0001_296556; IMAGE_296556.
EXP   PEROU1; P0001_60846; IMAGE_60846.
EXP   ROSETTA; R0001_20907; RNA_X02761.
EXP   SERUM1; S0001_136798; IMAGE_136798.
EXP   SERUM1; S0001_151144; IMAGE_151144.
EXP   SERUM1; S0001_512275; IMAGE_512275.
EXP   SERUM1; S0001_512287; IMAGE_512287.

A detailed description of each line type is given below.

3.2.1 The ID line

The identification line is always the first line of an entry. The general form of the ID line is:
	ID    GENE_NAME     genetic_locus.
The ID line is terminated by a period.

3.2.2 The DE line

	DE    fibronectin 1.

The description lines contain general descriptive information about the gene. It is extracted from the Unigene corresponding entry. The description is given in ordinary English and is free-format. In some cases, more than one DE line is required; in this case, the text is divided only between words. The last DE line is terminated by a period.

3.2.3 The ON line


The ON line describes the history of the gene nomenclature. It lists all the previous gene symbols which have been attributed to the specific gene.

3.2.4 The RNA line

	RNA   EMBL; M27590.1; HSFNPFHL1.

It contains cross-references to the mRNA entries for this gene. These mRNAs are found in the EMBL database. The RNA lines can refer to partial mRNAs.
The format of this line is given below :
The line is terminated by a period.

3.2.5 The DR lines

The DR lines contain cross-references to entries from other databases. So far, we have incorporated links to SWISS-PROT, Entrez GeneID, RefSeq, Unigene, GeneCards and EPD. The precise format of these lines depends on the target database.
The format of the DR line is shown by the following examples :
DR    Entrez GeneID; 2335.
DR    Unigene; Hs.339722.
DR    MIM; 135600.
DR    Genew; HGNC:3778; FN1.
DR    RefSeq; NM_002026.
DR    RefSeq; NM_054034.
DR    EPD; EP16038; HS_FN1.

3.2.6 The EXP line

The EXP line contains cross-references to the human public available data on gene expression. An exhaustive list of datasets already integrated in CleanEx is available HERE.
Currently, the different data types considered for integration in CleanEx are :
  1. Stanford cDNA arrays.
  2. Nylon membrane arrays.
  3. Affymetrix oligoarrays.
  4. Other oligoarrays (Incyte, Resgen, Rosetta).
  5. EST counts per tissue category and per gene.
  6. SAGE experiments.
  7. MPSS experiments.
The format of the EXP line is shown by the following examples.
EXP   HSEST; HSEST_FN1; NM_002026.
EXP   LYMPHOMA1; L0001_15953; IMAGE_139009.
EXP   NCI60; NCI60_136798; IMAGE_136798.
EXP   PEROU1; P0001_139009; IMAGE_139009.
EXP   ROSETTA; R0001_20907; RNA_X02761.
EXP   SERUM1; S0001_136798; IMAGE_136798.
EXP   AFFY001; AFFY001_1575_at; AFFY_HC-G110_1575_at.

Users can visualize available expression data about the sequence given in the EXP line in two ways :

3.2.7 The // line

The // (terminator) line contains no data or comments. It designates the end of an entry.

3.3 CleanEx_trg

Each CleanEx_trg entry corresponds to one "target" (or "expression feature") used in an expression measurement experiment. Identifiers are composed of a code which describes the target type followed by an underscore and the target accession number. Types could be, for example, IMAGE clone (IMAGE), Affymetrix probeset (AFFY), SAGE tags (SAGE), or EMBL RNA or DNA sequences (RNA,DNA).

The format of CleanEx_trg resembles that of CleanEx. Each CleanEx_trg entry contains the following information :

Below is an example of an entry for an Affymetrix probe set :
ID   AFFY_HC-G110_1575_at   Type=Affy_Tag
OA   M14758; HUMMDR1 Human P-glycoprotein (MDR1) mRNA; complete cds.
OS   Homo sapiens (human).
GC   1
QU   High
SR   Unigene=Hs.21330;
FM   Tag;
FN   16
UG   UniGene Build #160
F1  TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F2  AAAGCGCCAGTGAACTCTGACTGTA:284-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F3  GCGCCAGTGAACTCTGACTGTATGA:285-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F4  CCAGTGAACTCTGACTGTATGAGAT:286-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F5  TTAACATTTCCTCAGTCAAGTTCAG:287-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F6  ACATTTCCTCAGTCAAGTTCAGAGT:288-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F7  TTTCCTCAGTCAAGTTCAGAGTCTT:289-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F8  CCTCAGTCAAGTTCAGAGTCTTCAG:290-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F9  AGACATCATCAAGTGGAGAGAAATC:291-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F10  ATTTTCCCATTTGGACTGTAACTGA:292-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F11  TTCCCATTTGGACTGTAACTGACTG:293-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F12  CCATTTGGACTGTAACTGACTGCCT:294-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F13  TTTGGACTGTAACTGACTGCCTTGC:295-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F14  TAACTGACTGCCTTGCTAAAAGATT:296-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F15  CTGACTGCCTTGCTAAAAGATTATA:297-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F16  ACTGCCTTGCTAAAAGATTATAGAA:298-105; Refseq=NM_000927(+); Unigene=Hs.21330;
DR   AFFY001_1575_at;
Description of the line formats :

3.3.1 The ID line

The identification line is always the first line of an entry. The general form of the ID line is:
        ID   TRG_ID     Type

3.3.2 The OA line

        OA   X60188; Human ERK1 mRNA for protein serine/threonine kinase

This line contains either the target's Original Annotation found in the corresponding description files, for example the Affymetrix chips annotation, or the description of the sequence given in the corresponding EMBL entry. It exists only for CleanEx_trg entries corresponding to Affymetrix tags.

3.3.3 The GN line

        GN   TIE

The GN line lists the official gene symbols which correspond to that entry. If more than four genes match the target, only the four first ones are listed.

3.3.4 The GC line

        GC   1

The GC line gives the total count of genes having an approved symbol which match that target entry

3.3.5 The QU line

        QU   High
The QU line is a quality tag based on the precision of the mapping of the target.

This tag can take different values, according to the corresponding entry type or to the mapping protocol. For AFFY tags and IMAGE clones, the meaning of the four tags is :

For INCYTE clones, quality tags are a bit more stringent and correspond to the following criteria :

3.3.6 The SR line

        SR   Unigene=Hs.21330;

The SR line stands for Sequence Reference and gives the associated Unigene Cluster for the whole target.

3.3.7 The FM line

        FM   Tag;

This line describes the format of the features for the target.

3.3.8 The FN line

        FN   16

The FN line gives the number of features belonging to that target. For cDNA clones, this number is typically one. For Affymetrix probesets, it can vary between eleven to twenty-five.

3.3.9 The UG line

        UG   UniGene Build #160

The UG line shows the Unigene Release which has been used to map the target sequences to its corresponding cluster.

3.3.10 The F1-F25 lines

        F1  TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;

These lines show the individual mapping for all the features of the corresponding target. Fields are separated by a ";". The first field is the name of the feature. The second fields contains the RefSeq accession number of the sequences which map the feature. The sign in parenthesis indicates if the tag mapped on the positive or on the negative strand of the RefSeq sequence. Last field shows the Unigene clusters to which the RefSeq sequences are associated.

3.3.4 The DR line

        DR   AFFY001_1575_at;

DR lines in CleanEx_trg are crosslinks to the expression data found in CleanEx under the line type "EXP". Link is done via the expression data local identifier.


References from UniGene Build #213 Homo sapiens

Number of entries 21820
Number of RNA cross-references 124038
Number of Entrez GeneID cross-references 21699
Number of Unigene cross-references 19138
Number of Genew cross-references 21820
Number of RefSeq cross-references 24981
Number of EPD cross-references 1400
Number of SWISS-PROT cross-references 17755
Number of cross-references to EST count 17556
Number of cross-references to dual channel experiments 1886076
Number of cross-references to Affymetrix experiments 6423998
Number of cross-references to SAGE experiments 644525

References from UniGene Build #172 Mus musculus

Number of entries 36044
Number of RNA cross-references 48710
Number of Entrez GeneID cross-references 28468
Number of Unigene cross-references 18545
Number of MGD cross-references 36043
Number of RefSeq cross-references 26310
Number of EPD cross-references 115
Number of SWISS-PROT cross-references 14880
Number of cross-references to EST count 15715
Number of cross-references to Affymetrix experiments 1964526
Number of cross-references to SAGE experiments 421824


The local display gives different representations of the gene's expression.

  1. The EST dataset :

    For the EST dataset, the ratio is the number of ESTs per gene for one category divided by the total number of ESTs for this category. This ratio is represented by a scale from white (underexpressed) to black (overexpressed).

  2. Other datasets

    • In the first column, the color represents the log2 of the ratio between the two channels (green and red). The color display goes from light green (underexpressed) to light red (overexpressed).
    • The second column displays the superposition of both scanned images
In any cases, the ratio color range goes from 1 to 256 and the value is scaled according to the following formula :
where logratio is the log2 of the ratio, logmin is the minimum log ratio found in the dataset and logmax is its maximum.
For the display showing the sum of both channels, the color is obtained by superposing the intensity of both channels. Color value for each channel is scaled the same way than the color ratio value :
where logchannel is the log2 of the channel intensity, logminchannel is the minimum value of the log2 channel intensity, and logmaxchannel is its maximum.