The main goal of the CleanEx database, is to provide access to public gene expression data via unique gene names and to represent heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-dataset comparisons. To achieve this goal, each single gene expression experiment is regularly mapped on a permanent target identifier consisting of a physical description of the targeted RNA.
This manual leads the user through the different construction steps which are necessary to generate the CleanEx system. It also provides a "Learn-by-Example" tutorial page for each entry point in CleanEx and for each CleanEx tool.
- THE CleanEx DATABASE : CONCEPT AND DATA ORGANIZATION
- BUILDING CleanEx
- CleanEx format conventions
- CleanEx entry search engines and viewers
- Data extraction
- Data analysis
- Using CleanEx : Examples of applications
CleanEx : CONCEPT AND DATA ORGANIZATION
Introduction
CleanEx contains human and mouse genes for which the symbol is approved by the representative organism nomenclature committee. For human genes, we use the approved Genew gene symbols. The mouse gene index is based on the MGD (Mouse Genome Database) nomenclature. There is one entry per gene name for each organism.
CleanEx is a flat file formatted database system consisting of three different file types.
Each of these files contain specific information and is linked to the others through a defined accession number.
The three file types are named :
- CleanEx_exp : storage system for the expression datasets numerical values
- CleanEx_trg : target database storage of gene mapping and quality control information
- CleanEx : the gene index containing cross-references to expression data incorporated into the system
CleanEx_exp
CleanEx_exp files store publicly available gene expression data.
Each "exp" file contains a matrix of measured expression levels for a set of target sequences and conditions, which is typically published and analyzed at once, and referred to by a common name. plus one data entry for each expression target.
Each "exp" file begins with a documentation entry for the corresponding dataset, which could be compared to the GEO series instance, and which provides general information about the data set including :A data entry contains expression values for a particular feature over all conditions.
- number of spotted features
- number of tissues or experiments in the dataset
- description of the tissues or experiments in the dataset (as provided by the authors)
- organism
- published reference, if provided
- type of associated reference sequences (clones, RefSeq sequences, other RNA sequences...)
By feature we mean any molecule that is used to retrieve a certain transcript's abundance in an experiment, such as a clone or oligonucleotide spotted on a certain position of a dual-channel chip, an Affymetrix probe set, or a SAGE or MPSS tag.
Each CleanEx_exp data entry's header line contains the CleanEx_target identifier linking this specific "exp" entry to its target (the transcript which is "targetted" by the so-called feature) expressed sequence in the "trg" file.
The CleanEx_exp files are in principle static, except if the authors modify their own data. Only the "exp" file that contains the tissue distribution of public ESTs, which is derived from Unigene and regenerated from scratch whenever the original source is updated regularly.
CleanEx_exp files have short alpha-numeric strings as identifiers, which for most cases correspond to the GEO series identifier. The individual expression data entries have composite identifiers consisting of the corresponding "exp" file name followed by an underscore character and a second unique identifier.
CleanEx_trg
The "trg" files type contain a physical description of the expression targets, linked to genes and quality control information. A CleanEx "target" stands for the sequence to which any nucleotide element, or "feature", which is spotted or sequenced for an expression experiment corresponds. Features can be either :
- A spotted cDNA or oligo
- An Affymetrix probeset
- A SAGE or MPSS tag
An entry in the CleanEx "trg" file type is an annotated feature with its corresponding gene name and possibly its position on the gene nucleotide sequence reference.
The exact content of a target entry depends on the feature type. Currently we distinguish between:The latter two are not true physical descriptions of spotted features and serve as substitutes when more precise information is lacking.
- Public cDNA clone names included in UniGene
- cDNA clones from private suppliers (e.g. Incyte)
- Affymetrix probe sets
- SAGE or MPSS tags
- Gene names
- Sequence database accession numbers
The CleanEx_trg entries consist of a stable part and a weekly updated dynamic part. The stable part is imported from external sources, such as the original feature names given by the experiment authors, or the probe set documentation files posted by Affymetrix, and is used to generate the dynamic part, namely the mapping CleanEx_trg "features" to "targets" via a weekly updating procedure.
In case of multiple target match for the same feature, the cleanex_trg entry lists all corresponding genes found but adds a quality-control flag to indicate that the mapping is ambiguous.
CleanEx
Cleanex is the catalog of officially approved genes from model organisms (for now : human and mouse) with cross-references to entries in cleanex_trg and cleanex_exp, and links to external databases. There is one entry per gene, regardless of whether there are corresponding expression data in cleanex_exp. This file is completely rebuilt from scratch every week synchronously with the remapping of expression targets to genes. The process starts with a compilation of officially approved gene names from the reference gene catalogs (Genew for human and MGD for mouse). These names are then used to establish cross-references to cleanex_trg entries and from there to expression data in cleanex_exp via the target unique identifier. The link between sequences and gene names is done via the Unigene database. To have a complete view of the transcript and its product, we also link each entry to the corresponding protein. We also provide the genomic position of the transcription start site from EPD, when available. Otherwise we give the annotated start site position in Ensembl.
TOP
Building CleanEx : Main Steps
Introduction
The building procedure for the CleanEx system consists of regenerating from scratch the weekly updated files CleanEx_trg and CleanEx, and then adding the dataset information contained in the stable files (CleanEx_exp) to this new version, and concatenating all the cross-references together in CleanEx.
This page describes the building process of the stable CleanEx_exp files, which occurs only once, and the updating procedures for the two other file types, CleanEx_trg and CleanEx.
CleanEx_exp
The different platforms which have been integrated in the CleanEx system so far are :
Though some features are similar between some datasets (for example the three first methods give as main output a ratio between a reference experiment and the tested condition, Affymetrix-like experiments usually give a single intensity per probeset, and the EST, SAGE, and MPSS methods all give a basic count of transcripts found), each type of dataset needs a specific protocol to be integrated in CleanEx. Nowadays, most of the datasets in CleanEx are extracted from the GEO (Gene Expression Omnibus) database at the NCBI. The GEO database has become the most popular expression dataset repository, and thus represents a very complete expression data source.
- Dual channel chips from the Stanford Microarray Database (SMD)
- 60-mer oligoarray from the Rosetta institute (http://www.rii.com/)
- Nylon array fron ClonTech (http://www.clontech.com/)
- Affymetrix experiments done with any commercially available Human or Mouse chip (http://www.affymetrix.com)
- EST counts
- SAGE tag counts
- MPSS tag counts
Typically, the metadata for each dataset, which contains information like the type of experiment realized, organism, methods applied, paper reference and so on, give rise to the first entry of one dataset, namely the documentation file (DOC). This is the first part to generate for each dataset, regardless of its origin. This DOC entry is usually built by processing the information contained in the GEO "Series" files, as well as in the GSE "Samples" description part.
Data from GEO : semi-automatic dataset generation method
The semi-automatic procedure allows the direct generation of new CleanEx datasets from GEO.
GEO has a very specific and well-designed format, including the three following files types :The series from GEO are stored under an in-house format called "soft". Each GSE soft file contains the above mentionned information, namely the platform(s) used, the general information about the serie, and the numerical valued for each sample.
- "GPL" files : description of the platform used (chip description)
- "GSE" files : the series made (all the experiments corresponding to one dataset, or in other words one publication).
- "GSM" files : sample, containing the numerical values for individual experiments.
The procedure consists of the following main steps :
- Extract the serie "soft" file
- Extract from the platform the correspondence between spots (features) and sequences (targets)
- Create the documentation entry from the information contained in the GSE file and from the individual samples descriptions.
- For all samples of the serie, reformat the numerical values to adapt them to the CleanEx format (values are stores for each feature, and not for each experiment). Add the target name for each feature in each "exp" entry header line
- Add value scales in the DOC entry.
The EST dataset
The dataset generated from the EST counts needs a frequent update, as it is based on the Unigene database.
The EST dataset is an in silico expression dataset generated from a basic per-tissue split of ESTs from UniGene clusters according to the library from which they've been extracted (Figure 10). This allows EST counts in healthy and tumor specific tissues to be compared with results obtained via other expression experiment protocols.
The tissue split is based on the library classification from CGAP (Cancer Genome Anatomy Project) at the NCBI. The tissue-specific libraries from the CGAP, MGC and ORESTES projects can be classified as normal, precancer, or cancer. This type of classification is perfectly adapted to our need. The CGAP library classification contains fifty-five different tissue classes divided in three different histology classes. Amongst these tissue types, the different chosen classes which appear to contain a reasonable amount of ESTs are the following :
Colon cancer Colon normal Kidney cancer Kidney normal Lung cancer Lung normal Mammary Gland cancer Mammary gland normal Skin cancer Skin normal Cell-line cancer Cell-line normal Other tissues cancer Other tissues normal
The main steps to generate the EST datasets are :
- Extract library identifier and full name, tissue type, tissue condition (tumor, normal) from CGAP.
- Unigene identifier and full name for each library from the Unigene library info.
- Classify ESTs found in Unigene according to their original library.
- Count all ESTs per tissue class, and then all ESTs per tissue class and per Unigene cluster.
- Generate the EXP file with one entry per Unigene cluster.
- The calculated relative amount of ESTs for each entry is given as TPMs (Tags per Million).
- Do all steps for both human and mouse data.
A schematic description of the EST dataset update is given below :
![]()
CleanEx_trg
The method used to build CleanEx_trg entries depends on the feature source given by the authors.
- For public cDNA clones, sequence accession numbers (RefSeq, EMBL/GenBank) and gene symbols, these links are established directly by associating the Unigene accession number, if exists, to each given clone number or accession number.
- For oligonucleotides without direct links to Unigene (via a public accession number), the feature (nucleotide sequence) is first mapped to existing mRNA sequences. According to the mapping result, a quality tag is then associated to each feature. This quality measurement can be used to modulate expression data results according to the precision of the corresponding feature. The quality tag attribution, as well as the mapping procedure, depend on the feature type. They are detailed below.
Long clones
For long clones (e.g Incyte clones), the re-mapping is done with megablast following these steps :
- Compare 3' and 5' end of the clones using megablast.
- Keep matches filling these two criteria :
- Similarity > 95%
- Total alignment length >= (Original clone length-15)
- Assign quality score to each of the clones.
The assignement of the quality tag for long clones follows these rules :
Quality Criteria High Both 3' and 5' ends of the clone are available and match the same Unigene cluster. Medium Either 3' or 5' ends of the clone are available and give a statistically significant result. Medium Both 3' and 5' ends of the clone are available, but only one is statistically significant. Low No statistically significant results have been found. Low Both ends of the clone match different genes. Unknown The sequence is not yet available. Affymetrix probe sets
To give access to the precise position of Affymetrix probes, the Affymetrix mapping is done on the individual probes for each probe set, and not on the consensus sequence given in the available annotation files.
To avoid noise in the results, these short sequences (Affymetrix probes, SAGE and MPSS tags) are mapped via an in-house developped program called tagger, which generates a list of only perfect matches on the reference sequence database The mapping is done as follows :
- Extract individual probes from an Affymetrix chip
- Use Tagger to map these probes on mRNA sequence databases (RefSeq, mRNA section of EMBL, HTC section of EMBL). Note that, as a supplementary file, the access to the probes mapping results on the dbEST database is available from the CleanEx ftp server.
- Cluster the mapping results in Affymetrix probe sets
- From the three reference databases (RefSeq, HTC and mRNA), keep the optimal results.
- Apply a quality criteria for each probe set.
The assignement of the quality tag for Affymetrix probe sets follows these rules :
Quality Criteria High A maximum of two Unigene identifiers match the probes of the probeset. All probes of the probeset match both Unigene identifiers. Medium A maximum of 4 Unigene identifiers match the probes of the probe set. In addition, a maximum of 3 "errors" were permitted. Errors were defined as probes that matched nothing, probes that failed to match a Unigene identifier or probes that matched an additional Unigene identifier Low Anything below the two preceeding criteria Unknown Absolutely no match on the selected mRNA databases was found for all the probes of the probe set. A detailed chart of the Affymetrix mapping procedure is goven bellow.
The updated annotation files are available on the SIB ftp server.
Each subdirectory contains organism-specific chip annotation files corresponding to the mapping on RefSeq, HTC, RNA and EST. For the EST database matches, note that, to discard the matches on the wrong DNA strand, the mapping on ESTs is done via a supplementary check of the EST orientation. The first check is extracted from the Unigene EST direction annotation. This information is then cross-checked and corrected via the EST orientation found in the in-house transcriptome project called "trome". Thus, ESTs with tags 5' or 3' are accessible in two different files, respectively flanked with the extension "_PLUS" or "_MINUS".
Each line of these mapping files contains one match for one individual probe. Supplementary information included, if known, are :
- UniGene accession number
- Gene symbol
- Entrez GeneID
SAGE and MPSS tags
The clustering of SAGE, LONGSAGE, MPSS, LONGMPSS tags together with reference sequence databases is provided at the SIB via the trome project.
To speed up the database update process, the SAGE tags mapping is done on a trome-based pre-filtered reference sequence database. The procedure then follows the Affymetrix one, and the tagger program is used to extract exact mapping positions on this filtered reference database. The quality criteria given for each inividual tag follows the same rules as the ones applied for Affymetrix, but are adapted for single tags, and not probe sets.
In the CleanEx_trg final file, all of the information is kept for the users. The quality criteria given for each inividual tag follows the same rules as the ones applied for Affymetrix, but are adapted for single tags, and not probe sets.
As the CleanEx database not only contains the most 3' end tags, the tag position on the reference sequence is given, and might also help deciding whether to keep or to discard the suspicious tag.Here is a schematic view of the SAGE mapping procedure.
CleanEx
For each new Unigene release, the CleanEx files to be updated (namely CleanEx_trg and CleanEx) are rebuilt from scratch via this procedure :
- The CleanEx_trg files are rebuilt for each target type (clones, Affymetrix, SAGE..) according to the methods described above.
- Expression entries in CleanEx_exp are linked to their respective targets in CleanEx_trg
- ClenEx, the link file, is then rebuilt via a step-by-step data integration system.
- From the new Unigene release, extract Unigene cluster accession number, Gene description, Entrez GeneID, Locus position, RefSeq sequences and Gene symbol.
From the EMBL database, extract the mRNAs list, and link them with their corresponding Unigene cluster.- Extract genomic position of the transcription start site from Entrez GeneID genomic annotation file. This information will be used instead of the EPD transcription start site position whenever this later one is unavailable
- From the Swissprot database, extract the Swissprot accession number and identifier, as well as the corresponding gene symbol and Entrez GeneID.
- From the EPD database, extract the EPD accession number, the EPD identifier, as well as the corresponding Swissprot accession number. Merge the EPD information to the Swissprot file.
- From the officially approved list of gene symbols, extract the database accession number and Entrez GeneID.
- Link all these above fields to the expression data entries via CleanEx_trg files
Here is a schematic view of the build steps for the CleanEx database.
PREVIOUS TOP
CleanEx : FORMAT CONVENTIONS
CleanEx main file format
A CleanEx entry contains the following information :
CleanEx entries are presented in a similar format as EMBL and SWISS-PROT sequence entries. Each line starts with a line code identifying the type of information presented. The current line types and line codes and the order in which they appear in an entry, are shown below:
- Gene name description and localisation
- Corresponding RNA sequences found in EMBL
- Cross-references to other databases
- Cross-references to available expression data
ID - IDentification. DE - DEscription. ON - Old gene Name. RNA - RNA sequence in EMBL. DR - Databases crosslinks. EXP - EXPression cross-references. // - Termination line.
Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72.Below is an example of an entry:
ID HS_FN1 2q34. XX DE fibronectin 1. ON none. XX RNA EMBL; AF130095.1; AF130095. RNA EMBL; AF312399.1; AF312399. RNA EMBL; AJ276395.1; HSA276395. RNA EMBL; AJ320525.1; HSA320525. RNA EMBL; AJ320526.1; HSA320526. RNA EMBL; AJ320527.1; HSA320527. RNA EMBL; BC005858.1; BC005858. RNA EMBL; M10905.1; HSFNC. RNA EMBL; M27589.1; HSFNPFH1. RNA EMBL; M27590.1; HSFNPFHL1. RNA EMBL; U41724.1; U41724. RNA EMBL; U41850.1; U41850. RNA EMBL; U42404.1; U42404. RNA EMBL; U42455.1; U42455. RNA EMBL; U42456.1; U42456. RNA EMBL; U42457.1; U42457. RNA EMBL; U42458.1; U42458. RNA EMBL; U42592.1; U42592. RNA EMBL; U42593.1; U42593. RNA EMBL; U42594.1; U42594. RNA EMBL; U60067.1; U60067. RNA EMBL; U60068.1; U60068. RNA EMBL; X02761.1; HSFIB1. XX DR Entrez GeneID; 2335. DR Unigene; Hs.339722. DR MIM; 135600. DR Genew; HGNC:3778; FN1. DR RefSeq; NM_002026. DR RefSeq; NM_054034. DR SWISSPROT; P02751; FINC_HUMAN. DR EPD; EP16038; HS_FINC. XX EXP HSEST; HSEST_FN1; NM_002026. EXP LYMPHOMA1; L0001_15953; IMAGE_139009. EXP LYMPHOMA1; L0001_16112; IMAGE_139009. EXP LYMPHOMA1; L0001_17791; IMAGE_139009. EXP NCI60; NCI60_136798; IMAGE_136798. EXP NCI60; NCI60_151144; IMAGE_151144. EXP NCI60; NCI60_512275; IMAGE_512275. EXP NCI60; NCI60_512287; IMAGE_512287. EXP PEROU1; P0001_139009; IMAGE_139009. EXP PEROU1; P0001_268091; IMAGE_268091. EXP PEROU1; P0001_269203; IMAGE_269203. EXP PEROU1; P0001_296556; IMAGE_296556. EXP PEROU1; P0001_60846; IMAGE_60846. EXP ROSETTA; R0001_20907; RNA_X02761. EXP SERUM1; S0001_136798; IMAGE_136798. EXP SERUM1; S0001_151144; IMAGE_151144. EXP SERUM1; S0001_512275; IMAGE_512275. EXP SERUM1; S0001_512287; IMAGE_512287. //A detailed description of each line type is given below.
The ID line The identification line is always the first line of an entry. The general form of the ID line is:
ID GENE_NAME genetic_locus.The ID line is terminated by a period.
- GENE_NAME is the species code followed by the gene identifier which obeys the Human Gene Nomenclature rules
- The genetic_locus field is the cytogenetic location of the gene. It is cross-linked with the NCBI's genome map viewer
The DE line
DE fibronectin 1.
The description lines contain general descriptive information about the gene. It is extracted from the Unigene corresponding entry. The description is given in ordinary English and is free-format. In some cases, more than one DE line is required; in this case, the text is divided only between words. The last DE line is terminated by a period.The ON line
ON STGD1, ABCR, RP19, STGD.
The ON line describes the history of the gene nomenclature. It lists all the previous gene symbols which have been attributed to the specific gene.The RNA line
RNA EMBL; M27590.1; HSFNPFHL1.
It contains cross-references to the mRNA entries for this gene. These mRNAs are found in the EMBL database. The RNA lines can refer to partial mRNAs.
The format of this line is given below :RNA EMBL; EMBL_SV; EMBL_ID.The line is terminated by a period.
- EMBL_SV is the EMBL sequence version number.
- EMBL_ID is a secondary identifier or name for the EMBL entry.
The DR lines
The DR lines contain cross-references to entries from other databases. So far, we have incorporated links to SWISS-PROT, Entrez GeneID, RefSeq, Unigene, GeneCards and EPD. The precise format of these lines depends on the target database.
The format of the DR line is shown by the following examples :DR Entrez GeneID; 2335. DR Unigene; Hs.339722. DR MIM; 135600. DR Genew; HGNC:3778; FN1. DR RefSeq; NM_002026. DR RefSeq; NM_054034. DR SWISSPROT; P02751; FINC_HUMAN. DR EPD; EP16038; HS_FN1.
- The first item on the DR line is the abbreviated name of the data collection to which reference is made. The currently defined data bank identifiers are the following:
Entrez Gene A single query interface to curated sequence and descriptive information about genetic loci. Unigene The gene cluster database from NCBI. MIM The Mendelian Inheritance in Man Database, a catalog of human genes and genetic disorders. Genew The Human Gene Nomenclature Database providing data for all human genes which have approved symbols. RefSeq The NCBI Reference Sequence project. SWISSPROT Protein sequence database. EPD The eukaryotic promoter database. - The second item is the primary accession number (or an equivalent unique identifier of another data bank) of the entry to which reference is made.
- The third item (if it exists) is a secondary identifier or name for the cross-referenced database entry. For Genew, this number is the HGNC (Hugo Gene Nomenclature Committee) identifier.
The EXP line
The EXP line contains cross-references to the human public available data on gene expression. An exhaustive list of datasets already integrated in CleanEx is available HERE.
Currently, the different data types considered for integration in CleanEx are :The format of the EXP line is shown by the following examples.
- Stanford cDNA arrays.
- Nylon membrane arrays.
- Affymetrix oligoarrays.
- Other oligoarrays (Incyte, Resgen, Rosetta).
- EST counts per tissue category and per gene.
- SAGE experiments.
- MPSS experiments.
EXP HSEST; HSEST_FN1; NM_002026. EXP LYMPHOMA1; L0001_15953; IMAGE_139009. EXP NCI60; NCI60_136798; IMAGE_136798. EXP PEROU1; P0001_139009; IMAGE_139009. EXP ROSETTA; R0001_20907; RNA_X02761. EXP SERUM1; S0001_136798; IMAGE_136798. EXP AFFY001; AFFY001_1575_at; AFFY_HC-G110_1575_at.
- The first field of the EXP line is the abbreviated name of the data collection to which reference is made. The currently defined data bank identifiers are the following:
- PEROU1 for the data from Perou et al
- SERUM1 for the data from Iyer et al
- NCI60 for the data from Ross et al
- LYMPHOMA1
- ROSETTA
- AFFY001
- HSEST for the data from EST counts.- The second field is the local identifier of the corresponding expression entry. It is built with the local code for the dataset (HSEST_ for ESTs, P0001_ for Perou and S0001_ for SERUM1...) followed by the clone number used for the experiment (mostly Image clone numbers). For the EST dataset, we used the HUGO official name
- The third field is the reference to the CleanEx_trg corresponding entry.
Users can visualize available expression data about the sequence given in the EXP line in two ways :
- First by using (if it exists) the link to the web site provided by the original authors of the expression dataset.
- Second by using a local visualiser. A few explanations about this display is given in Appendix B
The // line
The // (terminator) line contains no data or comments. It designates the end of an entry.
CleanEx_trg
Each CleanEx_trg entry corresponds to one "target" (or "expression feature") used in an expression measurement experiment. Identifiers are composed of a code which describes the target type followed by an underscore and the target accession number. Types could be, for example, IMAGE clone (IMAGE), Affymetrix probeset (AFFY), SAGE tags (SAGE), or EMBL RNA or DNA sequences (RNA,DNA).
The format of CleanEx_trg resembles that of CleanEx. Each CleanEx_trg entry contains the following information :
Below is an example of an entry for an Affymetrix probe set :
- ID CleanEx_trg ID
- OS Organism Species
- GC Gene Count
- GN Official HUGO Gene Symbol
- OA Original Annotation (if existing)
- QU QUality
- SR Sequence Reference
- FN Feature Number
- UG UniGene release
- F1 Feature
- DR CleanEx_ref ID
- //
ID AFFY_HC-G110_1575_at Type=Affy_Tag OA M14758; HUMMDR1 Human P-glycoprotein (MDR1) mRNA; complete cds. OS Homo sapiens (human). GN ABCB1 GC 1 QU High SR Unigene=Hs.21330; FM Tag; FN 16 UG UniGene Build #160 F1 TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330; F2 AAAGCGCCAGTGAACTCTGACTGTA:284-105; Refseq=NM_000927(+); Unigene=Hs.21330; F3 GCGCCAGTGAACTCTGACTGTATGA:285-105; Refseq=NM_000927(+); Unigene=Hs.21330; F4 CCAGTGAACTCTGACTGTATGAGAT:286-105; Refseq=NM_000927(+); Unigene=Hs.21330; F5 TTAACATTTCCTCAGTCAAGTTCAG:287-105; Refseq=NM_000927(+); Unigene=Hs.21330; F6 ACATTTCCTCAGTCAAGTTCAGAGT:288-105; Refseq=NM_000927(+); Unigene=Hs.21330; F7 TTTCCTCAGTCAAGTTCAGAGTCTT:289-105; Refseq=NM_000927(+); Unigene=Hs.21330; F8 CCTCAGTCAAGTTCAGAGTCTTCAG:290-105; Refseq=NM_000927(+); Unigene=Hs.21330; F9 AGACATCATCAAGTGGAGAGAAATC:291-105; Refseq=NM_000927(+); Unigene=Hs.21330; F10 ATTTTCCCATTTGGACTGTAACTGA:292-105; Refseq=NM_000927(+); Unigene=Hs.21330; F11 TTCCCATTTGGACTGTAACTGACTG:293-105; Refseq=NM_000927(+); Unigene=Hs.21330; F12 CCATTTGGACTGTAACTGACTGCCT:294-105; Refseq=NM_000927(+); Unigene=Hs.21330; F13 TTTGGACTGTAACTGACTGCCTTGC:295-105; Refseq=NM_000927(+); Unigene=Hs.21330; F14 TAACTGACTGCCTTGCTAAAAGATT:296-105; Refseq=NM_000927(+); Unigene=Hs.21330; F15 CTGACTGCCTTGCTAAAAGATTATA:297-105; Refseq=NM_000927(+); Unigene=Hs.21330; F16 ACTGCCTTGCTAAAAGATTATAGAA:298-105; Refseq=NM_000927(+); Unigene=Hs.21330; DR AFFY001_1575_at; //Description of the line formats :The ID line
The identification line is always the first line of an entry. The general form of the ID line is:
ID TRG_ID Type
- TRG_ID is the internal identifier for the entry. The first part of the ID is a target type identifier. The second part is built with the original target name (image clone identifier, Affymetrix chip and probeset name,...)
- The Type field is a description of the target's provenance. Type could be for example "Seq_Ref" (for a sequence in EMBL or in RefSeq), "cDNA_clone", "Affy_Tag", "SAGE_Tag"...
The OA line
OA X60188; Human ERK1 mRNA for protein serine/threonine kinase
This line contains either the target's Original Annotation found in the corresponding description files, for example the Affymetrix chips annotation, or the description of the sequence given in the corresponding EMBL entry. It exists only for CleanEx_trg entries corresponding to Affymetrix tags.The GN line
GN TIE
The GN line lists the official gene symbols which correspond to that entry. If more than four genes match the target, only the four first ones are listed.The GC line
GC 1
The GC line gives the total count of genes having an approved symbol which match that target entryThe QU line
QU HighThe QU line is the quality tag based on the precision of the mapping of the target (see >a href="#trg">CleanEx build for details).The SR line
SR Unigene=Hs.21330;
The SR line stands for Sequence Reference and gives the associated Unigene Cluster for the whole target.The FM line
FM Tag;
This line describes the format of the features for the target.The FN line
FN 16
The FN line gives the number of features belonging to that target. For cDNA clones, this number is typically one. For Affymetrix probesets, it can vary between eleven to twenty-five.The UG line
UG UniGene Build #160
The UG line shows the Unigene Release which has been used to map the target sequences to its corresponding cluster.The F1-F25 lines
F1 TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;
These lines show the individual mapping for all the features of the corresponding target. Fields are separated by a ";". The first field is the name of the feature. The second fields contains the RefSeq accession number of the sequences which map the feature. The sign in parenthesis indicates if the tag mapped on the positive or on the negative strand of the RefSeq sequence. Last field shows the Unigene clusters to which the RefSeq sequences are associated.The DR line
DR AFFY001_1575_at;
DR lines in CleanEx_trg are crosslinks to the expression data found in CleanEx under the line type "EXP". Link is done via the expression data local identifier.
PREVIOUS TOP CleanEx entry search engines and viewers
- Querying CleanEx
- CleanEx expression data viewer
- CleanEx target entry viewer
- Batch search for CleanEx_trg
CleanEx entry queries and viewer
The CleanEx and CleanEx_trg data types are accessible either as flat files on the ftp server (detailed exaplanation on the directory content can be found in the README file of this directory), or via a web-based entry search and retrieval system at : http://www.cleanex.isb-sib.ch.
The CleanEx_exp data can be accessed only via the search tools which are explained in the Data Extraction page of this tutorial.
Entry quick search
Retrieves CleanEx entries via their main identifiers. CleanEx identifiers are built by concatenating the organism code and the official up-to-date gene symbol. This tool is the fastest way to access data, but one needs to know the official gene symbol of the searched gene. This tool also works with partial gene names, meaning that typing for example "FN1" will not only retrieve HS_FN1 and MM_FN1, but also, for example, MM_ANKFN1, MM_SLFN10, or HS_MFN1.
Once the list is shown, one can choose the desired entry as well as the output format (either html on text).
Example :
Typing "fn1" in the quick search box will lead to the Quick Search Results page.
Now selecting "HS_FN1" and "CleanEx entry (NICE view)" output on this page will give the individual HS_FN1 entry from CleanEx.
Entry browser
The entry browser can be filled with information as diverse as gene name, description, Unigene accession number, organism, RefSeq sequence, Swissprot or EPD identifiers, or even the clone accession numbers, or the expression experiment's identifiers.
Fields can be combined with limiting (AND, BUT NOT) or expanding (OR) operators. The entries which match the whole expression with combined fieds and operators will be selected.
As the search is done on the whole file, the search is much slower than the quick search system.
Entry viewer
The HTML entry viewer contains all the information described in the Format convention description part of this tutorial.
In addition, the following tool are provied :
- A direct access to the promoter sequence ("Extract corresponding genomic sequence" button. The user can select the length of the sequence that has to be extracted
- A direct access to the list of all associated "features" ("View list of associated clones or tags" button). From this link, one can select the feature types that one wants to retrieve. The output list Shows for each feature the gene symbol , the Unigene cluster, the sequence reference from the nucleotide databases, the feature type, and the CleanEx quality tag for this feature. A direct link to each target entry is also provided
- For each expression data measurement linked to this gene, a direct link to the expression data viewer for independant clones mapped to this gene
- For each expression dataset, a link to a parallel view of the expression of all the clones mapped to this gene in this dataset (the "view all dataset_name probes at once" button for each dataset line.
CleanEx expression viewers
The expression viewer depends on the type of the selected experiment.
- Dual channel data are represented in a "Treeview-like" display, from green to red, meaning respectively under- or overexpression. If the original dataset contains measurements for the two individual channels, a second display, showing the superposition of both channels, is shown. This gives an idea of the intensity level of the spot, and corresponds to the reconstructed image of the chip with both scanned values shown together. Flagged spots are displayed in grey, missing values in white. Example
- Affymetrix data are colored from blue to pink, according to the first results published by Alon. If available, the "Absent/Present call" tag is shown in the experiment result
- Counts data (ESTs, SAGE, MPSS) are displayed on a grayscale basis
CleanEx target entry viewer
The CleanEx target entries can be retrieved individually with the same search engine than CleanEx entries, namely the "CleanEx Target quick search" and the Target browser.
As for the CleanEx viewer, the CleanEx_trg entry viewer gives access to all the fields described in the format description, as well a a direct link to the data expression viewer for the associated CleanEx_exp entries.
For locally mapped targets, the exact position of all the tags on the reference sequence is provided, as well as a link to the SIB "TagScan" system, which gives the tag position on the genome sequence.
The tag position on the mRNA sequence can thus be used for example to check the SAGE tags distance from the 3' end of the gene.
For Affymetrix, this position could help solving two problems :
- First, it allows to check the real distance of the gene which is spanned by the individual probes of one probeset, as these probes sometimes overlap
- Second, this is a way of understanding expression differences between two probe sets designed for the same gene. Are thes probe sets in the same region, or is it possible that they belong to two different transcript variants, for example.
CleanEx target batch search
The Batch Search page is meant to help users determining what kind of genes correspond to their identifiers
It differs from a single search at the NCBI, for example, in two main ways :The ouput of the batch query shows, for each given identifier, the list of associated "features", the feature type, and the CleanEx target quality tag.
- First, it allows to retrieve information for a great number of sequences at once
- Second, this retrieval system can deal with heterogeneous identifiers. The input can thus mix either RefSeq, Unigene, or EMBL/GenBank accession numbers, as well as Gene symbols or Entrez GeneIDs.
PREVIOUS TOP CleanEx expression data retrieval systems
- The MeSH-oriented expression data retrieval system
- The keywords-based expression data retrieval system
- Extracting expression data numerical values
- Finding common genes in different datasets
Finding expression datasets via the MeSH annotation
The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary produced by the National Library of Medicine and used for indexing, cataloging, and searching for biomedical and health-related information and documents.
MeSH descriptors are arranged in a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Organisms". More specific headings are found at more narrow levels of the eleven-level hierarchy, such as "Monocytes", "Kidney Tubules" or "Leukemia, Lymphocytic, Acute".
The MeSH thesaurus is used by NLM for indexing articles for the MEDLINE/PubMED database. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.
CleanEx is the first gene expression database which provides MeSH-oriented search tools.
Each individual experiment in all the datasets included in CleanEx have been annotated with MeSH controlled terms via a semi-automatic process. From this hierarchical controlled annotation system, new search tools have been developped, which give rapid access to expression data having a certain biological or medical specificity. One can thus, for example, easily retrieve all the datasets containing expression measurements for "Breast AND Neoplasms" samples.
This search technique is much more precise that a basic "free-text" search in the experiments'annotations provided by the authors in the GEO SERIES files.
The proposed search tools, described bellow, can retrieve either :
- Whole expression datasets
- Individual experiments coming from heterogeneous datasets
Datasets retrieval using the MeSH hierarchical annotation
The MeSH-oriented dataset selection and numerical extraction module is a tool which goes step-by-step in the MeSH annotation tree to find the CleanEx datasets which correspond to the selected part of the tree.
One can select more than one branch at a time, like for example "Anatomy" and "Diseases". Once the tree branches have been selected, one then chooses between the two following options in the "Select next step" part :If the first option is selected, a new search is performed on the selected branches, and the corresponding sub-branches are shown in the following step. This operation can be repeated until the correct precision has been reached.
- Either go down one more level in the annotation tree to refine the search
- Or extract the corresponding data in CleanEx
To improve the next branch selection, clicking on the MeSH term itself will open a new window, which shows the corresponding MeSH subtree.
Once the correct level has been reached, selecting the second option will extract all the datasets which contain the MeSH terms that have been kept. This includes also datasets in which only one experiment corresponds to the given criteria.
The following step shows the selected CleanEx datasets with brief description of their content. One can then select one of these datasets for numerical data extraction by clicking on the dataset's identifier. This will display a more precise description of the dataset's content, including :At that stage, one has to select the numerical field to extract for the matrix generation of the expression values. This is especially useful for dual-channel experiments, as some people might want to work with one channel only, and some other might want to use the experiment/reference ratio. One can also discard some experiments of this dataset by unchecking them.
- Experiments number
- Total Features number
- Channel number
- Experiment type
- Brief description of the numerical fields in the dataset
- Description of all the individual experiments in this dataset
The data extraction too then provides access to three different files.This file format, especially the numerical matrix, can be directly imported in data analysis softwares, as for example the R expression data analysis packages, or the online EPCLUSTt tool by Jaak Vilo.
- The "matrix" file contains the numerical values. Each row represents one experiment, and each column is one feature.
- The "experiments" file contains the experiments detailed description, as given in the previous page. Each line represents one experiment. One line is divided in three fields. The first one is just an experiment's counter, but keeps track of the original experiment number in the CleanEx dataset (number shown in parenthesis after the experiment's counter). the second field is the experiment's short name, as given for example in GEO, and the third one is the text description of the experiment.
- The "feature" file is the features description file. Each line is one feature, ordered as the columns in the matrix file. The different line fields are : feature conter, feature name, corresponding CleanEx target identifier and feature text description, respectively.
Individual experiment retrieval
The first steps to retrieve experiments from heterogenous datasets, the MeSH-oriented data selection and extraction module, are identical to the datasets'retrieval system. The number of correspomdig experiments, and not datasets, is given for each tree branch.
Coming to the data extraction part a new intermediate page will allow to refine the search by joing the selected MeSH term with different operators. For example, one can thus discard all experiments annotated as "Neoplasms" by linking the terms with "BUT NOT" of with the "AND" operator, or one could select data from "Colon" "OR" "Kidney" to kepp both biological classes.
The important point to remember here is that if you want to discard one biological class by using the "BUT NOT" operator, you have to select this class via the MeSH-oriented tool from the begining of you analysis.
The following page displays all the selected experiments. Here again, one can unselect experiments to discard before extracting the values.
To generate the numerical matrix from heterogenous datasets, one has to take into account the differences between these datasets. Values coming from an Affymetrix experiment are very different than the ones coming from SAGE or MPSS data, for example. To deal with this problem, pre-values have been calculated for each experiment of each dataset, where all the numerical row values are re-scaled on the same basis. All the final values for each experiment are scaled between 0 and 1000, so that the same range is conserved for all experiments. These are the values which will then be extracted for the final matrix.
Before generating the matrix, an intermediate step extracts all the common genes for the different datasets selected for this analysis. The matrix is then generated with only the common genes.
The three resulting files are identical to the ones generated for one single dataset.The keywords-based expression data retrieval system
The keywords-based search tool is a general text search engine which parses the experiments descriptions. This tools works in two different flavours :
The search by MeSH term is faster and more precise, but the free-text tool is quite handy when one is not that familiar with MeSH terms, or when searching for annotation which is not part of the MeSH catalog.
- The tool searches in the MeSH annotation files of datasets/experiments
- It searches in the original datasets/experiment descriptions provided by the authors
Expression datasets retrieval
As for the MeSH tree-based dataset selection, the Find CleanEx Expression Datasets page retrieves all the datasets for which all words of the query appear in any experiments of the dataset, independently or in the same experiment.
The result page lists the corresponding datasets, and allows the user to extract data from one dataset at a time, as with the MeSH tool.Individual experiment retrieval
the Find Specific Experiments in CleanEx Expression Datasets, as for the last part of the experiments selection MeSH tool allows to group keywords and to link them with different operators, namely "AND", "BUT NOT" and "OR". For people who are familiar with MeSH term, this spares the time spent to "walk down" the MeSH terms tree. For free text search, it allows to refine the search to a more specific definition. Still, the result with the free text approach will always be more noisy than the one with the controlled MeSH vocabulary.
Once the experiments have been selected, the numerical data extraction process explained in the above paragraph is proposed.Extracting expression data numerical values
The Extract numerical data from a selected CleanEx dataset tool works as explained in the Datasets retrieval using the MeSH hierarchical annotation paragraph. It has been created for people who already know the accession number of one specific dataset. One can then just select this dataset from the list. All the dataset search part is thus discarded to go straight to the numerical data extraction part.
Finding common genes in different datasets
The Search common genes in different datasets finds, from the selected dataset list, the gennes which are common to all the given datasets. The resulting page shows, for each common gene and for each selected dataset, a lis of the corresponding features. Each feature is associated with its corresponding CleanEx target and its quality criteria.
PREVIOUS TOP
CleanEx Expression Data Analysis
CleanEx provides very powerful tools to extract expression measurements matrices from different datasets, and format them so that they can be directly imported in very powerfull expression data analysis tools, such as "R".
CleanEx offers anyway some quite handy and fast methods to compare gene expression levels in one single datasets, between datasets, or even across different datasets.
The first of these tools is a "step-by-step" method, which goes successively through different datasets, each time using the preceeding result to improve and refine the final set of differentially expressed genes.
The second one is more complex. Using the previously described method to extract heterogeneous data from different datasets, it generates two matrices representing two different biological conditions, and then compares the gene expression levels between the two pools.Step-by-step expression pattern search
The step_by_step tool first generates a form for the selected dataset. From this form, the user can separate the experiments in two pools, usually representing two different conditions (for eample, the first pool could represent "prostate normal tissue", and the second could be "prostate cancer tissue"). One then selects the analysis to apply to these two experiment pools (over-expression in either the first or the second pool compared to the other one or co-expression levels in the two pools), and the number or percentage of features/genes to keep.
The comparison is currently based on the general mean difference ranking, where the mean expression is calculated for each gene and for each experiment pool, and the difference between the two pools'means for each gene is then ranked.
The following step displays the gene list according to the difference rank. The user can then select between two options :
- Extract the promoter sequence in a fasta format for the shown over-expressed genes. This file can then be used for promoter analysis, for example from the SSA (Signal Search Analysis) online tool available at the Swiss Institute of Bioinformatics
- Proceed to the next analysis step, by selecting a new comparable dataset, then generating the two comparison pools, and launch a new analysis. This new step will generate a gene ranking from the newly selected dataset, but it will show only the top genes which are common with the first analysis results.
By class expression pattern search
The MeSH-oriented data extraction and comparison module works on the same basis than the MeSH oriented data selection tools. It works as follows :
The difference value for over-expressed genes is shown in red, and the one for the under-expressed genes is shown in green.
- First, by walking down the MeSH categories, the user selects two pools of experiments, coming from different datasets, to compare. For example, one could compare prostate normal tissue (Prostate BUT NOT Neoplasms) versus prostate cancer tissue (Prostate AND Neoplasms BUT NOT Neoplasm Metastasis).
- Once the experiments have been selected, the user stil can discard some data that he does not want to use for further analysis
- Then, the program generates , for the two experiments pools, the three files, namely the matrix, the experiment and the feature files (see Data Extraction for details on the files format).
- The next step is the analysis part. It uses the same rules as the step-by-step module, and produces a list of genes which are either :
- Over-expressed in pool two compared with pool one
- Under-expressed in pool two compared with pool one
- Over- OR under-expressed in pool two compared with pool one
A direct link to each genes corresponding entry in CleanEx is provided froom this result page
PREVIOUS TOP We're working hard to get it ready....
![]()
PREVIOUS TOP