CleanEx : How-To

The main goal of the CleanEx database, is to provide access to public gene expression data via unique gene names and to represent heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-dataset comparisons. To achieve this goal, each single gene expression experiment is regularly mapped on a permanent target identifier consisting of a physical description of the targeted RNA.
This manual leads the user through the different construction steps which are necessary to generate the CleanEx system. It also provides a "Learn-by-Example" tutorial page for each entry point in CleanEx and for each CleanEx tool.


  1. THE CleanEx DATABASE : CONCEPT AND DATA ORGANIZATION
  2. BUILDING CleanEx
  3. CleanEx format conventions
  4. CleanEx entry search engines and viewers
  5. Data extraction
  6. Data analysis
  7. Using CleanEx : Examples of applications

CleanEx : CONCEPT AND DATA ORGANIZATION

  1. Introduction
  2. CleanEx_exp
  3. CleanEx_trg
  4. CleanEx


Introduction

CleanEx contains human and mouse genes for which the symbol is approved by the representative organism nomenclature committee. For human genes, we use the approved Genew gene symbols. The mouse gene index is based on the MGD (Mouse Genome Database) nomenclature. There is one entry per gene name for each organism.
CleanEx is a flat file formatted database system consisting of three different file types.
Each of these files contain specific information and is linked to the others through a defined accession number.
The three file types are named :


CleanEx_exp

CleanEx_exp files store publicly available gene expression data.
Each "exp" file contains a matrix of measured expression levels for a set of target sequences and conditions, which is typically published and analyzed at once, and referred to by a common name. plus one data entry for each expression target.
Each "exp" file begins with a documentation entry for the corresponding dataset, which could be compared to the GEO series instance, and which provides general information about the data set including :

A data entry contains expression values for a particular feature over all conditions.
By feature we mean any molecule that is used to retrieve a certain transcript's abundance in an experiment, such as a clone or oligonucleotide spotted on a certain position of a dual-channel chip, an Affymetrix probe set, or a SAGE or MPSS tag.
Each CleanEx_exp data entry's header line contains the CleanEx_target identifier linking this specific "exp" entry to its target (the transcript which is "targetted" by the so-called feature) expressed sequence in the "trg" file.
The CleanEx_exp files are in principle static, except if the authors modify their own data. Only the "exp" file that contains the tissue distribution of public ESTs, which is derived from Unigene and regenerated from scratch whenever the original source is updated regularly.
CleanEx_exp files have short alpha-numeric strings as identifiers, which for most cases correspond to the GEO series identifier. The individual expression data entries have composite identifiers consisting of the corresponding "exp" file name followed by an underscore character and a second unique identifier.


CleanEx_trg

The "trg" files type contain a physical description of the expression targets, linked to genes and quality control information. A CleanEx "target" stands for the sequence to which any nucleotide element, or "feature", which is spotted or sequenced for an expression experiment corresponds. Features can be either :


An entry in the CleanEx "trg" file type is an annotated feature with its corresponding gene name and possibly its position on the gene nucleotide sequence reference.
The exact content of a target entry depends on the feature type. Currently we distinguish between: The latter two are not true physical descriptions of spotted features and serve as substitutes when more precise information is lacking.
The CleanEx_trg entries consist of a stable part and a weekly updated dynamic part. The stable part is imported from external sources, such as the original feature names given by the experiment authors, or the probe set documentation files posted by Affymetrix, and is used to generate the dynamic part, namely the mapping CleanEx_trg "features" to "targets" via a weekly updating procedure.
In case of multiple target match for the same feature, the cleanex_trg entry lists all corresponding genes found but adds a quality-control flag to indicate that the mapping is ambiguous.

CleanEx

Cleanex is the catalog of officially approved genes from model organisms (for now : human and mouse) with cross-references to entries in cleanex_trg and cleanex_exp, and links to external databases. There is one entry per gene, regardless of whether there are corresponding expression data in cleanex_exp. This file is completely rebuilt from scratch every week synchronously with the remapping of expression targets to genes. The process starts with a compilation of officially approved gene names from the reference gene catalogs (Genew for human and MGD for mouse). These names are then used to establish cross-references to cleanex_trg entries and from there to expression data in cleanex_exp via the target unique identifier. The link between sequences and gene names is done via the Unigene database. To have a complete view of the transcript and its product, we also link each entry to the corresponding protein. We also provide the genomic position of the transcription start site from EPD, when available. Otherwise we give the annotated start site position in Ensembl.


TOP


Building CleanEx : Main Steps

  1. Introduction
  2. CleanEx_exp files
  3. CleanEx_trg files
  4. CleanEx with cross-references


Introduction

The building procedure for the CleanEx system consists of regenerating from scratch the weekly updated files CleanEx_trg and CleanEx, and then adding the dataset information contained in the stable files (CleanEx_exp) to this new version, and concatenating all the cross-references together in CleanEx.
This page describes the building process of the stable CleanEx_exp files, which occurs only once, and the updating procedures for the two other file types, CleanEx_trg and CleanEx.


CleanEx_exp

The different platforms which have been integrated in the CleanEx system so far are :

Though some features are similar between some datasets (for example the three first methods give as main output a ratio between a reference experiment and the tested condition, Affymetrix-like experiments usually give a single intensity per probeset, and the EST, SAGE, and MPSS methods all give a basic count of transcripts found), each type of dataset needs a specific protocol to be integrated in CleanEx. Nowadays, most of the datasets in CleanEx are extracted from the GEO (Gene Expression Omnibus) database at the NCBI. The GEO database has become the most popular expression dataset repository, and thus represents a very complete expression data source.
Typically, the metadata for each dataset, which contains information like the type of experiment realized, organism, methods applied, paper reference and so on, give rise to the first entry of one dataset, namely the documentation file (DOC). This is the first part to generate for each dataset, regardless of its origin. This DOC entry is usually built by processing the information contained in the GEO "Series" files, as well as in the GSE "Samples" description part.


Data from GEO : semi-automatic dataset generation method

The semi-automatic procedure allows the direct generation of new CleanEx datasets from GEO.
GEO has a very specific and well-designed format, including the three following files types :

  1. "GPL" files : description of the platform used (chip description)
  2. "GSE" files : the series made (all the experiments corresponding to one dataset, or in other words one publication).
  3. "GSM" files : sample, containing the numerical values for individual experiments.
The series from GEO are stored under an in-house format called "soft". Each GSE soft file contains the above mentionned information, namely the platform(s) used, the general information about the serie, and the numerical valued for each sample.
The procedure consists of the following main steps :
  1. Extract the serie "soft" file
  2. Extract from the platform the correspondence between spots (features) and sequences (targets)
  3. Create the documentation entry from the information contained in the GSE file and from the individual samples descriptions.
  4. For all samples of the serie, reformat the numerical values to adapt them to the CleanEx format (values are stores for each feature, and not for each experiment). Add the target name for each feature in each "exp" entry header line
  5. Add value scales in the DOC entry.


The EST dataset


The dataset generated from the EST counts needs a frequent update, as it is based on the Unigene database.
The EST dataset is an in silico expression dataset generated from a basic per-tissue split of ESTs from UniGene clusters according to the library from which they've been extracted (Figure 10). This allows EST counts in healthy and tumor specific tissues to be compared with results obtained via other expression experiment protocols.
The tissue split is based on the library classification from CGAP (Cancer Genome Anatomy Project) at the NCBI. The tissue-specific libraries from the CGAP, MGC and ORESTES projects can be classified as normal, precancer, or cancer. This type of classification is perfectly adapted to our need. The CGAP library classification contains fifty-five different tissue classes divided in three different histology classes. Amongst these tissue types, the different chosen classes which appear to contain a reasonable amount of ESTs are the following :

Coloncancer
Colonnormal
Kidneycancer
Kidney normal
Lungcancer
Lungnormal
Mammary Glandcancer
Mammary glandnormal
Skincancer
Skinnormal
Cell-linecancer
Cell-linenormal
Other tissuescancer
Other tissuesnormal

The main steps to generate the EST datasets are :
  1. Extract library identifier and full name, tissue type, tissue condition (tumor, normal) from CGAP.
  2. Unigene identifier and full name for each library from the Unigene library info.
  3. Classify ESTs found in Unigene according to their original library.
  4. Count all ESTs per tissue class, and then all ESTs per tissue class and per Unigene cluster.
  5. Generate the EXP file with one entry per Unigene cluster.
  6. The calculated relative amount of ESTs for each entry is given as TPMs (Tags per Million).
  7. Do all steps for both human and mouse data.

A schematic description of the EST dataset update is given below :


CleanEx_trg

The method used to build CleanEx_trg entries depends on the feature source given by the authors.

Long clones

For long clones (e.g Incyte clones), the re-mapping is done with megablast following these steps :

The assignement of the quality tag for long clones follows these rules :

QualityCriteria
HighBoth 3' and 5' ends of the clone are available and match the same Unigene cluster.
MediumEither 3' or 5' ends of the clone are available and give a statistically significant result.
MediumBoth 3' and 5' ends of the clone are available, but only one is statistically significant.
LowNo statistically significant results have been found.
LowBoth ends of the clone match different genes.
UnknownThe sequence is not yet available.

Affymetrix probe sets

To give access to the precise position of Affymetrix probes, the Affymetrix mapping is done on the individual probes for each probe set, and not on the consensus sequence given in the available annotation files.
To avoid noise in the results, these short sequences (Affymetrix probes, SAGE and MPSS tags) are mapped via an in-house developped program called tagger, which generates a list of only perfect matches on the reference sequence database The mapping is done as follows :

The assignement of the quality tag for Affymetrix probe sets follows these rules :

QualityCriteria
HighA maximum of two Unigene identifiers match the probes of the probeset. All probes of the probeset match both Unigene identifiers.
MediumA maximum of 4 Unigene identifiers match the probes of the probe set. In addition, a maximum of 3 "errors" were permitted. Errors were defined as probes that matched nothing, probes that failed to match a Unigene identifier or probes that matched an additional Unigene identifier
LowAnything below the two preceeding criteria
UnknownAbsolutely no match on the selected mRNA databases was found for all the probes of the probe set.

A detailed chart of the Affymetrix mapping procedure is goven bellow.

The updated annotation files are available on the SIB ftp server.
Each subdirectory contains organism-specific chip annotation files corresponding to the mapping on RefSeq, HTC, RNA and EST. For the EST database matches, note that, to discard the matches on the wrong DNA strand, the mapping on ESTs is done via a supplementary check of the EST orientation. The first check is extracted from the Unigene EST direction annotation. This information is then cross-checked and corrected via the EST orientation found in the in-house transcriptome project called "trome". Thus, ESTs with tags 5' or 3' are accessible in two different files, respectively flanked with the extension "_PLUS" or "_MINUS".
Each line of these mapping files contains one match for one individual probe. Supplementary information included, if known, are :

SAGE and MPSS tags

The clustering of SAGE, LONGSAGE, MPSS, LONGMPSS tags together with reference sequence databases is provided at the SIB via the trome project.
To speed up the database update process, the SAGE tags mapping is done on a trome-based pre-filtered reference sequence database. The procedure then follows the Affymetrix one, and the tagger program is used to extract exact mapping positions on this filtered reference database. The quality criteria given for each inividual tag follows the same rules as the ones applied for Affymetrix, but are adapted for single tags, and not probe sets.
In the CleanEx_trg final file, all of the information is kept for the users. The quality criteria given for each inividual tag follows the same rules as the ones applied for Affymetrix, but are adapted for single tags, and not probe sets.
As the CleanEx database not only contains the most 3' end tags, the tag position on the reference sequence is given, and might also help deciding whether to keep or to discard the suspicious tag.

Here is a schematic view of the SAGE mapping procedure.


CleanEx

For each new Unigene release, the CleanEx files to be updated (namely CleanEx_trg and CleanEx) are rebuilt from scratch via this procedure :

CleanEx_trg

Each CleanEx_trg entry corresponds to one "target" (or "expression feature") used in an expression measurement experiment. Identifiers are composed of a code which describes the target type followed by an underscore and the target accession number. Types could be, for example, IMAGE clone (IMAGE), Affymetrix probeset (AFFY), SAGE tags (SAGE), or EMBL RNA or DNA sequences (RNA,DNA).

The format of CleanEx_trg resembles that of CleanEx. Each CleanEx_trg entry contains the following information :

Below is an example of an entry for an Affymetrix probe set :
ID   AFFY_HC-G110_1575_at   Type=Affy_Tag
OA   M14758; HUMMDR1 Human P-glycoprotein (MDR1) mRNA; complete cds.
OS   Homo sapiens (human).
GN   ABCB1
GC   1
QU   High
SR   Unigene=Hs.21330;
FM   Tag;
FN   16
UG   UniGene Build #160
F1  TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F2  AAAGCGCCAGTGAACTCTGACTGTA:284-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F3  GCGCCAGTGAACTCTGACTGTATGA:285-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F4  CCAGTGAACTCTGACTGTATGAGAT:286-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F5  TTAACATTTCCTCAGTCAAGTTCAG:287-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F6  ACATTTCCTCAGTCAAGTTCAGAGT:288-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F7  TTTCCTCAGTCAAGTTCAGAGTCTT:289-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F8  CCTCAGTCAAGTTCAGAGTCTTCAG:290-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F9  AGACATCATCAAGTGGAGAGAAATC:291-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F10  ATTTTCCCATTTGGACTGTAACTGA:292-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F11  TTCCCATTTGGACTGTAACTGACTG:293-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F12  CCATTTGGACTGTAACTGACTGCCT:294-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F13  TTTGGACTGTAACTGACTGCCTTGC:295-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F14  TAACTGACTGCCTTGCTAAAAGATT:296-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F15  CTGACTGCCTTGCTAAAAGATTATA:297-105; Refseq=NM_000927(+); Unigene=Hs.21330;
F16  ACTGCCTTGCTAAAAGATTATAGAA:298-105; Refseq=NM_000927(+); Unigene=Hs.21330;
DR   AFFY001_1575_at;
//
Description of the line formats :

The ID line

The identification line is always the first line of an entry. The general form of the ID line is:

        ID   TRG_ID     Type

The OA line

        OA   X60188; Human ERK1 mRNA for protein serine/threonine kinase

This line contains either the target's Original Annotation found in the corresponding description files, for example the Affymetrix chips annotation, or the description of the sequence given in the corresponding EMBL entry. It exists only for CleanEx_trg entries corresponding to Affymetrix tags.

The GN line

        GN   TIE

The GN line lists the official gene symbols which correspond to that entry. If more than four genes match the target, only the four first ones are listed.

The GC line

        GC   1

The GC line gives the total count of genes having an approved symbol which match that target entry

The QU line

        QU   High
The QU line is the quality tag based on the precision of the mapping of the target (see >a href="#trg">CleanEx build for details).

The SR line

        SR   Unigene=Hs.21330;

The SR line stands for Sequence Reference and gives the associated Unigene Cluster for the whole target.

The FM line

        FM   Tag;

This line describes the format of the features for the target.

The FN line

        FN   16

The FN line gives the number of features belonging to that target. For cDNA clones, this number is typically one. For Affymetrix probesets, it can vary between eleven to twenty-five.

The UG line

        UG   UniGene Build #160

The UG line shows the Unigene Release which has been used to map the target sequences to its corresponding cluster.

The F1-F25 lines

        F1  TGTCCAGGCTGGAACAAAGCGCCAG:283-105; Refseq=NM_000927(+); Unigene=Hs.21330;

These lines show the individual mapping for all the features of the corresponding target. Fields are separated by a ";". The first field is the name of the feature. The second fields contains the RefSeq accession number of the sequences which map the feature. The sign in parenthesis indicates if the tag mapped on the positive or on the negative strand of the RefSeq sequence. Last field shows the Unigene clusters to which the RefSeq sequences are associated.

The DR line

        DR   AFFY001_1575_at;

DR lines in CleanEx_trg are crosslinks to the expression data found in CleanEx under the line type "EXP". Link is done via the expression data local identifier.


PREVIOUSTOP


CleanEx entry search engines and viewers

  1. Querying CleanEx
  2. CleanEx expression data viewer
  3. CleanEx target entry viewer
  4. Batch search for CleanEx_trg


CleanEx entry queries and viewer

The CleanEx and CleanEx_trg data types are accessible either as flat files on the ftp server (detailed exaplanation on the directory content can be found in the README file of this directory), or via a web-based entry search and retrieval system at : http://www.cleanex.isb-sib.ch.
The CleanEx_exp data can be accessed only via the search tools which are explained in the Data Extraction page of this tutorial.

Entry quick search

Retrieves CleanEx entries via their main identifiers. CleanEx identifiers are built by concatenating the organism code and the official up-to-date gene symbol. This tool is the fastest way to access data, but one needs to know the official gene symbol of the searched gene. This tool also works with partial gene names, meaning that typing for example "FN1" will not only retrieve HS_FN1 and MM_FN1, but also, for example, MM_ANKFN1, MM_SLFN10, or HS_MFN1.
Once the list is shown, one can choose the desired entry as well as the output format (either html on text).
Example :
Typing "fn1" in the quick search box will lead to the Quick Search Results page.
Now selecting "HS_FN1" and "CleanEx entry (NICE view)" output on this page will give the individual HS_FN1 entry from CleanEx.

Entry browser

The entry browser can be filled with information as diverse as gene name, description, Unigene accession number, organism, RefSeq sequence, Swissprot or EPD identifiers, or even the clone accession numbers, or the expression experiment's identifiers.
Fields can be combined with limiting (AND, BUT NOT) or expanding (OR) operators. The entries which match the whole expression with combined fieds and operators will be selected.
As the search is done on the whole file, the search is much slower than the quick search system.

Entry viewer

The HTML entry viewer contains all the information described in the Format convention description part of this tutorial.
In addition, the following tool are provied :

CleanEx expression viewers

The expression viewer depends on the type of the selected experiment.

CleanEx target entry viewer

The CleanEx target entries can be retrieved individually with the same search engine than CleanEx entries, namely the "CleanEx Target quick search" and the Target browser.
As for the CleanEx viewer, the CleanEx_trg entry viewer gives access to all the fields described in the format description, as well a a direct link to the data expression viewer for the associated CleanEx_exp entries.
For locally mapped targets, the exact position of all the tags on the reference sequence is provided, as well as a link to the SIB "TagScan" system, which gives the tag position on the genome sequence.
The tag position on the mRNA sequence can thus be used for example to check the SAGE tags distance from the 3' end of the gene.
For Affymetrix, this position could help solving two problems :

CleanEx target batch search

The Batch Search page is meant to help users determining what kind of genes correspond to their identifiers
It differs from a single search at the NCBI, for example, in two main ways :

The ouput of the batch query shows, for each given identifier, the list of associated "features", the feature type, and the CleanEx target quality tag.

PREVIOUSTOP


CleanEx expression data retrieval systems

  1. The MeSH-oriented expression data retrieval system
  2. The keywords-based expression data retrieval system
  3. Extracting expression data numerical values
  4. Finding common genes in different datasets


Finding expression datasets via the MeSH annotation

The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary produced by the National Library of Medicine and used for indexing, cataloging, and searching for biomedical and health-related information and documents.
MeSH descriptors are arranged in a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Organisms". More specific headings are found at more narrow levels of the eleven-level hierarchy, such as "Monocytes", "Kidney Tubules" or "Leukemia, Lymphocytic, Acute".
The MeSH thesaurus is used by NLM for indexing articles for the MEDLINE/PubMED database. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.
CleanEx is the first gene expression database which provides MeSH-oriented search tools.
Each individual experiment in all the datasets included in CleanEx have been annotated with MeSH controlled terms via a semi-automatic process. From this hierarchical controlled annotation system, new search tools have been developped, which give rapid access to expression data having a certain biological or medical specificity. One can thus, for example, easily retrieve all the datasets containing expression measurements for "Breast AND Neoplasms" samples.
This search technique is much more precise that a basic "free-text" search in the experiments'annotations provided by the authors in the GEO SERIES files.
The proposed search tools, described bellow, can retrieve either :

Datasets retrieval using the MeSH hierarchical annotation

The MeSH-oriented dataset selection and numerical extraction module is a tool which goes step-by-step in the MeSH annotation tree to find the CleanEx datasets which correspond to the selected part of the tree.
One can select more than one branch at a time, like for example "Anatomy" and "Diseases". Once the tree branches have been selected, one then chooses between the two following options in the "Select next step" part :

If the first option is selected, a new search is performed on the selected branches, and the corresponding sub-branches are shown in the following step. This operation can be repeated until the correct precision has been reached.
To improve the next branch selection, clicking on the MeSH term itself will open a new window, which shows the corresponding MeSH subtree.
Once the correct level has been reached, selecting the second option will extract all the datasets which contain the MeSH terms that have been kept. This includes also datasets in which only one experiment corresponds to the given criteria.
The following step shows the selected CleanEx datasets with brief description of their content. One can then select one of these datasets for numerical data extraction by clicking on the dataset's identifier. This will display a more precise description of the dataset's content, including : At that stage, one has to select the numerical field to extract for the matrix generation of the expression values. This is especially useful for dual-channel experiments, as some people might want to work with one channel only, and some other might want to use the experiment/reference ratio. One can also discard some experiments of this dataset by unchecking them.
The data extraction too then provides access to three different files.
  1. The "matrix" file contains the numerical values. Each row represents one experiment, and each column is one feature.
  2. The "experiments" file contains the experiments detailed description, as given in the previous page. Each line represents one experiment. One line is divided in three fields. The first one is just an experiment's counter, but keeps track of the original experiment number in the CleanEx dataset (number shown in parenthesis after the experiment's counter). the second field is the experiment's short name, as given for example in GEO, and the third one is the text description of the experiment.
  3. The "feature" file is the features description file. Each line is one feature, ordered as the columns in the matrix file. The different line fields are : feature conter, feature name, corresponding CleanEx target identifier and feature text description, respectively.
This file format, especially the numerical matrix, can be directly imported in data analysis softwares, as for example the R expression data analysis packages, or the online EPCLUSTt tool by Jaak Vilo.

Individual experiment retrieval

The first steps to retrieve experiments from heterogenous datasets, the MeSH-oriented data selection and extraction module, are identical to the datasets'retrieval system. The number of correspomdig experiments, and not datasets, is given for each tree branch.
Coming to the data extraction part a new intermediate page will allow to refine the search by joing the selected MeSH term with different operators. For example, one can thus discard all experiments annotated as "Neoplasms" by linking the terms with "BUT NOT" of with the "AND" operator, or one could select data from "Colon" "OR" "Kidney" to kepp both biological classes.
The important point to remember here is that if you want to discard one biological class by using the "BUT NOT" operator, you have to select this class via the MeSH-oriented tool from the begining of you analysis.
The following page displays all the selected experiments. Here again, one can unselect experiments to discard before extracting the values.
To generate the numerical matrix from heterogenous datasets, one has to take into account the differences between these datasets. Values coming from an Affymetrix experiment are very different than the ones coming from SAGE or MPSS data, for example. To deal with this problem, pre-values have been calculated for each experiment of each dataset, where all the numerical row values are re-scaled on the same basis. All the final values for each experiment are scaled between 0 and 1000, so that the same range is conserved for all experiments. These are the values which will then be extracted for the final matrix.
Before generating the matrix, an intermediate step extracts all the common genes for the different datasets selected for this analysis. The matrix is then generated with only the common genes.
The three resulting files are identical to the ones generated for one single dataset.

The keywords-based expression data retrieval system

The keywords-based search tool is a general text search engine which parses the experiments descriptions. This tools works in two different flavours :

The search by MeSH term is faster and more precise, but the free-text tool is quite handy when one is not that familiar with MeSH terms, or when searching for annotation which is not part of the MeSH catalog.

Expression datasets retrieval

As for the MeSH tree-based dataset selection, the Find CleanEx Expression Datasets page retrieves all the datasets for which all words of the query appear in any experiments of the dataset, independently or in the same experiment.
The result page lists the corresponding datasets, and allows the user to extract data from one dataset at a time, as with the MeSH tool.

Individual experiment retrieval

the Find Specific Experiments in CleanEx Expression Datasets, as for the last part of the experiments selection MeSH tool allows to group keywords and to link them with different operators, namely "AND", "BUT NOT" and "OR". For people who are familiar with MeSH term, this spares the time spent to "walk down" the MeSH terms tree. For free text search, it allows to refine the search to a more specific definition. Still, the result with the free text approach will always be more noisy than the one with the controlled MeSH vocabulary.
Once the experiments have been selected, the numerical data extraction process explained in the above paragraph is proposed.

Extracting expression data numerical values

The Extract numerical data from a selected CleanEx dataset tool works as explained in the Datasets retrieval using the MeSH hierarchical annotation paragraph. It has been created for people who already know the accession number of one specific dataset. One can then just select this dataset from the list. All the dataset search part is thus discarded to go straight to the numerical data extraction part.

Finding common genes in different datasets

The Search common genes in different datasets finds, from the selected dataset list, the gennes which are common to all the given datasets. The resulting page shows, for each common gene and for each selected dataset, a lis of the corresponding features. Each feature is associated with its corresponding CleanEx target and its quality criteria.


PREVIOUSTOP


CleanEx Expression Data Analysis

  1. Step-by-step expression pattern search
  2. By class expression pattern search


CleanEx provides very powerful tools to extract expression measurements matrices from different datasets, and format them so that they can be directly imported in very powerfull expression data analysis tools, such as "R".
CleanEx offers anyway some quite handy and fast methods to compare gene expression levels in one single datasets, between datasets, or even across different datasets.
The first of these tools is a "step-by-step" method, which goes successively through different datasets, each time using the preceeding result to improve and refine the final set of differentially expressed genes.
The second one is more complex. Using the previously described method to extract heterogeneous data from different datasets, it generates two matrices representing two different biological conditions, and then compares the gene expression levels between the two pools.

Step-by-step expression pattern search

The step_by_step tool first generates a form for the selected dataset. From this form, the user can separate the experiments in two pools, usually representing two different conditions (for eample, the first pool could represent "prostate normal tissue", and the second could be "prostate cancer tissue"). One then selects the analysis to apply to these two experiment pools (over-expression in either the first or the second pool compared to the other one or co-expression levels in the two pools), and the number or percentage of features/genes to keep.
The comparison is currently based on the general mean difference ranking, where the mean expression is calculated for each gene and for each experiment pool, and the difference between the two pools'means for each gene is then ranked.
The following step displays the gene list according to the difference rank. The user can then select between two options :

  1. Extract the promoter sequence in a fasta format for the shown over-expressed genes. This file can then be used for promoter analysis, for example from the SSA (Signal Search Analysis) online tool available at the Swiss Institute of Bioinformatics
  2. Proceed to the next analysis step, by selecting a new comparable dataset, then generating the two comparison pools, and launch a new analysis. This new step will generate a gene ranking from the newly selected dataset, but it will show only the top genes which are common with the first analysis results.

By class expression pattern search

The MeSH-oriented data extraction and comparison module works on the same basis than the MeSH oriented data selection tools. It works as follows :

  1. First, by walking down the MeSH categories, the user selects two pools of experiments, coming from different datasets, to compare. For example, one could compare prostate normal tissue (Prostate BUT NOT Neoplasms) versus prostate cancer tissue (Prostate AND Neoplasms BUT NOT Neoplasm Metastasis).
  2. Once the experiments have been selected, the user stil can discard some data that he does not want to use for further analysis
  3. Then, the program generates , for the two experiments pools, the three files, namely the matrix, the experiment and the feature files (see Data Extraction for details on the files format).
  4. The next step is the analysis part. It uses the same rules as the step-by-step module, and produces a list of genes which are either :
    • Over-expressed in pool two compared with pool one
    • Under-expressed in pool two compared with pool one
    • Over- OR under-expressed in pool two compared with pool one
The difference value for over-expressed genes is shown in red, and the one for the under-expressed genes is shown in green.
A direct link to each genes corresponding entry in CleanEx is provided froom this result page


PREVIOUSTOP


We're working hard to get it ready....


PREVIOUSTOP