CleanEx expression data retrieval systems

  1. The MeSH-oriented expression data retrieval system
  2. The keywords-based expression data retrieval system
  3. Extracting expression data numerical values
  4. Finding common genes in different datasets


Finding expression datasets via the MeSH annotation

The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary produced by the National Library of Medicine and used for indexing, cataloging, and searching for biomedical and health-related information and documents.
MeSH descriptors are arranged in a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Organisms". More specific headings are found at more narrow levels of the eleven-level hierarchy, such as "Monocytes", "Kidney Tubules" or "Leukemia, Lymphocytic, Acute".
The MeSH thesaurus is used by NLM for indexing articles for the MEDLINE/PubMED database. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.
CleanEx is the first gene expression database which provides MeSH-oriented search tools.
Each individual experiment in all the datasets included in CleanEx have been annotated with MeSH controlled terms via a semi-automatic process. From this hierarchical controlled annotation system, new search tools have been developped, which give rapid access to expression data having a certain biological or medical specificity. One can thus, for example, easily retrieve all the datasets containing expression measurements for "Breast AND Neoplasms" samples.
This search technique is much more precise that a basic "free-text" search in the experiments'annotations provided by the authors in the GEO SERIES files.
The proposed search tools, described bellow, can retrieve either :

Datasets retrieval using the MeSH hierarchical annotation

The MeSH-oriented dataset selection and numerical extraction module is a tool which goes step-by-step in the MeSH annotation tree to find the CleanEx datasets which correspond to the selected part of the tree.
One can select more than one branch at a time, like for example "Anatomy" and "Diseases". Once the tree branches have been selected, one then chooses between the two following options in the "Select next step" part :

If the first option is selected, a new search is performed on the selected branches, and the corresponding sub-branches are shown in the following step. This operation can be repeated until the correct precision has been reached.
To improve the next branch selection, clicking on the MeSH term itself will open a new window, which shows the corresponding MeSH subtree.
Once the correct level has been reached, selecting the second option will extract all the datasets which contain the MeSH terms that have been kept. This includes also datasets in which only one experiment corresponds to the given criteria.
The following step shows the selected CleanEx datasets with brief description of their content. One can then select one of these datasets for numerical data extraction by clicking on the dataset's identifier. This will display a more precise description of the dataset's content, including : At that stage, one has to select the numerical field to extract for the matrix generation of the expression values. This is especially useful for dual-channel experiments, as some people might want to work with one channel only, and some other might want to use the experiment/reference ratio. One can also discard some experiments of this dataset by unchecking them.
The data extraction too then provides access to three different files.
  1. The "matrix" file contains the numerical values. Each row represents one experiment, and each column is one feature.
  2. The "experiments" file contains the experiments detailed description, as given in the previous page. Each line represents one experiment. One line is divided in three fields. The first one is just an experiment's counter, and shows the original experiment number in the CleanEx dataset. the second field is the experiment's short name, as given for example in GEO, and the thirs one is the text description of the experiment.
  3. The "feature" file is the features description file. Each line is one feature, ordered as the columns in the matrix file. The different line fields are : feature conter, feature name, corresponding CleanEx target identifier and feature text description, respectively.
This file format, especially the numerical matrix, can be directly imported in data analysis softwares, as for example the R expression data analysis packages, or the online EPCLUSTt tool by Jaak Vilo.

Individual experiment retrieval

The first steps to retrieve experiments from heterogenous datasets, the MeSH-oriented data selection and extraction module, are identical to the datasets'retrieval system. The number of correspomdig experiments, and not datasets, is given for each tree branch.
Coming to the data extraction part a new intermediate page will allow to refine the search by joing the selected MeSH term with different operators. For example, one can thus discard all experiments annotated as "Neoplasms" by linking the terms with "BUT NOT" of with the "AND" operator, or one could select data from "Colon" "OR" "Kidney" to kepp both biological classes.
The important point to remember here is that if you want to discard one biological class by using the "BUT NOT" operator, you have to select this class via the MeSH-oriented tool from the begining of you analysis.
The following page displays all the selected experiments. Here again, one can unselect experiments to discard before extracting the values.
To generate the numerical matrix from heterogenous datasets, one has to take into account the differences between these datasets. Values coming from an Affymetrix experiment are very different than the ones coming from SAGE or MPSS data, for example. To deal with this problem, pre-values have been calculated for each experiment of each dataset, where all the numerical row values are re-scaled on the same basis. All the final values for each experiment are scaled between 0 and 1000, so that the same range is conserved for all experiments. These are the values which will then be extracted for the final matrix.
Before generating the matrix, an intermediate step extracts all the common genes for the different datasets selected for this analysis. The matrix is then generated with only the common genes.
The three resulting files are identical to the ones generated for one single dataset.

The keywords-based expression data retrieval system

The keywords-based search tool is a general text search engine which parses the experiments descriptions. This tools works in two different flavours :

The search by MeSH term is faster and more precise, but the free-text tool is quite handy when one is not that familiar with MeSH terms, or when searching for annotation which is not part of the MeSH catalog.

Expression datasets retrieval

As for the MeSH tree-based dataset selection, the Find CleanEx Expression Datasets page retrieves all the datasets for which all words of the query appear in any experiments of the dataset, independently or in the same experiment.
The result page lists the corresponding datasets, and allows the user to extract data from one dataset at a time, as with the MeSH tool.

Individual experiment retrieval

the Find Specific Experiments in CleanEx Expression Datasets, as for the last part of the experiments selection MeSH tool allows to group keywords and to link them with different operators, namely "AND", "BUT NOT" and "OR". For people who are familiar with MeSH term, this spares the time spent to "walk down" the MeSH terms tree. For free text search, it allows to refine the search to a more specific definition. Still, the result with the free text approach will always be more noisy than the one with the controlled MeSH vocabulary.
Once the experiments have been selected, the numerical data extraction process explained in the above paragraph is proposed.

Extracting expression data numerical values

The Extract numerical data from a selected CleanEx dataset tool works as explained in the Datasets retrieval using the MeSH hierarchical annotation paragraph. It has been created for people who already know the accession number of one specific dataset. One can then just select this dataset from the list. All the dataset search part is thus discarded to go straight to the numerical data extraction part.

Finding common genes in different datasets

The Search common genes in different datasets finds, from the selected dataset list, the gennes which are common to all the given datasets. The resulting page shows, for each common gene and for each selected dataset, a lis of the corresponding features. Each feature is associated with its corresponding CleanEx target and its quality criteria.


PREVIOUSNEXT