The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary produced by the National Library of Medicine and used for indexing, cataloging, and searching for biomedical and health-related information and documents.
MeSH descriptors are arranged in a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Organisms". More specific headings are found at more narrow levels of the eleven-level hierarchy, such as "Monocytes", "Kidney Tubules" or "Leukemia, Lymphocytic, Acute".
The MeSH thesaurus is used by NLM for indexing articles for the MEDLINE/PubMED database. Each bibliographic reference is associated with a set of MeSH terms that describe the content of the item. Similarly, search queries use MeSH vocabulary to find items on a desired topic.
CleanEx is the first gene expression database which provides MeSH-oriented search tools.
Each individual experiment in all the datasets included in CleanEx have been annotated with MeSH controlled terms via a semi-automatic process. From this hierarchical controlled annotation system, new search tools have been developped, which give rapid access to expression data having a certain biological or medical specificity. One can thus, for example, easily retrieve all the datasets containing expression measurements for "Breast AND Neoplasms" samples.
This search technique is much more precise that a basic "free-text" search in the experiments'annotations provided by the authors in the GEO SERIES files.
The proposed search tools, described bellow, can retrieve either :
The MeSH-oriented dataset selection and numerical extraction module is a tool which goes step-by-step in the MeSH annotation tree to find the CleanEx datasets which correspond to the selected part of the tree.
One can select more than one branch at a time, like for example "Anatomy" and "Diseases". Once the tree branches have been selected, one then chooses between the two following options in the "Select next step" part :
The first steps to retrieve experiments from heterogenous datasets, the MeSH-oriented data selection and extraction module, are identical to the datasets'retrieval system. The number of correspomdig experiments, and not datasets, is given for each tree branch.
Coming to the data extraction part a new intermediate page will allow to refine the search by joing the selected MeSH term with different operators. For example, one can thus discard all experiments annotated as "Neoplasms" by linking the terms with "BUT NOT" of with the "AND" operator, or one could select data from "Colon" "OR" "Kidney" to kepp both biological classes.
The important point to remember here is that if you want to discard one biological class by using the "BUT NOT" operator, you have to select this class via the MeSH-oriented tool from the begining of you analysis.
The following page displays all the selected experiments. Here again, one can unselect experiments to discard before extracting the values.
To generate the numerical matrix from heterogenous datasets, one has to take into account the differences between these datasets. Values coming from an Affymetrix experiment are very different than the ones coming from SAGE or MPSS data, for example. To deal with this problem, pre-values have been calculated for each experiment of each dataset, where all the numerical row values are re-scaled on the same basis. All the final values for each experiment are scaled between 0 and 1000, so that the same range is conserved for all experiments. These are the values which will then be extracted for the final matrix.
Before generating the matrix, an intermediate step extracts all the common genes for the different datasets selected for this analysis. The matrix is then generated with only the common genes.
The three resulting files are identical to the ones generated for one single dataset.
The keywords-based search tool is a general text search engine which parses the experiments descriptions. This tools works in two different flavours :
As for the MeSH tree-based dataset selection, the Find CleanEx Expression Datasets page retrieves all the datasets for which all words of the query appear in any experiments of the dataset, independently or in the same experiment.
The result page lists the corresponding datasets, and allows the user to extract data from one dataset at a time, as with the MeSH tool.
the Find Specific Experiments in CleanEx Expression Datasets, as for the last part of the experiments selection MeSH tool allows to group keywords and to link them with different operators, namely "AND", "BUT NOT" and "OR". For people who are familiar with MeSH term, this spares the time spent to "walk down" the MeSH terms tree. For free text search, it allows to refine the search to a more specific definition. Still, the result with the free text approach will always be more noisy than the one with the controlled MeSH vocabulary.
Once the experiments have been selected, the numerical data extraction process explained in the above paragraph is proposed.
The Extract numerical data from a selected CleanEx dataset tool works as explained in the Datasets retrieval using the MeSH hierarchical annotation paragraph. It has been created for people who already know the accession number of one specific dataset. One can then just select this dataset from the list. All the dataset search part is thus discarded to go straight to the numerical data extraction part.
The Search common genes in different datasets finds, from the selected dataset list, the gennes which are common to all the given datasets. The resulting page shows, for each common gene and for each selected dataset, a lis of the corresponding features. Each feature is associated with its corresponding CleanEx target and its quality criteria.
| PREVIOUS | NEXT |