much.more (http://muchmore.dfki.de) develops technologies that will result in a prototype system for cross-lingual information organization and access for the medical domain. The project provides a framework in which at the same time existing technologies can be integrated and refined and new technologies can be developed. Main contributions of the project will be on research regarding the effective combination of statistical, knowledge-based and heterogeneous approaches and resources and, in connection with this, on the construction and expansion of domain specific concept hierarchies through multilingual term and relation extraction. This approach is driven by the availability of very rich concept hierarchies in the medical domain (International Classification of Diseases: ICD, Medical Subject Headings: MESH and the Unified Medical Language System: UMLS), as well as large, correspondingly classified document collections that help advance the state of the art over the usual cross-lingual retrieval based on search terms. The medical domain is therefore an advanced starting point for research into the use of concept hierarchies in cross-lingual information access and management
much.more will carry out the following tasks:
oResearch regarding the effective combination of statistical, knowledge-based and heterogeneous approaches and resources and their integrated use for cross-lingual information access and management, including performance evaluation for realistic information access tasks.
oResearch and technology development concerning the automated acquisition and effective use of domain-specific concept hierarchies and corresponding multi-lingual linguistic resources (parallel and comparable corpora).
oDemonstration of a cross-lingual information access prototype system for the medical domain, and user evaluation of the system to ensure usability for real-life tasks.
The cross-lingual information access prototype system for the medical domain will be made publicly accessible through the internet. It provides access to multilingual information on the basis of a domain ontology and classification. For the main task of multilingual domain modelling, the project will focus on German and English. For the multilingual terminology extraction task, a broader range of languages will be covered, depending on the availability of parallel and comparable corpora.
A large part of the work in the period July 2000 to November 2001 consisted of the compilation of several reports that would help to further define the scope and purpose of the project. The consortium therefore first formulated a State of the Art report on cross-lingual information retrieval (CLIR) in general and on concept-based methods in the medical domain in particular. On the basis of this report, User Requirements for a concept-based, medical CLIR system could be formulated, while for evaluation purposes of such a system EIT defined a Performance Testing Plan. Reports are available at
http://muchmore.dfki.de/pub.html.Parallel to these developments, relevant Medical Corpora were identified, collected and prepared for further processing. To facilitate an easy exchange of annotated data, an XML-based Annotation Format was defined, on the basis of which Corpus Annotation was initiated with tools for shallow processing (i.e. PoS tagging, morphological analysis and phrase recognition - chunking) and semantic annotation (based on UMLS and EuroWordNet). An experimental prototype was set up by EIT that gives access to semantically annotated scientific medical journal abstracts.
In order to conduct a relevant Performance Evaluation, comparing different CLIR methods and combinations of such methods in the medical domain, the project initiated the development of a Test Collection of medical documents with corresponding relevance assessments using the EIT probabilistic RotondoSpider Retrieval System.
Also, research and development work was started in the areas of Bilingual Term Extraction, Sense Disambiguation, and Relation Extraction where EIT's technology - similarity thesaurus - were deployed to generated domain specific bilingual lexical resources.
A small group of relevant industry and research representatives are consulted to become a member of the MuchMore User Group.