Kurzbeschreibung
(Französisch)
|
Objectifs spécifiques :- EPD doit rester la base de données de référence des bioinformaticiens qui tentent de développer des outiles de prédiction pour identifier et caractériser les promoteurs d'eucaryiotes multicellulaires.
- En plus, EPD doit devenir plus utile et mieux connue par tous les chercheurs qui s'intéressent à la régulation d'un gène spécifique.
- La liste des promoteurs des principaux organismes doit être mieux décrite, notamment pour l'homme, la sours, la drosophile, C. elegans et Arabidopsis.
- Des interfaces entre EPD et d'autres logiciels doivent être développés, de façon à permettre aux non-spécialistes d'accéder aux promoteurs décrits dans EPD par des méthodes bioinformatiques avancées.
- CleanEx doit continuer de servir comme outil (structure intermédiaire) pour lier les promoteurs dans EPD à des données publiques sur l'expression des gènes.
- En plus, CleanEx doit renforcer sa position comme ressource d'annotation des données d'expression.
- A moyen terme, CleanEx doit consolider et mieux définir sa place parmi d'autres bases de données d'expression de gènes, et éventuellement profiter de collaborations plus formelles.
Mesures les plus importantes :
- Perfectionnement et développement de nouvelles procédures automatiques pour définir des promoteurs à partir de données brutes, telles que des chromatogrammes d'EST 5' de cDNAs complets.
- Elargissement et amélioration de la documentation des promoteurs pour rendre EPD plus utile à un plus grand nombre de chercheurs.
- Collaborations, "labour splitting" et "resource sharing", avec d'autres groupes du SIB. L'arrivée de deux nouveaux groupes travaillant dans le domaine de la régulation et de l'expression des gènes offre un grand potentiel de synergies.
- Evalutation régulière des outiles de prédiction des promoteurs et d'autres signaux de régulation. Si ces outils devaient devenir plus fiable, nous envisagerions alors l'intégratiion des promoteurs prédits dans EPD.
- Collaboration avec les fournisseurs de "gene expression technologies" comme Affymetrix, pour améliorer CleanEx comme outil d'annotation biologique des données d'expression.
- Réunions régulières avec les utilisateurs d'EPD et de CleanEx, éventuellement dans le cadre de structures existantes comme le Swiss Microarray Consortium, pour permettre de mieux connaître les besoins d'un large spectre d'utilisateurs.
- Organisation de cours pour faire connaître les bases de données à un public plus large.
|
Abstract
(Englisch)
|
1. Procedures for automatic promoter definition
During 2006, we developed a gene-driven data processing pipeline, which now enables us to synchronize gene names and gene annotations between EPD, Ensembl and other gene-centric resources. This maintenance system, initially insprire by our automatic procedures to build CleanEx relases, is based on the following principles : Stable information, surch as old EPD entries based on experiments published in journal articles, or transcription initiation site profiles derived from public MGA (mass genome annotation) data are stored separately in so-called EPD sources files. The EPD release entries for a given organism are rebuilt dynamically from these sources files, Ensembl and other resources, each time a new genome assembly becomes available for a particular organism.
The introduction of the new maintenance procedures required fundamental changes in the database format and organization. From now on, the identifier of EPD entries will be based on official gene symbols, such as those provided by the Genew database for human genes. Alternative promoters for the same gene will be redefined and renumbered each time new MGA data become available for a given organism. A uniform file format has been worked out for the storage of TSS mapping data generated by high-throughput technologies. In collaboration with Christian Iseli from Victor Jongeneel's group, we developed rapid software for matching sequence tags to a new genome releases. The MADAP software for defining promoters in transcription start site histograms had to be re-parameterized for the definition of preliminary promoters from lower quality data, in dorder to increase coverage. Furthermore, at the very technical side, the format of the field defining position of transcription start sites within a sequence had to be increased in order to deal with human chomosome-sized sequence entries. This in fact required a major upgrade of the Signal Search analysis software package described in previous reports.
The web display of promoter entries has been improved in several ways. Most importantly, we have introduced hyperlinks to genome browsers which enable the display of EPD-derived sequence features in the context of other genome annotations. These links use the BED (browser extension data) format for uploading promoter-related information.
2. CleanEx developments
CleanEx has continued to grow rapidly thanks to automatic data import from public repositories, in particular GEO. Today, the database offers access to 580 gene expression datasets containing millions of individual gene expression measurements. Format and software support for two new platforms have been added : MPSS and LongSAGE.
As announced in last year's report, we introduced an annotation system for gene expression data sets based on MeSH (Medical subject headings) terms. This system enables end users to rapidly zoom in on data sets falling into their area of interest. In addition, we improved the web-interfaces for exporting numerical data in various formats. In particular, it is now possible to download biologically annotated gene expression data sets in a format that can be directly imported into BioConductor software.
Additional improvements concern an html-based entry viewers and the export formats for target entries. The development of fast sequence tag mapping software (see under EPD) enabled us to add dynamic links from target entries to genome browsers.
3. Collaboration with other SIB groups
Several of the new features added to EPD and CleanEx web interfaces are based on joint developments including other SIB groupes. Computationally intensive maintenance procedures run on machines of the Vital-IT high-performance computing platform. The mapping of cDNA'5 ends to genome positions (a necessary step in the definition of promoters and in the construction of EPD entries) is based on the trome databse maintained by Victor Jongeneel's group. The initiation site clustering program MADAP has been developed by Mauro Delorenzi. The tagger software (developed in collaboration with Victor Jongeneel's group) is used for mapping of promoters and CleanEx expression targets to new genome assemblies, as well as for dynamic hyperlinks to genome browsers in the web-based display of CleanEx entries.
CleanEx developments have benefited from regular informal discussions with the local user community. We thank in particular Pascale Anderle, Thierry Sengstag, Eugenia Migliavacca, and Pierre Farmer for useful feedback and suggestions.
4. Organisation of workshops and courses
EPD,CleanEx and additional web-based bioinformatics resources of our group have been presented at the following workshops and courses : EMBRACE Workshop on Regulatory Sequence Motif Discovery, Uppsala, Sweden November 2006 SIB/CIG workshop on finding and analysing eukaryiotic promoters, Lausanne November 2006.
|