HUSAR Bioinformatics Lab
Deutsches Krebsforschungszentrum Genomics Proteomics Core Facility
HUSAR Research
  1. National Genome Research Network (NGFN)
  2. BioinfoGRID - EGEE
  3. HOBIT - Helmholtz Open Bioinformatics Technology
  4. Alu exonization in cancer
  5. Automatic Gene Function Annotation
  6. MicroRNA Prediction
  7. GENIUS Sequence Analysis
  8. Gene Expression Data Analysis

Former Projects:
  1. German Human Genome Project (DHGP)
  2. Helmholtz Network for Bioinformatics (HNB)

1. National Genome Research Network (NGFN)
NGFN logoThe aim of the National Genome Research Network (NGFN) is to increase our knowledge about the function of medically relevant genes in humans. In the NGFN, clinical research as well as basic research will complement each other in order to develop new therapeutic strategies.

We are currently collaborating with the group of Dr. Wiemann, a member of the German cDNA Sequencing Consortium at DKFZ and coordinator of the project Sequence analysis of tissue- and development-specific full length cDNAs of the human genome":

1.1. Functional Annotation of cDNAs
In our collaboration we designed a web enabled tool that can be used for the automatic high throughput mapping and characterization of cDNAs. We plan to extend the task cDNA2Genome in the following directions.

  • Quality assessment of cDNAs: cDNA-sequences are obtained from complementary DNA-libraries (cDNA-libraries). Theoretically, each of these libraries should contain full-length DNA copies of every functional messenger RNA. However the quality of the cDNA sequences in these libraries is often highly inconsistent. We are currently implementing a tool for the assessment of the quality of cDNAs in order to reduce the time-consuming process of manual curation and the great deal of experience that is necessary at the moment to validate each cDNA.

  • Integration of the NCBI and Ensembl databases: At the moment cDNA2Genome integrates experimental data as well as our own databases. To provide a broader overview, we will also integrate precomputed data from the Ensembl project. Both pipelines, Ensembl and NCBI, use different approaches for data analysis that result in a different set of annotated genes. Benchmarking performed in our group has shown significant differences in the sequence content of both genomic annotation projects.

  • Extension to other organisms as their genomes become available.

  • Pseudogene identification: The tool for cDNA quality evaluation will be extended in order to decide whether duplicated hits of a cDNA in the genome represent pseudogenes or a simple gene duplication.

  • Development of an interactive ORF viewer
1.2. Functional annotation of proteins
There are three tasks for the high throughput functional annotation of proteins using the W3H-task-system: ProtSweep (identification of proteins), 2Dsweep (secondary structure features, protein localization and physicochemical characteristics) and DomainSweep (functional annotation of protein domains).

At the moment DomainSweep predictions are highly stringent. Only matches which fulfill database specific criterias of cutoff values or e-value thresholds, e.g. a predicted pfamA domain that scores higher than the lowest score of all true positive family members, are taken into account. Further putative domain hits can be identified using less stringent cut-offs. We are working on a new version of DomainSweep in which it will be possible for the user to specify cut-offs for the examination of domains within the twilight zone.

We are also working integrating more family databases. For increased coverage Smart representing mainly nuclear, signalling and extracellular protein domains, and Tigrfams, containing a library of Hidden Markov models of full length proteins and shorter regions, designed to support both automated and manually curated annotation of genomes, are currently being added. Subsequently protein structure databases, e.g. SCOP and CATH, will be added to integrate structural information in order to enhance the capability of DomainSweep in the field of protein classification and characterisation.

Additionally, we are improving the quality of DomainSweep funtional assignments by increasing its capacity to search for proteins with identical domain architecture. A protein database with known domain combinations is being implemented and DomainSweep will be able to query it, group all proteins sharing a common domain structure, retrieve more functional annotations and perform further analysis steps such as multiple alignments.


Our goal for the new NGFN-2 is to extend our data analysis system by creating individual analysis blocks and pipelines adapted to the needs of NGFN's KGs and SMPs. Therefore we are looking to collaborate with partners willing to work at developing specialized analysis blocks not yet included in our library. Both these newly developed blocks as well as the existing ones within our library can then be combined to complement each other and provide our collaborating partners with a customized high throughput analysis tasks.


2. BioinfoGRID - EGEE: Bioinformatics Grid Application for life science
BioinfoGRID logo The European Commission promotes the Bioinformatics Grid Application for life science (BioinfoGRID) project. The project aims to connect many European computer centres in order to carry out Bioinformatics research and to develop new applications in the sector using a network of services based on futuristic Grid networking technology that represents the natural evolution of the Web.

More specifically the BioinfoGRID project will make research in the fields of Genomics, Proteomics, Transcriptomics and applications in Molecular Dynamics much easier, reducing data calculation times thanks to the distribution of the calculation at any one time on thousands of computers across Europe and the world.

Furthermore it will provide the possibility of accessing many different databases and hundreds of applications belonging to thousands of European users by exploiting the potential of the Grid infrastructure created with the EGEE European project and coordinated by CERN in Geneva.

3. HOBIT - Helmholtz Open Bioinformatics Technology
Hobit logo The HOBIT initiative is dedicated to form the core of a network linking bioinformatic centres together. It shall be understood as an initial organisational and technological platform for interconnection of bioinformatics activities. The aim of the network is to concatenate applications and resources in a uniform way so providing an efficient communication tier between for bioinformatics resource access.

Commonly bioinformatics resources are highly localized and only accessible via interactive web pages. This causes several disadvantages. It complicates for example the possibility to integrate external resources in local applications. As a result this leads very often to redundant installations of external applications and databases. Consequential information may diverge especially in databases if additional information is generated. Beside this a significant administrative overhead is required. Other problems are for example no uniform access mechanisms for information and the requirement to deal with different technologies.

Hobit pages at DKFZ


4. Alu exonization in cancer
Alu exonization and alternative splicing in cancer genes
Israel Cooperation in Cancer Research Project Ca-119 with Dr. Gil Ast, TelAviv University, Israel

Alternative splicing is a major mechanism that diversifies the genetic information by producing more than one mRNA from a single gene. Aberrant regulation of alternative splicing has been implicated in cancer. It has been shown by Gil Ast's Group that more than 5% of human lternatively spiced exons ariginate from Alu retrotransposons, which are found in primate only. Gil Ast demonstrated, that a single point mutation in the Alu sequence can turn an intronic Alu element into a new exon. The analysis predicted that in one of the genes the insertion of an Alu-exon into the mRNA is specific to certain types of leukemia and lymphomas. Other genes involved in cancer like BRCA and APC are known to contain Alu elements wich are involved in alternative splicing. Our collaboration with Gil Ast combines experimental work with bioinformatics work. We will build a high-throughput bioinformatics tool for analyzing DNA for Alu exonization. We will conduct a systematic search for Alu-containing splice isoforms in different cancer types and cancer stages and systematically screen genes involved in pathways on Alu-containing splice isoforms which are modified in cancer cells.

5. Strategies for Automatic Gene Function Annotation using Gene Ontology
Traditional methods of gene or protein function prediction are often manual or semi-automated, which is inadequate in large-scale sequence analysis. Automated methods basically always rely on sequence homology searches against various databases and extract the annotation of related sequences found in the databases. These methods have to deal with the following problems
  1. Databases usually differ in their quality of curation and in the amount of annotation they offer.

  2. The annotation currently found in databases is highly heterogeneous and inconsistent in its use of database fields.

  3. Annotation is often not expressed in a machine-readable or consistent manner or is misleading.

In collaboration with the department of Dr. Roland Eils we are developing a learning algorithm for the automatic annotation of gene function using the Gene Ontology (GO) database. This approach overcomes many of the usual problems faced in automatic annotation such as incomplete, inconsistent or incorrect annotation, as well as the lack of a formalised annotation language. The main strategy of this approach is using GO mapped databases, including model organism and domain databases, which contain reliable and high-quality information with structured GO-terms. Sequence similarity, frequency and relative or supportive information is selected from database searches to assign score values for each functional category. Different scoring functions are combined by a mathematical model. In parallel we use standard classification algorithms (e.g. support vector machines (SVM), artificial neural networks, sparse grids) to correctly assign GO nodes to a certain sequence. Therefore we will use attributes like the scores from BlastX searches, overlap length, percentage of similarity, hits against domain databases, and others.

We use sequences from GO-mapped organism databases like Yeast, Mouse, Drosophila, C.elegans and Arabidopsis as training sets. The training process based on the model organism databases significantly improved the quality of annotations, when applying our novel scoring scheme. However, the annotation coverage is only for a limited number of sequences; to solve this problem, we would like to include information from GO-mapped protein domain and family databases.

One of the problems one encounters when using the Gene Ontology database is the existence of multiple parents for single GO entries, due to the nature of the underlying biological data. Proteins, for instance, may contain different domains involved in different biological processes and thus may have different GO numbers attached to them. To tackle this problem we make use of different protein domain and family databases like Pfam, Prosite or Blocks which are linked to Gene Ontology via the Interpro system. Query sequences will be scanned against specific Pfam families. Scores above a threshold usually reflect the presence of a particular domain. In a similar way specific Prosite and Blocks patterns can also be searched in the sequence. The existence (or non-existence) of certain domains in a sequence may help to confirm (or reject) the assignment of a GO entity to a sequence.

The true domains (existing domains) will be assembled to get the protein information from the domain information. For this purpose reverse querying the GO linked database can be useful, Go numbers will be used as query and the results are proteins (reference sequences) which contain all true domains. Even though GO mapped databases fail to give positional information of domains in the respective sequences, this method can locate the domains (using the positional information acquired from Pfam searches, for instance) and map the domain arrangements in both reference sequence and query sequence. This will also give an insight into domain rearrangements (if any) in comparison with query and reference sequences. Finally reference protein names will be annotated to query sequences along with domain level GO numbers. For cases where there are no proteins having all the true domains, the annotation can be given at the domain level.

Using the strategy described above we will be able to reliably assign functional classification to a given cDNA or protein sequence with a high level of confidence. In cases where there are contradictions or even no assignments we will implement more traditional ways for sequence annotation if only with less reliability. This will include multiple kinds of database searches against some of the popular sequence databases like SwissProt or certain parts of the EMBL database, but also methods for predicting specific features in proteins like secondary structure assignments, protein localization or the occurrence of signal peptides.


6. MicroRNA Prediction
MicroRNA (miRNAs) are small RNAs that form imperfect duplexes with the 3'UTR regions of target messenger RNAs. MiRNAs are transcribed as short hairpin precursors and are processed into active miRNAs by Dicer, a ribonuclease. In all known cases miRNAs repress the translation of the target gene. To understand the biological function of miRNAs it is important to identify all miRNAs in the different organisms, find their targets and develop models for their regulation. It has been shown by Violinia et al. (PNAS, 103, no. 7, 2257-2261) that miRNAs can contribute to cancer development and progression and are differentially expressed in normal tissues and cancer. We have installed several tools like Miranda and Targetscan for target identification. Those programs use different algorithms and identify partially different targets in different genomes. On one hand we will try to implement an improved and robust predictor for miRNA targets by using fuzzy logic techniques for combining predictors available with other information like GO. On the other hand we are developing a new task for analyzing "possible" miRNAs by comparing them to known miRNAs and by checking the localization on the genome with the hairpin precurser.


7. GENIUS Sequence Analysis:
The HUSAR Bioinformatics Lab at DKFZ was established in 1986 as the German EMBnet (European Molecular Biology Network) node. The EMBnet consists of a group of 26 recognized bioinformatics centers (nodes). The combined expertise of the nodes allows the EMBnet to provide a service to the European molecular biology community, establishing a legitimate forum for questions and a recognised source of information about new developments.

As GENIUS Sequence Analysis - the HUSAR Bioinformatics Lab at DKFZ provides regularly updated databases, software, independent research, networking, training and documentation for the german scientific community.


8. Algorithmic Tools for Gene Expression Data Analysis
To interpret the biology of genetic profiles produced by microarray experiments, the expression data must be analysed in the context of the corresponding proteins coded. Therefore, we have started with the development of computational tools to compare and analyse these expression profiles in a suitable way and retrieving information about the biological, biochemical and molecular function of the proteins.

The principal aim of the project is the development of a set of data mining informatics tools that will be able to translate gene expression data from microarray analysis into functional profiles. This set of tools will be able to compare two or more sets of gene expression data from different experimental conditions as well as from different organisms submitted to the same conditions.



Former Projects:


DHGP (German Human Genome Project) - Genome Computing Resource

DHGP logoThe German Human Genome Project (DHGP) funded by the German Federal Ministry of Education and Research (BMBF) and the Deutsche Forschungsgemeinschaft (DFG), aims to systematically identify and characterize the structure, function and regulation of human genes, in particular those with medical relevance.
Genome Computing Resources
The HUSAR Bioinformatics Lab has established an efficient bioinformatics infrastructure (Genome Computing Resource) for the specific needs of the German Human Genome Project (DHGP) complementing the biocomputing services of the Resource Center.
  • Extensive support for scientific users : We offer a whole range of support to DHGP users, from 'start-up' help to bioinformatics counselling in more complex scientific projects, including the development of unix scripts, advanced applications and the development of custom tailored analysis tasks flows.

  • Practical introductory courses: These courses combine theoretical and practical training sessions over the course of two days forming a thorough introduction to the HUSAR program package.

  • Advanced workshops: They bring the user up to date concerning the different fields and latest developments in bioinformatics. Important programs and their underlying algorithms are discussed and alternative programs are compared.

  • Development of an on-line tutorial for the WWW interface

  • Full service including analysis tools, databases, computers and network hardware
High throughput EST annotation
As a result of our collaboration with the DHGP group of PD Dr. T. Hankeln, GENenterprise Mainz, we developed ESTAnnotator, a tool for the high-throughput annotation of expressed sequence tags (ESTs). In this task the provided DNA is masked eliminating from the analysis vector parts and low quality sequences within EST reads.
Then the analysis is performed in successive steps, firstly by the identification of already known transcripts present within human mRNA and genomic DNA reference databases. The second program section comprises tools for the clustering of `anonymous` ESTs and for further database searches on the protein level. ESTAnnotator has allowed the semi-automatical analysis of 3500 EST sequences found by Hankeln's group, which was the bottle-neck for the assignment of functional annotations to these novel transcript sequences.

The Helmholtz Network for Bioinformatics (HNB): A user-friendly web interface for performing complex bioinformatics tasks
HNB logoThe Helmholtz Network for Bioinformatics was funded as a venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal . Based on a novel software framework that provides a general solution for transferring results from one bioinformatics tool to another, complex tool cascades ('tasks') have been implemented, allowing users to perform comprehensive data analyses without further manual interaction. Currently, in addition to hundreds of tools accessible via the Guided Solution Finder developed in the HNB, automated cascades for the analysis of regulatory DNA segments as well as the prediction of protein functional properties are also provided.