HUSAR Bioinformatics Lab
Deutsches Krebsforschungszentrum Genomics Proteomics Core Facility
HUSAR Developments
  1. HUSAR Tasks
  2. W2H - the WWW interface to HUSAR
  3. W3H - the task framework
  4. Webservices in Bioinformatics - Hobit @ DKFZ

1. HUSAR Tasks:
Competence sharing The growing complexity of both biological data and bioinformatics tools requires knowledge of underlying biological concepts and computing methods. Often it is difficult and time consuming to select the correct combination of applications and databases. Therefore, we have developed a task system that allows the integration of applications and methods to create tailor-made analysis.
At DKFZ the W3H task framework is currently used within the HUSAR environment (Heidelberg Unix Sequence Analysis Resources) allowing the combination of bioinformatics tools within HUSAR into work flows.
W3H tasks result in XML data containing all relevant information obtained when combining the individual methods in the environment. This XML output can be used in successive analysis.

The HUSAR team is open to new collaborations with other groups in order to design new tasks.

Available tasks:

2DSweep Secondary Structure Prediction Tool
cDNA2Genome Tool for mapping CDNAs
DNASweep DNA Identification Tool
DomainSweep Protein Family Search Tool
ESTAnnotator EST Identification Tool
GeneConsensus Tool for combining gene prediction programs
GeneModel Tool for calculating complete gene structures
GOPET Tool for GO term prediction and validation
IntegrationMap Tool for maps human integration sequences to the genome
IntegrationSeq Tool for isolates integration sequence
miRpredict An potential miRNA Identification Tool
miRTaCa miRNA Target Catcher - miRNA Target Prediction in UTR Regions
PATH Phylogenetic Analysis Task in HUSAR
PrimerSweep Primer Search&Analysis Tool
ProtSweep Protein Identification Tool
PromoterSweep Identification of Transcription Factor Binding Sites
SERpredict Detection of tissue- or tumor-specific isoforms created via the exonization of an retroelement

2. W2H - the WWW interface to HUSAR:
W2H logo W2H is a free WWW interface to sequence analysis software tools such as the GCG-Package (Genetic Computer Group), EMBOSS (European Molecular Biology Open Software Suite) or to services (such as HUSAR, Heidelberg Unix Sequence Analysis Resources). It tries to cover as much functionality as possible while keeping it as user friendly as possible. It gives you the opportunity to access more than hundred programs from any computer platform with a JavaScript enabled web browser. The interface is freely available and under constant maintenance. The development of W2H started in 1996 here at the HUSAR Bioinformatics Lab at DKFZ (Senger et al.) and has been maintained since 1997 in a collaborative project between DKFZ and EMBL-EBI (European Bioinformatics Institute, Hinxton, UK).
All information about W2H can be found on the W2H-Homepage.

3. W3H - the task framework:
W3H task framework
The task framework for W2H
The W3H task framework allows the execution of compound jobs utilising the description of work and data flows in a heterogeneous bioinformatics environment using meta-data information. By means of these descriptions, the task system can schedule the necessary execution of applications available in the environment, depending on rules specified in the meta-data. (Ernst et al.). By integrating this task framework into the web interface W2H, similarly based on meta-data, web access and data management are immediately available for each task description. Authors of task descriptions can base their work on the underlying classes and objects to be able to describe dependency rules between previously independent applications. At DKFZ the W3H task framework is currently used within the HUSAR environment (Heidelberg Unix Sequence Analysis Resources), which allows the combination of bioinformatics tools within HUSAR into work flows. W3H tasks result in XML data containing all relevant information obtained when combining the individual methods in the environment. The resulting XML data is translated according XSLT data into web pages or plain text to report the result of the task to the user. W3H simple overview

Secondary Structure Prediction Tool
2DSweep is a task for performing secondary structure predictions on protein sequences. It reports predictions for alpha-helix, beta-strand, coiled-coil, and helix-turn-helix motifs. The task also predicts transmembrane regions, signal sequences, hydrophobicity, antigenicity, protease cleavage sites as well as possible protein localization, and provides peptide statistics (pI value, amino-acid composition, molecular weight). View XML schema documentation and example output.

A Tool for mapping CDNAs
cDNA2Genome is an application for the automatic high-throughput mapping and characterization of cDNAs. (del Val et al.). It uses already existing annotation data and improves them when possible with the most up-to-date databases, especially in the case of ESTs, proteins and mRNAs. cDNA2Genome is focussed on the determination of the cDNA exon-intron structure which is exhaustively assessed with a vast number of approaches to gene prediction. The input cDNA sequence is masked for repetitive elements. Then this sequence is blasted against the human genomic database. From the blast output the best group of compatible HSPs is selected. To be selected the HSPs must be consecutive both in the genomic sequence and in the cDNA, they have to lie on the same genomic strand and they have to cover the input cDNA in a maximal way. cDNA2Genome gives information about the chromosomal and contig location of the cDNA. It extracts the genomic sequence where the cDNA is located and predicts the exons and introns in this region for both strands. The gene prediction methods used are GenScan, HMMgene, GeneID, GeneWise, and Sim4. View data flow and example output.

A DNA Identification Tool
DNASweep tries to identify a piece of eukaryotic DNA by homology search and locates possible genes or promoter elements in the sequence. The input DNA sequence is masked for repetitive elements per default and a homology search is then performed against a database of non-EST sequences, though users can choose different databases or have additional searches against the EST or HTG sequences. If your sequence is human it is possible to use the Human_Assembled (NCBI) database. Due to the fact that the search for genes and transcription factors is organism specific, the organism to which your sequence belongs should be specified. The programs that are used are Genscan for gene and promoters prediction, Factor for the identification of transcription factor binding sites, and Fasta against the Eukaryotic Promoter Database (EPD) for the location of eukaryotic promoter elements. View example output

A Protein Family Search Tool
DomainSweep identifies the domain architecture within a protein sequence and thus can help find correct functional assignments for an uncharacterised protein sequence. It employs different database search methods to scan a number of protein/domain family databases. This is due to the fact that different analytical approaches have been used to create family signatures. Among these models, in increasing complexity, are: automatically generated protein family consensus sequences (Prodom), regular-expression patterns (Prosite), ungapped position specific scoring matrices of sequence segments (Blocks) or sequence motifs (Prints), gapped position specific scoring matrices (Prosite profiles), and Hidden Markov Models (Pfam, Smart, Tigrfams). Each database covers a slightly different, but overlapping, set of protein families/domains. Each model has its own diagnostic strength and weakness. DomainSweep is an integrated search tool for the most important protein family databases. In the final result domains are classified as "Significant" or "Putative" according to predefined rules such as database specific criterias of cutoff values or e-value thresholds, etc. Domain hits are linked to the corresponding protein family database entries and are grouped together if they belong to the same InterPro family. Interpro - as an integrated resource - provides extensive domain annotations including direct access to the GO (gene ontology) classification system. View data flow , XML schema documentation and example output

An EST Identification Tool
ESTAnnotator is a tool for automatical analysis of EST sequences supporting the search of functional annotations of novel transcript sequences (Hotz-Wagenblatt et al.). In a first quality check step repeats, vector parts and low quality sequences are masked. Then successive steps of BLAST searching against suitable databases and EST clustering are performed. Already known transcripts present within mRNA and genomic DNA reference databases are identified. Subsequently, tools for the clustering of anonymous ESTs and for further database searches at the protein level are executed. ESTAnnotator was successfully applied for the systematic identification and characterisation of novel human genes involved in cartilage/bone formation, growth, differentiation and homeostasis (Zabel et al.) View data flow and example output

Tool for combining gene prediction programs
GeneConsensus combines the predictions of different gene-finding programs: GenScan, HMMGene and GeneID. From their outputs, it computes a consensus sequence employing one of the following algorithms (selectable by the user): The "OR"-method for high sensitivity, the "AND"-method for high specificity, the "EUI-method", suitable for short sequences, and the "GI"-method which is optimized for long sequences. View example output.

A Tool for calculating complete gene structures
GeneModel calculates the full-length structure of a gene from an input cDNA or mRNA sequence. It integrates already existing information from different resources such as: NCBI, ENSEMBL, VEGA, and UCSC. To predict the gene structure it combines CpG-Islands data, ESTs and hand annotated and computer-predicted genes from the named resources using algorithms from the W3H Tasks Caftan mapping/comparison of introns exons- and Geneconsensus -detection of common groups of compatibles exons-. Because of the differences in the anotation procedure and quality reliability of the transcripts in the different data sources, GeneModel applies a quality scoring system depending on the origin of the annotation to each transcript in order to improve the prediction of the structure.
The web output of GeneModel is divided in six sections: (i) General Information, (ii) cDNA location, (iii) Complete gene structure, (iv) full-length cDNA exon table (v) cDNA genomic context and (vi) Genomic table summary table. Sections (iii) and (iv) are graphical outputs while the rest of the sections are tables containing the information used to generate this graphics. Section (i) provides information about the parameters used to run GeneModel, section (ii) gives information about the Organism, Chromosome, Begin, End and Strand of the cDNA in the genome. Section (iv) includes information about all the exons that made the full-length cDNA indicating their begin and end in the genome and weather they are constitutive (always present in all transcripts) or alternative. To select the exons forming part of the full-length cDNA there is a filtering criteria for overlapping genes. In those cases the exon with the best annotation source will be selected. The last table, section (vi), is a summary containing for all exons founds for the subject gene the following fields: Exon number sorted by genomic begin and end, the name of the transcript were it was found, source and quality of the annotation, type of transcript, status of the transcript annotation, and if was found with sim4.
The user has immediate access to all complete application outputs and database entries via hyperlinks. At the bottom of the HTML output there is a link to the explanatory legend as well as to the XML output containing all the generated information. View data flow and example output.

A Tool for GO term prediction and validation
GoPet is a complete automated tool for assigning molecular function or biological process terms to cDNA or protein sequences utilising Gene Ontology for annotation terms, GO-mapped protein databases for performing homology searches, and Support Vector Machines for the prediction and the assignment of confidence values. GOPET provides an organism-independent prediction since the databases cover a broad variety of different organisms and the selected attributes are independent of the organism. It was shown previously that the prediction quality was comparable to high-quality manual annotation and a high number of sequences could be annotated when compared to other systems. View example output.

maps human integration sequences to the genome
Integrationmap can be used to determine and profile integration sites of viruses or viral vectors on a chromosomal and genomic level. DNA sequences adjacent to the viral 'long terminal repeat' (LTR) can be exactly located in the human genome, as well as the actual viral insertion site. Information about hit or next genes, hit or adjacent repetitive elements like SINEs, LINEs, CpGs and LTRs together with their distances to the insertion site are displayed in the output file. Input sequences must start with the first base following 5' of the LTR. View example output.

isolates integration sequence
Integrationseq can be used to prepare raw files from a genetic analyzer for mapping to the human genome. Beginning with a quality check viral 'long terminal repeats' (LTRs), adaptor sequences and cloning vector backbone sequences are recognized and cut off. Internal vector sequences ('internal bands') are recognized, too. The input should be a multiple FASTA sequence file. More detailed information is available. View example output.

An potential miRNA Identification Tool
miRpredict is a tool for automatical identification of known and potential new miRNAs in DNA sequences. The sequences are split into overlapping pieces of a miRNA like size, compared to known miRNAs of miRBase and to organism-specific non-coding RNAs of EnsEMBL. The genomic precursor sequence is build after localization on the genome and potential miRNAs are identified by recognizing a palindrome and classifying the palindrome as a miRNA-like palindrome using a triplet-SVM classifier (Xue, C. et al; BMC Bioinformatics 6, 310,2005). View data flow and example output.

miRNA Target Catcher - miRNA Target Prediction in UTR Regions
miRTaCa can be used to find miRNA binding sites on the 3'UTR region of cDNAs. UTR or cDNA sequences can be given as input. If a cDNA is given, the tool finds the 3' UTR and checks with the programs MIRANDA, TARGETSCAN and RNAHYBRID for miRNA binding sites. It also looks for conserved regions in the UTR if the homologous gene of another organism can be found. The results can be combined by an AND, OR, or MAJORITY algorithm. The input should be a single sequence (or a multiple FASTA sequence file). The result will be a summary page giving the information about miRNA binding sites and the conserved sites which correspond to miRNA binding sites of the 3'UTR. View data flow and example output.

Phylogenetic Analysis Task in HUSAR
PATH is a task for the inference of phylogenies (del Val et al.). It executes each of the three main phylogenetic methods: maximum likelihood (using TREE-PUZZLE), pairwise distance combined with Neighbor-Joining and parsimony (using programs of the PHYLIP package). According to recomendations by Jin and Nei (1990) it automatically chooses the evolutionary model for each data-set in order to optimize the performance of the neighbor-joining. The newly created phylogenetic trees are then compared for consistency of the subgroups. The output of the tasks shows the consensus trees together with full results obtained from all executed methods as well as additional information generated in the process. To find inconsistencies in the input data the splittability index of the split decomposition method is evaluated. View data flow and example output.

A Primer Searching And Analysing Tool
Primersweep finds primer pairs for PCR reactions matching your input sequence and a target region or checks a given primer pair according to their target and the sequence. The task performs a quality check for primer pairs by searching for all possible PCR products with the primers using a user defined database. The result of PrimerSweep is a list of primer pairs with melting temperature and GC content, and all possible PCR products created either by two primers or by a single one binding to the database sequences. View example output.

A Protein Identification Tool
Protsweep can be used for analysis and possible identification of newly obtained protein sequences. The result lists protein features such as molecular weight etc., and reports predicted secretory signals, the possible subcellular localization, and the result of homology searches against general DNA and Protein sequence databases as well as against the protein family databases Prosite and Blocks. View XML schema documentation and example output.

Identification of Transcription Factor Binding Sites - Analysing Promoter Sequences
PromoterSweep is an automated bioinformatics pipeline to analyse promoter sequences and predict transcription factor binding sites. PromoterSweep uses a combination of different tools: Sequence comparison to promoter databases, identification of transcription factor binding sites provided by the databases Transfac and Jasper, as well as collecting orthologous sequences and applying general motif discovery tools. The results are combined and classified for reliability. View data flow and example output.

A tool to predict tissue or tumor-Specific Exonised Repetitive element containing isoforms
SERpredict is an automated bioinformatics pipeline to predict tissue or tumor-specific repetitive element (RE)-containing isoforms in human and mouse DNA. SERpredict extracts all available exons of the input sequence found in the EnsEMBL database and screens for REs in all of the exons. For every RE-containing exon, we are aiming to detect tissue or tumor specific isoforms caused by the exonization of the repetitive element. Therefore, all EST and mRNA sequences are extracted to perform a statistical analysis to classify potential tissue or tumor specific isoforms. View data flow and example output.