GARSA documentation

GARSA workflow

/garsa/img/GARSA-pipeline.jpg

Database (MySQL) schema

/garsa/img/GARSA-tables.jpg

Architecture

The current version of GARSA was implemented based on scripts, due to a tight schedule imposed by the need to obtain fast results on ongoing experiments. Next version is currently under development, and contemplates the use of Web services and parallelization techniques among others. Such development involves three master thesis and should be released within one year.

GARSAs architecture is supported by Perl scripts, resulting on some coding effort to add new tools to its pipeline. However, its implementation design is modular, well documented and based on development standards like page templates. Its flexibility has been proven by recent extensions, which have been developed by biologists with basic Perl knowledge, in short time.

Requirements

Software

* OS: Linux (Fedora Core tested)
* Servers: Apache 2.0 and MySQL 4.1 or higher
* Bioperl ("core" and "run") 1.4 or higher
Phylip 3.61-5 or higher
ClustalW 1.83-2 or higher
RBS finder
* CAP3
Glimmer 2.10 (for Yacop)
Glimmer 2.13
Interpro 3.3
NCBI Blast
Critica 1.05
Orpheus 2
Weblogo
Zcurve 1.02
Yacop
WU-Blast
Emboss 2.9.0.6 or higher
* Phred/Phrap

Perl modules: perl-DBI, IO-String, perl-DBD-MySQL, Mail-Mailer, GD-Graph, Spreadsheet::WriteExcel (get them via CPAN or RPMPan)

HTTP configuration:

The following options must be added to the the httpd.conf (Apache server) file:

<Directory "/var/www/html">
    Options Indexes FollowSymLinks ExecCGI Multiviews
    AddHandler cgi-script .cgi
    AllowOverride AuthConfig
    Order allow,deny
    Allow from all
</Directory>

DirectoryIndex index.cgi

Blast databases (Pre-formatted or fasta format):

NCBI
Uniprot
CDD
GeneOntology

Hardware

* 1 GB RAM or higher
* 20 GB Hard Disk available (80GB or higher is reccomended)
* 2.0 Ghz processor or higher

* Minimal requirements: GARSA is going to work with those (*) minimal packages, but in a limited way as similarity analyses won't be executed without NCBI-Blast and InterPro. Gene prediction won't be executed without Critica package, and Phylogeny without ClustalW and Phylip.

Download: Visit /garsa/ or contact Dr. Alberto Dávila (davila AT ioc.fiocruz.br)

Licensing: GPL

For users interested to use our servers for their own projects, we offer the option to host those projects and provide advice and consultancy. The costs for it will be evaluated case-by-case. Those costs are for hardware maintenance and upgrade as well as consultancy, if needed. Users interested in this option should contact Dr. Alberto Dávila "davila AT ioc.fiocruz.br". At the moment, we have 3 Intel Xeon Dual Processor servers and over 500GB of disk space available for this.

Starting a new Project

The only way to create a new project in GARSA is having super-user privileges, either as "admin_garsa" or "subadmin_garsa" user. The "admin_garsa" user can grant "subadmin_garsa" privileges to several users, then they can create several new projects without the need to be the "admin_garsa" user.

Any of the above mentioned users should use the "Create Project" option of the "Project Administration" menu to create a new project. GARSA ask for the following input data:

Project Name: scientific name of the species to be studied, eg: Trypanosoma cruzi or Drosophila melanogaster or Plasmodium falciparum
Project Code: code for the new projet, has to be 2 letters, eg: TC or DM or PF.
Minimum Read Quality: Phred minimum quality to be used in the chromatigrams, eg: 20
Minimum Lenght Size: Minimum Length (base pairs) of good quality sequence that GARSA will accept, eg: 100
Project administrator name: name of the administrator of the new project, eg: Joe Smith
Administrator email: eg: [email protected]
Administrator password: minimum 6 characters, a combination of letters and digits.

Project administrator login is created by GARSA based on "Project Code", eg: admin_TC or admin_DM or admin_PM

GARSA does not allow "admin_garsa" and "subadmin_garsa" users to administer projects, the only function of these users is to create projects.

Once a new project has been created, GARSA will send all the details of the new project to the administrator email.

Project Configuration

New Library: GARSA asks for the folowing input data:

Library Name: eg: Fat tissue or kDNA or Salivary gland
Library Code: eg: 001 or ABC
Library Description: EST Library 001 or GSS Library ABC or ORESTES Library ZY9
Vector: Choose a vector from the database or include a new one using "New Vector Sequence"

Primers (Optional): Add any pair of primer sequences (forward and reverse) that should be removed from your sequences.

Set Contaminant (Optional): Choose ribosomal or mithochondrial sequences that should be removed from your sequences. The model organism more phylogentically related to the organism to be studied by GARSA should be selected.

New Blast DB: project administrator can load zipped multifasta files (nucleotide or aminoacid) and format them (with formatdb) for NCBI Blast.

Load Sequences

Download from GenBank: Gene in Genomic, EST and GSS data can be downloaded from GenBank using scientific names, eg: Plasmodium falciparum. GARSA shows the number of available entries, then project administrator can decide to download the entries or not . Two scenarios ara antecipated for the use of "Download from GenBank": a) chromatograms are not available, then users aim to analyze data from GenBank, b) chromatograms are available, then user aim to complement their data with Genbank data.

Rename and Submit Plate: chromatograms from 1 sequencing run (equivalent to a plate of 96 sloths or less) should be copied to a single folder keeping their original names, zipped resulting in a file as "chromats1.zip" or "reads.zip" or "files9.zip". This zipped file is the input for GARSA. Library Code and Plate Code should be choosen from the available options, then GARSA can rename and upload properly the chromatograms in the zipped file. Minimum Read Quality and Minimum Size Length can be optionally modified here.

Submit Plate: chromatograms from 1 sequencing run (equivalent to a plate of 96 sloths or less) should be copied to a single folder and renamed to meet GARSA nomenclature. In a project with DM as Project Code, JS as Lab Code, 111 as Library Code and 001 as Plate Code should contain chromatogram files with the following names:

DMJS111001A07.g
DMJS111001C11.g
DMJS111001E08.g
Resulting Zipped File should be named: DMJS111001.zip

DMJSABC100A07.b
DMJSABC100A07.b
DMJSABC100A07.b
Resulting Zipped File should be named: DMJSABC100.zip

Only chromatograms from the same sequencing run or plate should be zipped together, resulting in a file as "DMJS111001.zip" or "DMJSABC100.zip". This zipped file is the input for GARSA. Minimum Read Quality and Minimum Size Length can be optionally modified here.

Sequence Assembly

Build clustering: Once sequences have been loaded (either in the form of chromatograms or download from GenBank) into a given project, they can be clustered using CAP3. The main CAP3 paramenters can be modified several times looking for the best results.

Each time a clusterization is done, GARSA produce 1 clusterization for each library plus 1 clusterization of all the libraries together.

Only after clusterization has been done, GARSA allows project administrators to run Gene Prediction, Clusters Analysis and Sequence Annotation.

Garsa shows a warning message when users try to analyze non-clustered sequences:

Gene Prediction

GARSA can use Glimmer or the YACOP metatool (RBS, Critica, Zcurve) for gene prediction.

Glimmer needs (complete) CDS (multifasta format) of the organism under study or from a closely related species to be trained.

YACOP: Critica needs a set of nucleotide sequences from the organism under study or a closely related species. The nucleotide sequences needed by Critica must be formatted to be used with WU-Blast.

Sequence Annotation

Run Blast

GARSA can uses as many Blast databases as available HardDisk space. The New Blast DB option is used to upload and format databases. TblastX, BlastX and BlastN flavours are active by default. However, only 2 Blast runs are allowed to happen at the same time, in order to avoid CPU overload. E-value is configurable at this stage.

A figure showing best Blast results according to each frame is showed aiming to help with the identification of the right frame of CDS:

Run InterPro

The current version of GARSA works with InterPro 3.2, but the new version of GARSA (under development) will work with InterPro 4.0.

Run RPSBlast

The Conserved Domain Databases from CDD, Smart, Kog, Cog and Kegg are available.

Notes

Users can enter comment or notes of each CLUSTER with this option. Notes entered by user "a" cannot be deleted or modified by user "b", then several users can work/comment the same cluster sharing and complementing analysis.

Validate CDS

When a cluster is being viewed or examined, there is always a link to "Validate CDS":

To Validate a CDS, users need to enter the begining and end coordinates, then Garsa translate that sequence range using the TranSeq program of the Emboss package. Validated CDS always appear listed at the bottom of the page:

Project queries

Generic database queries

A little console is presented, then users can query the MySQL database using MySQL command. For security reasons, only the SELECT command is allowed in this version.

Search Reads/Clusters

A search tool to facilitate the finding of specific reads or clusters.

Hit queries

A number of options to query the different analysis results from GARSA. Clusters with a specific number of hits can be easily found. Clusters with no hits can be easily found with this feature.

Blast vs Project Sequences

Garsa uses "formatdb" from the Blast package to format "Reads" and "Clusters" to be used for WWW-Blast analysis, then any sequence can be query against "Reads" and "Clusters" of a given project in Garsa.

Phylogeny

Users need first to clusterize sequences using "Build Clustering" in the Sequence Assembly" menu. Most options from the menu are only available once sequences has been clusterized and Blast done, those results are used to help with gene finding, alignment and phylogeny. For Blast, "Run Blast" from the "Sequence Annotation" menu should be used. For Logo, users should first have Blast results (after clustering), then view results frrom a given cluster either via "View Clusters by Library" (Project View menu) or "Search reads / cluster" (Project Queries).

Once users are viewing the Blast results from a given cluster, they can select one of the Blast DBs used (eg: kinetoplastida- nt) together with their respective results:

select from the bottom option what type of sequences you want to analyze (eg: Nucleotide Sequences) then click "Run Clustal and Phylip".

After that, user will be asked for the model of substitution that Phylip should use (eg Kimura 2- parameter). Once user have selected the model, him/her will have a screen like this:

Acknowledgements

To Dr. José Marcos Ribeiro (NIAID/NIH) for suggestions and sharing his experience on EST analysis. To Dr. João Setubal (VBI and LBI/IC/UNICAMP) for allowing us to modify the algorithm for processing EST chromatograms. To MCT/CNPq, IAEA, CIRAD and FAPESP for financial support. To the Open Source Community for all the valuabe help. To the authors of the softwares/modules used in/by GARSA for granting the academic and GPL licenses.