CDD FTP-archive ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/README rev. 22 Mar 2013 =============================================================================== This ftp-directory archives collections of position-specific scoring matrices (PSSMs) that have been created for the CD-Search service (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). CD-Search can be used to identify conserved domains in a query protein sequence and infer its putative function (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). PSSMs are briefly described in the CDD help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CD_PSSM. The PSSMs are meant to be used for compiling RPS-BLAST search databases. The RPS-BLAST executable, as well as the makeprofiledb application needed to convert files in this directory, are part of the BLAST executables (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/) and the NCBI software development toolkit distribution (ftp://ncbi.nlm.nih.gov/toolbox). The makeprofiledb application is described at www.ncbi.nlm.nih.gov/books/NBK1763 Be sure to use recent BLAST executables in order to obtain the makeprofiledb application that is compatible with the CDD FTP files. (The formatrpsdb application packaged with earlier BLAST releases is not compatible and will result in an error message, "unable to match element in intermediateData... ERROR: no data found in file.") The little_endian and big_endian subdirectories of this CDD FTP site contain preformatted databases, eliminating the need to use the makeprofiledb application. However, if you prefer to create customized search sets, you will still need to run makeprofiledb. Note that the E-values you get (for any given protein query--conserved domain hit pair) on the CD-Search web service might differ from those you get when using standalone RPS-BLAST on your local PC. The last section of this document describes the differences between the web service and standalone program and provides a tip on how you can generate the same results in standalone RPS-BLAST as those produced by the web service. =============================================================================== SCOPE OF DATA in FTP FILES =============================================================================== Data accessible via the CD-Search tool and in the Entrez Conserved Domain Database (CDD) originate from a number of source databases, including NCBI- curated domain models as well as models from external sources: 1 cd ....... alignment models curated at NCBI as part of the CDD project (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains) 2 Pfam ..... PSSMs from a mirror of the Pfam-A seed alignment database (see: http://pfam.sanger.ac.uk/) 3 Smart .... PSSMs from a mirror of the Smart domain alignment database (see: http://smart.embl-heidelberg.de/) 4 COG ...... PSSMs from automatically aligned sequences and sequence fragments classified in the COGs resource, which focuses primarily on prokaryotes (see: http://www.ncbi.nlm.nih.gov/COG/new/) 5 PRK ...... PSSMs from automatically aligned sequences and sequence fragments classified as stable clusters in the Protein Clusters database (see: http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&db=proteinclusters) 6 TIGRFAM PSSMs from a mirror of the TIGRFAM database of protein families (see: http://www.jcvi.org/cms/research/projects/tigrfams/overview/) 7 KOG ...... PSSMs from automatically aligned sequences and sequence fragments classified in the KOGs resource, the eukaryotic counterpart to COGs (see "http://www.ncbi.nlm.nih.gov/COG/new/"). These are available as a separate search set in CD-Search but are not indexed for text searching in Entrez CDD. 8 LOAD ..... Library of Ancient Domains These 55 models are available only as a data file on the FTP site but are not searchable via CD-Search and are not indexed for text searching in Entrez CDD. The domains in this set are represented by domain models in the other data collections above. (Additional details about source Databases are provided in the CDD Help Doc: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSource) The CD-Search databases, Entrez CDD, and the FTP files in this directory, encompass various data sets. The scope of data covered by each FTP file is noted in the FILE LIST and SUMMARY, below, and can be one of the following: SCOPE A: ALL CD models accessible via the CD-Search tool (subsets 1-7, above) plus subset 8, which is accessible only in this FTP directory. SCOPE B: Data from the CD-Search tool's DEFAULT "cdd" database, which includes subsets 1-6, above. These subsets are also indexed and searchable in NCBI's Entrez CDD database. SCOPE C: NCBI-curated CD models (subset 1, above). SCOPE D: Data from specific, individual NCBI-curated CD models SCOPE E: conserved domain models that are members of superfamilies; these can include models from subset 1, and models from subsets 2-6 that are not multidomains. Superfamiles and multidomains are described in the CDD Help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types =============================================================================== FILE LIST and SUMMARY =============================================================================== The CDD FTP directory includes the following files and subdirectories. The SCOPE levels noted in this summary table are described in the preceding section, "SCOPE OF DATA in FTP FILES". Additional DETAILS for each file are provided in the next section. ------------------------------------------------------------------------- FILENAME |scope| summary ------------------------------------------------------------------------- cdd.tar.gz | A | PSSMs originating from various alignment | | collections; can be used to build search | | databases for RPS-BLAST. | | (scope A: all CD models) ------------------------------------------------------------------------ acd.tar.gz | A+ | CD data as used by the CD-server for | | visualization of CD-search results | | (scope A, PLUS data for superfamily clusters) ------------------------------------------------------------------------ cddid_all.tbl.gz | A | summary information about all CD models in this | | distribution | | (scope A: all CD models) ------------------------------------------------------------------------ fasta.tar.gz | A | sequence alignments from the CDs in mFASTA format | | (scope A: all CD models) ------------------------------------------------------------------------ cdd.versions | A | list of all conserved domain model accessions, | | versions, and PSSM IDs present in the current and | | previous versions of the Conserved Domain Database | | (scope A: all CD models) ------------------------------------------------------------------------ cdd.info | B | CDD release version number and details | | (scope B: default "cdd" database) ------------------------------------------------------------------------- cddid.tbl.gz | B | summary information about the CD models in this | | distribution that are part of the CD-Search tool's | | default "cdd" database and are indexed in | | NCBI's Entrez CDD database | | (scope B: default "cdd" database) ------------------------------------------------------------------------ cddmasters.fa.gz | B | FASTA-formatted sequences that show representative | | sequences for each conserved domain model in the | | collection | | (scope B: default "cdd" database) ------------------------------------------------------------------------ cddannot.dat.gz | C | information about conserved family features | | (such as binding and catalytic sites) as | | recorded for NCBI-curated CD models | | (scope C: NCBI-curated domain models) ------------------------------------------------------------------------ cdtrack.txt | C | information from NCBI's internal tracking system | | about hierarchies of related domain models in | | NCBI-curated domains (scope C) ------------------------------------------------------------------------ bitscore_specific_X.XX.txt | domain-specific score thresholds used by | | CD-Search tool to determine whether hits to | C | NCBI-curated domain models are specific or | | non-specific. The X.XX portion of the filename | | indicates CDD release number. (scope C) ------------------------------------------------------------------------ cd00882_notree.acd | D | versions of files distributed within acd.tar cd01659_notree.acd | D | that are meant for users of the old NCBI C-toolkit cd02039_notree.acd | D | (scope D: specific, individual NCBI-curated models) ------------------------------------------------------------------------ big_endian | A- | subdirectories containing pre-formatted search little_endian | A- | databases for use with various architecture/OS (subdirectories) | | combinations (little_endian for Intel CPUs and | | Linux or Windows, big_endian for SUN or SGI under | | Solaris or IRIX, for example). | | (almost scope A: ALL CD models accessible via | | the CD-Search tool (subsets 1-6, described in | | "SCOPE OF DATA in FTP FILES", above) BUT NOT | | subset 7 (LOAD)) ------------------------------------------------------------------------ family_superfamily_links | list of NCBI-curated and imported domain models | that are members of CDD superfamilies, along | E | with the superfamily accession (cl*) to which | | each domain model belongs | | (scope E: superfamily members) ------------------------------------------------------------------------ =============================================================================== DETAILS for each file =============================================================================== Files below are listed in alphabetical order: =============================================================================== acd.tar.gz =============================================================================== "acd.tar.gz" is a gzipped archive that contains the CD data as used by the CD-server for visualization of CD-search results. They have been stored as ASN.1 formatted files. The types of information provided for each conserved domain model are described in the CDD help document section on "CDD Record (CD Summary page): What information is displayed for each domain model?" http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDVisual. (SCOPE A+: this file includes data from all CD models, PLUS data for superfamily clusters (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily). Additional details about data coverage are provided in the earlier section, "SCOPE OF DATA in FTP FILES") Technical note: The "acd" acronym is used at NCBI to denote "ASN.1 Cd Datafile". It is also used as a file extension for CD data files (e.g., the "cd0????_notree.acd" files in this FTP directory). However, the conserved domain file extensions appear as "*.cn3" when using the CDD database web server's "Structure View" funtion. That allows the conserved domain data files to be uniquely associated with the Cn3D viewing program (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml), and to differentiate them from *.acd file extension used by computer-aided design programs such as Autocat. =============================================================================== bitscore_specific_X.XX.txt =============================================================================== "bitscore_specific_X.XX.txt" (e.g., "bitscore_specific_2.14.txt") contains the domain-specific bit score thresholds used by CD-Search tool to determine whether hits to NCBI-curated domain models are specific or non-specific (both hit types are described in: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types). This file is saved for each CDD release (the X_XX portion of the filename indicates the CDD release number), allowing retrieval of current and previous bit score thresholds for a domain model. The file contains three columns: 1. conserved domain PSSM ID This is a unique identifier for a domain model's position-specific scoring matrix (PSSM). If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, curators might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change. Additional information about PSSMs is accessible from: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 2. conserved domain accession number Domain-specific score thresholds are currently calculated only for NCBI-curated domains; therefore, all accessions in the file begin with the prefix "cd" (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd_help.shtml#CDSource_accession_prefix) 3. domain-specific score threshold, shown as bit score This column shows the lowest bit score among self-hits of a domain�s member protein sequences to the resulting domain model. This domain-specific score threshold can change for the same reasons the PSSM ID can change (explained in #1, above). An illustrated example and additional details about specific hits and domain-specific thresholds are provided in the CD-Search help document: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#SpecificHit (SCOPE C: this file includes data from NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cd0????_notree.acd =============================================================================== "cd0????_notree.acd" are versions of files distributed within "acd.tar", which have been stored without data representing the sequence tree of the underlying set of sequence fragments. Trees in these particular examples are deeply nested and can not be read with the old NCBI C-toolkit object loaders. These separate files allow users of the old NCBI C-toolkit to load the full set of conserved domain models into their applications. (SCOPE D: these files include data from specific, individual NCBI-curated models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdd.info =============================================================================== "cdd.info" contains the CDD release version number and details the content of the release (number of models from each data source) (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdd.tar.gz =============================================================================== "cdd.tar.gz" is a gzipped archive file that contains Position-Specific Scoring Matrices (PSSMs) originating from all of the alignment collections encompassed by the Conserved Domain database project. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) To build search databases for RPS-Blast you need to unpack the archive and extract its contents. It contains ascii formatted files only, with the following extensions: *.smp ...... Position Specific Scoring Matrices (PSSMs). These are stored in a new ASN.1 format ("scoremat"), which is shared between various BLAST applications. *.pn ....... lists of PSSM file names and allows for the compilation of 5 RPS-Blast search databases Smart Pfam Cog Kog Prk Cdd (domains from Smart, Pfam, COG, PRK, and cd, this is the set that's indexed in NCBI's Entrez) The databases must be formatted with the "makeprofiledb" application that is distributed with the BLAST executables (ftp://ftp.ncbi.nih.gov/blast/executables/). Be sure to use recent BLAST executables in order to obtain the makeprofiledb application that is compatible with the CDD FTP files. (The formatrpsdb application packaged with earlier BLAST releases is not compatible and will result in an error message, "unable to match element in intermediateData... error no data found in file.") The following sequence of commands will build the search databases: makeprofiledb -title SMART.v6.0 -in Smart.pn -out Smart -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title Pfam.v.26.0 -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title COG.v.1.0 -in Cog.pn -out Cog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title KOG.v.1.0 -in Kog.pn -out Kog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title CDD.v.3.10 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title PRK.v.6.00 -in Prk.pn -out Prk -threshold 9.82 -scale 100.0 -dbtype rps -index true Note that the parameter '-threshold' supplied with makeprofiledb, the three-letter word score threshold for detecting and extending hits in RPS-Blast, will determine the size of the search database. A lower threshold will result in larger databases and slightly increased search sensitivity, at the cost of additional memory requirements and reduced search speed. Matrices distributed for creating RPS-Blast search databases are scaled by a factor of 100 (parameter -scale). A score threshold value of 9.82 will result in search-databases of a size very similar to using unscaled matrices and a threshold value of 11. Note also that the RPS-Blast search databases generated by makeprofiledb are architecture dependent, it may not be possible to create them on one and use them on another platform. When searching with your local version of RPS-blast, use the command-line argument "-d" to specify the database name and location. You need an executable version of the "rpsblast" program, type "rpsblast" without arguments to obtain a list of command-line options. You can now take any arbitrary subset of PSSMs and compile them into an RPS-Blast search database. All that makeprofiledb needs is a list of file names (such as "Smart.pn" in the example above) and the corresponding "scoremats" (*.smp) files. Newer versions of Psi-BLAST (blastpgp) can now write out "checkpoints" in the "scoremat" format as well (blastpgp parameter -u1). These again can be combined with arbitrary subsets of scoremat- formatted PSSMs distributed here, to create customized RPS-Blast search sets. The scoremat-formatted PSSMs distributed here are scaled with a factor 100.0, and if one was to combine them with Psi-BLAST generated "scoremats", the same scaling factor must be set as a parameter with makeprofiledb. Note: If you prefer to use preformatted databases, see the big_endian and little_endian subdirectories of the CDD FTP site. They contain databases that have been preformatted for use with various architecture/OS combinations (Intel, Sun, SGI / Linux, Windows, Solaris, IRIX). =============================================================================== cdd.versions =============================================================================== "cdd.versions" lists all conserved domain model accessions, versions, and PSSM IDs present in the current and previous versions of the Conserved Domain Database. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) Example/Excerpt from file: # Acc ShortName PssmId Root Ver Lv Rl ER Time # ------------ ----------------- -------- ---- -- -- -- ----------------- ... pfam09006 Surfac_D-t 90442 N/A 4 1 1 0 01/09/08 09:49:00 pfam09006 Surfac_D-t 87766 N/A 3 0 1 0 09/13/07 17:36:00 pfam09006 Surfac_D-t 72424 N/A 2 0 1 0 05/07/07 17:24:00 pfam09006 Surfac_D-t 72424 N/A 1 0 1 0 03/12/07 13:54:00 ... Column descriptions: Acc = conserved domain model accession number (e.g., pfam09006) ShortName = first 10 characters of domain model's short name, in this case, Surfac_D-t, for Surfac_D-trimer. PSSMID = unique identifier for the position specific scoring matrix (e.g., as the pfam09006 domain model has evolved, it has had three PSSMs, with IDs 72424, 87766, and 90442, respectively). If there are any changes in the protein sequence alignment of a domain model (for example, the addition/deletion of member protein sequences or changes in the span of aligned residues), or if there are changes in the interpretation of the alignment, a new PSSM will be calculated. In that case, it will receive a new PSSM ID, although the accession number of the conserved domain model will remain the same. If only the domain model description or other annotations have changed, but the PSSM did not change, the version of the model will be incremented but the the PSSM ID will remain the same, as it did for version 1 and 2 of pfam09006, both of which had the PSSM ID 72424. Root = if the domain model is NCBI-curated, the "Root" column will show the accession number of the parent node of the curated domain hierarchy. If the domain hierarchy contains only a single node, the value in the "Root" column will be the same as that in the "Acc" column. The values will also be the same if the accession listed in the first column is the parent node of a multi-level hierarchy. Version = version number of that particular domain model Lv = indicates the current live version of the record: 1 = live status; 0 = dead, earlier version. Rl = indicates whether the domain model version has been released into the public database. This is a flag NCBI uses for internal data tracking. For most domain models, the value will be 1= released, which means at some point the model was live in the database. Ocassionally a value of "0" might appear, primarily for ncbi-curated models. This indicates a newer version of a model is in preparation at NCBI and will be released in the future. ER = Expendable or redundant models; value in this column can be: 0 = non-expendable or not redundant 1 = expendable or redundant; indicates a model that has been removed from the default "cdd" search set because the information in it is represented in another domain model. Time = date and time on which the model was last updated in the internal conserved domain tracking database. =============================================================================== cddannot.dat.gz =============================================================================== "cddannot.dat.gz" contains information about conserved family features (such as binding and catalytic sites) as recorded for curated CD models. This is a tab-delimited text file, with a single row per "feature" and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd') CD "short name" Feature number Feature description/name Boolean flag (0/1), indicating presence of structure-based feature evidence Boolean flag (0/1), indicating presence of reference-based feature evidence Boolean flag (0/1), indicating presence of additional comments comma-separated feature addresses site type (numerical) The feature addresses are positions on the alignment's "master sequence", which is a consensus sequence, and on the alignment's PSSM (the database search model). Note that feature addresses are stored in a coordinate system that counts the first residue in the consensus sequence as "0". The site types are assigned as follows: 0 ... unassigned or type "other" 1 ... active site 2 ... polypeptide binding site 3 ... nucleic acid binding site 4 ... ion binding site 5 ... chemical binding site 6 ... posstranslational modification site (SCOPE C: this file includes data from NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddid.tbl.gz =============================================================================== "cddid.tbl.gz" contains summary information about the CD models in this distribution, which are part of the default "cdd" search database and are indexed in NCBI's Entrez database. This is a tab-delimited text file, with a single row per CD model and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK' or "CHL') CD "short name" CD description PSSM-Length (number of columns, the size of the search model) (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddid_all.tbl.gz =============================================================================== "cddid_all.tbl.gz" contains summary information about all CD models in this distribution. This is a tab-delimited text file, with a single row per CD model and the following columns: PSSM-Id (unique numerical identifier) CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK', 'CHL', 'KOG', or 'LOAD') CD "short name" CD description PSSM-Length (number of columns, the size of the search model) (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cddmasters.fa.gz =============================================================================== "cddmasters.fa.gz" is an archive containing the FASTA-formatted sequences that shows representative sequences for each conserved domain model in the collection. The representative sequences are consensus sequences with an approximate median length relative to all the sequence footprints used in the alignment. They are constructed for calculating a position-specific score matrix (PSSM), each residue in the representative sequence corresponds to a column in the PSSM. When RPS-BLAST formats output, it will display pair-wise alignments between the query and PSSMs' representative sequences. (SCOPE B: this file includes data from the default "cdd" database; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== cdtrack.txt =============================================================================== "cdtrack.txt" lists information from NCBI's internal tracking system for conserved domain models curated at NCBI. The intent of this file is to provide information about hierarchies of related domain models. All models that map to the same root accession have been linked together in a hierarchical set, in which the alignment models are consistent with each other. Columns in this table are: Acc .......... CD accession ShortName .... CD short name PssmId ....... CD PSSM-ID, a unique numerical identifier for each CD Root ......... Accession of the CD hierarchy root model. Ver .......... CD version number Lv ........... is model live in the tracking system? Rl ........... has model been released to the public? ER ........... has model been flagged as "expendable or redundant"? Time ......... time stamp in the tracking system (last modification) (SCOPE C: NCBI-curated domain models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== family_superfamily_links =============================================================================== "family_superfamily_links" lists the conserved domain models that are members of superfamilies, along with the superfamily cluster (cl*) accession to which each domain model belongs. Superfamily members can include NCBI-curated domain models as well as imported models that are not multi-domains. More information about superfamilies and multidomains is available at: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily and http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types. The file contains four columns: 1. conserved domain accession number For examples, see: http://www.ncbi.nlm.nih.gov/Structure/cdd/ cdd_help.shtml#CDSource_accession_prefix. 2. conserved domain PSSM ID This is a unique identifier for a domain model's position-specific scoring matrix (PSSM). If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, the curators of a source database might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change. Additional information about PSSMs is accessible from: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 3. superfamily cluster accession number If a conserved domain model belongs to a superfamily with two or more members, this column contains the accession of the corresponding superfamily (an alphanumeric string starting with a "cl" prefix (for "cluster") and followed by a series of digits, e.g., cl02915). If a conserved domain model is a "singleton" (the sole member of a superfamily), this column simply repeats the conserved domain model's accession number that is shown in column 1. (Note: The majority of superfamilies are singletons, containing a single model from either Pfam, TIGRFAM, COGs, etc. While the CDD data processing pipeline does generate corresponding superfamily cluster models, they are not indexed in the Entrez search system in order to reduce redundancy in the presentation of search results.) Superfamily clusters are produced via an automated procedure each time there is a new CDD release. Information about clustering methodology is provided at: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily. The composition of a cluster can change over time due to a variety of factors, such as (a) availability of new domain models, (b) changes to previously existing models, (c) new and/or updated sequence records in the Entrez Protein database, and (d) refinements to the automated clustering procedures. A superfamily cluster accession number will remain the same if at least 50 percent of its member models (conserved domain accessions) have not changed relative to the previous version of the cluster. If more than 50 percent of the conserved domain accessions from a previous version of a cluster are no longer present in the new build of that cluster, or if the cluster size more than doubles with a new build, then the superfamily cluster accession is retired and replaced by a new accession(s). If two previous clusters merge into a single new cluster, the superfamily cluster accession number of the larger component cluster is used for the new grouping. 4. superfamily cluster PSSM ID A superfamily's PSSM ID refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily. The superfamily cluster PSSM ID will change if there is any change to the set of member PSSM IDs relative to the previous version of the cluster (e.g., if a member conserved domain gets a new PSSM ID due to changes in its multiple sequence alignment, of if a new conserved domain model is added to the superfamily as the result of a CDD database update). The family_superfamily_links file for each CDD release will be saved on the FTP site and can be used to track changes in superfamily clusters over time. (Scope E: this file includes data from NCBI-curated and imported domain models that are members of superfamiles; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== fasta.tar.gz =============================================================================== "fasta.tar.gz" contains sequence alignments from the CDs in mFASTA format. Note that sequence fragments are identified with GIs and/or accessions, but the alignments do not necessarily contain full-length sequences: the fragments span the region between the first and last aligned residue only. (Scope A: this file includes data from ALL CD models; see section on "SCOPE OF DATA in FTP FILES" for details) =============================================================================== little_endian and big_endian sub-directories =============================================================================== "little_endian" and "big_endian" sub-directories store pre-formatted search databases for these architectures. Use with the following architecture/OS combinations: Intel/Linux, Intel/Windows, Intel/Solaris: little_endian Sun/Solaris, SGI/IRIX: big_endian The subdirectories contain gzipped archives for each of the 5 different search sets listed above. Simply download the set you need, unpack the archive, and use the search set with rpsblast on your platform. (Scope A-: this file includes data from ALL CD models, but NOT models from the LOAD data set; see section on "SCOPE OF DATA in FTP FILES" for details) Note that starting with CDD version v2.09 the pre-calculated RPS-BLAST databases will be presented in a new format, that may require a recent RPS-BLAST binary. If you prefer to format the search databases on your own rather than use preformatted databases, see the "cdd.tar.gz" file description. =============================================================================== What accounts for the differences in search results generated by the CD-Search web service and standalone RSP-BLAST? =============================================================================== There are several differences between the CD-Search web service and standalone RSP-BLAST, as distributed by NCBI and used with search databases as distributed by the CDD group. The web server is optimized for the most common use of the CDD resource, which is to annotate protein sequences with clearly identified and well understood protein domains, and is also optimized for speed in order to accomodate a high volume of searches. As part of the optimization, we use some different statistical parameters for the web service than for the standalone RPS-BLAST application. Specifically, we use a constant, assumed search "database size" setting on the web server for calculating E-values. This means that the actual size of the search database can change (we are adding new models every few weeks), but the E-value computed for any individual GI -- PSSM match will remain constant. This approach: (a) ensures that pre-calculated residues are not dependent on the actual size of the model collection (which is redundant and mostly grows by increasing that redundancy); (b) facilitates incremental updates of pre-computed sequence annotation with conserved domains; and (c) is used for the creation of protein-CDD links. In contrast, standalone RPS-BLAST does not employ the constant, assumed database size parameter. So when you use a search set downloaded from the CDD FTP site, the database size might be different than the one used by the CD-Search web service, and the same hit of your query protein to a model will receive a different E-value in the standalone result. For example, if the size of the FTP'ed database is smaller than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a model will receive a lower E-value in the standalone. Conversely, if the size of the FTP'ed database is larger than what the CD-Search web service assumes in its database size parameter, the same hit of your query protein to a conserved domain model will receive a higher E-value in the standalone. If you want standalone RPS-BLAST to use the same database size parameter that is used for the web server (and thereby reproduce the same E-values with standalone RPS-BLAST that are generated by the web service), you can do that by creating an "alias" file on your local computer and placing it in the same directory as the standalone RPS-BLAST executable. The file can have a name such as "mycdd.pal" and can have contents such as the following (where lines starting with "#" are comments): # # RPSBLAST alias file # TITLE mycdd # DBLIST ./Cdd # STATS_TOTLEN 5000000 STATS_NSEQ 21000 This will now let you search against the database named "Cdd" using the two search set size parameters as specified, e.g.: ~$ rpsblast -i rpstest.tfa -d mycdd -F T -e 0.01 -m 9 # RPSBLAST 2.2.26 [Sep-21-2011] # Query: gi|156356500|ref|XP_001623960.1| predicted protein [Nematostella vectensis] # Database: mycdd # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.91 47 29 2 432 475 4 50 7e-04 36.9 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 31.48 54 31 3 493 545 6 54 8e-04 36.5 gi|156356500|ref|XP_001623960.1| gnl|CDD|197660 33.33 42 27 1 312 352 2 43 0.003 35.3 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 23.53 51 34 2 493 542 1 47 8e-04 36.4 gi|156356500|ref|XP_001623960.1| gnl|CDD|119391 21.57 51 35 2 375 424 1 47 0.003 34.5 gi|156356500|ref|XP_001623960.1| gnl|CDD|177721 24.47 94 56 3 463 541 18 111 0.005 38.6 In addition to the different statistical parameters, the CD-Search web service filters out, by default, compositionally biased regions in the query sequence. In contrast, the standalone RPS-BLAST filters them out only if you specify that option in the command line. (For example, in the current RPS-BLAST version 2.2.23, you can do this by specifying "-F T", where "F" represents the "Filter" option and "T" indicates a status of "True.") If those options are not specified, the standalone RPS-BLAST may retrieve additional hits that could also be false positives. Finally, some advanced options in standalone RPS-BLAST are not available in the web service, such as the ability to use a single-hit/two-pass mode in order to detect more distant homologous relationships. Users who select such options in the standalone version may get different search results with the web service. =============================================================================== Aron Marchler-Bauer, Renata Geer, 22 March 2013