CDD FTP-archive   ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/README  rev. 22 Mar 2013
===============================================================================

This ftp-directory archives collections of position-specific 
scoring matrices (PSSMs) that have been created for the CD-Search 
service (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).  

CD-Search can be used to identify conserved domains in a 
query protein sequence and infer its putative function (see:  
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). 

PSSMs are briefly described in the CDD help document: 
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CD_PSSM.  

The PSSMs are meant to be used for compiling RPS-BLAST search
databases. The RPS-BLAST executable, as well as the makeprofiledb 
application needed to convert files in this directory, are part of the 
BLAST executables (ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/) 
and the NCBI software development toolkit distribution  
(ftp://ncbi.nlm.nih.gov/toolbox).  The makeprofiledb application is
described at www.ncbi.nlm.nih.gov/books/NBK1763

Be sure to use recent BLAST executables in order to obtain the 
makeprofiledb application that is compatible
with the CDD FTP files. (The formatrpsdb application packaged 
with earlier BLAST releases is not compatible and will result in 
an error message, "unable to match element in intermediateData... 
ERROR: no data found in file.") 

The little_endian and big_endian subdirectories of this 
CDD FTP site contain preformatted databases, eliminating the 
need to use the makeprofiledb application. However, if you prefer 
to create customized search sets, you will still need to run makeprofiledb. 

Note that the E-values you get (for any given 
protein query--conserved domain hit pair) on the CD-Search web service 
might differ from those you get when using standalone RPS-BLAST 
on your local PC. The last section of this document describes the 
differences between the web service and standalone program and 
provides a tip on how you can generate the same results in 
standalone RPS-BLAST as those produced by the web service.

===============================================================================
SCOPE OF DATA in FTP FILES 
===============================================================================

Data accessible via the CD-Search tool and in the Entrez Conserved Domain 
Database (CDD) originate from a number of source databases, including NCBI-
curated domain models as well as models from external sources: 

1 cd ....... alignment models curated at NCBI as part of the CDD project
             (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains)
2 Pfam ..... PSSMs from a mirror of the Pfam-A seed alignment database
             (see: http://pfam.sanger.ac.uk/)
3 Smart .... PSSMs from a mirror of the Smart domain alignment database
             (see: http://smart.embl-heidelberg.de/)
4 COG ...... PSSMs from automatically aligned sequences and sequence
             fragments classified in the COGs resource, which focuses 
             primarily on prokaryotes 
             (see: http://www.ncbi.nlm.nih.gov/COG/new/)
5 PRK ...... PSSMs from automatically aligned sequences and sequence
             fragments classified as stable clusters in the 
             Protein Clusters database 
             (see: http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&db=proteinclusters) 
6 TIGRFAM    PSSMs from a mirror of the TIGRFAM database of protein families
             (see: http://www.jcvi.org/cms/research/projects/tigrfams/overview/)
7 KOG ...... PSSMs from automatically aligned sequences and sequence
             fragments classified in the KOGs resource, the eukaryotic 
             counterpart to COGs (see "http://www.ncbi.nlm.nih.gov/COG/new/").
             These are available as a separate search set in CD-Search 
             but are not indexed for text searching in Entrez CDD.
8 LOAD ..... Library of Ancient Domains 
             These 55 models are available only as a data file on the  
             FTP site but are not searchable via CD-Search and are not  
             indexed for text searching in Entrez CDD.  The domains
             in this set are represented by domain models in the other 
             data collections above. 
            
             (Additional details about source Databases are provided 
             in the  CDD Help Doc: 
             http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDSource)

The CD-Search databases, Entrez CDD, and the FTP files in this directory, 
encompass various data sets.  The scope of data covered by each FTP file is 
noted in the FILE LIST and SUMMARY, below, and can be one of the following: 

SCOPE A:     ALL CD models accessible via the CD-Search tool  
             (subsets 1-7, above) plus subset 8, which is 
             accessible only in this FTP directory. 

SCOPE B:     Data from the CD-Search tool's DEFAULT "cdd" database, 
             which includes subsets 1-6, above.  These subsets are 
             also indexed and searchable in NCBI's Entrez CDD database. 
 
SCOPE C:     NCBI-curated CD models (subset 1, above). 

SCOPE D:     Data from specific, individual NCBI-curated CD models

SCOPE E:     conserved domain models that are members of superfamilies; 
             these can include models from subset 1, and models from 
             subsets 2-6 that are not multidomains. 
             Superfamiles and multidomains are described in the 
             CDD Help document: 
       http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types
 

===============================================================================
FILE LIST and SUMMARY
===============================================================================

The CDD FTP directory includes the following files and subdirectories. 
The SCOPE levels noted in this summary table are described in the 
preceding section, "SCOPE OF DATA in FTP FILES". 
Additional DETAILS for each file are provided in the next section. 

-------------------------------------------------------------------------
FILENAME           |scope| summary 
-------------------------------------------------------------------------
cdd.tar.gz         |  A  | PSSMs originating from various alignment 
                   |     | collections; can be used to build search  
                   |     | databases for RPS-BLAST.   
                   |     | (scope A: all CD models) 
------------------------------------------------------------------------
acd.tar.gz         |  A+ | CD data as used by the CD-server for 
                   |     | visualization of CD-search results
                   |     | (scope A, PLUS data for superfamily clusters)
------------------------------------------------------------------------
cddid_all.tbl.gz   |  A  | summary information about all CD models in this 
                   |     | distribution 
                   |     | (scope A: all CD models) 
------------------------------------------------------------------------
fasta.tar.gz       |  A  | sequence alignments from the CDs in mFASTA format
                   |     | (scope A: all CD models)
------------------------------------------------------------------------
cdd.versions       |  A  | list of all conserved domain model accessions,  
                   |     | versions, and PSSM IDs present in the current and 
                   |     | previous versions of the Conserved Domain Database 
                   |     | (scope A: all CD models) 
------------------------------------------------------------------------
cdd.info           |  B  | CDD release version number and details
                   |     | (scope B: default "cdd" database)
-------------------------------------------------------------------------
cddid.tbl.gz       |  B  | summary information about the CD models in this
                   |     | distribution that are part of the CD-Search tool's
                   |     | default "cdd" database and are indexed in 
                   |     | NCBI's Entrez CDD database 
                   |     | (scope B: default "cdd" database)
------------------------------------------------------------------------
cddmasters.fa.gz   |  B  | FASTA-formatted sequences that show representative 
                   |     | sequences for each conserved domain model in the 
                   |     | collection
                   |     | (scope B: default "cdd" database)
------------------------------------------------------------------------
cddannot.dat.gz    |  C  | information about conserved family features
                   |     | (such as binding and catalytic sites) as  
                   |     | recorded for NCBI-curated CD models
                   |     | (scope C: NCBI-curated domain models)
------------------------------------------------------------------------
cdtrack.txt        |  C  | information from NCBI's internal tracking system
                   |     | about hierarchies of related domain models in 
                   |     | NCBI-curated domains (scope C) 
------------------------------------------------------------------------
bitscore_specific_X.XX.txt | domain-specific score thresholds used by 
                   |     | CD-Search tool to determine whether hits to
                   |  C  | NCBI-curated domain models are specific or 
                   |     | non-specific.  The X.XX portion of the filename
                   |     | indicates CDD release number.  (scope C)
------------------------------------------------------------------------
cd00882_notree.acd | D   | versions of files distributed within acd.tar 
cd01659_notree.acd | D   | that are meant for users of the old NCBI C-toolkit
cd02039_notree.acd | D   | (scope D: specific, individual NCBI-curated models)
------------------------------------------------------------------------
big_endian         |  A- | subdirectories containing pre-formatted search 
little_endian      |  A- | databases for use with various architecture/OS
(subdirectories)   |     | combinations (little_endian for Intel CPUs and
                   |     | Linux or Windows, big_endian for SUN or SGI under
                   |     | Solaris or IRIX, for example).
                   |     | (almost scope A: ALL CD models accessible via 
                   |     | the CD-Search tool (subsets 1-6, described in 
                   |     | "SCOPE OF DATA in FTP FILES", above) BUT NOT
                   |     | subset 7 (LOAD))
------------------------------------------------------------------------ 
family_superfamily_links | list of NCBI-curated and imported domain models   
                         | that are members of CDD superfamilies, along   
                   |  E  | with the superfamily accession (cl*) to which
                   |     | each domain model belongs
                   |     | (scope E: superfamily members)
------------------------------------------------------------------------


===============================================================================
DETAILS for each file 
===============================================================================

Files below are listed in alphabetical order: 

===============================================================================
acd.tar.gz
===============================================================================

"acd.tar.gz" is a gzipped archive that contains the CD data as 
used by the CD-server for visualization of CD-search results. They have been 
stored as ASN.1 formatted files. 

The types of information provided for each conserved domain model are 
described in the CDD help document section on "CDD Record (CD Summary page): 
What information is displayed for each domain model?"
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDVisual.
 
        (SCOPE A+: this file includes data from all CD models, 
        PLUS data for superfamily clusters 
        (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily).
        Additional details about data coverage are provided in the 
        earlier section, "SCOPE OF DATA in FTP FILES") 
        
Technical note: The "acd" acronym is used at NCBI to denote
"ASN.1 Cd Datafile". It is also used as a file extension for 
CD data files (e.g., the "cd0????_notree.acd" files in this
FTP directory). However, the conserved domain file extensions 
appear as "*.cn3" when using the CDD database web server's 
"Structure View" funtion.  That allows the conserved domain 
data files to be uniquely associated with the Cn3D viewing program 
(http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml), 
and to differentiate them from *.acd file extension used 
by computer-aided design programs such as Autocat. 

===============================================================================
bitscore_specific_X.XX.txt
===============================================================================

"bitscore_specific_X.XX.txt" (e.g., "bitscore_specific_2.14.txt") 
contains the domain-specific bit score thresholds used by CD-Search 
tool to determine whether hits to NCBI-curated domain models are 
specific or non-specific (both hit types are described in:  
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types). 

This file is saved for each CDD release (the X_XX portion of the filename 
indicates the CDD release number), allowing retrieval of current and 
previous bit score thresholds for a domain model. 

The file contains three columns: 

1. conserved domain PSSM ID 

   This is a unique identifier for a domain model's position-specific 
   scoring matrix (PSSM).  If a domain model's PSSM changes in any way 
   as a result of updates to its multiple sequence alignment, it receives 
   a new PSSM ID.  This happens because a conserved domain model can evolve 
   over time.  For example, as new sequence data become available, curators 
   might add sequences to a multiple sequence alignment or update the  
   sequences already present. As a result of such changes to the domain model, 
   the PSSM and its ID can change. 

   Additional information about PSSMs is accessible from: 
   http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 

2. conserved domain accession number  

   Domain-specific score thresholds are currently calculated only for 
   NCBI-curated domains; therefore, all accessions in the file begin with  
   the prefix "cd" 
   (see: http://www.ncbi.nlm.nih.gov/Structure/cdd/
   cdd_help.shtml#CDSource_accession_prefix)  

3. domain-specific score threshold, shown as bit score 

   This column shows the lowest bit score among self-hits of a domain�s 
   member protein sequences to the resulting domain model. 
   This domain-specific score threshold can change for the same reasons
   the PSSM ID can change (explained in #1, above). 

   An illustrated example and additional details about specific hits 
   and domain-specific thresholds are provided in the CD-Search help document: 
   http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#SpecificHit 


        (SCOPE C: this file includes data from NCBI-curated domain models;
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cd0????_notree.acd
=============================================================================== 

"cd0????_notree.acd" are versions of files distributed within "acd.tar",
which have been stored without data representing the sequence tree of the 
underlying set of sequence fragments. Trees in these particular examples are 
deeply nested and can not be read with the old NCBI C-toolkit object loaders.
These separate files allow users of the old NCBI C-toolkit to load the full set
of conserved domain models into their applications.

        (SCOPE D: these files include data from specific, 
        individual NCBI-curated models; see section on 
        "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cdd.info
=============================================================================== 

"cdd.info" contains the CDD release version number and details the 
content of the release (number of models from each data source)

        (SCOPE B: this file includes data from the default "cdd" database;
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cdd.tar.gz
===============================================================================

"cdd.tar.gz" is a gzipped archive file that contains Position-Specific 
Scoring Matrices (PSSMs) originating from all of the alignment collections 
encompassed by the Conserved Domain database project. 

        (Scope A: this file includes data from ALL CD models; 
        see section on "SCOPE OF DATA in FTP FILES" for details)

To build search databases for RPS-Blast you need to unpack the
archive and extract its contents. It contains ascii formatted
files only, with the following extensions:

 *.smp ...... Position Specific Scoring Matrices (PSSMs). These are
              stored in a new ASN.1 format ("scoremat"), which is shared
              between various BLAST applications.
 *.pn ....... lists of PSSM file names
 
and allows for the compilation of 5 RPS-Blast search databases

 Smart 
 Pfam
 Cog
 Kog
 Prk
 Cdd  (domains from Smart, Pfam, COG, PRK, and cd, 
       this is the set that's indexed in NCBI's Entrez)
 
The databases must be formatted with the "makeprofiledb" application 
that is distributed with the BLAST executables 
(ftp://ftp.ncbi.nih.gov/blast/executables/).  
Be sure to use recent BLAST executables
in order to obtain the makeprofiledb application that is compatible
with the CDD FTP files. (The formatrpsdb application packaged 
with earlier BLAST releases is not compatible and will result in 
an error message, "unable to match element in intermediateData... 
error no data found in file.") 

The following sequence of commands will build the search databases:

  
makeprofiledb -title SMART.v6.0 -in Smart.pn -out Smart -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title Pfam.v.26.0 -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title COG.v.1.0 -in Cog.pn -out Cog -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title KOG.v.1.0 -in Kog.pn -out Kog -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title CDD.v.3.10 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title PRK.v.6.00 -in Prk.pn -out Prk -threshold 9.82 -scale 100.0 -dbtype rps -index true


Note that the parameter '-threshold' supplied with makeprofiledb, the three-letter
word score threshold for detecting and extending hits in RPS-Blast, will
determine the size of the search database. A lower threshold
will result in larger databases and slightly increased search sensitivity,
at the cost of additional memory requirements and reduced search speed.
Matrices distributed for creating RPS-Blast search databases are scaled by a
factor of 100 (parameter -scale). A score threshold value of 9.82 will result in 
search-databases of a size very similar to using unscaled matrices and
a threshold value of 11.

Note also that the RPS-Blast search databases generated by makeprofiledb 
are architecture dependent, it may not be possible to create them on one
and use them on another platform.

When searching with your local version of RPS-blast, use the command-line
argument "-d" to specify the database name and location. You need an
executable version of the "rpsblast" program, type "rpsblast" without
arguments to obtain a list of command-line options.
 
You can now take any arbitrary subset of PSSMs and compile them into an
RPS-Blast search database. All that makeprofiledb needs is a list of file
names (such as "Smart.pn" in the example above) and the corresponding 
"scoremats" (*.smp) files. Newer versions of Psi-BLAST (blastpgp) can now
write out "checkpoints" in the "scoremat" format as well (blastpgp parameter
-u1). These again can be combined with arbitrary subsets of scoremat-
formatted PSSMs distributed here, to create customized RPS-Blast search sets.
The scoremat-formatted PSSMs distributed here are scaled with a factor 100.0,
and if one was to combine them with Psi-BLAST generated "scoremats", the
same scaling factor must be set as a parameter with makeprofiledb. 

Note: If you prefer to use preformatted databases, see the big_endian and 
little_endian subdirectories of the CDD FTP site. They contain databases 
that have been preformatted for use with various architecture/OS combinations 
(Intel, Sun, SGI / Linux, Windows, Solaris, IRIX). 

===============================================================================
cdd.versions 
=============================================================================== 

"cdd.versions" lists all conserved domain model accessions, versions, 
and PSSM IDs present in the current and previous versions of the 
Conserved Domain Database.  

        (Scope A: this file includes data from ALL CD models; 
        see section on "SCOPE OF DATA in FTP FILES" for details)


Example/Excerpt from file:  

# Acc          ShortName  PssmId  Root     Ver  Lv Rl ER Time             
# ------------ ----------------- -------- ---- -- -- -- -----------------
...
pfam09006      Surfac_D-t 90442   N/A      4    1  1  0  01/09/08 09:49:00
pfam09006      Surfac_D-t 87766   N/A      3    0  1  0  09/13/07 17:36:00
pfam09006      Surfac_D-t 72424   N/A      2    0  1  0  05/07/07 17:24:00
pfam09006      Surfac_D-t 72424   N/A      1    0  1  0  03/12/07 13:54:00
... 

Column descriptions: 

Acc = conserved domain model accession number (e.g., pfam09006) 

ShortName = first 10 characters of domain model's short name, 
        in this case, Surfac_D-t, for Surfac_D-trimer. 

PSSMID = unique identifier for the position specific scoring matrix
        (e.g., as the pfam09006 domain model has evolved, it has had
        three PSSMs, with IDs 72424, 87766, and 90442, respectively).
        
        If there are any changes in the protein sequence alignment 
        of a domain model (for example, the addition/deletion of 
        member protein sequences or changes in the span of aligned residues), 
        or if there are changes in the interpretation of the alignment, 
        a new PSSM will be calculated. In that case, it will receive
        a new PSSM ID, although the accession number of the conserved 
        domain model will remain the same. 
        
        If only the domain model description or other annotations have
        changed, but the PSSM did not change, the version of the model 
        will be incremented but the the PSSM ID will remain the same, 
        as it did for version 1 and 2 of pfam09006, both of which had 
        the PSSM ID 72424. 
        
Root =  if the domain model is NCBI-curated, the "Root" column will 
        show the accession number of the parent node of the curated
        domain hierarchy.  If the domain hierarchy contains only a
        single node, the value in the "Root" column will be the same
        as that in the "Acc" column.  The values will also be the same 
        if the accession listed in the first column is the parent node
        of a multi-level hierarchy. 

Version = version number of that particular domain model 

Lv =         indicates the current live version of the record:  
        1 = live status; 
        0 = dead, earlier version. 

Rl =         indicates whether the domain model version has been 
             released into the public database. This is a flag 
             NCBI uses for internal data tracking.  
             For most domain models, the value will be 
             1= released, which means at some point the model was 
             live in the database.  Ocassionally a value of "0" might 
             appear, primarily for ncbi-curated models.  This indicates
             a newer version of a model is in preparation at NCBI and 
             will be released in the future. 

ER =         Expendable or redundant models; value in this column can be: 
             0 = non-expendable or not redundant 
             1 = expendable or redundant; indicates a model that has been 
             removed from the default "cdd" search set because the 
             information in it is represented in another domain model. 

Time =         date and time on which the model was last updated in the 
        internal conserved domain tracking database.  

===============================================================================
cddannot.dat.gz
=============================================================================== 

"cddannot.dat.gz" contains information about conserved family features
(such as binding and catalytic sites) as recorded for curated CD models. 
This is a tab-delimited text file, with a single row per "feature" and the 
following columns:

 PSSM-Id (unique numerical identifier)
 CD accession (starting with 'cd')
 CD "short name"
 Feature number
 Feature description/name
 Boolean flag (0/1), indicating presence of structure-based feature evidence
 Boolean flag (0/1), indicating presence of reference-based feature evidence
 Boolean flag (0/1), indicating presence of additional comments
 comma-separated feature addresses
 site type (numerical)
 
The feature addresses are positions on the alignment's "master sequence", which
is a consensus sequence, and on the alignment's PSSM (the database search
model). Note that feature addresses are stored in a coordinate system that
counts the first residue in the consensus sequence as "0".

The site types are assigned as follows:

0 ... unassigned or type "other"
1 ... active site
2 ... polypeptide binding site
3 ... nucleic acid binding site
4 ... ion binding site
5 ... chemical binding site
6 ... posstranslational modification site

        (SCOPE C: this file includes data from NCBI-curated domain models;
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cddid.tbl.gz
=============================================================================== 

"cddid.tbl.gz" contains summary information about the CD models in this
distribution, which are part of the default "cdd" search database and are 
indexed in NCBI's Entrez database. This is a tab-delimited text file, with a 
single row per CD model and the following columns:

 PSSM-Id (unique numerical identifier)
 CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK' or "CHL')
 CD "short name"
 CD description
 PSSM-Length (number of columns, the size of the search model)

        (SCOPE B: this file includes data from the default "cdd" database;
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cddid_all.tbl.gz
=============================================================================== 

"cddid_all.tbl.gz" contains summary information about all CD models in
this distribution. This is a tab-delimited text file, with a single row per CD 
model and the following columns:

 PSSM-Id (unique numerical identifier)
 CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK', 'CHL', 'KOG',
               or 'LOAD')
 CD "short name"
 CD description
 PSSM-Length (number of columns, the size of the search model)

        (Scope A: this file includes data from ALL CD models; 
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cddmasters.fa.gz
=============================================================================== 

"cddmasters.fa.gz" is an archive containing the FASTA-formatted 
sequences that shows representative sequences for each conserved domain model
in the collection. The representative sequences are consensus sequences with
an approximate median length relative to all the sequence footprints used in 
the alignment. They are constructed for calculating a position-specific score
matrix (PSSM), each residue in the representative sequence corresponds to a 
column in the PSSM. When RPS-BLAST formats output, it will display pair-wise
alignments between the query and PSSMs' representative sequences.

        (SCOPE B: this file includes data from the default "cdd" database;
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
cdtrack.txt
=============================================================================== 

"cdtrack.txt" lists information from NCBI's internal tracking system
for conserved domain models curated at NCBI. The intent of this file is to 
provide information about hierarchies of related domain models. All models 
that map to the same root accession have been linked together in a 
hierarchical set, in which the alignment models are consistent with each 
other.

Columns in this table are:
Acc .......... CD accession
ShortName .... CD short name
PssmId ....... CD PSSM-ID, a unique numerical identifier for each CD
Root ......... Accession of the CD hierarchy root model.
Ver .......... CD version number
Lv ........... is model live in the tracking system?
Rl ........... has model been released to the public?
ER ........... has model been flagged as "expendable or redundant"?
Time ......... time stamp in the tracking system (last modification)

        (SCOPE C: NCBI-curated domain models;
        see section on "SCOPE OF DATA in FTP FILES" for details)

=============================================================================== 
family_superfamily_links 
=============================================================================== 

"family_superfamily_links" lists the conserved domain models that are members 
of superfamilies, along with the superfamily cluster (cl*) accession to which 
each domain model belongs.  

Superfamily members can include NCBI-curated domain models as well as 
imported models that are not multi-domains.
More information about superfamilies and multidomains is available at:  http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily and 
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#RPSB_hit_types. 

The file contains four columns:  

1. conserved domain accession number  

   For examples, see: http://www.ncbi.nlm.nih.gov/Structure/cdd/
   cdd_help.shtml#CDSource_accession_prefix.  

2. conserved domain PSSM ID 

   This is a unique identifier for a domain model's position-specific 
   scoring matrix (PSSM).  If a domain model's PSSM changes in any way 
   as a result of updates to its multiple sequence alignment, it receives 
   a new PSSM ID.  This happens because a conserved domain model can evolve 
   over time.  For example, as new sequence data become available, the 
   curators of a source database might add sequences to a multiple sequence 
   alignment or update the  sequences already present. As a result of 
   such changes to the domain model, the PSSM and its ID can change. 
   Additional information about PSSMs is accessible from: 
   http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#CDProcess) 

3. superfamily cluster accession number 

   If a conserved domain model belongs to a superfamily with two or 
   more members, this column contains the accession of the corresponding 
   superfamily (an alphanumeric string starting with a "cl" prefix (for "cluster") 
   and followed by a series of digits, e.g., cl02915).  
   
   If a conserved domain model is a "singleton" (the sole member of a 
   superfamily), this column simply repeats the conserved domain model's 
   accession number that is shown in column 1. (Note: The majority of  
   superfamilies are singletons, containing a single model from either 
   Pfam, TIGRFAM, COGs, etc. While the CDD data processing pipeline 
   does generate corresponding superfamily cluster models, they are 
   not indexed in the Entrez search system in order to reduce redundancy 
   in the presentation of search results.)
   
   Superfamily clusters are produced via an automated procedure 
   each time there is a new CDD release.  Information about  
   clustering methodology is provided at: 
   http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#Superfamily. 

   The composition of a cluster can change over time due to a variety  
   of factors, such as (a) availability of new domain models,  
   (b) changes to previously existing models, (c) new and/or updated  
   sequence records in the Entrez Protein database, and (d) refinements 
   to the automated clustering procedures.  

   A superfamily cluster accession number will remain the same if 
   at least 50 percent of its member models (conserved domain accessions)
   have not changed relative to the previous version of the cluster. 

   If more than 50 percent of the conserved domain accessions from   
   a previous version of a cluster are no longer present in the new build 
   of that cluster, or if the cluster size more than doubles with a new 
   build, then the superfamily cluster accession is retired and replaced 
   by a new accession(s). If two previous clusters merge into a single new
   cluster, the superfamily cluster accession number of the larger 
   component cluster is used for the new grouping.  

4. superfamily cluster PSSM ID 

   A superfamily's PSSM ID refers to the specific set of 
   conserved domain PSSM IDs that comprise the superfamily, rather 
   than to an actual position-specific scoring matrix for the overall 
   superfamily.   
  
   The superfamily cluster PSSM ID will change if there is any change 
   to the set of member PSSM IDs relative to the previous version of 
   the cluster (e.g., if a member conserved domain gets a new PSSM ID 
   due to changes in its multiple sequence alignment, of if a new conserved 
   domain model is added to the superfamily as the result of a CDD database 
   update). 
  

The family_superfamily_links file for each CDD release will be saved on the 
FTP site and can be used to track changes in superfamily clusters over time. 

        (Scope E: this file includes data from NCBI-curated and 
        imported domain models that are members of superfamiles; 
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
fasta.tar.gz
===============================================================================

"fasta.tar.gz" contains sequence alignments from the CDs in mFASTA
format. Note that sequence fragments are identified with GIs and/or accessions,
but the alignments do not necessarily contain full-length sequences: 
the fragments span the region between the first and last aligned residue only. 

        (Scope A: this file includes data from ALL CD models; 
        see section on "SCOPE OF DATA in FTP FILES" for details)

===============================================================================
little_endian and big_endian sub-directories
=============================================================================== 

"little_endian" and "big_endian" sub-directories store pre-formatted 
search databases for these architectures. Use with the following 
architecture/OS combinations: 

Intel/Linux, Intel/Windows, Intel/Solaris:  little_endian
Sun/Solaris, SGI/IRIX:                      big_endian

The subdirectories contain gzipped archives for each of the 5 different
search sets listed above. Simply download the set you need, unpack the
archive, and use the search set with rpsblast on your platform.

        (Scope A-: this file includes data from ALL CD models, 
        but NOT models from the LOAD data set; 
        see section on "SCOPE OF DATA in FTP FILES" for details)

Note that starting with CDD version v2.09 the pre-calculated RPS-BLAST databases
will be presented in a new format, that may require a recent RPS-BLAST binary. 

If you prefer to format the search databases on your own rather than use 
preformatted databases, see the "cdd.tar.gz" file description. 

===============================================================================
What accounts for the differences in search results generated by the 
CD-Search web service and standalone RSP-BLAST? 
===============================================================================

There are several differences between the CD-Search web service and 
standalone RSP-BLAST, as distributed by NCBI and used with search databases 
as distributed by the CDD group.

The web server is optimized for the most common use of the CDD resource, 
which is to annotate protein sequences with clearly identified and 
well understood protein domains, and is also optimized for speed in order to 
accomodate a high volume of searches.

As part of the optimization, we use some different statistical parameters 
for the web service than for the standalone RPS-BLAST application. 
Specifically, we use a constant, assumed search "database size" setting 
on the web server for calculating E-values. This means that the actual size 
of the search database can change (we are adding new models every few weeks), 
but the E-value computed for any individual GI -- PSSM match will remain 
constant. This approach: (a) ensures that pre-calculated residues are 
not dependent on the actual size of the model collection (which is redundant 
and mostly grows by increasing that redundancy); (b) facilitates incremental 
updates of pre-computed sequence annotation with conserved domains; and 
(c) is used for the creation of protein-CDD links.

In contrast, standalone RPS-BLAST does not employ the constant, assumed 
database size parameter. So when you use a search set downloaded from the 
CDD FTP site, the database size might be different than the one used by 
the CD-Search web service, and the same hit of your query protein to a 
model will receive a different E-value in the standalone result. 
For example, if the size of the FTP'ed database is smaller than 
what the CD-Search web service assumes in its database size parameter, 
the same hit of your query protein to a model will receive a lower E-value 
in the standalone. Conversely, if the size of the FTP'ed database is larger 
than what the CD-Search web service assumes in its database size parameter, 
the same hit of your query protein to a conserved domain model will receive 
a higher E-value in the standalone.

If you want standalone RPS-BLAST to use the same database size parameter 
that is used for the web server (and thereby reproduce the same E-values 
with standalone RPS-BLAST that are generated by the web service), 
you can do that by creating an "alias" file on your local computer and 
placing it in the same directory as the standalone RPS-BLAST executable. 
The file can have a name such as "mycdd.pal" and can have contents 
such as the following (where lines starting with "#" are comments):

     #
     # RPSBLAST alias file
     #
     TITLE mycdd
     #
     DBLIST ./Cdd
     #
     STATS_TOTLEN    5000000
     STATS_NSEQ      21000

This will now let you search against the database named "Cdd" using the 
two search set size parameters as specified, e.g.: 

     ~$ rpsblast -i rpstest.tfa -d mycdd -F T -e 0.01 -m 9
     # RPSBLAST 2.2.26 [Sep-21-2011]
     # Query: gi|156356500|ref|XP_001623960.1| predicted protein [Nematostella vectensis]
     # Database: mycdd
     # Fields: Query id, Subject id, % identity, alignment length, mismatches,
     gap openings, q. start, q. end, s. start, s. end, e-value, bit score
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|197660  31.91   47      29
     2       432     475     4       50      7e-04 36.9
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|197660  31.48   54      31
     3       493     545     6       54      8e-04 36.5
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|197660  33.33   42      27
     1       312     352     2       43      0.003 35.3
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|119391  23.53   51      34
     2       493     542     1       47      8e-04 36.4
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|119391  21.57   51      35
     2       375     424     1       47      0.003 34.5
     gi|156356500|ref|XP_001623960.1|        gnl|CDD|177721  24.47   94      56
     3       463     541     18      111     0.005 38.6

In addition to the different statistical parameters, the CD-Search web service 
filters out, by default, compositionally biased regions in the query sequence. 
In contrast, the standalone RPS-BLAST filters them out only if you specify 
that option in the command line. (For example, in the current 
RPS-BLAST version 2.2.23, you can do this by specifying "-F T", 
where "F" represents the "Filter" option and "T" indicates a status of "True.") 
If those options are not specified, the standalone RPS-BLAST may retrieve 
additional hits that could also be false positives.

Finally, some advanced options in standalone RPS-BLAST are not available 
in the web service, such as the ability to use a single-hit/two-pass mode 
in order to detect more distant homologous relationships. Users who select 
such options in the standalone version may get different search results 
with the web service.

===============================================================================

  Aron Marchler-Bauer, Renata Geer, 22 March 2013