Skip to content

how to edit gene sets

A customized gene set can be used to filter your VCF file for instance for variants that lie in the coding regions of genes that are of special interest to you. If there is consensus about the coding regions of a gene by the community (for us this means a CCDS ID exists), than it suffices to type in only the GeneSymbol or the Entrez ID of this gene and we will extract the coding regions for you. If there is no consensus about the coding regions, or if you would like to include regulatory regions for a gene, than you have to define the target regions with respect to the reference build hg19 by hand in the following nomenclature: (chrom.Nr, start, end), e.g. (9,1000000,2000000).

Currently we are using the CCDS release of September 2011. If a gene is not in this release an you wonder why, then the CCDS team would probably say something like this:

Thank you for your question concerning the CCDS database. As you noted, the reference sequences (RefSeq) on gene X (GeneID: XXXX, NM_00xxxx.x) and gene Y (GeneID: YYYY, NG_00yyyy.y) are not assigned CCDS IDs.

Please note that the CCDS database, which is a collaboration between the National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI), the European Bioinformatics Institute (EBI), and the University of California Santa Cruz (UCSC), specifically represents only coding sequences (CCDS = Consensus CoDing Sequence). Each CCDS ID represents a distinct coding sequence (CDS), where the members of the CCDS collaboration have annotated the same CDS at the same genomic location, and hence there is a consensus. In other words, the CCDS database specifically represents distinct coding sequences that multiple collaborators have annotated identically.

In the case of gene Y, this locus represents a non-transcribed pseudogene of PIGB. Since the CCDS database only includes protein-coding genes, this locus is out of scope for CCDS representation as are all non-coding genes and pseudogenes.

In the case of gene X, NM_00xxxx.x lacks a CCDS ID because there are inconsistencies among the collaborators in the placement of the exon 2-3 splice junction to the reference genome sequence. In order for a RefSeq to be CCDS-eligible the coding sequence of a particular transcript needs to be annotated identically by NCBI and the other CCDS collaborators, including every splice junction. For NM_00xxxx.x, exons 2 and 3 should have an AA-AT splice junction that uses the U12 splicing pathway. The current WTSI/EBI annotation (ENST00000164305) has the correct placement for this splice site but NCBI’s build 37.3 annotation is shifted 2 nucleotides resulting in an AA-CT splice site. As a result of this discrepancy NM_00xxxx.x did not acquire a CCDS ID.  We are working on correcting this placement in future builds and when the correction is made NM_00xxxx.x should acquire a CCDS ID.

Once again, thank you for your query. For more information on the CCDS Project, please see our Home page at, or PMID:19498102 ( if you are interested in more comprehensive details. We welcome and appreciate user feedback, suggestions and error reports, which helps us to maintain the integrity of our database

Leave a Reply