Variants in noncoding regions are always difficult to interpret on a genomic level . In a recently published article in Human Mutation entitled
“Rare Noncoding Mutations Extend the Mutational Spectrum in the PGAP3 Subtype of HPMRS“
we present an approach for filtering pathogenic nonconding variants and explain how to interpret the implications of the variants on protein level. We analysed the variants on RNA level by applying targeted gene panel sequencing on a cDNA library and sequenced all transcripts of the GPI anchor synthesis pathway.
When gDNA variant data is analysed the filter is set to identify variants in the coding regions, that have implications on the protein level. Contrariwise, when filtering cDNA variant data, mutations in noncoding regions provide valuable information about their pathogenicity. Variants that are found in remaining introns indicate that the transcript is not correctly spliced. By comparing the allelic balance between gDNA and cDNA data the regulation (miRNA-emdiated, nonsense mediate decay, or methylation) of expression of a transcript can be determined. In our example a rare 3′UTR variant led to the down regulation of a healthy allele resulting in a predominant expression of the mutant allele.
Using GeneTalk on variant data from cDNA sequencing pathogenic noncoding variants can be identified!
Many of you perform trio exome sequencing to detect de novo mutations in an affected individual. This is for a good reason, as we all know that the probability of a coding DNM to be disease-causing is very high. However, sometimes there are more than a single loss-of-function DNM in an exome, or the gene in which it occurs is simple not related to any disease yet. We looked over the shoulder of experts that are facing such a case. They usually go through the ExAc cohort data and count how many LoF variants they can find in such a gene. Now, there is even a much more elegant way to do so. The ExAc consortium computed a score, called pLI, that indicates the probability that a gene is intolarent to a loss of function mutation. The statistical framework behind this score is explained in detail by Samocha, et al. Basically, the depature for a certain mutational class from the expectation is quantified. The figure below shows the distribution of z-scores for synonymous (gray), missense (orange), and protein-truncating, PTV (red) mutations in about 18,000 genes. There is a considerable right-shift in the distribution of missense and PTVs, indicating that more genes are intolarent to these classes of mutations. The proportion of genes that are very likely intolerant of loss-of-function variation (pLI ≥ 0.9) is highest for ClinGen haploinsufficient genes, and stratifies by the severity and age of onset of the haploinsufficient phenotype. Vice versa, if you encounter a missense or protein-truncating DNM in a gene with a pLI close to 1, the chances are high that this mutation is disease causing. As we don’t want you to count all the LoF mutations in a gene in ExAC by hand anymore, we added the pLI to the gene info in GeneTalk. See an example for ZEB2 below.
When you ask two laywers for their opinion you might get three different answers. In genetics it could get even worse: Corpas et al. analyzed four publicly available exomes of a family with different software tools.
Surprisingly there was only little overlap in the sequence variants that were assessed as clinically relevant. We would like you to form your own opinion, that’s why we made the exome data of this study also available in the Demo account. Enjoy!
The allele balance is defined as the ratio of reads that support the alternative allele in a next-generation sequencing data set. In a usual vcf file this ratio can be computed by the information encoded in the AD, AO, and DP flag. For heterozygous genotypes and especially de novo calls the allele balance is also often a valuable indicator for quality. Krumm et al. showed e.g. that almost all candidates could be validated by Sanger, if the allelic balance was restricted to the inteval [0.3,0.7].
In GeneTalk you can now use the allelic balance to refine you call set to high confindence candidates. Enjoy!
Yippie, our new paper about rare variant association studies, RVAS, just appeared in Bioinformatics! In this work we describe strategies to optimise the probability to detect the disease-causing mutations in a cohort of patients. Obviously, the detection power depends on the size of your case group and the genetic variability of the true disease gene. We tested multiple Mendelian disorders and found that our approach outperformed the existing analysis strategies that are based on simple intersection filtering. In the figure below you can see three example studies from the literature, a cohort of 10 patients with Kabuki make-up syndrome, 7 patients with Catel-Manzke syndrome, 13 patients with Mabry syndrome, where the disease causing gene can be readily identified. A suitable matching technique for the controls helps to decrease spurious artifacts from heterogeneous data quality and population backgrounds.
Maybe who have also got some unsolved cases, so don’t hesitate to contact our tech support for more information about this approach.
We are looking forward to meeting you at our Booth #462 !
You are suspecting a recessive mode of inheritance. What cutoff for the frequency filter should you select? Well, it depends, are you looking for a homozygous pathogenic mutation or compound heterozygotes? This makes a big difference and the Hardy-Weinberg principle might help you to decide. Let’s review the math: In a sufficiently large population there is a relationship between allele- and genotype-frequencies if certain conditions are met. If f(a) is the frequency of allele ‘a’, the frequency of the genotype ‘aa’ should be close to g(aa)=f(a)*f(a). Let’s have a look at the figure: There are 10 individuals, 1 of them shows genotype ‘aa’, 4 individuals are heterozygous and the remaining 5 show the wildtype genotype ‘AA’. One of the heterozygous individuals inherited their allele ‘a’ from their mother and the other 3 from their father. Thus, the genotype frequency of ‘aa’ is g(aa)=1/10. For the allele frequency we have to count the total number of ‘a’s and devide them by all copies of the gene, f(a)=6/20. In this example we could state that the allele- and genotype-frequency are in equilibrium as f(a)*f(a)=36/400 is close to g(aa)=1/10.
Now, what will happen, if there is selective pressure on individuals with genotype ‘aa’? This is certainly the case for pathogenic alleles in recessive disease genes and the homozygous individual in the example above is already fading. In this case the ‘aa’s are removed from the pool, but the effect on the allele frequency is not so overwhelming, it’s still at f(a)=4/18. However, the allele- and genotype-frequencies are not anymore in Hardy-Weinberg equilibrium. Actually Hardy-Weinberg disquilibrium is often a strong indication for pathogenicity and the mere existence of homozygotes in a healthy control group are a strong argrument for ruling out a candidate mutation.
Now let’s think about how you can use that information for your filtering strategy. Let’s assume the recessive disease you are trying to elucidate has an incidence of 1 in about 10.000 individuals. This means the risk allele carrier rate could be as high as 2% or two in a hundred healthy individuals. However, if there are more than e.g. 6 homozygotes in 60,000 controls, you should wonder whether this is really the disease causing mutation.
When designing the new frequency filter that is working one gentoype frequencies of several thousands of healthy controls we could almost hear your complains: “Come on, do you really expect me to do this mental arithmetic every time I am analyzing a case?” That’s why we tried to be smart on the ‘aa’s: Once you set the frequency cutoff for the heterozygous genotype frequency, we will automatically set the homozygous genotype frequency to a reasonably lower value and only leave the fine adjustment to you. Enjoy!
The number of potentially disease-causing variants in an affected individual can effectively be reduced if further samples of the family are available for the analysis. As you already know GeneTalk provides highly effective filters for different modes of inheritance that work on multiple VCF files.
However, these filters will only yield the correct results if the relationships between the samples have been defined properly. If labels of samples have been erroneously mixed-up or the DNA of another individual has been sequenced, your diagnostic workup will fail.
Luckily, with several thousands of genotypes in a multiple VCF file it is possible to reconstruct the relationships between samples. We implemented a new feature in GeneTalk and are proud to present the pedigree predictor: In the pedigree editor there is a new “guess” button that will estimate the relationships of your samples. The predicted family structure is displayed in a pedigree. If the predicted relationships do not agree with your expectations this might point to a sample mix-up. So don’t get tricked!
If you experience any predictions that you don’t trust, please let us know (verena.heinrich (at) charite.de)!
Gene Panels comprising hundreds of genes are getting more and more popular for analyzing patients with rare inherited disorders. Currently enrichment or amplification kits of Agilent SureSelect, Illumina TruSight, and Ion Torrent AmpliSeq seem to be most widely used.
We are working on a really great new feature that will make it almost fun to analyze variants from such gene panels. The new prioritizing feature will be platform agnostic and we will support all kinds of gene panels. However, we would like to make a short survey about the approaches of the GeneTalk community.
We would highly appreciate if you could participate in tan online poll. It won’t take more than 30 secs. Imagine a patient walks into your clinic and you suspect a monogenic disorder, but you are not exactly sure which gene to analyze first. What is your first choice for the diagnostic work? In the following poll there are three preset answers, that include Gene Panels for this use case:
IonTorrent AmpliSeq Inherited Disease (328 genes)
Illumina TruSight Inherited Disease (552 genes)
Agilent SureSelect Inherited Disease (2932)
The performance of the Agilent panel was analyzed by Robinson et al. and showed a diagnostic yield of more than 30%
Please also indicate if you are using another gene panel for solving the case by selecting “others”. This could be e.g. the TruSight One panel, an exome, or anything else.
Enough explanations, now let’s start the voting!
The Watchlist – a customer-pulled request,
I just read “running lean” from Ash Maurya and learned many new words! Now I can refer to some of the stuff we recently released with the appropriate terms! The research variant database was clearly a customer-pull request. We heard from many different users about the dilemma of finding a second patient. So we moved this feature quickly from backlog to development and released around Christmas
Over the recent weeks we got feedback about the first user experiences with the new Watchlist and created a new whitepaper for the research variant database, that also trys to address some frequently asked questions. Please let us know what isn’t covered yet!