Skip to content

Quality and Coverage Filters

Many of you asked us for a quality filter, so here it is! Compared to all the other filters it wasn’t actually that hard to implement, so you might wonder, what took us that long? Well, the problem basically starts right away with the term “quality”. What is actually meant by quality? My best explanation would be: The creators of the VCF format wanted to include something like an error rate or p-value for the trustability of a variant call. However, when they realized that this is not that easy they included something in that direction, but definitely no proper probability and called it the quality column.

Some people say the value in the quality column is something like a phred score. That is, the negative of the logarithm to basis ten of the probability that the variant call is wrong. A quality score of 30 would then mean a 1:1000 chance that the variant call is a false positive. Sound’s good, so what don’t I like about it then? Well, this requires a probability model that is reasonable and this is simply not the case for most variant callers. An easy example: Most probability models assume diploid organisms with either heterozygous or homozygous genotypes. Thus the quality value is not applicable if you are interested in somatic mutations or mosaics. Another scenario, where the quality value is usually meaningless is any value above 100. Most probability models in the variant callers ignore the fact, that the DNA fragments are amplified before they are sequenced. Consider for example a position in the genome for which DNA was extracted from around 50 cells. If the genotype at this position is heterozygous we would have 50 “ref” alleles and 50 “alt” alleles. However, if we sequence with a sequencing depth of around 200, the quality value would suggest, that this call is much more trustworthy than one with a sequencing depth of only 100. But in this case a binomial model for the distribution of the sequence fragments simply doesn’t apply anymore.

In these cases it would be safer to have a look at the coverage instead of the quality value. Here, most variant callers provide information with either the DP, AD, or DP4 flag. The DP flag tells you how many sequence reads cover a certain position, the AD flag lists the number of sequence reads with the reference allele and the alternative allele. The flag that I like best, is DP4. Here the number of reads with the reference or the alternative allele that have been aligned forward or reverse are listed seperately. This allows you to see whether there is e.g. an artifact from one sequencing error: AD4:0,0,10,0 looks suspicious whereas 0,0,5,5 looks very promising. As you know all that information is shown by a move over the variant in the VCFviewer, if it was annotated in your VCF file.

So to summarize, the quality filter and the coverage filter require a  healthy skeptisism of the user and some knowledge about how these values were created.

But to get started there are some rules of thumb: If you are looking for homozygous variants in your data, a minimum coverage of 5 should be a good trade-off between noise and a true signal, for heterozygous variants you should set the minimum rather to 10. The quality values depend highly on the probability model, as stated, but minimum values of 30 are a good starter.

Please let us know about your experience and discuss it with the community in this blog entry!