-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BCFTools liftover - replace NCBI remap which is retired #817
Comments
Here's the numbers: Shariant prod - NCBI linked 119 of 40620 alleles (0.292959133431807 %)
|
Saving the 0.5% of ones that fail ClinGen allele registry is worth it, especially as that has a 2k limit and we're going to be using CNVs So, first thing is to evaluate the options. There is the UCSC one and also: SNPLift: Fast and accurate conversion of genetic variant coordinates across genome assemblies Then we need to make NCBI liftover only work historically, and switch in the replacement. |
To dump out our variants that used NCBI (for testing), use the following:
|
@YasirKusay if you could:
Within a few days I will provide you with CSVs of NCBI liftover (see comment above) results
|
It was (relatively) ok to setup SNPLift. Setup
Using it
|
Looks like service is now gone we should change settings in all environments:
|
I emailed you a list of NCBI liftovers from Shariant and SA Path test prod (vg3upgrade) So, you should convert these into VCF files. The coordinates are in the first genome build (ie 37 in "GRCh37_to_GRCh38" so you need to make 4 files (2 GRCh37, 2xGRCh38) The VCF has 2 mandatory header lines, starting with a single "#" - the fileformat and the #chrom pos etc line that shows column names - see https://davetang.github.io/learning_vcf_file/img/vcf_format.png - heaps of info online So, make a GRCh37 VCF file from the CSV file columns (chrom/start/ref/alt), and add an "INFO" field eg:
VCF files need to sorted by chrom/pos order - you can probably do this in Pandas pretty easily as you read the CSV Then put the GRCh38 g.HGVS into this INFO field for each record of the VCF file Then run the VCF file through converting from GRCh37 to GRCh38. It should come out the other side with GRCh38 coordinates (chrom/pos/ref/alt) but keep the info fields it was given. Then, write a program that goes through the VCF files and compare the VCFs chrom/pos/ref/alt with the chrom/pos/ref/alt in the g.HGVS from the INFO field. Remember you may need to adjust by 1 to handle VCF/HGVS conversion Get some stats eg what percentage match, how many snplift can't do etc (I should go back and find out what % ncbi couldn't do) Also grab a handful of examples where they are different or snp lift fails, and see if you can see any patterns etc. We can then dig into them and see which one was right (could have been NCBI?) chrom/post/ref/alt |
So yeah to compare the SNPlift results with NCBI, need to compare newly produced lifted over VCF coords with the g.HGVS from the INFO field (this is what NCBI liftover made) So you can produce a g.HGVS from the VCF, then compare that - can use this library - if you want to get used to it https://github.com/biocommons/hgvs or annotate VCF with VEP (online or offline) to produce g.HGVS HGVS can be a pain to setup, you can't use their remote UTA on SA Path network, maybe try cdot - https://github.com/SACGF/cdot Or maybe just pull it apart with a regex and add/subtract 1 - you can probably skip more complicated HGVS than snps (ie A>C) |
I am up to the "comparison" phase but would just like to make a few notes about some observations/challenges about using SNPLift.
|
I have been trying to do some comparisons now but I have encountered a few roadblocks:
Since I am mostly interested in comparing the VCF outputs, I have chosen to re-write the code in snplift so as to just include any item regardless of score (and delete duplicate entries). However, I feel like this is still something we should talk about next time we meet. |
I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT). |
I will just comment the results here:
The skipped ones are because they were translated to a non NC sequence. Overall had quite a high error rate. |
Requiring contig names match exactly is normal - I think of it as a safety feature to make sure you have the right genome. In VG we use contig names internally, I just exported them as chromosome name as I probably wasn't thinking about it. But when we export it we'll do it using contig names so that's fine. I wrote up some docs on this here: https://github.com/SACGF/variantgrid/wiki/VCF-chromosome---contig-names You should be able to convert them back via instructions on that page
Can you just force it to use 1 file / 1 cpu?
Maybe you could raise an issue on their GitHub and suggest a command line flag to disable this, or make a patch that handles files with <20 entries? To me it should be looking within a certain region, ie looking at genomic distance, rather than VCF file entries, as it could be eg chr20 next to chrX - the variants on 1 contig have nothing to do with ones on other ones when mapping is concerned
We generally ignore these other contigs as it drastically increases complexity of things - just regard this as a failure I think |
If you regard the translation from NC to something else, then this program has an extremely high error rate. In the second file alone, there were > 50% of these occurrences. |
Maybe you could try stripping the fasta files of all of the patch/alt contigs then re-doing the tests? Can probably make a Python script that does something like (thanks ChatGTP):
That would force it to only use those as a destination - hopefully only drop the score a tiny bit |
That's actually not a bad idea. I will do that. Hopefully the score goes up this time. |
As the software is in pre-print stage, I think they would be pretty keen for feedback - preprint corresponding author given is davoud.torkamaneh.1@ulaval.ca But yeah raising an issue then making a pull request for command line options etc would be great |
Just got the updated results here:
Overall had quite a high error rate. |
It's hard to judge whether it has a high error rate, as this data set is for when ClinGen Allele Registry failed - so presumably it's tricky stuff. If we used this set to validate ClinGen Allele registry it would get 100% failure rate even though it is sub 1% usually Maybe it would be worth creating a good benchmarking set as well as this difficult one? Perhaps take a 1k subset of ClinVar, where the clingen allele ID is available in both 37/38 VCF files - that way we have a way of checking whether it was right There are some other liftover tools it might be worth evaluating - see this Biostars post ie: |
For the benchmark set - I have created a subsample of VCFs before using https://github.com/alexpreynolds/sample Something like this (untested code copy/pasting lines from my bash history)
|
Should I also try a database aside from UTA? |
You can use cdot if you want but if UTA is setuo that should work as you're only using UTA for HGVS conversion of genomic variants Clinvar should have everything in vcf. So to get liftover truth set subsample one, read the the clinvar allele ids into a set, then go through other builds vcf and subset out the ones that have allele ids that match the ones in your set |
Just gonna push out a few new stats. I extracted 1000 variants at random from Clinvar for the GRCh37 gene. I also extracted variants with the same ID from GRCh38. Here are the tests for snplift, crossmap and picard for this new example:
I will soon also test your examples on crossmap and picard. That should give me a good indication of which might be the best to use. |
(IN PROCESS OF WRITING THIS)
In my opinion, crossmap is easier to use, picard had quite stringent format requirements that made using it quite time consuming/difficult. |
shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
I would say that its tied between Crossmap and Picard. However, Picard is more difficult to use. I have encountered the following issues when using it. If the variant entry contains a '=' anythere (e.g. in "ALT", "INFO", etc), it will fail. This is because it thinks that there is a key (from INFO) being assigned a value. Also, I think it expects that if you have a value inside INFO, it must have an associated a key (and that must be defined in the header). For testing purposes, I have removed the lines that contain a '=' from the vcf input. I also believe (from some observations) that even minor issues cause the entire program to fail. In my opinion, it is quite frustrating to use and at many times it is difficult to diagnose when an issue arises. Also, here is another major issue I encountered when using Picard and Crossmap. These require the chain file hg38ToHg19.over.chain.gz (OR hg19ToHg38.over.chain.gz), to work. They are available from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ and http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/ respectively. The issue with these files are that they contain the chromosome indexes only, and they might be difficult to translate due to the structure of that file. To get around this issue temporarily, I modified the input vcfs and input genome files to use chromosome indexes, translating the vcf chromosome names after I perform a liftover of the VCF file. This is not an issue with with snplift, however instead of using flags, it uses a shell file where you assign the input/output names, etc (these get exported to the environment). I guess we can just skip this step by assigning our own environment variables (or writing a command line flag interface ourselves). I recommend that we investigate crossmap more. |
@davmlaw @YasirKusay thanks for looking into this. @davmlaw As we're now in November, I just wanted to confirm that we won't see any issues on the Shariant end when NCBI does retire? Particularly thinking about alleles that are unable to be lifted over using clingen allele registry and therefore try to go down the NCBI liftover route? |
We have worked out what tool to use, but need to integrate it, do shariant test then deploy We have disabled even attempting ncbi liftover so won't get error. But we will probably get 0.5% failing to liftover due to clingen allele registry failure They will sit there with liftover failure until deploy with new tool comes through |
Is there an ETA on this? I've been looking at the imported alleles and saw that there have been two new genome build missing alleles (against NM transcripts) since NCBI remap was retired. That is compared to a total of three missing genome build alleles (in NM transcripts) throughout the whole of Shariant. It may be that NCBI liftover was saving more alleles than we antcipated |
I hope to get this done within 2-3 weeks |
bcftools liftover...
Have documented installing this https://github.com/SACGF/variantgrid/wiki/Install-bcftools-liftover Running VG test: /data/annotation/liftover |
Another data point - after importing SA Path variant tags - number that failed clingen was 1k out of 179k = 0.55% |
Initial testing failed - couldn't click "create variant" to make it. This was because that only worked with clingen. Fixed, Testing on vg test, create a super long variant: NC_000008.11:g.117822562_117852562dup This has SVLEN of 30001 and ClinGen fails with ClinGenError: VariationTooLong But I can now click create variant/liftover. Checking the annotated c.HGVS it comes out the same GRCh37 = NM_000127.3(EXT1):c.963-15361_1320dup |
@davmlaw is there an equivilent setting to LIFTOVER_NCBI_REMAP_ENABLED for this that can be turned off until full testing is complete? |
It's off by default, need to enable via:
|
Cheers, currently it's enabled for Shariant Test but not Prod. I'm happy to leave it that way for this deploy - and then do the full testing with a goal to enable it in prod at the next deploy. |
@davmlaw If a variant is lifted over vis BCF tools, it doesn;t populate g.hgvs. See - https://test.shariant.org.au/variantopedia/view_allele/28939 |
g.HGVS not being populated was due to VEP failing to populate any new variants not just BCFTools ones |
Note from issue https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid/1014 There are 2 separate things here: VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them. So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue) For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release |
VG testing complete, Shariant testing spun off into new issue |
After vg.com deploy found some issues:
|
ok, fixed that with update_or_create Now I get an error about the reference not matching. We'll have to filter those out and die before writing to the VCF We should probably think about how we want to store AlleleLiftovers that fail validation and never actually attempt a liftover run At the moment, because we need to store the destination - for clingen we shove them all in a fake one that always fails, and don't have a way of doing it for other liftover types Maybe we should return objects for liftover run, but with errors, so they are created and not eg dumped into a VCF etc, just created with an initial failure? Then phase out that initial always fail run |
ok, have rewritten code to yield multiple results, including errors, so we now always record what happens. Deployed and tested vg.com |
https://ncbiinsights.ncbi.nlm.nih.gov/2023/05/04/remap-tool-to-retire/
The text was updated successfully, but these errors were encountered: