Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCFTools liftover - replace NCBI remap which is retired #817

Closed
davmlaw opened this issue May 5, 2023 · 42 comments
Closed

BCFTools liftover - replace NCBI remap which is retired #817

davmlaw opened this issue May 5, 2023 · 42 comments
Assignees
Labels
Milestone

Comments

@davmlaw
Copy link
Contributor

davmlaw commented May 5, 2023

https://ncbiinsights.ncbi.nlm.nih.gov/2023/05/04/remap-tool-to-retire/

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 15, 2023

Here's the numbers:

Shariant prod - NCBI linked 119 of 40620 alleles (0.292959133431807 %)
SA Path - NCBI linked 743 of 181911 alleles (0.408441490619039 %)

from snpdb.models import VariantAllele, AlleleConversionTool

# note field is 'conversion_tool' in older code
va_ncbi = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP)
va = VariantAllele.objects.all()
perc = va_ncbi.count() * 100 / va.count()
print(f"NCBI linked {va_ncbi.count()} of {va.count()} alleles ({perc} %)")

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 15, 2023

Saving the 0.5% of ones that fail ClinGen allele registry is worth it, especially as that has a 2k limit and we're going to be using CNVs

So, first thing is to evaluate the options. There is the UCSC one and also:

SNPLift: Fast and accurate conversion of genetic variant coordinates across genome assemblies

Then we need to make NCBI liftover only work historically, and switch in the replacement.

@davmlaw davmlaw removed the meeting label Jun 19, 2023
@davmlaw davmlaw added this to the SA Path VG 4 milestone Jun 19, 2023
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 22, 2023

To dump out our variants that used NCBI (for testing), use the following:

import pandas as pd

from socket import gethostname
from snpdb.models import GenomeBuild, VariantAllele, AlleleConversionTool
from genes.hgvs import HGVSMatcher

def get_ncbi_from_build(src_genome_build, dest_genome_build):
        src_matcher = HGVSMatcher(src_genome_build)
        dest_matcher = HGVSMatcher(dest_genome_build)
        src_ghgvs = f"{src_genome_build.name}_ghgvs"
        dest_ghgvs = f"{dest_genome_build.name}_ghgvs"
        
        liftover_from_build = []
        # old code = conversion_tool=
        va_ncbi_qs = VariantAllele.objects.filter(allele_linking_tool=AlleleConversionTool.NCBI_REMAP, genome_build=dest_genome_build)
        for va in va_ncbi_qs:
            v_src = va.allele.variant_for_build(src_genome_build)
            v_dest = va.allele.variant_for_build(dest_genome_build)
            # If this fails need to use ._asdict().copy()
            data = v_src.coordinate.__dict__.copy()
            data["allele"] = va.allele_id
            data[src_ghgvs] = src_matcher.variant_to_g_hgvs(v_src)
            data[dest_ghgvs] = dest_matcher.variant_to_g_hgvs(v_dest)
            liftover_from_build.append(data)
        return liftover_from_build

def write_csv(src_genome_build, dest_genome_build):
        liftover_from_build = get_ncbi_from_build(src_genome_build, dest_genome_build)
        df = pd.DataFrame.from_records(liftover_from_build)
        filename = f"/tmp/{gethostname()}_ncbi_liftover_from_{src_genome_build}_to_{dest_genome_build}.csv"
        print(f"Wrote '{filename}'")
        df.to_csv(filename)

write_csv(GenomeBuild.grch37(), GenomeBuild.grch38())
write_csv(GenomeBuild.grch38(), GenomeBuild.grch37())

@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 27, 2023

@YasirKusay if you could:

  • Setup SNPLift
  • Document any tricky bits etc
  • Run it on some test VCFs, and eg see what % of them work, have errors etc

Within a few days I will provide you with CSVs of NCBI liftover (see comment above) results

  • Run SNPLift on same variants - then we can compare results and see what concordance it has with NCBI liftover and ClinGen allele registry

@YasirKusay
Copy link
Contributor

YasirKusay commented Oct 30, 2023

It was (relatively) ok to setup SNPLift.

Setup

  • No easy requirements.txt file so need to install everything manually.
  • Does not list versions (samtools needs a version greater than 1.10, annoyingly, had to reinstall samtools).
  • Also needs GNU Parallel
  • Also need scipy
  • In addition, need the following R packages: data.tables and scales (not listed in installation)

Using it

  • Inside 02_infos/snplift_config.sh, add all necessary config info, including old/new genomes and inputs/outputs (and any extra settings to use e.g. for alignment).
  • To verify installation: ./01_scripts/util/run_test.sh
  • Run ./snplift 02_infos/snplift_config.sh

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 1, 2023

Looks like service is now gone we should change settings in all environments:

  • SA Path prod / test / train - need to wait till out of hours to restart prod
  • vg.com / vg test - settings.LIFTOVER_NCBI_REMAP_ENABLED has always been False
  • shariant / test
  • runx1 - settings.LIFTOVER_NCBI_REMAP_ENABLED has always been False

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 3, 2023

I emailed you a list of NCBI liftovers from Shariant and SA Path test prod (vg3upgrade)

So, you should convert these into VCF files. The coordinates are in the first genome build (ie 37 in "GRCh37_to_GRCh38" so you need to make 4 files (2 GRCh37, 2xGRCh38)

The VCF has 2 mandatory header lines, starting with a single "#" - the fileformat and the #chrom pos etc line that shows column names - see https://davetang.github.io/learning_vcf_file/img/vcf_format.png - heaps of info online

So, make a GRCh37 VCF file from the CSV file columns (chrom/start/ref/alt), and add an "INFO" field eg:

##INFO=<ID=EXPECTED_HGVS,Number=1,Type=String,Description='NCBI g.HGVS to compare with snplift results">

VCF files need to sorted by chrom/pos order - you can probably do this in Pandas pretty easily as you read the CSV

Then put the GRCh38 g.HGVS into this INFO field for each record of the VCF file

Then run the VCF file through converting from GRCh37 to GRCh38. It should come out the other side with GRCh38 coordinates (chrom/pos/ref/alt) but keep the info fields it was given.

Then, write a program that goes through the VCF files and compare the VCFs chrom/pos/ref/alt with the chrom/pos/ref/alt in the g.HGVS from the INFO field. Remember you may need to adjust by 1 to handle VCF/HGVS conversion

Get some stats eg what percentage match, how many snplift can't do etc (I should go back and find out what % ncbi couldn't do)

Also grab a handful of examples where they are different or snp lift fails, and see if you can see any patterns etc. We can then dig into them and see which one was right (could have been NCBI?)

chrom/post/ref/alt

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 3, 2023

So yeah to compare the SNPlift results with NCBI, need to compare newly produced lifted over VCF coords with the g.HGVS from the INFO field (this is what NCBI liftover made)

So you can produce a g.HGVS from the VCF, then compare that - can use this library - if you want to get used to it https://github.com/biocommons/hgvs or annotate VCF with VEP (online or offline) to produce g.HGVS

HGVS can be a pain to setup, you can't use their remote UTA on SA Path network, maybe try cdot - https://github.com/SACGF/cdot

Or maybe just pull it apart with a regex and add/subtract 1 - you can probably skip more complicated HGVS than snps (ie A>C)

@YasirKusay
Copy link
Contributor

I am up to the "comparison" phase but would just like to make a few notes about some observations/challenges about using SNPLift.

  • It will need the genome files (GRCH37 and 38), and they need to be indexed with bwa-mem. Fortunately, there is an option in the settings inside 02_infos/snplift_config.sh called SKIP_INDEXING that will auto generate the index. Be sure to switch this setting off once you generate the index.
  • The versioning may be quite difficult to manage. The chromosome name in the vcf files must perfectly match its corresponding sequence inside the "old" genome file. Hence, you cannot simply use the chromosome number inside the vcf file and the chromosome name in the vcf file must match the full name of its corresponding sequence in the old genome file INCLUDING its version.

@YasirKusay
Copy link
Contributor

YasirKusay commented Nov 8, 2023

I have been trying to do some comparisons now but I have encountered a few roadblocks:

  • It firstly aligns and fetches the variant translations into (NUM_CPU) files, all having the same number of lines (except the last file). These translations are put into this file in the same order as their original variant appeared in the VCF file. This means that translations belonging to the same chromosome may be in different files and translations belonging to different chromosomes are in the same file. Also, the same translation entry may appear in several different files (this most likely is a bug). These could be issues in the next step.
  • When determining if a VCF entry will be added to the new VCF file, if the alignment score of its translation is bad (<0.5), it will inspect the adjacent variants (if physically close enough). If their average score is good enough, it will be added to the new VCF file (this is just a simplification). This however could cause issues if the VCF file in the input is small, as it will look at by default 20 adjacent entries, (can be changed in the settings), and my program has experienced crashes when the VCF entries are small (except when adjusting the number of adjacent entries it looks at).

Since I am mostly interested in comparing the VCF outputs, I have chosen to re-write the code in snplift so as to just include any item regardless of score (and delete duplicate entries). However, I feel like this is still something we should talk about next time we meet.

@YasirKusay
Copy link
Contributor

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

@YasirKusay
Copy link
Contributor

YasirKusay commented Nov 9, 2023

I will just comment the results here:

  • shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

    • 66 inputs, 14 skipped, 4 mismatches, 0 exceptions.
  • shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

    • 64 inputs, 37 skipped, 4 mismatches
  • vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

    • 292 inputs, 69 skipped, 24 mismatches, 4 variants caused an exception error.
  • vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

    • 593 inputs, 268 skipped, 57 mismatched, 14 variants caused an exception error.

The skipped ones are because they were translated to a non NC sequence.

Overall had quite a high error rate.

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 9, 2023

The versioning may be quite difficult to manage. The chromosome name in the vcf files must perfectly match its corresponding sequence inside the "old" genome file

Requiring contig names match exactly is normal - I think of it as a safety feature to make sure you have the right genome.

In VG we use contig names internally, I just exported them as chromosome name as I probably wasn't thinking about it. But when we export it we'll do it using contig names so that's fine.

I wrote up some docs on this here:

https://github.com/SACGF/variantgrid/wiki/VCF-chromosome---contig-names

You should be able to convert them back via instructions on that page

Split into NUM_CPU files

Can you just force it to use 1 file / 1 cpu?

alignment score of its translation is bad (<0.5)

Maybe you could raise an issue on their GitHub and suggest a command line flag to disable this, or make a patch that handles files with <20 entries? To me it should be looking within a certain region, ie looking at genomic distance, rather than VCF file entries, as it could be eg chr20 next to chrX - the variants on 1 contig have nothing to do with ones on other ones when mapping is concerned

I did encounter an additional issue when working with SNPLift. SNPLift will translate some chromosomes (NC_0000XX) to something else (NW/NT).

We generally ignore these other contigs as it drastically increases complexity of things - just regard this as a failure I think

@YasirKusay
Copy link
Contributor

  • Regarding the contig names, what I did when generating the VCF file from your input is I used the contig name from the HGSV of the non-translated entry.
  • Regarding the NUM_CPU files, yes you can manually specify that. It splits it to NUM_CPU files for multiprocessing.
  • I can raise an issue on SNPLift.
    • Regarding the genomic distance comment, if that condition is triggered for a poorly scoring translated variant, it will only look at the closest translated variants that belong on the same chromosome. Did NCBI remap also do this check (I am sure that it does not). How important is this feature in your opinion.
    • So far, I have just commented out that code in SNPLift and just included all translated variants for comparison purposes

If you regard the translation from NC to something else, then this program has an extremely high error rate. In the second file alone, there were > 50% of these occurrences.

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 9, 2023

Maybe you could try stripping the fasta files of all of the patch/alt contigs then re-doing the tests? Can probably make a Python script that does something like (thanks ChatGTP):

import gzip
from Bio import SeqIO

input_file = 'orig.fa.gz'
output_file = 'main_contigs_only.fa.gz'

with gzip.open(output_file, 'w') as f_out:
    for record in SeqIO.parse(input_file, 'fasta'):
        if record.id.startswith("NC_"):  # I think this is what we want...
            SeqIO.write(record, f_out, 'fasta')

That would force it to only use those as a destination - hopefully only drop the score a tiny bit

@YasirKusay
Copy link
Contributor

That's actually not a bad idea. I will do that. Hopefully the score goes up this time.

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 9, 2023

As the software is in pre-print stage, I think they would be pretty keen for feedback - preprint corresponding author given is davoud.torkamaneh.1@ulaval.ca

But yeah raising an issue then making a pull request for command line options etc would be great

@YasirKusay
Copy link
Contributor

Just got the updated results here:

  • shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
    • 66 inputs, 5 mismatches, 6 variants caused an exception error.
  • shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
    • 64 inputs, 6 mismatches
  • vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
    • 291 inputs, 33 mismatches, 1 variant caused an exception error.
  • vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
    • 593 inputs, 106 mismatches, 14 variants caused an exception error.

Overall had quite a high error rate.

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 10, 2023

It's hard to judge whether it has a high error rate, as this data set is for when ClinGen Allele Registry failed - so presumably it's tricky stuff. If we used this set to validate ClinGen Allele registry it would get 100% failure rate even though it is sub 1% usually

Maybe it would be worth creating a good benchmarking set as well as this difficult one? Perhaps take a 1k subset of ClinVar, where the clingen allele ID is available in both 37/38 VCF files - that way we have a way of checking whether it was right

There are some other liftover tools it might be worth evaluating - see this Biostars post ie:

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 10, 2023

For the benchmark set - I have created a subsample of VCFs before using https://github.com/alexpreynolds/sample

Something like this (untested code copy/pasting lines from my bash history)

# Get header
zgrep "^#" clinvar_20230430.GRCh38.vcf > clinvar_20230430.GRCh38.10k_sample.vcf
./sample --sample-size=10000 clinvar_20230430.GRCh38.vcf --preserve-order >> clinvar_20230430.GRCh38.10k_sample.vcf
bgzip clinvar_20230430.GRCh38.10k_sample.vcf

@YasirKusay
Copy link
Contributor

Should I also try a database aside from UTA?

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 10, 2023

You can use cdot if you want but if UTA is setuo that should work as you're only using UTA for HGVS conversion of genomic variants

Clinvar should have everything in vcf. So to get liftover truth set subsample one, read the the clinvar allele ids into a set, then go through other builds vcf and subset out the ones that have allele ids that match the ones in your set

@YasirKusay
Copy link
Contributor

YasirKusay commented Nov 13, 2023

Just gonna push out a few new stats. I extracted 1000 variants at random from Clinvar for the GRCh37 gene. I also extracted variants with the same ID from GRCh38.

Here are the tests for snplift, crossmap and picard for this new example:

  • 37_to_38

    • SNPLIFT: 2 mismatches, 1 unmapped
    • CROSSMAP: 1 unmapped, 2 mismatches
    • PICARD: 2 unmapped, 0 mismatches
  • 38_to_37

    • SNPLIFT: 2 mismatches,1 unmapped
    • CROSSMAP: 2 unmapped, 1 mismatch
    • PICARD: 2 unmapped, 0 mismatches

I will soon also test your examples on crossmap and picard. That should give me a good indication of which might be the best to use.

@YasirKusay
Copy link
Contributor

YasirKusay commented Nov 13, 2023

(IN PROCESS OF WRITING THIS)

  • shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
    • crossmap: 9 exceptions, 2 mismatches, 1 failed to map
    • picard: manually removed 1 record. 1 exception. 1 failed to translate.
  • shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
    • crossmap: 6 mismatches, 1 exception, 3 failed to map
    • picard: 2 mismatches, 4 failed to translate
  • vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:
    • crossmap: 7 failed to map, 17 mismatches, 1 exception
    • picard: manually removed 2 records. FAILED COMPLETELY (Will look into it more later)
  • vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:
    • crossmap: 25 failed to map. 50 mismatches, 14 exceptions,
    • picard: 29 mismatches, 45 exceptions, 28 failed to translate

In my opinion, crossmap is easier to use, picard had quite stringent format requirements that made using it quite time consuming/difficult.

@YasirKusay
Copy link
Contributor

YasirKusay commented Nov 14, 2023

shariant_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 1 2 9 12
PICARD 2 (1 removed manually) 0 1 3
SNPLIFT 1 5 6 12

shariant_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 3 6 1 11
PICARD 4 2 0 6
SNPLIFT 1 6 0 7

vg3upgrade_ncbi_liftover_from_GRCh37_to_GRCh38.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 7 17 1 25
PICARD FAILED FAILED FAILED Failed to Map
SNPLIFT 12 33 1 46

vg3upgrade_ncbi_liftover_from_GRCh38_to_GRCh37.vcf:

Program Failed to Map Mismatches Exceptions Total Errors
CROSSMAP 25 50 14 89
PICARD 28 29 45 102
SNPLIFT 27 106 14 147

I would say that its tied between Crossmap and Picard. However, Picard is more difficult to use. I have encountered the following issues when using it. If the variant entry contains a '=' anythere (e.g. in "ALT", "INFO", etc), it will fail. This is because it thinks that there is a key (from INFO) being assigned a value. Also, I think it expects that if you have a value inside INFO, it must have an associated a key (and that must be defined in the header). For testing purposes, I have removed the lines that contain a '=' from the vcf input. I also believe (from some observations) that even minor issues cause the entire program to fail. In my opinion, it is quite frustrating to use and at many times it is difficult to diagnose when an issue arises.

Also, here is another major issue I encountered when using Picard and Crossmap. These require the chain file hg38ToHg19.over.chain.gz (OR hg19ToHg38.over.chain.gz), to work. They are available from http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/ and http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/ respectively.

The issue with these files are that they contain the chromosome indexes only, and they might be difficult to translate due to the structure of that file. To get around this issue temporarily, I modified the input vcfs and input genome files to use chromosome indexes, translating the vcf chromosome names after I perform a liftover of the VCF file.

This is not an issue with with snplift, however instead of using flags, it uses a shell file where you assign the input/output names, etc (these get exported to the environment). I guess we can just skip this step by assigning our own environment variables (or writing a command line flag interface ourselves).

I recommend that we investigate crossmap more.

@EmmaTudini
Copy link
Contributor

@davmlaw @YasirKusay thanks for looking into this. @davmlaw As we're now in November, I just wanted to confirm that we won't see any issues on the Shariant end when NCBI does retire? Particularly thinking about alleles that are unable to be lifted over using clingen allele registry and therefore try to go down the NCBI liftover route?

@davmlaw
Copy link
Contributor Author

davmlaw commented Nov 23, 2023

We have worked out what tool to use, but need to integrate it, do shariant test then deploy

We have disabled even attempting ncbi liftover so won't get error. But we will probably get 0.5% failing to liftover due to clingen allele registry failure

They will sit there with liftover failure until deploy with new tool comes through

@EmmaTudini
Copy link
Contributor

Is there an ETA on this?

I've been looking at the imported alleles and saw that there have been two new genome build missing alleles (against NM transcripts) since NCBI remap was retired. That is compared to a total of three missing genome build alleles (in NM transcripts) throughout the whole of Shariant. It may be that NCBI liftover was saving more alleles than we antcipated

@davmlaw
Copy link
Contributor Author

davmlaw commented Jan 19, 2024

I hope to get this done within 2-3 weeks

@davmlaw
Copy link
Contributor Author

davmlaw commented Jan 24, 2024

bcftools liftover...

SNVs and indels and it drops less genetic variants than other tools commonly used for the same task such as Picard/LiftoverVcf and CrossMap

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae038/7585532?login=true

Have documented installing this

https://github.com/SACGF/variantgrid/wiki/Install-bcftools-liftover

Running VG test: /data/annotation/liftover

@davmlaw
Copy link
Contributor Author

davmlaw commented Feb 9, 2024

Another data point - after importing SA Path variant tags - number that failed clingen was 1k out of 179k = 0.55%

@davmlaw davmlaw self-assigned this Feb 16, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented May 2, 2024

Initial testing failed - couldn't click "create variant" to make it. This was because that only worked with clingen. Fixed,

Testing on vg test, create a super long variant:

NC_000008.11:g.117822562_117852562dup

This has SVLEN of 30001 and ClinGen fails with ClinGenError: VariationTooLong

But I can now click create variant/liftover. Checking the annotated c.HGVS it comes out the same

GRCh37 = NM_000127.3(EXT1):c.963-15361_1320dup
GRCh38 = NM_000127.3(EXT1):c.963-15361_1320dup

@davmlaw davmlaw changed the title NCBI remap retires November 2023 New Genome build Liftover as NCBI remap retired May 8, 2024
@EmmaTudini EmmaTudini assigned EmmaTudini and unassigned EmmaTudini May 16, 2024
@TheMadBug
Copy link
Member

@davmlaw is there an equivilent setting to LIFTOVER_NCBI_REMAP_ENABLED for this that can be turned off until full testing is complete?
(After the deploy of the somatic data I'd like us to spend a deploy with no new functionality, just using biocommons and this)

@davmlaw
Copy link
Contributor Author

davmlaw commented May 16, 2024

It's off by default, need to enable via:

LIFTOVER_BCFTOOLS_ENABLED = False

@TheMadBug
Copy link
Member

Cheers, currently it's enabled for Shariant Test but not Prod. I'm happy to leave it that way for this deploy - and then do the full testing with a goal to enable it in prod at the next deploy.

@davmlaw davmlaw changed the title New Genome build Liftover as NCBI remap retired BCFTools liftover - replace NCBI remap which is retired Jun 4, 2024
@EmmaTudini
Copy link
Contributor

@davmlaw If a variant is lifted over vis BCF tools, it doesn;t populate g.hgvs. See - https://test.shariant.org.au/variantopedia/view_allele/28939

@davmlaw
Copy link
Contributor Author

davmlaw commented Jun 17, 2024

g.HGVS not being populated was due to VEP failing to populate any new variants not just BCFTools ones

@EmmaTudini
Copy link
Contributor

Note from issue https://app.zenhub.com/workspaces/everything-space-5bb3158c4b5806bc2beae448/issues/gh/sacgf/variantgrid/1014

There are 2 separate things here:

VCF normalization - to make sure variants in inconsistent input representation end up stored as 1 consistent format in the database (left align, smallest representation etc) - prod currently uses "VT" for this
Liftover - conversion between genome builds. Prod currently uses ClinGen allele registry (and just converting MT as it's the same contig on both builds)
VT was the first tool to introduce normalization but it does not support symbolic variants at all, has a number of open issues, and hasn't made a release since 2018

BCFTools is an extremely popular package (samtools is universally used in next gen sequencing) and now supports liftover. They have quickly added support for symbolic variants etc and fixed bugs when I raised them.

So, we are switching to BCFTools for normalization - there is no turning this on/off it's the only one we will use going foward, it should be the same as VT though (except for working correctly for Structural Variant Normalization - ie this issue)

For liftover, we can enable/disable using "BCFTools +liftover" (a plugin) as our fallback liftover method via settings. It's currently turned off in the settings, plan is to enable it in prod one after next release

@davmlaw
Copy link
Contributor Author

davmlaw commented Jul 4, 2024

VG testing complete, Shariant testing spun off into new issue

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 18, 2024

After vg.com deploy found some issues:

  • Need to check for chain files in deployment_check
  • Running manual liftover pipeline button causes:
[2024-08-18 12:22:54,776: ERROR/ForkPoolWorker-1] duplicate key value violates unique constraint "snpdb_liftovererror_liftover_id_allele_id_ac62c5c6_uniq"
DETAIL:  Key (liftover_id, allele_id)=(28, 137) already exists.
  • This error didn't appear on the page (only exception in logs)
  • Should have some kind of way to delete / retry the bad data (have issue for this?)

davmlaw added a commit that referenced this issue Aug 19, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 19, 2024

ok, fixed that with update_or_create

Now I get an error about the reference not matching. We'll have to filter those out and die before writing to the VCF

We should probably think about how we want to store AlleleLiftovers that fail validation and never actually attempt a liftover run

At the moment, because we need to store the destination - for clingen we shove them all in a fake one that always fails, and don't have a way of doing it for other liftover types

Maybe we should return objects for liftover run, but with errors, so they are created and not eg dumped into a VCF etc, just created with an initial failure?

Then phase out that initial always fail run

@davmlaw
Copy link
Contributor Author

davmlaw commented Aug 19, 2024

ok, have rewritten code to yield multiple results, including errors, so we now always record what happens.

Deployed and tested vg.com

@davmlaw davmlaw closed this as completed Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants