Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different Results Using EPACTS version 3.4 #10

Open
smhaider opened this issue Mar 8, 2019 · 35 comments
Open

Different Results Using EPACTS version 3.4 #10

smhaider opened this issue Mar 8, 2019 · 35 comments

Comments

@smhaider
Copy link

smhaider commented Mar 8, 2019

Dear All,

I am just curious what is the difference between EPACTS version 3.4 and 3.2.6 or 3.3.2, As I was unable to find a proper documentation on versions. I compared results of all these versions I gave the same input files called the same method the result from 3.2.6 and 3.3.2 were exactly the same (and makes more sense according to my data) whereas the result from 3.4 was totally different and doesn't make sense according to my data as well. so I am just curious what is the difference between these versions?

ps. the command I used for all these versions is as follows.

epacts-single --vcf filtered.vcf.gz --ped phenotype.ped --pheno PHENO --test b.glrt --anno --out Single_SNP_Analysis_COmplete --run 2 | tee log.txt

I look forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

The only difference that is known to affect results is the handling of multiallelic sites. Can you elaborate on doesn't make sense according to my data?

@smhaider
Copy link
Author

smhaider commented Mar 8, 2019

thanks for your quick response. So by doesn't make sense I meant in my vcf file that I provided I had like 20k variants (not multiallelic: have already filtered them) but after the analysis I only got like 1040 variants. where as in version 3.2.6 and 3.3.2 I got same number of variants.

Secondly the allele frequencies in cases and controls and overall allele frequencies were also wrong where as 3.2.6 and 3.3.2 gave had the right frequencies I confirmed it through manual calculation as well.

so that was a bit strange for me I thought that maybe 3.4 uses a different build but still not sure what went wrong.

@jonathonl
Copy link

Ok. I'll test that exact command and get back to you. It might take a couple days to get to this though.

@smhaider
Copy link
Author

smhaider commented Mar 8, 2019

ok that sounds perfect thanks again. looking forward to hearing from you.

have a nice weekend :)

@jonathonl
Copy link

Can you pull the latest from the develop branch and give it a go? You will need to reinstall the savvy dependency as well after pulling the latest (cget remove statgen/savvy; cget install -f requirements.txt).

There were two issues that I fixed. One was the handling of PL and GL data, which caused incorrect data. The other issue was the handling of variants that do not have PL data. I suspect this is why the number of variants differ in your analyses.

@smhaider
Copy link
Author

Hey Good Morning,

Thank you so much for the quick look up and sorry for my late response I was travelling. ok I will let our cluster team know about the changes. I am just curious if these changes will fix the incorrect frequencies as well?

Thanks again.

Regards,
Haider

@smhaider
Copy link
Author

@jonathonl

Thanks again for your quick response. I just tried the analysis again but unfortunately the results were same.

Just to confirm things if we are using the right version as I am using this tool on our server our team couldnt find a git tag named 3.4 in the repo. please see below the response from the team.

I installed git version 3.3.0 and the latest commit from the development branch of the git repo in a container; I think that the ladder is a 3.4.0. Unfortunately, there is no git tag named 3.4.0 in the repo, so I can only refer to a specific git commit ID in order to specify the version. Our version 3.4.0 is commit ID 98d7ff7 ( 98d7ff7 )

if we are doing something wrong can you please direct us in the right direction?

I look forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

jonathonl commented Mar 11, 2019

That is the correct commit id. A tarball can be found at https://github.com/statgen/EPACTS/archive/98d7ff7c0f0831d521324f16671d8fa989800f63.tar.gz. Are you saying that you are getting the same exact results when running b.glrt with this commit as you do when running with v3.4-rc? That would be very unlikely. I would double check that you are using the correct docker image. Maybe run docker images to see the details of your local images and then run docker ps -a to confirm that the container you ran matches the correct image ID.

Also, am I correct in assuming you have PL data in your VCF file?

@smhaider
Copy link
Author

@jonathonl

Thanks again for your response. I just double checked the versions and re-ran the analysis again but unfortunately got the same results.
for your reference and better understanding I am attaching the log files of the run through different versions using same command and the recipe by which difference versions of epacts were installed on the container. if you want I can also share the output files from different versions.

log_3.4.0.txt
log_3.3.0.txt
log_3.2.6.txt
log_3.4rc.txt

recipe_3.3.0.txt
recipe-v3.2.6.txt
recipe_3.4rc.txt
recipe_3.4.0.txt

lastly, yes I have PL data as well in my vcf file. I have attached a chunk of my vcf file as well for your reference please find attached.

I look forward to hearing from you.

Kind Regards,
Haider
Filtered.txt

@jonathonl
Copy link

I did some tests with the Filtered.txt VCF file you supplied. One problem is that this file has carriage returns, which breaks the parsing on the master branch version (v3.30). This may have just been an artifact of subsetting, and the CRs may not exist in your original VCF. Another issue is that this VCF does contain multiallelic variants. In the master branch the extra alleles are ignored. In the develop branch, the file parsing stops for that chunk. Apparently it doesn't propagate the error up the call stack resulting in truncated regions. This is why the number of variants doesn't match.

Can you run the following command on your end? The input files are in the tarball below. The expected results are also in the tarball.

epacts-single --vcf smhaider-nocr-noma.vcf.gz --ped smhaider_dummy_pheno.ped --pheno DISEASE --test b.glrt --anno --out test-out --run 2 --chr 1

test-files.tar.gz

@smhaider
Copy link
Author

oh yes I am so sorry I forgot that I filtered for multi-alleles after this vcf. I mixed it with another vcf file my bad.
and I will test it and get back to you as soon as possible currently our cluster is down for maintenance as soon as it is up again will test it and get back to you.

thanks again for all your assistance and help I really appreciate that.

Kind Regards,
Haider

@smhaider
Copy link
Author

@jonathonl

the link you provided is broken can you please share it again?

https://github.com/statgen/EPACTS/archive/98d7ff7c0f0831d521324f16671d8fa989800f63

unable to access this link.

Kind Regards,
Haider

@jonathonl
Copy link

@smhaider
Copy link
Author

thanks again will get back to you soon :)

@smhaider
Copy link
Author

@jonathonl

I hope you are fine. so I have just tested with the material you provided and the results are matching with your output. so does that mean I need to remove multialleles from my vcf and I will be good to go??

secondly I tested your files with 2 versions 3.3.0 and 3.4.0 I got same results like your with 3.4.0 but different results from 3.3.0. please see both results in the attachment. just wondering why they are different?

secondly I tested my real vcf with version 3.3.2 and 3.2.6 the results from both versions are matching and seems to be correct as well. so Just need your suggestions should I change my vcf and remove the multialleles or should I just keep using 3.3.2. the commit id used for 3.3.2 was "a5209db1b3929c4dd2f15f27ea085edf3a634ee7"

Lastly I am mainly interested in q.emmax testing for that I calculated the kinship matrix while running it through 3.2.6 and 3.3.2 I got some warnings during the run there wasn't any error but I am not sure if these warnings would have effected the results so I was just wondering if you can have a look at the run, I would really appreciate that. I have attached the log file as well.

Thank you again for all your help and assistance.

I look forward to hearing from you.

Kind Regards,
Haider

test-out-v3.3.0.epacts.gz
test-out-v3.4.0.epacts.gz

kinship-run.txt

@jonathonl
Copy link

The versions are admittedly a little wonky, and I will eventually clean that up. In the meantime, let me make some clarifications. According to recipe_3.3.0.txt, what you are calling v3.3.0 is the same as the git tag v1.0, which is the same as commit a5209db, which you also said was v3.3.2. So v3.3.0, v3.3.2, v1.0 and the current HEAD of master are all the same since they all represent the commit a5209db. For the sake of simplicity, I'm going to call this commit master.

When I run the test files in the tarball I provided above, I get the same exact results between what is on master (a5209db) and what is on develop (98d7ff7). These results also match what you provided in test-out-v3.4.0.epacts.gz.

I would recommend running bcftools norm -m- <input.vcf.gz> to convert the multiallelics to biallelics. This will give you more complete results for all version.

I would stick with using the version that is on master (a5209db). There some issues still being worked out in the develop branch, and it is not entirely stable.

As for you kinship error log, I suspect that your dataset is sparse (chip data maybe) and some of the regions being processed do not include any variants above the minMAF. You can confirm this by looking at those regions with bcftools.

@smhaider
Copy link
Author

@jonathonl

thank you so much again. everything is clear and I really appreciate your help.

Kind Regards,
Haider

@smhaider
Copy link
Author

smhaider commented Apr 9, 2019

@jonathonl

I hope you are doing well. Sorry for bothering you again. I have a quick question. I ran q.emmax test on my LOF variants using commit master. it seems to run fine so far without any error the makefile is updating time to time. but its been like 5 days since the analysis is running on the server so I was just wondering if it is normal or something's is fishy?
on the documentation page it is stated that its a slow analysis but I am just curious if it is normal that it will take that much time.

ps. I am running the analysis with 50 cores and 200gb memeory.

I look forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

I would expect that to be normal if you have tens of thousands of samples. You can get an idea of progress by looking at the output directory. There will be temporary files for each region that has been run.

@smhaider
Copy link
Author

smhaider commented Apr 9, 2019

ok thank you again for your response. So just to confirm if I am understanding you correctly: I have 18125 samples in my vcf file. I made the kinship matrix out of it using your standard guidance. then I ran the q.emmax analysis 5 days ago.
so in the output folder, it generated 4 files instantly when I ran the analysis.
1: annotated.ind.
2: annotated.cov
3: annotated.Makefile
4: annotated.phe
but from last 5 days the total number files didnt change in the output directory. even the size. apart from the makefile. the size of Makefile did increase. but it didnt generate any other file since then.

Secondly I am constantly checking the log file of the run it didnt update since 5 days Kindly please have a look at the attachment.

Although the analysis is running (no error so far) and it is consuming memory and CPU's but I am bit worried if it is normal or there is something wrong.

Slurm.output.txt
log_lof.txt

I am sorry for bothering you again and again and I really appreciate your help.

Kind Regards,
Haider

@jonathonl
Copy link

According to the log, it's stuck on matrix inversion, which I believe should be single threaded. What percentage of CPU and memory is the pEmmax process using?

@smhaider
Copy link
Author

@jonathonl

Thanks again for your response. ok I just checked the run a while ago and this process seems to be working:

/usr/local/bin/pEmmax reml --phenof /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.phe --covf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.cov --kinf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Kinship_Matrix/kinship.kinf --indf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.ind --out-eigf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.eigR --out-remlf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.reml

It constantly uses one core at 100% and 16.9GB of RAM

There are two other processes running:

make -f /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.Makefile -j 4
/usr/bin/perl -w /usr/local/bin/epacts-single --vcf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/LOF_annotated.vcf.gz --ped /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/phenotype.ped -min-maf

They ran for a few seconds, at most. They seem to have started the first, working process.

There is almost no communication with /data, so it's probably not writing anything to my output folder.

I dont really know whether it is normal or not. So I am looking forward to your advise. As its already been 6 days and I put the script for 10 days only so I am a bit worried as the server will cancel my job in 4 days.

Looking forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

6 days does seem too long for the reml step with 18,000 samples. I would still let the job run to see if it finishes before the time limit. In the meantime, I'm curious whether modifying the SVD problem slightly would have an effect. Maybe try adding a few more covariates. To be honest, I'm kind of grasping at straws here.

@smhaider
Copy link
Author

@jonathonl

Thanks again. So the script just stopped but somehow it didnt calculate anything. there are all NA's in the pvalues, beta, sbeta, columns etc.

I have also attached the log file and the traceback file for your reference. Can you kindly please have a look at it again and point out the problem?

log_lof.txt

Slurm.output.txt

and below you can also find few lines of the generated output.

Result.txt

Looking forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

All of the variants in the results file have a MAC below the threshold. I suspect you will have valid p-values for variants with MAC > 3 and MAF > 0.001. Have you looked at the manhattan plot?

@smhaider
Copy link
Author

Hi @jonathonl ,

Thanks for your help so far...however I still do not understand your answer: Firstly, I only set a MAF threshold of minimum MAF > 0.001 but no MAC as the tools manual just requires a minMAF cutoff; secondly, I have 74 variants with Allele Count >3 however there are no pvalues for these 74 as well (all pvalues are shown as NA) and due to this, the Manhattan Plot is empty. Kindly Please have a look at the complete result of epacts and the Manhattan plot below.

SVT.LOF.annotated.epacts.xlsx

SVT.LOF.annotated.epacts.mh.pdf

@jonathonl
Copy link

There are default min MAC (3) and min MAF (0.001) thresholds for the epacts single command. You can see them being passed down to pEmmax in the logs. Is this the same set of variants you used for epacts make-kin? That subcommand has a default MAF threshold of 0.01. Common variants are necessary for generating a useful kinship matrix.

@smhaider
Copy link
Author

smhaider commented Apr 15, 2019

@jonathonl

First of all thank you so much for your response and guidance so far. I am sorry but I didn't understand why it is better to include common variants in the kinship matrix? I am only interested in rare variants. the vcf file that I am using is already filtered for rare variants. and I used the same file to generate the kinship matrix as well.

Secondly I wanted to explain one thing in detail might be I am doing things totally wrong.

The vcf file that I am using for SVT is filtered for rare variants and it is also annotated. one of my senior suggested me to extract only the LOF (frameshit, stop-gain, start-loss etc) variants out of it and run the Single variant analysis (q.emmax) on only the LOF variants as well as on the complete vcf file just to see if there is a difference. So I did the same. I extracted only the LOF variants from my annotated file zipped it according to the instructions, generated the kinship matrix using LOF.annotated.vcf.gz file and ran the single variant analysis. The files that I have shared with you so far they are the results of single variant test on only the LOF variants. this thing ran for more then 7 days.

After that just for test I ran same test on the complete vcf file containing all the variants. but to my surprise the analysis finished in just 5 hours. and it did generate some results.(although I am not sure if they are alright or not) BUT UNLIKE ONLY LOF variants VCF file, I have pvalues for allot of variants when I ran the analysis on my complete vcf file.

But, even for my complete vcf files output there are many variants with MAC greater then 3 still there are no pvalues. and a strange thing is the there are many variants who have same MACs same MAF and same CALLRATE still some has pvalues and some only have NA. so I am not sure if the results are reliable or not. I have also attached the output file for the complete VCF file for your reference.

LASTLY, just to mention I have used the same PED file for both LOF and Complete vcfs.

I am sorry for bothering you again and again but I really need your assistance as I am pretty confused with my results.

I hope I have managed to explain the scenario if you need more information please let me.

I look forward to hearing form you.

Kind Regards,
Haider
Single_Variant_Complete_VCF.xlsx

@jonathonl
Copy link

jonathonl commented Apr 15, 2019

You should use the kinship matrix generated with the full VCF in both the full and LOF association tests. The kinship matrix is an input to the mixed-model for each variant and is independent of the set of variants being analyzed. With small datasets, one could generate a kinship matrix from a pedigree without even looking at the individuals' genotypes. But for large datasets, we need to infer relatedness, and including common variants allows those inferences to be more accurate. Imagine trying to infer relatedness from a dataset of cousins using only singletons. The result would be a matrix of zeros even though these individuals are highly related.

5 hours is more inline with the expected time for analyzing ~20,000 samples.

The MAF column is rounded. For example, the actual MAF for the marker 10:82182287_TC/CC,GC,AC,T is 36 / (2 * 17954), which is greater than 0.001. The MAF for 2:216285445_G/A is 36 / (2 * 18002), which is less than 0.001.

@smhaider
Copy link
Author

@jonathonl

ok makes sense. thank you so much for explaining it nicely. I have just ran the single variant test again on my raw vcf file. will test the LOF and filtered vcf as well with the same kinhsip matrix used for raw vcf and will get back to you again.

Thanks again for all your help I really appreciate that.

Kind Regards,
Haider

@smhaider
Copy link
Author

@jonathonl

So I am done with the single variant test for the raw and for the filtered vcf. To be honest I don't really understand the results, just have a feeling that something is wrong. Firstly there are still allot of NA for the variants. Secondly i don't know why but it seems it makes no real sense how some variants pop up with so low pvalues. if you have some time can you please explain it a bit as I am not sure if the results are reliable or not.

for your reference I am attaching all the relevant files

Output files of Raw.vcf

EPACTS.Raw.vcf.Results.xlsx
EPACTS.Raw.vcf.qq.pdf
EPACTS.Raw.vcf.MP.pdf
log_Kinship.Raw.txt
log.Raw.RUN.txt

output files for Filtered.vcf
Results.EPACTS.Filtered.xlsx
filtered_vcf.epacts.qq.pdf
filtered_vcf.epacts.mh.pdf
log_filtered.txt

I am sorry again for bothering you, I am pretty fresh with this stuff so have understanding issues as well.

I look forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

Am I correct in assuming you are using microarray data with custom content that targets rare alleles? If so, the results look fine to me. 1) All of the NAs are those variants that fall outside the MAC and MAF thresholds. The test statistics for those variants will be calculated if you set a lower MAF. 2) Which variants/p-values don't make sense to you and why don't they make sense? Have you checked the potential hits on chromosome 2 and 7 in the full VCF with the known hits for your phenotype? What exactly are you filtering on when you say "filtered VCF"? Are you simply including filter=PASS?

@smhaider
Copy link
Author

@jonathonl

Thanks again for your reply. First of all I am extremely sorry for not explaining you the dataset properly. So basically, I am working with a panel sequencing (targeted sequencing) data. so we had around 106 genes in this panel and we sequenced around 12000 cases and around 6000 controls. these genes are all known CAD GWAS loci's. The variant calling is done using GATK Standard whole-exome pipline.

Well, I don't know if the results makes sense or not As I dont understand the q.emmax test, I read Prof Dr. Kang's publication as well but unfortunately didn't really understand the working of q.emmax test, how the associations are calculated. I just noticed there are allot of variants who showed high associations but when I checked the frequencies in cases and controls they were more frequent in the controls (as we have less number of controls). I don't know it might be a very lame question but I have some understanding issues. Might be a very lame question again but will ask anyway, how can a variant show association to Phenotype if it is frequent in both cases and controls (infect more in controls)? I was expecting that I will getting variants showing associations based on cases or on control? or I am not making sense at all? I am sorry again I have some serious understanding issues as I have actually started working on this project 5 months ago and working with this kind of stuff for the first time so my concepts are also not clean enough have to learn allot so forgive me for my questions if they arent making any sense.

and the filtered file is basically filtered on following criteria.

$bcftools filter -i "(MQ>=50)" $IN | $bcftools filter -S "." -i "((FORMAT/GT != '0/0') && (( (FORMAT/AD[1](INFO/QD)(INFO/QD)(FORMAT/GQ)(FORMAT/GQ))>(55*55) ) && ( FORMAT/DP>=10 )&& ((FORMAT/AD[1]>=25) || (FORMAT/AD[1]>=15 && FORMAT/DP>=100)))) || (FORMAT/DP>=5)" | $bcftools filter -i "(AN>=32500)&&(AC>0)" | bgzip -c > $OUT

Lastly I would like to ask for your advise. As you now know what kind of data I am dealing with can you please suggest or recommend me which method is suitable for the analysis? Is q.emmax is ok to use with this data? or do you recommend any other method as well? or if you have any other suggestion I would really appreciate that.

Thanks again for all your help and guidance I really do appreciate are assistance response.

Looking forward to hearing from you.

Kind Regards,
Haider

@jonathonl
Copy link

Looking more closely, some of the betas are going in an unexpected direction. This can happen in extreme cases of relatedness. The kinship matrix is probably still not dependable since the data is so sparse. The score test in EPACTS will probably give you more reliable results with gene panel data. With that said, I am not the best person to give advise on this.

@smhaider
Copy link
Author

@jonathonl

Thank you so much for your guidance, I am really grateful for everything. I will look into it and hope to get some suitable results.

Thanks again.

Kind Regards,
Haider

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants