How to setup the environment for the pipeline of FASTQ to VCF (RNA-Seq) tool. (Windows version)

This is a preparation guide for using the FASTQ to VCF tool.

Before you go

Please finish setting up the RNA-Seq FASTQ file processing pipeline on Windows first because this pipeline also uses the same tools of fastp and HISAT2. What you add is only one more software, GATK4.

The Environment Set-up Panel

Setting up the pipeline is like doing it for the RNA-Seq FASTQ file processing. Settings for the fastp and HISAT2 executable are the same. The GATK Resource Bundle is available only for human (b37/hg19 and Grch38/hg38) but not for other organisms.

However, the indexes provided by the HISAT2 website cause the error on GATK execution because their chromosome names are like "1", not "chr1." So, download the "chr" added version of GRCh38 HISAT2 indexes and GTF file that we added "chr" to chromosome names. Please uncompress the zip files before you use them.

FASTQ to VCF(RNA-Seq) Environment Set-up Panel (windows 10)

1. Java 8 Executable

Download Linux x64 Compressed Archive version. If you don't have an Oracle account, please sign-up. Uncompress the archive file. And set the path to jdk-[version]\bin\java.

2. GATK jar Executable

Download the zip archived file of the latest release. Uncompress the zip file. And set the path to gatk[version]\gatk-package-[version]-local.jar.

3. GTF/GFF3 File

You can download the GTF file of GRCh38 version from here. This GTF file is modified version of the original provided by EBI. Chromosome names have "chr." And to make this work, you need to use it together with the "chr" added version of HISAT2 index.

4. dbSNP VCF File

Get it from gcp-public-data--broad-references on Google Cloud. For GRCh38 data, please download Homo_sapiens_assembly38.dbsnp.vcf under the hg38/v0 folder.

5&6. Known Indels VCF File

You can get them from the same directory as step4. Please find Homo_sapiens_assembly38.known_indels.vcf and Mills_and_1000G_gold_standard.indels.hg38.vcf. maybe they are listed on the second page.

Troubleshooting of the files of Step 4 to 6.

Though the names of those files end with ".vcf.gz" on Goole Cloud, downloaded files might look like ".vcf" or something else. Both raw vcf and gz compressed files are accepted, but you may need to correct extensions.

If you want to analyze on hg19

If you analyze on hg19, please download the original HISAT2 Index (grch37_genome.tar.gz) and GRCh37 GTF file. And use the following dbSNP and Indels files.

Support

Help - Installing & Using Software

How to setup the environment for the pipeline of FASTQ to VCF (RNA-Seq) tool. (Windows version)