How to build HISAT2 indexes if you can't download them from the website?

If you can't get HISAT2 indexes from the website, you have to create them yourself. This instruction is for those unfamiliar with the operation using Linux or Unix commands.

The Preparation only for Windows Users.

We introduce WSL to use bioinformatics tools for Linux systems on Windows. And they need to do something before building indexes. So Mac users can skip this section.

Firstly, please start WSL by hitting "wsl" on Command Prompt and update packages with the following command.

apt update

If you get a "permission denied" error by this command, you put "sudo" before the command like

sudo apt update

Please input the password you set while setting up Ubuntu to execute the command. And this workaround works when you get the error at the following steps.

Install python2 with a command like

apt -y install python2

Check the location of python2 and python3 with commands like

which python2
which python3

If the paths are "/usr/bin/python2" and "/usr/bin/python3," type the following command to check their sub versions.

ll /usr/bin/python2*
ll /usr/bin/python3*

If they are "/usr/bin/python2.7" and "/usr/bin/python3.8," type the following command.

update-alternatives --install /usr/bin/python python /usr/bin/python3.8 1
update-alternatives --install /usr/bin/python python /usr/bin/python2.7 2

Type the following command to select which version of python to use if called by simply "python."

update-alternatives --config python

And then input the number indicating python3.8.

Lastly, please check if you have done rightly with a command.

python -V

If it responds "Python 3.8.x," where x can be any number, it means you could complete this section.

Get The Genome Sequence Files

You need genome sequence data in FASTA format files, separated by chromosome. For example, if you see a website of Ensembl's chicken data, click "Download DNA sequence (FASTA)" to download fa.gz files of all chromosomes.

I recommend you change the file names as short as possible, like "chr1.fa.gz" or "chrZ.fa.gz." This is not necessary, but it makes the command you will type more efficiently. And then, unbundle all the gz.

Building Indexes

Windows users need to start WSL by hitting "wsl" on Command Prompt. Mac users need to open Terminal.

Let's say you have the .fa files in a directory named "genomeseq" in the Documents directory of your account. Move the current directory to "genomeseq" by the "cd" command.  

For Windows Users:

cd /mnt/c/Users/XXXXX/Documents/genomeseq

For Mac Users:

cd /Users/XXXXX/Documents/genomeseq

You have to replace XXXXX with your account name. And let's say you have hisat2-2.1.0 folder under your account's Documents directory. Then, you can build the indexes for the organism with the following command.

For hisat2-2.2.0 or later, if the genome is longer than 4 billion nucleotides, please use the hisat2-build-l command instead. Otherwise, hisat2-build or hisat2-build-s commands are OK.

For Windows Users:

/mnt/c/Users/XXXXX/Documents/hisat2-2.1.0/hisat2-build -f chr1.fa,chr2.fa,chr3.fa,,,chrZ.fa Gallus_gallus_GRCg6a

For Mac Users:

[PATH]/hisat2-2.1.0/hisat2-build -f chr1.fa,chr2.fa,chr3.fa,,,chrZ.fa Gallus_gallus_GRCg6a

Mac users have to determine [PATH] for your system. Please fill with the names of all .fa files at ",,," part. And please change "Gallus_gallus_GRCg6a" at last with any text without space, indicating the organism and genome version.

Wait until the execution completes. You will find .ht2 files in "genomeseq" folder.