Get Hadoop set up

Maven is the preferred compilation tool for Hadoop, Hadoop-BAM, and HadoopCNV. Please visit https://maven.apache.org/ if you do not already have Maven installed. Maven facilitates the building of projects by automatically resolving dependencies to other projects, and building these dependencies first.

You will want to install Hadoop on your cluster if you have not done so already. More information can be found here:

https://hadoop.apache.org/

Our software relies upon an open source package called Hadoop BAM to parse the BAM binary files and split these for our Mappers. Hadoop BAM can be obtained from

http://sourceforge.net/projects/hadoop-bam/

Please visit the Hadoop configuration page for information on how to configure Hadoop. This page is intended for systems administrators.

Downloads

Obtain source and build the project

Go to a directory where you want the HadoopCNV project to reside in. You can pull the code from Git via HTTPS, SSH, or Subversion using the clone URL on the main page of this repository. Change to the new directory HadoopCNV. This directory will be referred to hereafter as PROJECT_ROOT (i.e. the project root directory). In PROJECT_ROOT, open up pom.xml to make any necessary edits. In particular, for the XML tag hadoop.version, you may want to change this value if you are running a newer version of Hadoop. This will ensure that Maven pull in the correct libraries from Hadoop based on your installation.

To build the project, simply enter

mvn package

and source code will be generated in PROJECT_ROOT/target. You can also optionally generate HTML documentation for the various Java source files, and these will be generated in PROJECT_ROOT/target/site/apidocs with an invocation of

mvn javadoc:javadoc

Fetch a small dataset

Let's get started with a small example dataset available from the 1000 Genomes project, namely Chr 22 data of the sample ID NA12878. What we will need is two files from the project. We will download these into PROJECT_ROOT/examples. The VCF file can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz (extract NA12878 out) and the BAM file can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.

Preprocessing

1) Please notice as HadoopCNV collects read depth information through BAM files, any uncovered region will be neglected. To solve this problem, we made a baseline BAM file for hg19 to make sure all valid genomic regions (excluding 'N') are covered: 'example/hg19.bam.extra'.

For example, if your file is called 'NA12878.chrom*.ILLUMINA.bwa.CEU.high_coverage.20100311.bam', you should change 'hg19.bam.extra' to 'NA12878.chrom.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.extra', and change 'BAM_FILE' parameter in your config.txt to 'NA12878.chrom*.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.*'. Then this additional file is automatically included.

2) For hg38 or other genomes, we provided a python script preprocessBAM.py to generate this baseline BAM file from FASTA files.

First put all fasta files of a specific genome into a folder ./fasta

Then run this script by python preprocessBAM.py <prefix> ./fasta. A SAM file will be generated as '\<prefix>.extra.sam'. Then use Samtools to transform it to BAM file by:

samtools -t fasta/chrall.fa.fai <prefix>.extra.sam > NA12878.chrom.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.extra

NOTICE: \<prefix> can be any name you give to the file. 'fasta/chrall.fa.fai' is the fasta index associated with all chromosomes.

If you don't have this file in your './fasta' dir, please first concatenate all fasta files into one:

cat ./fasta/*.fa > ./fasta/chrall.fa

Then use Samtools to generate index:

samtools faidx ./fasta/chrall.fa

Configure HadoopCNV to use this data

Go to the directory that you saved the BAM and VCF files. Upload the BAM file into HDFS. This can be done with the command

hdfs dfs -put <BAMFILE> <TARGET_DIR>

where is the name of the BAM file you just downloaded. In Hadoop configuration, we assigned with the value /bamfiles.

Change to the PROJECT_ROOT/example directory. Move the VCF file into the directory called vcfs. In the VcfLookup component, HadoopCNV will read this VCF file, and store necessary fields from this file into the output directory of the Depth Caller. The purpose of this step is to provide the Reducer of the Binner component with information on which sites (among all possible sites generated by the Depth Caller) are actually known SNPs (based on inferences from a reliable variant caller).

Next, open up config.txt for editing. Be sure the value for the fields BAM_FILE and VCF_FILE reflect the correct value for the paths of the BAM file and VCF file on HDFS and your local file system respectively. You will also want to ensure that the values for RUN_DEPTH_CALLER,INIT_VCF_LOOKUP,RUN_BINNER,and RUN_CNV_CALLER are set to true. You can leave RUN_GLOBAL_SORT set to false. This last component basically sorts the output from RUN_DEPTH_CALLER, and is helpful for debugging purposes. However, it is possible that you may want to tweak some settings in the HMM or the Binner. In that case, the values for the first components can be set to false, as the depth calls are independent of such settings, and will still be persisted on HDFS. There are other run-time settings in the configuration file, which are not critical. However, you may want to tweak these depending on the hardware resources of your cluster. Further explanations are provided as comments in the example config.txt file.

In case you want to change anything, the only parameters you should play with are 'ABBERATION_PENALTY' and 'TRANSITION_PENALTY'. Increase 'ABBERATION_PENALTY' to make CNV calls more stringent. Increase 'TRANSITION_PENALTY' to make only longer calls to be considered.

Advanced configuration

There are also other settings that HadoopCNV can be customized with. These settings are stored in <PROJECT_ROOT>/src/main/java/Constants.txt. Parameters include the overall read mapping quality threshold for inclusion in depth counts, base-specific quality score threshold, the size of the bin in base pairs, the minimum proportion of heterozygous sites (for LOH detection), sensitivity to copy number abberrations (i.e. CN!=2), transition penalties (i.e. bias towards larger or smaller CNVs), and mixture parameter for the influence of BAF information. More details can be found in the Java comments. Please note that these settings must be activated by re-compiling HadoopCNV since these settings are pertinent to Mappers and Reducers, who are by definition stateless. As a reminder, compile with command:

mvn package

Run the example dataset

You should be ready to run this example. But before we do, let's make sure the environment variable HADOOP_CLASSPATH is set correctly. You will want to ensure HADOOP_CLASSPATH is pointing to the path of the main JAR file from the Hadoop BAM installation. This way the slave nodes will also be able to locate the dependencies of HadoopCNV. For example, in our installation, our Bash initialization script ~/.bash_profile includes a line reading

export HADOOP_CLASSPATH=/home/hadoop/hadoop/Hadoop-BAM-master/target/hadoop-bam-7.0.1-SNAPSHOT-jar-with-dependencies.jar

You may want to add this into your initialization script as well. Verify that the variable is set correctly by logging out, logging in, and typing

echo $HADOOP_CLASSPATH

At this point, you should be able to run HadoopCNV with a call to

./run.sh

This example will generate a CNV inference file called results.txt.

The output of the CNV inference file may look something like

[kaiwang@compute-0-0 results]$ head results.txt 
chr15   20000007        20000099        -0.6997057      2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000100        20000199        -0.39941147     2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000200        20000299        -0.14916627     2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000300        20000399        9.808615E-4     2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000400        20000499        0.051029906     2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000500        20000599        9.808615E-4     2       2       0.0     0.0     0.0     0.0     0.0     0.0
chr15   20000600        20000699        0.10107895      2       2       0.0     0.0     0.0     0.0     0.0     0.0

The columns are interpreted as chromosome name, start base for the bin, end base for the bin, median depth count (normalized to [-1.0,1.0]), HMM state (0=total deletion,1=single deletion,2=copy neutral LOH,3=normal,4=single copy amplification, 5 = double copy amplification), copy number, MSE_BAF_0, MSE_BAF_1, MSE_BAF_2, MSE_BAF_3, MSE_BAF_4, MSE_BAF_5. For more information regarding the last four fields, please refer to our manuscript. Briefly, these are mean squared errors for each of the four states conditional on the BAF model for that state. Lower values signify a better fit for that particular state's hypothesis.

Then run

perl compile_penncnv3.pl < results.txt > out.cnv

to generate the final output CNV file, which looks like:

deletion    chr15:21101500-21299999 CN:1
deletion    chr15:25309100-25337899 CN:1
deletion    chr15:27340000-27340999 CN:1
deletion    chr15:28342100-28360899 CN:1
deletion    chr15:33022100-33520999 CN:1
deletion    chr15:37597000-37605899 CN:1
deletion    chr15:40612700-40613899 CN:1

The first column indicates the type of CNVs (deletion, duplication or loh), the second column is in "chromosome:start-end" format, the third column indicates the copy number.