Get Hadoop set up
Maven is the preferred compilation tool for Hadoop, Hadoop-BAM, and HadoopCNV. Please visit https://maven.apache.org/ if you do not already have Maven installed. Maven facilitates the building of projects by automatically resolving dependencies to other projects, and building these dependencies first.
You will want to install Hadoop on your cluster if you have not done so already. More information can be found here:
Our software relies upon an open source package called Hadoop BAM to parse the BAM binary files and split these for our Mappers. Hadoop BAM can be obtained from
Please visit the Hadoop configuration page for information on how to configure Hadoop. This page is intended for systems administrators.
Obtain source and build the project
Go to a directory where you want the HadoopCNV project to reside in. You can pull the code from Git via HTTPS, SSH, or Subversion using the clone URL on the main page of this repository. Change to the new directory HadoopCNV. This directory will be referred to hereafter as
PROJECT_ROOT (i.e. the project root directory). In
PROJECT_ROOT, open up pom.xml to make any necessary edits. In particular, for the XML tag hadoop.version, you may want to change this value if you are running a newer version of Hadoop. This will ensure that Maven pull in the correct libraries from Hadoop based on your installation.
To build the project, simply enter
and source code will be generated in
PROJECT_ROOT/target. You can also optionally generate HTML documentation for the various Java source files, and these will be generated in
PROJECT_ROOT/target/site/apidocs with an invocation of
Fetch a small dataset
Let's get started with a small example dataset available from the 1000 Genomes project, namely Chr 22 data of the sample ID NA12878. What we will need is two files from the project. We will download these into
PROJECT_ROOT/examples. The VCF file can be found at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz (extract NA12878 out) and the BAM file can be found at
1) Please notice as HadoopCNV collects read depth information through BAM files, any uncovered region will be neglected. To solve this problem, we made a baseline BAM file for hg19 to make sure all valid genomic regions (excluding 'N') are covered: 'example/hg19.bam.extra'.
For example, if your file is called 'NA12878.chrom*.ILLUMINA.bwa.CEU.high_coverage.20100311.bam', you should change 'hg19.bam.extra' to 'NA12878.chrom.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.extra', and change 'BAM_FILE' parameter in your config.txt to 'NA12878.chrom*.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.*'. Then this additional file is automatically included.
2) For hg38 or other genomes, we provided a python script preprocessBAM.py to generate this baseline BAM file from FASTA files.
First put all fasta files of a specific genome into a folder
Then run this script by
python preprocessBAM.py <prefix> ./fasta. A SAM file will be generated as '\<prefix>.extra.sam'.
Then use Samtools to transform it to BAM file by:
samtools -t fasta/chrall.fa.fai <prefix>.extra.sam > NA12878.chrom.ILLUMINA.bwa.CEU.high_coverage.20100311.bam.extra
NOTICE: \<prefix> can be any name you give to the file. 'fasta/chrall.fa.fai' is the fasta index associated with all chromosomes.
If you don't have this file in your './fasta' dir, please first concatenate all fasta files into one:
cat ./fasta/*.fa > ./fasta/chrall.fa
Then use Samtools to generate index:
samtools faidx ./fasta/chrall.fa
Configure HadoopCNV to use this data
Go to the directory that you saved the BAM and VCF files. Upload the BAM file into HDFS. This can be done with the command
hdfs dfs -put <BAMFILE> <TARGET_DIR>
Change to the
PROJECT_ROOT/example directory. Move the VCF file into the directory called
vcfs. In the VcfLookup component, HadoopCNV will read this VCF file, and store necessary fields from this file into the output directory of the Depth Caller. The purpose of this step is to provide the Reducer of the Binner component with information on which sites (among all possible sites generated by the Depth Caller) are actually known SNPs (based on inferences from a reliable variant caller).
Next, open up
config.txt for editing. Be sure the value for the fields BAM_FILE and VCF_FILE reflect the correct value for the paths of the BAM file and VCF file on HDFS and your local file system respectively. You will also want to ensure that the values for RUN_DEPTH_CALLER,INIT_VCF_LOOKUP,RUN_BINNER,and RUN_CNV_CALLER are set to true. You can leave RUN_GLOBAL_SORT set to false. This last component basically sorts the output from RUN_DEPTH_CALLER, and is helpful for debugging purposes. However, it is possible that you may want to tweak some settings in the HMM or the Binner. In that case, the values for the first components can be set to false, as the depth calls are independent of such settings, and will still be persisted on HDFS. There are other run-time settings in the configuration file, which are not critical. However, you may want to tweak these depending on the hardware resources of your cluster. Further explanations are provided as comments in the example
In case you want to change anything, the only parameters you should play with are 'ABBERATION_PENALTY' and 'TRANSITION_PENALTY'. Increase 'ABBERATION_PENALTY' to make CNV calls more stringent. Increase 'TRANSITION_PENALTY' to make only longer calls to be considered.
There are also other settings that HadoopCNV can be customized with. These settings are stored in
<PROJECT_ROOT>/src/main/java/Constants.txt. Parameters include the overall read mapping quality threshold for inclusion in depth counts, base-specific quality score threshold, the size of the bin in base pairs, the minimum proportion of heterozygous sites (for LOH detection), sensitivity to copy number abberrations (i.e. CN!=2), transition penalties (i.e. bias towards larger or smaller CNVs), and mixture parameter for the influence of BAF information. More details can be found in the Java comments. Please note that these settings must be activated by re-compiling HadoopCNV since these settings are pertinent to Mappers and Reducers, who are by definition stateless. As a reminder, compile with command:
Run the example dataset
You should be ready to run this example. But before we do, let's make sure the environment variable HADOOP_CLASSPATH is set correctly. You will want to ensure HADOOP_CLASSPATH is pointing to the path of the main JAR file from the Hadoop BAM installation. This way the slave nodes will also be able to locate the dependencies of HadoopCNV. For example, in our installation, our Bash initialization script
~/.bash_profile includes a line reading
You may want to add this into your initialization script as well. Verify that the variable is set correctly by logging out, logging in, and typing
At this point, you should be able to run HadoopCNV with a call to
This example will generate a CNV inference file called
The output of the CNV inference file may look something like
[kaiwang@compute-0-0 results]$ head results.txt chr15 20000007 20000099 -0.6997057 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000100 20000199 -0.39941147 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000200 20000299 -0.14916627 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000300 20000399 9.808615E-4 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000400 20000499 0.051029906 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000500 20000599 9.808615E-4 2 2 0.0 0.0 0.0 0.0 0.0 0.0 chr15 20000600 20000699 0.10107895 2 2 0.0 0.0 0.0 0.0 0.0 0.0
The columns are interpreted as chromosome name, start base for the bin, end base for the bin, median depth count (normalized to [-1.0,1.0]), HMM state (0=total deletion,1=single deletion,2=copy neutral LOH,3=normal,4=single copy amplification, 5 = double copy amplification), copy number, MSE_BAF_0, MSE_BAF_1, MSE_BAF_2, MSE_BAF_3, MSE_BAF_4, MSE_BAF_5. For more information regarding the last four fields, please refer to our manuscript. Briefly, these are mean squared errors for each of the four states conditional on the BAF model for that state. Lower values signify a better fit for that particular state's hypothesis.
perl compile_penncnv3.pl < results.txt > out.cnv
to generate the final output CNV file, which looks like:
deletion chr15:21101500-21299999 CN:1 deletion chr15:25309100-25337899 CN:1 deletion chr15:27340000-27340999 CN:1 deletion chr15:28342100-28360899 CN:1 deletion chr15:33022100-33520999 CN:1 deletion chr15:37597000-37605899 CN:1 deletion chr15:40612700-40613899 CN:1
The first column indicates the type of CNVs (deletion, duplication or loh), the second column is in "chromosome:start-end" format, the third column indicates the copy number.