Alignment algorithms
In the first step after getting the fastq files. The alignment step begins to work. There are multiple alignment algorithms which you can use in HiCPack simply by setting the ALIGNMENT_ALGORITHM
in HiCPack config.
NOTE: for now, Bowtie2 is just supported by HiCPack. However, bwa after testing and debugging process will be supported ASAP.
Bowtie 2
Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index (based on the Burrows-Wheeler Transform or BWT) to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and paired-end alignment modes. Multiple processors can be used simultaneously to achieve greater alignment speed.
You can easily use bowtie2 in HiCPack by passing your optional arguments for global(BOWTIE2_GLOBAL_OPTIONS
) and local(BOWTIE2_LOCAL_OPTIONS
) alignment. Bowtie also needs the path to indexes of reference genome so you can set this by passing the absolute path to BOWTIE2_IDX_PATH
in HiCPack config.
Burrows-Wheeler Aligner (BWA)
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
Alignment Results
The bowtie_results
folder contains the results of the reads mapping. The results of first mapping step are available in the bwt2_glob
folder, and the 2nd step in the bwt2_loc
folder. Final BAM files, reads pairing, and mapping statistics are available on the bwt2
folder.
Mapping statistics are available in the .mapstat
files. The total mapping statistics per sample are available in the 'SAMPLE_NAME'.mmapstat
file.
Usually, a high fraction of reads is expected to be aligned on the genome (80-90%). Among them, we usually observed a few percent (around 10%) of step 2 aligned reads. Those reads are chimeric fragments for which we detect a ligation junction. An abnormal level of chimeric reads can reflect a ligation issue during the library preparation.
Once R1 and R2 reads are aligned on the genome, HiCPack reconstruct the pairs information. The pairing statistics are available in the .pairstat
files. The combined pairing statistics per sample are available in the 'SAMPLE_NAME'.mpairstat
file.
The fraction of singleton or multi-hits depends on the genome complexity and the fraction of unmapped reads. The fraction of singleton is usually close to the sum of unmapped R1 and R2 reads, as it is unlikely that both mates from the same pair were unmapped.
The hic_results
folder contains all Hi-C processed data, and is further divided in several sub-folders. The hic_results/data
folder is used to store the valid interaction products (.validPairs
), as well as other statisics files.
The validPairs are stored using a simple tab-delimited text format. One validPairs file is generated per reads chunk. These files are then merged in the allValidPairs, and duplicates are removed if specified in the configuration file.
Statistics Results
Statistics about read pairs filtering are available in the hic_results/stat/
folder. This folder has a sub-folder for each sample and each sub-folder consists of 'SAMPLE_NAME'.mRSstat
file and .mmapstat
files. The ligation efficiency can be assessed using the filtering of valid and invalid pairs. As the ligation is a random process, 25% of each valid ligation class is expected. In the same way, a high level of dangling-end or self-circle read pairs is associated with a low quality experiment, and reveals a problem during the digestion, fill-in or ligation steps.
In case of allele specific analysis, an additional statistics file assplit.stat
is available with the fraction of read pairs assigned to the parental genomes. Note that allele specific assignment is also available from the .RSstat
files. Statistics from the assplit.stat
are usually different from the ones in the .RSstat
as they are calculated on the valid pairs after duplicates removal, whereas statistics on the .RSstat
files are calculated for each read chunk before duplicates removal.