Picard tools CollectHsMetrics

Posted by Joselynn Wallace on June 3, 2019

Picard tools are a set of tools for manipulating files (SAM, VCF, etc.) from high-throughput sequencing experiments, including several modules for collecting and reporting various quality metrics. This blog post details how I ran Picard tools’ CollectHsMetrics on an exome sequencing library.

You can find more details on Picard tools on github. According to the documentation, CollectHsMetrics requires the following input:

  1. Alignment file (.BAM or .SAM)
  2. Reference sequence .fasta file
  3. List of target intervals
  4. List of bait intervals

While the alignment and reference .fasta file should be easy to find, the bait and target intervals can be a little more challenging.

Download baits and intervals files

In this particular case, the capture kit used was the Agilent SureSelect Human All Exon v5 kit, so I went here and signed up for an account. After signing in, I was directed to the following landing page, where I clicked on the ‘Find Designs’ tab at the top middle of the page and clicked ‘SureSelect DNA’ from the dropdown menu that appeared.

When I arrived at the next page, I selected the tab for ‘Agilent Catalog’ and checked the box for H. sapiens (hg19) in the left menu. The results table did not populate until I also checked the box for ‘Design Category’ (also in the left menu).

Then I clicked the link for ‘SureSelect Human All Exon V5’ (highlighted in blue in the above figure) and another popup appeared, which had a pull-down menu to indicate which genome build I wanted to view. Again, I selected hg19 and clicked ‘View Design Details’ and clicked the download link.

I downloaded all of the files, since at this point I wasn’t sure which file would be the baits and which would be the intervals.

Which files are baits and intervals?

To get a better sense of what was in the .bed files, I looked at the first few lines of each file using the head command and saw the following information in the first few lines of each file:

The second line returned from head -10 S04380110_Padded.bed was:

The second line returned from `head -10 S04380110_Covered.bed` was:  
```track name="Covered" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5 - Genomic regions covered by probes".```

The second line returned from `head -10 S04380110_Regions.bed` was:   
```track name="Target Regions" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5"```      

Based on these descriptions, I used `S04380110_Covered.bed` as the baits and `S04380110_Regions.bed` as the targets files.    

**Convert baits and intervals and running CollectHsMetrics**
`CollectHsMetrics` expects the baits and targets files to be in interval list format, so I used Picard tools `BedToIntervalList` tool to convert the .bed files. `BedToIntervalList` requires a sequence dictionary file. If you don't already have access to a sequence dictionary file, it can be created with Picard tools `CreateSequenceDictionary`. I converted the .bed files to intervals with the following bash script:   

#SBATCH -t 8:00:00
#SBATCH --mem=4GB
#SBATCH -n 8
#SBATCH -J picardtools_BedToIntervalList

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda

java -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar BedToIntervalList \
I=S04380110_Covered.bed \
O=S04380110_Covered.interval_list \

After the BedToIntervalList script finishes running, you can run CollectHsMetrics:

#SBATCH -t 8:00:00
#SBATCH --mem=10GB
#SBATCH -n 32
#SBATCH -J picardtools_CollectHsMetrics

source /gpfs/runtime/cbc_conda/bin/activate_cbc_conda

java -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar CollectHsMetrics \
I=alignment.bam \
O=output_hs_metrics.txt \
R=/gpfs/data/cbc/refchef_refs/broad_bundle_hg19/primary/ucsc.hg19.fasta \
BAIT_INTERVALS=S04380110_Covered.interval_list \

Now you will have information to access the performance of your exome sequencing experiment. For more details on how to interpret the output from CollectHsMetrics, click here.

Joselynn Wallace

Joselynn is a genomics data scientist. Before coming to Brown, she got her PhD from the University of Rhode Island. Her background is primarily in studying microbes.