Picard tools are a set of tools for manipulating files (SAM, VCF, etc.) from high-throughput sequencing experiments, including several modules for collecting and reporting various quality metrics. This blog post details how I ran Picard tools’
CollectHsMetrics on an exome sequencing library.
You can find more details on Picard tools on github. According to the documentation,
CollectHsMetrics requires the following input:
- Alignment file (.BAM or .SAM)
- Reference sequence .fasta file
- List of target intervals
- List of bait intervals
While the alignment and reference .fasta file should be easy to find, the bait and target intervals can be a little more challenging.
Download baits and intervals files
In this particular case, the capture kit used was the Agilent SureSelect Human All Exon v5 kit, so I went here and signed up for an account. After signing in, I was directed to the following landing page, where I clicked on the ‘Find Designs’ tab at the top middle of the page and clicked ‘SureSelect DNA’ from the dropdown menu that appeared.
When I arrived at the next page, I selected the tab for ‘Agilent Catalog’ and checked the box for H. sapiens (hg19) in the left menu. The results table did not populate until I also checked the box for ‘Design Category’ (also in the left menu).
Then I clicked the link for ‘SureSelect Human All Exon V5’ (highlighted in blue in the above figure) and another popup appeared, which had a pull-down menu to indicate which genome build I wanted to view. Again, I selected hg19 and clicked ‘View Design Details’ and clicked the download link.
I downloaded all of the files, since at this point I wasn’t sure which file would be the baits and which would be the intervals.
Which files are baits and intervals?
To get a better sense of what was in the .bed files, I looked at the first few lines of each file using the
head command and saw the following information in the first few lines of each file:
The second line returned from
head -10 S04380110_Padded.bed was:
The second line returned from `head -10 S04380110_Covered.bed` was:
```track name="Covered" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5 - Genomic regions covered by probes".```
The second line returned from `head -10 S04380110_Regions.bed` was:
```track name="Target Regions" description="Agilent SureSelect DNA - SureSelectXT Human All Exon V5"```
Based on these descriptions, I used `S04380110_Covered.bed` as the baits and `S04380110_Regions.bed` as the targets files.
**Convert baits and intervals and running CollectHsMetrics**
`CollectHsMetrics` expects the baits and targets files to be in interval list format, so I used Picard tools `BedToIntervalList` tool to convert the .bed files. `BedToIntervalList` requires a sequence dictionary file. If you don't already have access to a sequence dictionary file, it can be created with Picard tools `CreateSequenceDictionary`. I converted the .bed files to intervals with the following bash script:
#SBATCH -t 8:00:00
#SBATCH -n 8
#SBATCH -J picardtools_BedToIntervalList
java -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar BedToIntervalList \
BedToIntervalList script finishes running, you can run
#SBATCH -t 8:00:00
#SBATCH -n 32
#SBATCH -J picardtools_CollectHsMetrics
java -Xmx1000M -jar /gpfs/runtime/cbc_conda/cbc_conda_v1_root/envs/cbc_conda_v1/share/picard-2.13.2-1/picard.jar CollectHsMetrics \
Now you will have information to access the performance of your exome sequencing experiment. For more details on how to interpret the output from
CollectHsMetrics, click here.
Joselynn is a genomics data scientist. Before coming to Brown, she got her PhD from the University of Rhode Island. Her background is primarily in studying microbes.