Extracting alignment statistics using Python
Recently this paper by Ilicic et al. suggested a method for assessing the quality of individual cells in a single-cell RNA-seq experiment. The basic idea is to extract various biological and technical features from each the reads for each cell, then use PCA with outlier detection or a SVM to classify cells as “high” or “low” quality. There are two pieces of software associated with the paper: cellity
, an R package that performs the classification and Celloline
, a Python script that performs alignment, summarisation and extraction of alignment statistics such as the number of reads aligned to exons, introns, intergenic regions etc. I was interested in using cellity
but I didn’t want to change my whole workflow to use the Celloline
pipeline, so instead I decided to take the part responsible for extracting alignment statistics (available here) and convert it to a stand-alone Python script.
The core processing remains the same (except I have removed read counting which I do with featureCounts
), but I have added a few features:
- Multiple files - paths to multiple alignment files can now be provided as arguments on the command line.
- BAM files - the script can now handle BAM files as well as SAM using pysam. It will work if the BAM is unsorted, but the output can be slightly different.
- Index - reading the GTF annotation file can take a significant amount of time, particularly for a single-cell experiment where there are a large number of files with relatively few reads. To limit this overhead the object holding the annotation can be pickled to disk for future use.
- Parallel - multiple files can now be processed in parallel using joblib. This is fairly crude but it is a significant improvment, particularly when combined with a pickled index.
- Argument handling - now performed by argparse, complete with handy help message.
- Logging - progress and error messages.
Putting it all together I can now extract alignment statistics from multiple BAM files in parallel with a single command:
alignStats -o stats.csv -g annotation.gtf -i annotation.index -t bam -p 10 *.bam
The script is available on Github.