Detection of pseudouridine, m6A, and m5C on individual molecules from direct RNA nanopore signals.
Detailed documentation is available at https://comprna.github.io/SWARM/
SWARM was trained on signals basecalled with guppy 6.4.6 for RNA002 and with dorado 0.7.2 for RNA004.
Recommended parameters RNA002:
guppy_basecaller -i $INPUTDIR --recursive -s $output_path.fastq -c guppy/ont-guppy/data/rna_r9.4.1_70bps_hac.cfg --device cuda:all:100%
Recommended parameters RNA004:
MODEL=dorado-0.7.2-linux-x64/rna004_130bps_sup@v5.0.0
dorado basecaller $MODEL $INPUTDIR -r -x cuda:all --emit-fastq > $output_path.fastq
minimap 2.24 for alignment and samtools 1.12 for quality checks
Recommended parameters: -k 5 for sythetic IVTs and -k 14 for human transcriptomes
minimap2 -ax map-ont -k 5 ${fasta} ${input_path}/guppy_pass.fastq | samtools sort -o ${output_path}.bam
samtools index ${output_path}.bam
samtools view -b -F 2324 ${bam_file}.bam > ${bam_file}_pass_filtered.bam
samtools index ${bam_file}_pass_filtered.bam
This step is optional but highly recommended, especially for large datasets.
https://github.com/hasindu2008/slow5tools
Example conversion command:
#convert fast5 files to slow5 files using 8 I/O processes
slow5tools f2s $INPUT_DIR -d $TEMPDIR -p 8
#Merge all the slow5 files in to a single file using 8 threads
slow5tools merge $TEMPDIR -o $OUTDIR/${SAMPLE}.blow5 -t 8
#remove the temporary directory
rm -rf $TEMPDIR
Our workflow supports both f5c sam and nanopolish tsv formats. We highly recommend opting for f5c and sam files. This requires the slow5 conversion outlined in previous step.
https://github.com/hasindu2008/f5c
Example event align command:
#f5c index
f5c index -t 48 $FASTQ_PATH --slow5 $SLOW5_PATH
#sam event alignment format
f5c eventalign -t 48 -r $FASTQ_PATH --rna -g $genome -b $BAM --slow5 $SLOW5_PATH --min-mapq 0 --secondary=yes --signal-index --scale-events --samples --print-read-names --sam > $OUT
We used this format in earlier stages of the project, our workflow can still support it. Note that our prediction workflow is optimised for f5c sam format.
https://github.com/jts/nanopolish
Example event align command:
nanopolish index -d ${fast5_path} -s ${guppy_files}/sequencing_summary.txt $fastq
nanopolish eventalign -t 48 --reads $fastq --bam $bam_file \
--genome $fasta --signal-index --scale-events --samples --print-read-names > $output_path
Simply clone from github (install lfs to download large h5 files)
git lfs install
git clone https://github.com/comprna/SWARM/ && cd SWARM
This step is highly recommended and required for using our C++ preprocessing of sam event files. Skip if using eventalign.tsv format.
cd SWARM_scripts/preprocess/
#build and compile htslib, slow5tools, SWARM_preprocess
bash build.sh
SWARM supports GPU inference with tensorflow, tested with versions 2.8.0 and 2.15.0
GPU-configured tensorflow should be available on most HPC systems. Otherwise, you can install tensorflow configured for GPU as per https://www.tensorflow.org/install/
python requirements:
python=3.11.7
tensorflow==2.15.0
numpy==1.26.2
pandas==2.2.0
sklearn==1.4.0
pysam==0.22.1
Use this approach for faster and simultaneous preprocessing + model inference. Run build.sh from above section.
Models for RNA002 or RNA004 chemistry are automatically selected based on the blow5 data.
Example bash code to run SWARM read-level prediction:
module load tensorflow
export MOD=m6A # [<m6A> <m5C> <pU>]
export FASTA=Homo_sapiens.GRCh38.cdna.fa
export BLOW5=Hek293_mRNA.blow5
export SAM=Hek293_mRNA_f5C.events.sam
export OUT=Hek293_mRNA.$MOD.m1.tsv
python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT
Alternatively, preprocessing and prediction can be run separately from eventalign.tsv, but that involves large temp files.
First preprocess the event alignments.
export MOD=pU
export BAM=Hek293_mRNA_f5C.bam
export EVENTS=Hek293_mRNA.events.tsv
export OUT=Hek293_mRNA_pU
python3 SWARM_read_level.py --preprocess -m $MOD --bam BAM --nanopolish $EVENTS -o $OUT
Then predict modification states.
export MOD=pU
export PICKLE=Hek293_mRNA_pU_T.pickle
export OUT=Hek293_mRNA_pU.pred.tsv
python3 SWARM_read_level.py --predict -m $MOD --pickle $PICKLE -o $OUT
First sort the model1 output, use cat if pooling multiple replicates.
cat Hek293_mRNA_rep1_pU.pred.tsv Hek293_mRNA_rep2_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv
sort -k 1 Hek293_mRNA_pooled_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv.sorted
Run site-level detection on sorted read-level data:
INPUT=Hek293_mRNA_pooled_pU.pred.tsv.sorted
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_site_level.py -i $INPUT -o $OUT
Run differential modification test to find reference coordinates with stoichiometry changes across different conditions. Handles multiple replicates.
Set up a tab-separated config file with paths to inputs. Make sure that the header is present.
M2_file_path RepName Condition
WT_rep1.m2.pred.tsv 1 WT
WT_rep2.m2.pred.tsv 2 WT
KD_rep1.m2.pred.tsv 1 KD
KD_rep2.m2.pred.tsv 2 KD
Run SWARM_diff on site-level predictions (12 threads):
python3 SWARM_diff.py --data_file diff_config.tsv --output_file diff_out.tsv -n 12
Use --sam tag with SWARM_read_level.py to get mod.sam output (pred.tsv is still produced too).
Note that this runs slower as multithreaded preprocessing is not implemented with modsam.
python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT --sam
mod.sam can also be generated from sorted read-level pred.tsv files.
This should be faster and also enables filtering of sites for cleaner results.
M1_sorted=Hek293_mRNA_pooled_pU.pred.tsv.sorted
SAM=Hek293_mRNA_f5C.events.sam
M2=Hek293_mRNA_pooled_pU.m2.pred.tsv
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_make_modsam.py -i $M1_sorted -s $SAM -m $M2 -o $OUT
This optional step reduces the time to retrain models as preprocessing only a fraction of signals from a whole sample is usually enough for training. We trim for events comprising 500 signals per 9mer.
python3 train_models/trim_tsv_events.py -i <eventalign.tsv> -o <out_prefix> --limit-out 500
Preprocess trimmed files for model1 input features. Make sure to include --out_counter arg here!
python3 SWARM_read_level.py --preprocess -m <pU/m6A/m5C/ac4C> --bam <BAM> //
--nanopolish <eventalign_trimmed.tsv> -o <out_prefix> --out_counter
Use this step for stratified sampling of the preprocessed signals. Splits equal number of signals per 9mer for positive/negative labels in each of train/validation/test set (60/20/20 split by default). Run on each sample, to make positive/negative data. Give same outpath if multiple samples are intended to be used under the same positive/negative label.
python3 train_models/split_training_by_9mers.py -i <preprocesed.pickle> //
--counts <preprocessed.counts> -o <outpath> --limit <signals_per_9mer> //
[--train_percent 0.6] [--validate_percent 0.2]
Use this step for finalising training/validation/testing data with matching positive/negative labels.
python3 train_models/assemble_data.py --input_positive <positive_prefix> //
--input_negative <negative_prefix> -o <outpath> [--positive_label 1] [--negative_label 0]
Use this step for training binary classifier of modification states with single-base single-molecule resolution.
python3 train_models/train_model1.py -i <assembled_prefix> -o <outpath> //
[--vector_len 36] [--features 7] [--labels 2] [--epochs 100]