Rigorous Alignment Curation: Cleanup Of Outliers and Noise
Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.
- Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
- Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
- Identify frame-breaking indels in coding regions using a GenBank reference.
- Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.
From source:
pip install .For development (editable install):
pip install -e .raccoon aln-qc examples/constructed_alignment.fasta -d outdir \
--genbank examples/constructed_reference.gb --reference-id refOutputs:
- mask_sites.csv
- alignment_qc_summary.txt
Show help:
raccoon --helpAlignment QC:
raccoon aln-qc <alignment.fasta> -d outdirWith a GenBank reference for frame‑break detection:
raccoon aln-qc <alignment.fasta> -d outdir \
--genbank <reference.gb> --reference-id <ref_id>Masking toggles (defaults are enabled):
raccoon aln-qc <alignment.fasta> -d outdir \
--no-mask-n-adjacent --no-mask-gap-adjacentKey alignment options:
--n-threshold: fraction of Ns allowed per sequence before flagging.--cluster-window: window size (bp) for clustered SNP detection.--cluster-count: minimum SNPs within a window to flag as clustered.--mask-clustered/--no-mask-clustered: include/exclude clustered SNPs.--mask-n-adjacent/--no-mask-n-adjacent: include/exclude SNPs adjacent to Ns.--mask-gap-adjacent/--no-mask-gap-adjacent: include/exclude SNPs adjacent to gaps.--mask-frame-break/--no-mask-frame-break: include/exclude frame-breaking indels.
Sequence QC:
raccoon seq-qc a.fasta b.fasta -o combined.fastaWith metadata-driven headers:
raccoon seq-qc a.fasta b.fasta -o combined.fasta \
--metadata metadata.csv other_metadata.csv --metadata-id-field id \
--metadata-location-field location --metadata-date-field date \
--header-separator '|'Phylogenetic QC:
raccoon tree-qc --phylogeny <treefile> -d outdir \
--alignment <alignment.fasta> --asr-state <treefile>.state \
--run-adar --adar-window 300 --adar-min-count 3Key phylo options:
--phylogeny: tree file (Newick or Nexus)--alignment: alignment used for ASR state parsing--asr-state: ASR state file (defaults to<treefile>.stateif present)--tree-format: auto/newick/nexus--run-adar: enable ADAR-like edit flagging--run-apobec: enable APOBEC3-like edit flagging--adar-window: max distance (bp) for ADAR clustering (default: 300)--adar-min-count: min ADAR sites in window to flag a branch (default: 3)--long-branch-sd: std dev threshold for long-branch flagging (default: 3.0)
See full CLI details in [docs/cli.md](docs/cli.md).
## Mask notes
Mask output uses the following note values:
| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |
## Example data
The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:
- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)
