Skip to content

artic-network/raccoon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

raccoon

raccoon logo

Rigorous Alignment Curation: Cleanup Of Outliers and Noise

Raccoon is a lightweight toolkit for alignment and phylogenetic QC workflows. It identifies problematic sites (e.g., clustered SNPs, SNPs near Ns/gaps, and frame‑breaking indels) and produces mask files and summaries for downstream analyses.


Contents

Use cases

  • Flag clustered SNPs that may indicate contamination, recombination, or misalignment.
  • Detect SNPs adjacent to low-coverage regions (Ns) or gaps.
  • Identify frame-breaking indels in coding regions using a GenBank reference.
  • Generate mask files to exclude suspect sites prior to phylogenetic or evolutionary analyses.

Installation

From source:

pip install .

For development (editable install):

pip install -e .

Quickstart

raccoon aln-qc examples/constructed_alignment.fasta -d outdir \
	--genbank examples/constructed_reference.gb --reference-id ref

Outputs:

  • mask_sites.csv
  • alignment_qc_summary.txt

CLI usage

Show help:

raccoon --help

Alignment QC:

raccoon aln-qc <alignment.fasta> -d outdir

With a GenBank reference for frame‑break detection:

raccoon aln-qc <alignment.fasta> -d outdir \
  --genbank <reference.gb> --reference-id <ref_id>

Masking toggles (defaults are enabled):

raccoon aln-qc <alignment.fasta> -d outdir \
  --no-mask-n-adjacent --no-mask-gap-adjacent

Key alignment options:

  • --n-threshold: fraction of Ns allowed per sequence before flagging.
  • --cluster-window: window size (bp) for clustered SNP detection.
  • --cluster-count: minimum SNPs within a window to flag as clustered.
  • --mask-clustered/--no-mask-clustered: include/exclude clustered SNPs.
  • --mask-n-adjacent/--no-mask-n-adjacent: include/exclude SNPs adjacent to Ns.
  • --mask-gap-adjacent/--no-mask-gap-adjacent: include/exclude SNPs adjacent to gaps.
  • --mask-frame-break/--no-mask-frame-break: include/exclude frame-breaking indels.

Sequence QC:

raccoon seq-qc a.fasta b.fasta -o combined.fasta

With metadata-driven headers:

raccoon seq-qc a.fasta b.fasta -o combined.fasta \
  --metadata metadata.csv other_metadata.csv --metadata-id-field id \
  --metadata-location-field location --metadata-date-field date \
  --header-separator '|'

Phylogenetic QC:

raccoon tree-qc --phylogeny <treefile> -d outdir \
  --alignment <alignment.fasta> --asr-state <treefile>.state \
  --run-adar --adar-window 300 --adar-min-count 3

Key phylo options:

  • --phylogeny: tree file (Newick or Nexus)
  • --alignment: alignment used for ASR state parsing
  • --asr-state: ASR state file (defaults to <treefile>.state if present)
  • --tree-format: auto/newick/nexus
  • --run-adar: enable ADAR-like edit flagging
  • --run-apobec: enable APOBEC3-like edit flagging
  • --adar-window: max distance (bp) for ADAR clustering (default: 300)
  • --adar-min-count: min ADAR sites in window to flag a branch (default: 3)
  • --long-branch-sd: std dev threshold for long-branch flagging (default: 3.0)

See full CLI details in [docs/cli.md](docs/cli.md).

## Mask notes

Mask output uses the following note values:

| Note | Meaning |
| --- | --- |
| clustered_snps | Clustered SNPs within the configured window. |
| N_adjacent | SNPs adjacent to an N run within the configured window. |
| gap_adjacent | SNPs adjacent to a gap within the configured window. |
| frame_break | Gap sites that break the CDS frame length. |

## Example data

The [examples](examples) folder includes a constructed alignment and GenBank reference suitable for quick testing:

- [examples/constructed_alignment.fasta](examples/constructed_alignment.fasta)
- [examples/constructed_reference.gb](examples/constructed_reference.gb)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors