A comprehensive toolkit for training and comparing different Handwritten Text Recognition (HTR) engines on historical manuscript datasets. Supports TrOCR, CRNN-CTC, Qwen3-VL, LightOnOCR, Party, and Kraken engines with a unified GUI interface.
Primary Focus: Cyrillic manuscripts (Russian, Ukrainian, Church Slavonic, Glagolitic)
- TrOCR: Transformer-based OCR (line-level)
- CRNN-CTC: Puigcerver CRNN with CTC decoding (line-level, PyLaia-inspired)
- Qwen3-VL: Vision-Language Model (line/page-level, custom prompts)
- LightOnOCR: Lightweight VLM (~4GB VRAM, line-level, fine-tuned variants)
- Churro: Qwen fork, experimental (line/page-level, custom prompts)
- Party: Transformer-based HTR (line-level, multilingual)
- Kraken: Segmentation & recognition
- PaddleOCR: Multi-language printed/mixed text detection + recognition (whole-page, subprocess-isolated)
- Commercial APIs: Google Gemini, Anthropic Claude Vision (via API keys)
- Local LLMs: OpenWebUI integration for local vision models
- Unified interface: All models accessible through same engine plugin system
- Plugin GUI: Compare engines side-by-side with unified interface
- Model management: Easy switching between trained models and API providers
- Export formats: TXT, CSV, PAGE XML
- CRNN-CTC: Custom CRNN training with PAGE XML support
- TrOCR: Fine-tuning pipeline with image caching (10-50x faster)
- Data preparation: Transkribus PAGE XML parser
- Line segmentation (automatic or PAGE XML-based)
- Custom prompt support (Qwen3-VL)
- Batch processing
- PAGE XML import/export
# Clone repository
git clone https://github.com/achimrabus/polyscriptor.git
cd polyscriptor
# Create virtual environment
python3 -m venv htr_env
source htr_env/bin/activate # Linux/Mac
# or: htr_env\Scripts\activate # WindowsGPU install (CUDA 12.1 β Linux/Windows with NVIDIA GPU):
# Install CUDA torch first, then the rest
pip install -r requirements-gpu.txt --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtCPU-only install (no GPU required):
# Linux/Mac:
pip install -r requirements.txt
# Windows (avoids a torch DLL load error on CPU-only machines):
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpuLocal usage (Linux/Mac):
source htr_env/bin/activate
python3 transcription_gui_plugin.pyLocal usage (Windows):
htr_env\Scripts\activate
python transcription_gui_plugin.pyNote: The plugin GUI requires
PyQt6(included inrequirements.txt). The web UI (uvicorn web.polyscriptor_server:app) works without PyQt6.
Remote server usage (GUI over X11):
# See REMOTE_GUI_GUIDE.md for detailed setup
# Quick test: X11 forwarding with MobaXterm
ssh -X user@server
cd ~/htr_gui/dhlab-slavistik
source htr_env/bin/activate
python3 transcription_gui_plugin.pyRecommended for remote: CLI batch processing
# More efficient than GUI for server workflows
python3 batch_processing.py \
--input-folder HTR_Images/my_folder \
--engine crnn-ctc \
--model-path models/crnn_ctc_model/best_model.pt \
--use-pagexmlπ See REMOTE_GUI_GUIDE.md for comprehensive remote access options (X11, VNC, CLI workflows)
# Step 1: Parse Transkribus PAGE XML export β CSV format
python3 transkribus_parser.py \
--input_dir /path/to/transkribus_export \
--output_dir ./data/my_dataset \
--preserve-aspect-ratio \
--target-height 128
# Step 2: Convert CSV β CRNN-CTC format (required!)
python3 convert_to_pylaia.py \
--input_csv ./data/my_dataset/train.csv \
--output_dir ./data/crnn_train
python3 convert_to_pylaia.py \
--input_csv ./data/my_dataset/val.csv \
--output_dir ./data/crnn_val
# Step 3: Train CRNN-CTC model
python3 train_pylaia.py \
--train_dir ./data/crnn_train \
--val_dir ./data/crnn_val \
--output_dir ./models/my_model \
--batch_size 32 \
--epochs 250.
βββ train_pylaia.py # CRNN-CTC training script
βββ inference_pylaia_native.py # CRNN-CTC inference (native Linux)
βββ inference_page.py # Line segmentation + OCR pipeline
βββ transcription_gui_plugin.py # Main GUI application
βββ polyscriptor_batch_gui.py # Batch processing GUI
βββ batch_processing.py # Batch processing CLI
βββ htr_engine_base.py # HTR engine interface
β
βββ engines/ # HTR engine plugins
β βββ trocr_engine.py # TrOCR transformer
β βββ pylaia_engine.py # CRNN-CTC (Puigcerver CRNN)
β βββ qwen3_engine.py # Qwen3-VL (local)
β βββ lighton_ocr_engine.py # LightOnOCR VLM (lightweight)
β βββ churro_engine.py # Churro (Qwen fork)
β βββ party_engine.py # Party multilingual HTR
β βββ kraken_engine.py # Kraken segmentation
β βββ commercial_api_engine.py # Google Gemini, OpenAI GPT & Anthropic Claude APIs
β βββ openwebui_engine.py # OpenWebUI local LLMs
β βββ paddle_engine.py # PaddleOCR (subprocess, isolated venv)
β βββ paddle_worker.py # PaddleOCR worker (runs inside venv_paddle)
β
βββ optimized_training.py # TrOCR fine-tuning script
βββ transkribus_parser.py # PAGE XML data preparation
βββ alto_parser.py # ALTO XML data preparation
βββ page_xml_exporter.py # Export results to PAGE XML
βββ qwen3_prompts.py # Custom prompts for Qwen3-VL
β
βββ requirements.txt # Python dependencies
β
βββ web/ # Browser-based web interface
β βββ polyscriptor_server.py # FastAPI backend (SSE streaming)
β βββ static/
β β βββ index.html # Single-page app
β β βββ app.js # State management, event bus
β β βββ app.css # Styles
β β βββ components/ # ES6 modules (engine, viewer, transcription, batch)
β βββ tests/
β βββ test_server.py # API tests (pytest + FastAPI TestClient)
β
βββ models/ # Trained models (excluded from git)
βββ pylaia_*/ # CRNN-CTC model checkpoints
βββ trocr_*/ # TrOCR fine-tuned models
- Export data from Transkribus (PAGE XML format)
- Parse with preprocessing:
python3 transkribus_parser.py \ --input_dir ./transkribus_export \ --output_dir ./data/my_dataset \ --preserve-aspect-ratio \ --target-height 128 - Convert to CRNN-CTC format:
python3 convert_to_pylaia.py \ --input_csv ./data/my_dataset/train.csv \ --output_dir ./data/crnn_train python3 convert_to_pylaia.py \ --input_csv ./data/my_dataset/val.csv \ --output_dir ./data/crnn_val - Train model:
python3 train_pylaia.py \ --train_dir ./data/crnn_train \ --val_dir ./data/crnn_val \ --output_dir ./models/my_model \ --batch_size 32 \ --epochs 250 - Use in GUI: Model will appear in the CRNN-CTC engine dropdown
Trained models can be loaded in the GUI:
- CRNN-CTC models: Select from dropdown or browse to model directory
- TrOCR models: Specify HuggingFace Hub ID or local checkpoint path
- Commercial APIs: Enter API keys in engine configuration
python3 inference_pylaia_native.py \
--checkpoint models/my_model/best_model.pt \
--syms models/my_model/symbols.txt \
--image line_image.pngpython3 inference_page.py \
--image page.jpg \
--checkpoint models/my_model/best_model.pt \
--num-beams 4For processing multiple images or folders, use the batch processing GUI:
python3 polyscriptor_batch_gui.pyFeatures:
- Process entire folders of images
- Automatic PAGE XML detection (uses existing segmentation if available)
- Progress tracking with live output
- Export results to TXT, CSV, or PAGE XML
- Resume interrupted processing
For scripted/automated workflows:
python3 batch_processing.py \
--input-folder ./images \
--engine crnn-ctc \
--model-path models/my_model/best_model.pt \
--segmentation-method kraken \
--output-folder ./output \
--use-pagexmlKey options:
--engine: crnn-ctc, TrOCR, Qwen3-VL, LightOnOCR, Party, Kraken, PaddleOCR--segmentation-method: kraken (recommended), hpp (fast), none (pre-segmented)--use-pagexml: Auto-detect and use existing PAGE XML segmentation--resume: Skip already-processed files--dry-run: Test without writing output
PaddleOCR performs its own text detection + recognition on whole pages β no pre-segmented lines needed. It runs in an isolated venv_paddle to avoid OpenCV conflicts with the main environment.
python3 -m venv venv_paddle
source venv_paddle/bin/activate
# CPU only:
pip install paddlepaddle paddleocr
# GPU (CUDA 12.x):
pip install paddleocr
pip install paddlepaddle-gpu==3.0.0 -f https://www.paddlepaddle.org.cn/packages/stable/cu126/
deactivatePaddleOCR uses ISO language codes (not script names). Enter the code in the "Language code" field of the engine config. Common examples:
| Code | Language / Script |
|---|---|
en |
English |
ch |
Chinese + English (strongest general model) |
de or german |
German |
fr or french |
French |
ru |
Russian (Cyrillic) |
uk |
Ukrainian (Cyrillic) |
bg |
Bulgarian (Cyrillic) |
la |
Latin (classical) |
ar |
Arabic |
japan |
Japanese |
korean |
Korean |
Note: Models download automatically on first use (~50β200 MB per script group). Only
enis fetched during initial setup. Other language models cache in~/.paddlex/official_models/. Full language list: https://paddlepaddle.github.io/PaddleOCR/main/en/ppocr/blog/multi_languages.html
Polyscriptor includes a browser-based web interface β run inference locally or on a remote server and interact from any browser. No X11 forwarding needed; when running on a remote server, no local Python install is required either.
# Web dependencies are included in requirements.txt β no extra install needed.
# Activate your virtual environment first:
source htr_env/bin/activate # Linux/Mac
# or: htr_env\Scripts\activate # Windows
# Start the server (run from the project root)
uvicorn web.polyscriptor_server:app --host 0.0.0.0 --port 8765
# Open in browser
# Local: http://localhost:8765
# Remote: use SSH tunnel (see below)Works without a GPU. Commercial APIs (Gemini, Claude, OpenAI) and TrOCR run on CPU. CRNN-CTC also runs on CPU β inference is slower (~1β2 min/page) but fully functional, and our published Church Slavonic, Ukrainian and Glagolitic models all work this way. Only Qwen3-VL and LightOnOCR require a GPU.
# On your laptop β tunnel port 8765 through SSH
ssh -L 8765:localhost:8765 user@your.server.edu
# Then open: http://localhost:8765
# No firewall issues β works on any university network- Engine selection with dynamic configuration forms (CRNN-CTC, TrOCR, Kraken, etc.)
- Image upload β drag-and-drop, file picker, or PDF upload (multi-page PDFs become batch items)
- Segmentation β Kraken (neural blla or classical) with color-coded region overlay
- Live transcription β Server-Sent Events stream, lines appear as processed
- Batch queue β multi-image queue, drag-to-reorder, cancel, prev/next navigation
- Inline editing β double-click any transcription line to correct it
- Confidence filter β slider to dim low-confidence lines
- Export β TXT, CSV, PAGE XML (single image or ZIP for entire batch)
- Font selector β Monomakh Unicode (recommended for Church Slavonic), Old Standard TT, and others
- Kraken model presets β 12 Zenodo community models with one-click download
- Resizable panels β drag handles to adjust column widths, saved across sessions
source htr_env/bin/activate
pip install pytest httpx
pytest web/tests/test_server.py -vRunning on a remote Linux server without GUI? You have several options:
Best for: Production workflows, processing many images
# Process entire folders efficiently
python3 batch_processing.py \
--input-folder HTR_Images/manuscripts \
--engine crnn-ctc \
--model-path models/crnn_ctc_model/best_model.pt \
--use-pagexml \
--output-folder outputBenefits: faster than GUI methods, no display overhead, scriptable
Best for: Interactive GUI work, visual parameter tuning, model comparison
Using MobaXterm on Windows:
- Install MobaXterm (X server auto-starts)
- SSH with X11 forwarding enabled
- Test:
xclock &(should show clock window) - Launch GUI:
python3 transcription_gui_plugin.py
Performance: Good over LAN/local WiFi, slower over internet connections. Enable compression for best results.
Best for: When X11 is too slow (poor internet), extended GUI sessions, session persistence
# On server
vncserver :1 -geometry 1920x1080
# Connect from Windows using VNC viewer to: server:5901Benefits: Better compression than X11, survives disconnects, works well over internet
| Method | Speed | Best For | Network Type |
|---|---|---|---|
| CLI Batch Processing | β‘β‘β‘ | Production, automation | Any |
| Web UI | β‘β‘β‘ | Interactive work, no install needed | Any (SSH tunnel) |
| X11 Forwarding | β‘β‘ | Interactive GUI work | LAN/Local WiFi |
| X11 Forwarding | β‘ | Light use only | Internet |
| VNC/NoMachine | β‘β‘ | Extended sessions, poor connections | Any |
Key hyperparameters for optimal performance:
{
"img_height": 128, # Target image height
"batch_size": 32, # GPU-optimized (44GB VRAM)
"num_epochs": 250, # With early stopping
"learning_rate": 0.0003,
"early_stopping_patience": 15,
"augment_train": True, # Data augmentation
"device": "cuda:0"
}model_name: "kazars24/trocr-base-handwritten-ru"
data_root: "./processed_data"
batch_size: 16
epochs: 10
cache_images: true # 10-50x faster data loading
fp16: true # Mixed precision trainingContributions welcome! Areas of interest:
- New HTR engines: Add plugins for other HTR systems
- Model training: Share trained models for new scripts/languages
- Bug fixes: Especially inference/GUI issues
- Documentation: Improve guides and examples
MIT License
Copyright (c) 2025 Achim Rabus
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- PyLaia (architecture inspiration for CRNN-CTC engine): https://github.com/jpuigcerver/PyLaia
- TrOCR: Microsoft's Transformer-based OCR: https://huggingface.co/microsoft/trocr-base-handwritten
- LightOnOCR: Lightweight VLM for OCR: https://huggingface.co/lightonai/LightOnOCR-2-1B-base
- Party: PAge-wise Recognition of Text-y https://github.com/mittagessen/party/
- Transkribus: Transcription, training, and inference plattform: https://app.transkribus.org/
- Qwen3-VL: Alibaba's Vision-Language Model: https://github.com/QwenLM/Qwen3-VL
- William Mattingly: Support with VLM fine-tuning and Church Slavonic models: https://huggingface.co/wjbmattingly
For questions, bug reports, or collaboration inquiries:
- GitHub Issues: Create an issue
Aspect Ratio Preservation is CRITICAL for high aspect ratio line images:
# ALWAYS use --preserve-aspect-ratio for manuscript lines
python3 transkribus_parser.py \
--preserve-aspect-ratio \
--target-height 128 \
# ...other argsWithout this, TrOCR's ViT encoder brutally resizes to 384Γ384, causing 10.6x width compression for Ukrainian lines (4077Γ357 β 384Γ384). Characters shrink from ~80px to ~7px width, making recognition nearly impossible.
- KALDI Format Vocabulary: Train/inference scripts now auto-detect format
<space>vs<SPACE>: Both cases handled correctly- Vocabulary File Mismatch: Training scripts auto-copy vocabulary to model directory