Skip to content

UB-Mannheim/polyscriptor

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

132 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Multi-Engine HTR Training & Comparison Tool

A comprehensive toolkit for training and comparing different Handwritten Text Recognition (HTR) engines on historical manuscript datasets. Supports TrOCR, CRNN-CTC, Qwen3-VL, LightOnOCR, Party, and Kraken engines with a unified GUI interface.

Primary Focus: Cyrillic manuscripts (Russian, Ukrainian, Church Slavonic, Glagolitic)


🎯 Features

Multiple HTR Engines

  • TrOCR: Transformer-based OCR (line-level)
  • CRNN-CTC: Puigcerver CRNN with CTC decoding (line-level, PyLaia-inspired)
  • Qwen3-VL: Vision-Language Model (line/page-level, custom prompts)
  • LightOnOCR: Lightweight VLM (~4GB VRAM, line-level, fine-tuned variants)
  • Churro: Qwen fork, experimental (line/page-level, custom prompts)
  • Party: Transformer-based HTR (line-level, multilingual)
  • Kraken: Segmentation & recognition
  • PaddleOCR: Multi-language printed/mixed text detection + recognition (whole-page, subprocess-isolated)

Commercial & Local Vision Models

  • Commercial APIs: Google Gemini, Anthropic Claude Vision (via API keys)
  • Local LLMs: OpenWebUI integration for local vision models
  • Unified interface: All models accessible through same engine plugin system

Core Capabilities

  • Plugin GUI: Compare engines side-by-side with unified interface
  • Model management: Easy switching between trained models and API providers
  • Export formats: TXT, CSV, PAGE XML

Training Pipelines (GPU required)

  • CRNN-CTC: Custom CRNN training with PAGE XML support
  • TrOCR: Fine-tuning pipeline with image caching (10-50x faster)
  • Data preparation: Transkribus PAGE XML parser

Key Capabilities

  • Line segmentation (automatic or PAGE XML-based)
  • Custom prompt support (Qwen3-VL)
  • Batch processing
  • PAGE XML import/export

πŸš€ Quick Start

1. Installation

# Clone repository
git clone https://github.com/achimrabus/polyscriptor.git
cd polyscriptor

# Create virtual environment
python3 -m venv htr_env
source htr_env/bin/activate  # Linux/Mac
# or: htr_env\Scripts\activate  # Windows

GPU install (CUDA 12.1 β€” Linux/Windows with NVIDIA GPU):

# Install CUDA torch first, then the rest
pip install -r requirements-gpu.txt --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

CPU-only install (no GPU required):

# Linux/Mac:
pip install -r requirements.txt

# Windows (avoids a torch DLL load error on CPU-only machines):
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu

2. Launch GUI for inference

Local usage (Linux/Mac):

source htr_env/bin/activate
python3 transcription_gui_plugin.py

Local usage (Windows):

htr_env\Scripts\activate
python transcription_gui_plugin.py

Note: The plugin GUI requires PyQt6 (included in requirements.txt). The web UI (uvicorn web.polyscriptor_server:app) works without PyQt6.

Remote server usage (GUI over X11):

# See REMOTE_GUI_GUIDE.md for detailed setup
# Quick test: X11 forwarding with MobaXterm
ssh -X user@server
cd ~/htr_gui/dhlab-slavistik
source htr_env/bin/activate
python3 transcription_gui_plugin.py

Recommended for remote: CLI batch processing

# More efficient than GUI for server workflows
python3 batch_processing.py \
    --input-folder HTR_Images/my_folder \
    --engine crnn-ctc \
    --model-path models/crnn_ctc_model/best_model.pt \
    --use-pagexml

πŸ“– See REMOTE_GUI_GUIDE.md for comprehensive remote access options (X11, VNC, CLI workflows)

3. Train a Model (CLI, CRNN-CTC Example)

# Step 1: Parse Transkribus PAGE XML export β†’ CSV format
python3 transkribus_parser.py \
    --input_dir /path/to/transkribus_export \
    --output_dir ./data/my_dataset \
    --preserve-aspect-ratio \
    --target-height 128

# Step 2: Convert CSV β†’ CRNN-CTC format (required!)
python3 convert_to_pylaia.py \
    --input_csv ./data/my_dataset/train.csv \
    --output_dir ./data/crnn_train

python3 convert_to_pylaia.py \
    --input_csv ./data/my_dataset/val.csv \
    --output_dir ./data/crnn_val

# Step 3: Train CRNN-CTC model
python3 train_pylaia.py \
    --train_dir ./data/crnn_train \
    --val_dir ./data/crnn_val \
    --output_dir ./models/my_model \
    --batch_size 32 \
    --epochs 250

πŸ“ Repository Structure

.
β”œβ”€β”€ train_pylaia.py                  # CRNN-CTC training script
β”œβ”€β”€ inference_pylaia_native.py       # CRNN-CTC inference (native Linux)
β”œβ”€β”€ inference_page.py                # Line segmentation + OCR pipeline
β”œβ”€β”€ transcription_gui_plugin.py      # Main GUI application
β”œβ”€β”€ polyscriptor_batch_gui.py        # Batch processing GUI
β”œβ”€β”€ batch_processing.py              # Batch processing CLI
β”œβ”€β”€ htr_engine_base.py              # HTR engine interface
β”‚
β”œβ”€β”€ engines/                         # HTR engine plugins
β”‚   β”œβ”€β”€ trocr_engine.py             # TrOCR transformer
β”‚   β”œβ”€β”€ pylaia_engine.py            # CRNN-CTC (Puigcerver CRNN)
β”‚   β”œβ”€β”€ qwen3_engine.py             # Qwen3-VL (local)
β”‚   β”œβ”€β”€ lighton_ocr_engine.py       # LightOnOCR VLM (lightweight)
β”‚   β”œβ”€β”€ churro_engine.py            # Churro (Qwen fork)
β”‚   β”œβ”€β”€ party_engine.py             # Party multilingual HTR
β”‚   β”œβ”€β”€ kraken_engine.py            # Kraken segmentation
β”‚   β”œβ”€β”€ commercial_api_engine.py    # Google Gemini, OpenAI GPT & Anthropic Claude APIs
β”‚   β”œβ”€β”€ openwebui_engine.py         # OpenWebUI local LLMs
β”‚   β”œβ”€β”€ paddle_engine.py            # PaddleOCR (subprocess, isolated venv)
β”‚   └── paddle_worker.py            # PaddleOCR worker (runs inside venv_paddle)
β”‚
β”œβ”€β”€ optimized_training.py            # TrOCR fine-tuning script
β”œβ”€β”€ transkribus_parser.py            # PAGE XML data preparation
β”œβ”€β”€ alto_parser.py                   # ALTO XML data preparation
β”œβ”€β”€ page_xml_exporter.py             # Export results to PAGE XML
β”œβ”€β”€ qwen3_prompts.py                 # Custom prompts for Qwen3-VL
β”‚
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”‚
β”œβ”€β”€ web/                             # Browser-based web interface
β”‚   β”œβ”€β”€ polyscriptor_server.py      # FastAPI backend (SSE streaming)
β”‚   β”œβ”€β”€ static/
β”‚   β”‚   β”œβ”€β”€ index.html              # Single-page app
β”‚   β”‚   β”œβ”€β”€ app.js                  # State management, event bus
β”‚   β”‚   β”œβ”€β”€ app.css                 # Styles
β”‚   β”‚   └── components/             # ES6 modules (engine, viewer, transcription, batch)
β”‚   └── tests/
β”‚       └── test_server.py          # API tests (pytest + FastAPI TestClient)
β”‚
└── models/                          # Trained models (excluded from git)
    β”œβ”€β”€ pylaia_*/                    # CRNN-CTC model checkpoints
    └── trocr_*/                     # TrOCR fine-tuned models

πŸŽ“ Typical Workflow

Training a CRNN-CTC Model

  1. Export data from Transkribus (PAGE XML format)
  2. Parse with preprocessing:
    python3 transkribus_parser.py \
        --input_dir ./transkribus_export \
        --output_dir ./data/my_dataset \
        --preserve-aspect-ratio \
        --target-height 128
  3. Convert to CRNN-CTC format:
    python3 convert_to_pylaia.py \
        --input_csv ./data/my_dataset/train.csv \
        --output_dir ./data/crnn_train
    python3 convert_to_pylaia.py \
        --input_csv ./data/my_dataset/val.csv \
        --output_dir ./data/crnn_val
  4. Train model:
    python3 train_pylaia.py \
        --train_dir ./data/crnn_train \
        --val_dir ./data/crnn_val \
        --output_dir ./models/my_model \
        --batch_size 32 \
        --epochs 250
  5. Use in GUI: Model will appear in the CRNN-CTC engine dropdown

Using Trained Models

Trained models can be loaded in the GUI:

  • CRNN-CTC models: Select from dropdown or browse to model directory
  • TrOCR models: Specify HuggingFace Hub ID or local checkpoint path
  • Commercial APIs: Enter API keys in engine configuration

πŸ› οΈ Command-Line Inference

CRNN-CTC (Single Line)

python3 inference_pylaia_native.py \
    --checkpoint models/my_model/best_model.pt \
    --syms models/my_model/symbols.txt \
    --image line_image.png

CRNN-CTC (Full Page with Segmentation)

python3 inference_page.py \
    --image page.jpg \
    --checkpoint models/my_model/best_model.pt \
    --num-beams 4

πŸ“¦ Batch Processing

Batch Processing GUI

For processing multiple images or folders, use the batch processing GUI:

python3 polyscriptor_batch_gui.py

Features:

  • Process entire folders of images
  • Automatic PAGE XML detection (uses existing segmentation if available)
  • Progress tracking with live output
  • Export results to TXT, CSV, or PAGE XML
  • Resume interrupted processing

Batch Processing CLI

For scripted/automated workflows:

python3 batch_processing.py \
    --input-folder ./images \
    --engine crnn-ctc \
    --model-path models/my_model/best_model.pt \
    --segmentation-method kraken \
    --output-folder ./output \
    --use-pagexml

Key options:

  • --engine: crnn-ctc, TrOCR, Qwen3-VL, LightOnOCR, Party, Kraken, PaddleOCR
  • --segmentation-method: kraken (recommended), hpp (fast), none (pre-segmented)
  • --use-pagexml: Auto-detect and use existing PAGE XML segmentation
  • --resume: Skip already-processed files
  • --dry-run: Test without writing output

πŸ–¨οΈ PaddleOCR Engine

PaddleOCR performs its own text detection + recognition on whole pages β€” no pre-segmented lines needed. It runs in an isolated venv_paddle to avoid OpenCV conflicts with the main environment.

Setup

python3 -m venv venv_paddle
source venv_paddle/bin/activate
# CPU only:
pip install paddlepaddle paddleocr
# GPU (CUDA 12.x):
pip install paddleocr
pip install paddlepaddle-gpu==3.0.0 -f https://www.paddlepaddle.org.cn/packages/stable/cu126/
deactivate

Language codes

PaddleOCR uses ISO language codes (not script names). Enter the code in the "Language code" field of the engine config. Common examples:

Code Language / Script
en English
ch Chinese + English (strongest general model)
de or german German
fr or french French
ru Russian (Cyrillic)
uk Ukrainian (Cyrillic)
bg Bulgarian (Cyrillic)
la Latin (classical)
ar Arabic
japan Japanese
korean Korean

Note: Models download automatically on first use (~50–200 MB per script group). Only en is fetched during initial setup. Other language models cache in ~/.paddlex/official_models/. Full language list: https://paddlepaddle.github.io/PaddleOCR/main/en/ppocr/blog/multi_languages.html


🌐 Web UI (Browser-Based Interface)

Polyscriptor includes a browser-based web interface β€” run inference locally or on a remote server and interact from any browser. No X11 forwarding needed; when running on a remote server, no local Python install is required either.

Quick Start

# Web dependencies are included in requirements.txt β€” no extra install needed.

# Activate your virtual environment first:
source htr_env/bin/activate    # Linux/Mac
# or: htr_env\Scripts\activate  # Windows

# Start the server (run from the project root)
uvicorn web.polyscriptor_server:app --host 0.0.0.0 --port 8765

# Open in browser
# Local: http://localhost:8765
# Remote: use SSH tunnel (see below)

Works without a GPU. Commercial APIs (Gemini, Claude, OpenAI) and TrOCR run on CPU. CRNN-CTC also runs on CPU β€” inference is slower (~1–2 min/page) but fully functional, and our published Church Slavonic, Ukrainian and Glagolitic models all work this way. Only Qwen3-VL and LightOnOCR require a GPU.

Remote Access via SSH Tunnel

# On your laptop β€” tunnel port 8765 through SSH
ssh -L 8765:localhost:8765 user@your.server.edu

# Then open: http://localhost:8765
# No firewall issues β€” works on any university network

Features

  • Engine selection with dynamic configuration forms (CRNN-CTC, TrOCR, Kraken, etc.)
  • Image upload β€” drag-and-drop, file picker, or PDF upload (multi-page PDFs become batch items)
  • Segmentation β€” Kraken (neural blla or classical) with color-coded region overlay
  • Live transcription β€” Server-Sent Events stream, lines appear as processed
  • Batch queue β€” multi-image queue, drag-to-reorder, cancel, prev/next navigation
  • Inline editing β€” double-click any transcription line to correct it
  • Confidence filter β€” slider to dim low-confidence lines
  • Export β€” TXT, CSV, PAGE XML (single image or ZIP for entire batch)
  • Font selector β€” Monomakh Unicode (recommended for Church Slavonic), Old Standard TT, and others
  • Kraken model presets β€” 12 Zenodo community models with one-click download
  • Resizable panels β€” drag handles to adjust column widths, saved across sessions

Running Tests

source htr_env/bin/activate
pip install pytest httpx
pytest web/tests/test_server.py -v

πŸ–₯️ Remote Server Usage

Running on a remote Linux server without GUI? You have several options:

Option 1: CLI Batch Processing

Best for: Production workflows, processing many images

# Process entire folders efficiently
python3 batch_processing.py \
    --input-folder HTR_Images/manuscripts \
    --engine crnn-ctc \
    --model-path models/crnn_ctc_model/best_model.pt \
    --use-pagexml \
    --output-folder output

Benefits: faster than GUI methods, no display overhead, scriptable

Option 2: X11 Forwarding (Interactive Work)

Best for: Interactive GUI work, visual parameter tuning, model comparison

Using MobaXterm on Windows:

  1. Install MobaXterm (X server auto-starts)
  2. SSH with X11 forwarding enabled
  3. Test: xclock & (should show clock window)
  4. Launch GUI: python3 transcription_gui_plugin.py

Performance: Good over LAN/local WiFi, slower over internet connections. Enable compression for best results.

Option 3: VNC (Alternative for Slow Connections)

Best for: When X11 is too slow (poor internet), extended GUI sessions, session persistence

# On server
vncserver :1 -geometry 1920x1080

# Connect from Windows using VNC viewer to: server:5901

Benefits: Better compression than X11, survives disconnects, works well over internet

Comparison

Method Speed Best For Network Type
CLI Batch Processing ⚑⚑⚑ Production, automation Any
Web UI ⚑⚑⚑ Interactive work, no install needed Any (SSH tunnel)
X11 Forwarding ⚑⚑ Interactive GUI work LAN/Local WiFi
X11 Forwarding ⚑ Light use only Internet
VNC/NoMachine ⚑⚑ Extended sessions, poor connections Any

βš™οΈ Configuration

CRNN-CTC Training Parameters

Key hyperparameters for optimal performance:

{
    "img_height": 128,           # Target image height
    "batch_size": 32,            # GPU-optimized (44GB VRAM)
    "num_epochs": 250,           # With early stopping
    "learning_rate": 0.0003,
    "early_stopping_patience": 15,
    "augment_train": True,       # Data augmentation
    "device": "cuda:0"
}

TrOCR Training Configuration

model_name: "kazars24/trocr-base-handwritten-ru"
data_root: "./processed_data"
batch_size: 16
epochs: 10
cache_images: true             # 10-50x faster data loading
fp16: true                     # Mixed precision training

🀝 Contributing

Contributions welcome! Areas of interest:

  1. New HTR engines: Add plugins for other HTR systems
  2. Model training: Share trained models for new scripts/languages
  3. Bug fixes: Especially inference/GUI issues
  4. Documentation: Improve guides and examples

πŸ“ License

MIT License

Copyright (c) 2025 Achim Rabus

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


πŸ™ Acknowledgments


πŸ“§ Contact

For questions, bug reports, or collaboration inquiries:


πŸ”¬ Technical Notes

Critical Preprocessing for CRNN-CTC

Aspect Ratio Preservation is CRITICAL for high aspect ratio line images:

# ALWAYS use --preserve-aspect-ratio for manuscript lines
python3 transkribus_parser.py \
    --preserve-aspect-ratio \
    --target-height 128 \
    # ...other args

Without this, TrOCR's ViT encoder brutally resizes to 384Γ—384, causing 10.6x width compression for Ukrainian lines (4077Γ—357 β†’ 384Γ—384). Characters shrink from ~80px to ~7px width, making recognition nearly impossible.

Known Bugs (Fixed)

  1. KALDI Format Vocabulary: Train/inference scripts now auto-detect format
  2. <space> vs <SPACE>: Both cases handled correctly
  3. Vocabulary File Mismatch: Training scripts auto-copy vocabulary to model directory

About

Multi-engine ATR for multiple languages and scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 88.8%
  • JavaScript 7.5%
  • CSS 2.4%
  • HTML 1.3%