achimrabus · achimrabus · Dec 3, 2025 · Nov 13, 2025 · Nov 13, 2025 · Nov 14, 2025
diff --git a/.gitignore b/.gitignore
@@ -34,6 +34,7 @@ Thumbs.db
 # Model checkpoints and outputs
 models/
 output/
+output_*/
 seq2seq_model_handwritten/
 seq2seq_*/
 cyrillic_seq2seq_*/
@@ -137,6 +138,7 @@ CLAUDE.md
 
 # Internal planning documents (not for public repo)
 *_PLAN.md
+*_PLAN_*.md
 IMPLEMENTATION_SUMMARY.md
 PARTY_FIX_TESTING.md
 PARTY_POC_VS_PLUGIN_COMPARISON.md
@@ -197,3 +199,14 @@ debug_*.py
 SERVER_ENV.md
 
 htr_gui/
+
+# Training logs
+training_ukrainian_v2c.log
+nohup.out
+
+# Diagnostic and inspection scripts (temporary)
+diagnose_exif_mismatch.py
+inspect_*.ipynb
+
+# Gabelsberger shorthand preparation (work in progress)
+prepare_gabelsberger_shorthand.py
diff --git a/COST_CONTROL_GUIDE.md b/COST_CONTROL_GUIDE.md
@@ -0,0 +1,143 @@
+# Cost Control & Performance Guide
+
+## Problem Identified
+
+Your transcription showed:
+```
+[tokens] prompt=1147 candidates=0 total=7290
+⏱️ Early reasoning fallback triggered: internal=6143 (100% of budget)
+Fallback max_output_tokens=12288
+✅ Fallback succeeded (527 chars)
+Time: 290 seconds (~5 minutes)
+```
+
+**Issues**:
+1. ❌ **Extremely expensive**: 12,288 token fallback for just 527 characters
+2. ❌ **Very slow**: 290 seconds for one page is unsustainable
+3. ❌ **gemini-3-pro-preview** burning all tokens on internal reasoning
+
+## Changes Made
+
+### 1. **Capped Fallback at 8192 Tokens**
+**Before**: `fallback_tokens = max(8192, max_output_tokens * 2)`
+- With 6144 initial → fallback to 12,288 tokens
+
+**After**: `fallback_tokens = 8192` (fixed cap)
+- **Saves ~33% tokens** on fallback attempts
+- Console shows: `Fallback max_output_tokens=8192 (capped for cost control)`
+
+### 2. **Added Debug Logging**
+Now shows:
+```
+🔧 LOW thinking mode: overriding max_output_tokens to 6144
+📊 Final settings: thinking_mode=low, max_output_tokens=6144, temp=1.0
+   Using max_output_tokens=6144 (from config)
+```
+This confirms your LOW-mode token setting is being applied.
+
+### 3. **Restriction Prompt Injection (Replacing Prior Banner)**
+Automatic injection for preview models:
+```
+INSTRUCTION: Provide ONLY the direct diplomatic transcription ... (see code)
+```
+This replaces the prior GUI warning banner and focuses on reducing hidden reasoning token burn without forcing model switches that are unsuitable for Church Slavonic.
+
+## Recommended Solutions
+
+### 🎯 **Primary Strategy: Preview Model + Restriction Prompt**
+Church Slavonic manuscripts require `gemini-3-pro-preview` for acceptable accuracy. Instead of switching models, we now:
+1. Inject a restriction instruction to reduce internal reasoning token consumption.
+2. Use LOW thinking + fast-direct for early emission.
+3. Trigger early fallback if internal reasoning reaches threshold with no output.
+
+### � **Alternate Strategy: High Reasoning Pass (If Low Underproduces)**
+If LOW mode still burns tokens without output, switch to HIGH thinking with an 8192 cap and keep restriction prompt. This can yield better completeness at the cost of time.
+
+---
+
+## Cost Comparison
+
+| Model | Time/Page | Tokens/Page | Notes |
+|-------|-----------|-------------|-------|
+| **gemini-3-pro-preview (LOW + restriction)** | 40-120s | ~4,000–8,000 | Balanced; early fallback + restriction reduce waste |
+| **gemini-3-pro-preview (HIGH)** | 90-180s | ~6,000–8,192 | Use if LOW fails to emit; higher completeness |
+| *(Other models)* | — | — | Not used (insufficient Church Slavonic fidelity) |
+
+*Approximate, varies by content & API pricing
+
+---
+
+## Debugging Your Current Setup
+
+### Check if LOW-mode override is working:
+
+Look for these lines in console:
+```
+🔧 LOW thinking mode: overriding max_output_tokens to 6144
+📊 Final settings: thinking_mode=low, max_output_tokens=6144, temp=1.0
+   Using max_output_tokens=6144 (from config)
+```
+
+**If you see**:
+```
+   Increasing max_output_tokens from 2048 to 4096 for preview model
+```
+→ Your GUI field is empty or invalid. Check Advanced panel "Low-mode tokens" = `6144`
+
+### Why preview model is slow:
+
+Preview models have internal "reasoning" that:
+1. Consumes tokens invisibly (`total - prompt - candidates` = internal)
+2. Adds latency (model is "thinking" but not outputting)
+3. Doesn't guarantee better output for simple text
+
+Your log: `internal=6143` out of 6143 budget = **100% wasted**
+
+---
+
+## Action Plan
+
+### Immediate (Next Transcription)
+1. Ensure restriction prompt injection message appears in console.
+2. Use LOW thinking + fast-direct early exit.
+3. If MAX_TOKENS hit with no parts → fallback auto-escalates to 8192.
+4. If still empty, rerun with HIGH thinking (restriction stays).
+
+### If Output Truncated
+- Disable early exit; enable auto continuation (2 passes)
+- Raise low-mode tokens (e.g., 7168) within 8192 cap
+
+---
+
+## Technical Notes
+
+### Why Capping Fallback Helps
+- Preview model fallback was `max(8192, 6144*2)` = 12,288
+- Token budget scales quadratically with reasoning depth
+- Capping at 8192 forces model to be concise
+- If 8192 fails → likely need different model, not more tokens
+
+### Early Fallback Trigger
+Your log shows trigger worked:
+```
+⏱️ Early reasoning fallback triggered: internal=6143 (100% of budget)
+```
+This is GOOD - system detected waste early and aborted stream.
+Without it, you'd wait even longer before getting fallback result.
+
+### Future Enhancement
+Could add **model auto-switching**:
+- Try flash first (15s timeout)
+- On failure/poor quality → escalate to pro
+- On repeated failure → preview as last resort
+
+---
+
+## Summary
+
+✅ **Fallback capped** at 8192 (was 12,288)
+✅ **Debug logging** added for transparency
+✅ **Restriction prompt** active for preview models
+✅ **Removed banner recommending alternative models (not suitable for Church Slavonic)**
+
+**Bottom line**: For Church Slavonic, preview model + restriction prompt + early fallback is the current best-performing path; alternative models underperform in fidelity.
diff --git a/CRITICAL_VENV_REQUIREMENT.md b/CRITICAL_VENV_REQUIREMENT.md
@@ -0,0 +1,81 @@
+# CRITICAL: Virtual Environment Requirement
+
+## Always Use htr_gui Virtual Environment
+
+**MANDATORY FOR ALL PYTHON SCRIPTS**:
+```bash
+source /home/achimrabus/htr_gui/dhlab-slavistik/htr_gui/bin/activate
+```
+
+**CORRECT PATH**: `/home/achimrabus/htr_gui/dhlab-slavistik/htr_gui/bin/activate`
+
+## Why This Matters
+
+- All dependencies (PyTorch, PIL, lxml, etc.) are installed in this venv
+- System Python may not have required packages
+- Different Python versions may cause compatibility issues
+
+## Usage Patterns
+
+### Direct Script Execution
+```bash
+# WRONG - will fail with missing dependencies
+python3 script.py
+
+# CORRECT - activate venv first
+source /home/achimrabus/htr_gui/htr_gui/bin/activate
+python3 script.py
+```
+
+### Shell Scripts
+Always add at the top of any shell script:
+```bash
+#!/bin/bash
+# CRITICAL: Always activate htr_gui virtual environment
+source /home/achimrabus/htr_gui/htr_gui/bin/activate
+
+# ... rest of script
+```
+
+### Long-Running Processes (Preprocessing, Training)
+Use `nohup` to prevent disconnection from stopping the process:
+```bash
+#!/bin/bash
+source /home/achimrabus/htr_gui/htr_gui/bin/activate
+
+nohup python3 training_script.py > training.log 2>&1 &
+echo "Process started with PID: $!"
+```
+
+## Current Scripts Updated
+
+All Prosta Mova scripts now include venv activation:
+- ✅ `preprocess_prosta_mova.sh` - includes venv activation
+- ✅ `run_preprocess_prosta_mova.sh` - nohup wrapper with venv
+- ✅ `run_pylaia_prosta_mova_training.sh` - nohup wrapper with venv
+- ✅ `start_pylaia_prosta_mova_training.py` - run via shell wrapper
+
+## Monitoring Long-Running Processes
+
+```bash
+# Start preprocessing in background
+./run_preprocess_prosta_mova.sh
+
+# Monitor progress
+tail -f preprocess_prosta_mova.log
+
+# Start training in background (after preprocessing completes)
+./run_pylaia_prosta_mova_training.sh
+
+# Monitor training
+tail -f training_prosta_mova.log
+```
+
+## Reference Scripts
+
+See successful examples:
+- Church Slavonic training: Used venv + nohup
+- Ukrainian PyLaia training: Used venv + nohup
+- Glagolitic PyLaia training: Used venv + nohup
+
+**NEVER forget to activate the venv - it will save hours of debugging!**
diff --git a/DATASET_FORMAT_FIX.md b/DATASET_FORMAT_FIX.md
@@ -0,0 +1,116 @@
+# Dataset Format Fix - PyLaiaDataset Update
+
+## Problem
+
+Training script crashed with:
+```
+FileNotFoundError: [Errno 2] No such file or directory:
+'/home/achimrabus/htr_gui/dhlab-slavistik/data/pylaia_prosta_mova_v4_train/images/line_images/...'
+```
+
+## Root Cause
+
+`PyLaiaDataset` class was designed for old format:
+- **Old format**: `lines.txt` contains just sample IDs (e.g., `0001`)
+  - Images: `images/0001.png`
+  - Ground truth: `gt/0001.txt`
+
+But V4 dataset uses new format:
+- **New format**: `lines.txt` contains full paths with text (e.g., `line_images/0001.png text here`)
+  - Images: directly in `line_images/`
+  - No separate `gt/` directory
+
+The dataset loader was looking for `images/line_images/...` (double-pathing error).
+
+## Solution
+
+Updated `train_pylaia.py` PyLaiaDataset class (lines 53-164):
+
+### Changes Made
+
+1. **Removed old directory structure** (lines 53-55):
+   ```python
+   # OLD:
+   self.images_dir = self.data_dir / "images"
+   self.gt_dir = self.data_dir / "gt"
+
+   # NEW: (removed, paths come from lines.txt)
+   ```
+
+2. **Parse new format** (lines 57-71):
+   ```python
+   # Load list of samples (new format: "image_path text")
+   self.samples = []  # List of (image_path, text) tuples
+   with open(list_path, 'r', encoding='utf-8') as f:
+       for line in f:
+           line = line.strip()
+           if not line:
+               continue
+           # Split on first space: image_path text
+           parts = line.split(' ', 1)
+           if len(parts) == 2:
+               img_path, text = parts
+               self.samples.append((img_path, text))
+   ```
+
+3. **Updated __getitem__** (lines 131-164):
+   ```python
+   def __getitem__(self, idx):
+       img_rel_path, text = self.samples[idx]
+
+       # Load image (relative to data_dir)
+       img_path = self.data_dir / img_rel_path
+       image = Image.open(img_path).convert('L')
+
+       # ... processing ...
+
+       return image, torch.LongTensor(target), text, img_rel_path
+   ```
+
+4. **Updated __len__** (line 129):
+   ```python
+   def __len__(self):
+       return len(self.samples)  # Was: len(self.sample_ids)
+   ```
+
+## Verification
+
+Tested both datasets successfully:
+
+### Training Dataset
+```
+✓ Loaded 58,843 samples
+✓ Vocabulary size: 187 characters
+✓ First sample image shape: torch.Size([1, 128, 1464])
+✓ Image path: line_images/0955_Suprasliensis_KlimentStd-0042_r1l26.png
+```
+
+### Validation Dataset
+```
+✓ Loaded 2,588 samples
+✓ Vocabulary size: 187 characters
+✓ First sample image shape: torch.Size([1, 128, 1974])
+✓ Image path: ../pylaia_prosta_mova_v4_val/line_images/0027_bibliasiriechkni01luik_orig_0442_region_1567088695198_394l30.png
+```
+
+## Impact
+
+✅ **Training script now works** with V4 dataset format
+✅ **No data conversion needed** - relative paths work correctly
+✅ **Backward compatible** - old datasets can be converted by creating similar `lines.txt` format
+✅ **Validation dataset** correctly references `../pylaia_prosta_mova_v4_val/` directory
+
+## Ready for Training
+
+All systems green:
+- ✅ Dataset loading fixed
+- ✅ Train CER logging added
+- ✅ Hyperparameters optimized
+- ✅ 58,843 training + 2,588 validation samples
+- ✅ EXIF rotation bug fixed
+- ✅ nohup launch script created
+
+Training command:
+```bash
+./run_pylaia_prosta_mova_v4_training.sh
+```