Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
86392d2
Fix PIL DecompressionBomb error for large manuscript images
achimrabus Nov 13, 2025
82b8ac0
Fix TrOCR model loading from local checkpoint with .safetensors file
achimrabus Nov 13, 2025
205331a
Update .gitignore: exclude output_* dirs and versioned plan docs
achimrabus Nov 14, 2025
4f62e7f
Gemini 3 adjustments: reasoning token detection, early fallback trigg…
achimrabus Nov 20, 2025
9fded11
Add comprehensive documentation for Gemini 3 enhancements
achimrabus Nov 20, 2025
7f762a8
Add validation script for Gemini 3 enhancements
achimrabus Nov 20, 2025
8d54121
Add quick start guide for Gemini 3 enhancements
achimrabus Nov 20, 2025
46eb70d
Add implementation summary
achimrabus Nov 20, 2025
cd1dcba
Add documentation index for easy navigation
achimrabus Nov 20, 2025
714588f
Improve Gemini Advanced UI: add defaults, symmetrical layout
achimrabus Nov 20, 2025
036a5ab
Add cost controls: cap fallback at 8192, add debug logging, warn abou…
achimrabus Nov 20, 2025
9d06c27
Add comprehensive cost control guide
achimrabus Nov 20, 2025
a802595
Add Ukrainian V2b training preparation and fix debug logging bug
achimrabus Nov 21, 2025
f392048
Fix GUI crash: defer thread ref cleanup until thread.finished
achimrabus Nov 22, 2025
39b6848
Fix neural segmentation: use blla's actual region polygons
achimrabus Nov 22, 2025
f50f2ec
Fix EXIF orientation issues in batch processing and inference
achimrabus Nov 24, 2025
20c1072
Add EXIF rotation fix verification test
achimrabus Nov 24, 2025
d374c34
Document Transkribus export structure and EXIF rotation handling
achimrabus Nov 24, 2025
894f9e9
Add PAGE XML to text converter with reading order support
achimrabus Nov 24, 2025
73d093f
Ukrainian V2c: Fix EXIF+case-sensitivity bugs, achieve 4.76% CER
achimrabus Nov 25, 2025
6d6c460
Add OpenWebUI integration to batch GUI with model dropdown
achimrabus Dec 2, 2025
7b6c9d0
Initial plan
Copilot Dec 3, 2025
3ffec86
Initial plan
Copilot Dec 3, 2025
8bcd060
Initial plan
Copilot Dec 3, 2025
67b630a
Initial plan
Copilot Dec 3, 2025
57f3afe
Initial plan
Copilot Dec 3, 2025
37b167b
Move env loading to instance method, remove import-time print statements
Copilot Dec 3, 2025
29882c4
Remove print statements executing during module import in openwebui_e…
Copilot Dec 3, 2025
51a9236
Improve env loading: move import to module level, enhance docstring
Copilot Dec 3, 2025
7101c1c
Fix trailing whitespace
Copilot Dec 3, 2025
90f0419
Make hardcoded paths configurable via command-line arguments
Copilot Dec 3, 2025
7c06211
Enhance docstring to document env vars and fallback behavior
Copilot Dec 3, 2025
6cad758
Remove unused variables and imports, fix exception handling
Copilot Dec 3, 2025
77b4fac
Update documentation to reflect configurable paths
Copilot Dec 3, 2025
79a7a2f
Add validation for arguments and improve error handling
Copilot Dec 3, 2025
e98f574
Merge pull request #3 from achimrabus/copilot/sub-pr-2
achimrabus Dec 3, 2025
d68920a
Address code review feedback: remove redundant validation and add sec…
Copilot Dec 3, 2025
20b189c
Merge pull request #4 from achimrabus/copilot/sub-pr-2-again
achimrabus Dec 3, 2025
3b08acb
Merge pull request #5 from achimrabus/copilot/sub-pr-2-another-one
achimrabus Dec 3, 2025
ae07487
Merge pull request #6 from achimrabus/copilot/sub-pr-2-yet-again
achimrabus Dec 3, 2025
e84c387
Merge pull request #7 from achimrabus/copilot/sub-pr-2-one-more-time
achimrabus Dec 3, 2025
6bd27e1
Initial plan
Copilot Dec 3, 2025
da4242e
Remove unnecessary pass statement from inference_commercial_api.py
Copilot Dec 3, 2025
782c2d7
Merge pull request #8 from achimrabus/copilot/sub-pr-2-please-work
achimrabus Dec 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Thumbs.db
# Model checkpoints and outputs
models/
output/
output_*/
seq2seq_model_handwritten/
seq2seq_*/
cyrillic_seq2seq_*/
Expand Down Expand Up @@ -137,6 +138,7 @@ CLAUDE.md

# Internal planning documents (not for public repo)
*_PLAN.md
*_PLAN_*.md
IMPLEMENTATION_SUMMARY.md
PARTY_FIX_TESTING.md
PARTY_POC_VS_PLUGIN_COMPARISON.md
Expand Down Expand Up @@ -197,3 +199,14 @@ debug_*.py
SERVER_ENV.md

htr_gui/

# Training logs
training_ukrainian_v2c.log
nohup.out

# Diagnostic and inspection scripts (temporary)
diagnose_exif_mismatch.py
inspect_*.ipynb

# Gabelsberger shorthand preparation (work in progress)
prepare_gabelsberger_shorthand.py
143 changes: 143 additions & 0 deletions COST_CONTROL_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Cost Control & Performance Guide

## Problem Identified

Your transcription showed:
```
[tokens] prompt=1147 candidates=0 total=7290
⏱️ Early reasoning fallback triggered: internal=6143 (100% of budget)
Fallback max_output_tokens=12288
✅ Fallback succeeded (527 chars)
Time: 290 seconds (~5 minutes)
```

**Issues**:
1. ❌ **Extremely expensive**: 12,288 token fallback for just 527 characters
2. ❌ **Very slow**: 290 seconds for one page is unsustainable
3. ❌ **gemini-3-pro-preview** burning all tokens on internal reasoning

## Changes Made

### 1. **Capped Fallback at 8192 Tokens**
**Before**: `fallback_tokens = max(8192, max_output_tokens * 2)`
- With 6144 initial → fallback to 12,288 tokens

**After**: `fallback_tokens = 8192` (fixed cap)
- **Saves ~33% tokens** on fallback attempts
- Console shows: `Fallback max_output_tokens=8192 (capped for cost control)`

### 2. **Added Debug Logging**
Now shows:
```
🔧 LOW thinking mode: overriding max_output_tokens to 6144
📊 Final settings: thinking_mode=low, max_output_tokens=6144, temp=1.0
Using max_output_tokens=6144 (from config)
```
This confirms your LOW-mode token setting is being applied.

### 3. **Restriction Prompt Injection (Replacing Prior Banner)**
Automatic injection for preview models:
```
INSTRUCTION: Provide ONLY the direct diplomatic transcription ... (see code)
```
This replaces the prior GUI warning banner and focuses on reducing hidden reasoning token burn without forcing model switches that are unsuitable for Church Slavonic.

## Recommended Solutions

### 🎯 **Primary Strategy: Preview Model + Restriction Prompt**
Church Slavonic manuscripts require `gemini-3-pro-preview` for acceptable accuracy. Instead of switching models, we now:
1. Inject a restriction instruction to reduce internal reasoning token consumption.
2. Use LOW thinking + fast-direct for early emission.
3. Trigger early fallback if internal reasoning reaches threshold with no output.

### � **Alternate Strategy: High Reasoning Pass (If Low Underproduces)**
If LOW mode still burns tokens without output, switch to HIGH thinking with an 8192 cap and keep restriction prompt. This can yield better completeness at the cost of time.

---

## Cost Comparison

| Model | Time/Page | Tokens/Page | Notes |
|-------|-----------|-------------|-------|
| **gemini-3-pro-preview (LOW + restriction)** | 40-120s | ~4,000–8,000 | Balanced; early fallback + restriction reduce waste |
| **gemini-3-pro-preview (HIGH)** | 90-180s | ~6,000–8,192 | Use if LOW fails to emit; higher completeness |
| *(Other models)* | — | — | Not used (insufficient Church Slavonic fidelity) |

*Approximate, varies by content & API pricing

---

## Debugging Your Current Setup

### Check if LOW-mode override is working:

Look for these lines in console:
```
🔧 LOW thinking mode: overriding max_output_tokens to 6144
📊 Final settings: thinking_mode=low, max_output_tokens=6144, temp=1.0
Using max_output_tokens=6144 (from config)
```

**If you see**:
```
Increasing max_output_tokens from 2048 to 4096 for preview model
```
→ Your GUI field is empty or invalid. Check Advanced panel "Low-mode tokens" = `6144`

### Why preview model is slow:

Preview models have internal "reasoning" that:
1. Consumes tokens invisibly (`total - prompt - candidates` = internal)
2. Adds latency (model is "thinking" but not outputting)
3. Doesn't guarantee better output for simple text

Your log: `internal=6143` out of 6143 budget = **100% wasted**

---

## Action Plan

### Immediate (Next Transcription)
1. Ensure restriction prompt injection message appears in console.
2. Use LOW thinking + fast-direct early exit.
3. If MAX_TOKENS hit with no parts → fallback auto-escalates to 8192.
4. If still empty, rerun with HIGH thinking (restriction stays).

### If Output Truncated
- Disable early exit; enable auto continuation (2 passes)
- Raise low-mode tokens (e.g., 7168) within 8192 cap

---

## Technical Notes

### Why Capping Fallback Helps
- Preview model fallback was `max(8192, 6144*2)` = 12,288
- Token budget scales quadratically with reasoning depth
- Capping at 8192 forces model to be concise
- If 8192 fails → likely need different model, not more tokens

### Early Fallback Trigger
Your log shows trigger worked:
```
⏱️ Early reasoning fallback triggered: internal=6143 (100% of budget)
```
This is GOOD - system detected waste early and aborted stream.
Without it, you'd wait even longer before getting fallback result.

### Future Enhancement
Could add **model auto-switching**:
- Try flash first (15s timeout)
- On failure/poor quality → escalate to pro
- On repeated failure → preview as last resort

---

## Summary

✅ **Fallback capped** at 8192 (was 12,288)
✅ **Debug logging** added for transparency
✅ **Restriction prompt** active for preview models
✅ **Removed banner recommending alternative models (not suitable for Church Slavonic)**

**Bottom line**: For Church Slavonic, preview model + restriction prompt + early fallback is the current best-performing path; alternative models underperform in fidelity.
81 changes: 81 additions & 0 deletions CRITICAL_VENV_REQUIREMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# CRITICAL: Virtual Environment Requirement

## Always Use htr_gui Virtual Environment

**MANDATORY FOR ALL PYTHON SCRIPTS**:
```bash
source /home/achimrabus/htr_gui/dhlab-slavistik/htr_gui/bin/activate
```

**CORRECT PATH**: `/home/achimrabus/htr_gui/dhlab-slavistik/htr_gui/bin/activate`

## Why This Matters

- All dependencies (PyTorch, PIL, lxml, etc.) are installed in this venv
- System Python may not have required packages
- Different Python versions may cause compatibility issues

## Usage Patterns

### Direct Script Execution
```bash
# WRONG - will fail with missing dependencies
python3 script.py

# CORRECT - activate venv first
source /home/achimrabus/htr_gui/htr_gui/bin/activate
python3 script.py
```

### Shell Scripts
Always add at the top of any shell script:
```bash
#!/bin/bash
# CRITICAL: Always activate htr_gui virtual environment
source /home/achimrabus/htr_gui/htr_gui/bin/activate

# ... rest of script
```

### Long-Running Processes (Preprocessing, Training)
Use `nohup` to prevent disconnection from stopping the process:
```bash
#!/bin/bash
source /home/achimrabus/htr_gui/htr_gui/bin/activate

nohup python3 training_script.py > training.log 2>&1 &
echo "Process started with PID: $!"
```

## Current Scripts Updated

All Prosta Mova scripts now include venv activation:
- ✅ `preprocess_prosta_mova.sh` - includes venv activation
- ✅ `run_preprocess_prosta_mova.sh` - nohup wrapper with venv
- ✅ `run_pylaia_prosta_mova_training.sh` - nohup wrapper with venv
- ✅ `start_pylaia_prosta_mova_training.py` - run via shell wrapper

## Monitoring Long-Running Processes

```bash
# Start preprocessing in background
./run_preprocess_prosta_mova.sh

# Monitor progress
tail -f preprocess_prosta_mova.log

# Start training in background (after preprocessing completes)
./run_pylaia_prosta_mova_training.sh

# Monitor training
tail -f training_prosta_mova.log
```

## Reference Scripts

See successful examples:
- Church Slavonic training: Used venv + nohup
- Ukrainian PyLaia training: Used venv + nohup
- Glagolitic PyLaia training: Used venv + nohup

**NEVER forget to activate the venv - it will save hours of debugging!**
116 changes: 116 additions & 0 deletions DATASET_FORMAT_FIX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Dataset Format Fix - PyLaiaDataset Update

## Problem

Training script crashed with:
```
FileNotFoundError: [Errno 2] No such file or directory:
'/home/achimrabus/htr_gui/dhlab-slavistik/data/pylaia_prosta_mova_v4_train/images/line_images/...'
```

## Root Cause

`PyLaiaDataset` class was designed for old format:
- **Old format**: `lines.txt` contains just sample IDs (e.g., `0001`)
- Images: `images/0001.png`
- Ground truth: `gt/0001.txt`

But V4 dataset uses new format:
- **New format**: `lines.txt` contains full paths with text (e.g., `line_images/0001.png text here`)
- Images: directly in `line_images/`
- No separate `gt/` directory

The dataset loader was looking for `images/line_images/...` (double-pathing error).

## Solution

Updated `train_pylaia.py` PyLaiaDataset class (lines 53-164):

### Changes Made

1. **Removed old directory structure** (lines 53-55):
```python
# OLD:
self.images_dir = self.data_dir / "images"
self.gt_dir = self.data_dir / "gt"

# NEW: (removed, paths come from lines.txt)
```

2. **Parse new format** (lines 57-71):
```python
# Load list of samples (new format: "image_path text")
self.samples = [] # List of (image_path, text) tuples
with open(list_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if not line:
continue
# Split on first space: image_path text
parts = line.split(' ', 1)
if len(parts) == 2:
img_path, text = parts
self.samples.append((img_path, text))
```

3. **Updated __getitem__** (lines 131-164):
```python
def __getitem__(self, idx):
img_rel_path, text = self.samples[idx]

# Load image (relative to data_dir)
img_path = self.data_dir / img_rel_path
image = Image.open(img_path).convert('L')

# ... processing ...

return image, torch.LongTensor(target), text, img_rel_path
```

4. **Updated __len__** (line 129):
```python
def __len__(self):
return len(self.samples) # Was: len(self.sample_ids)
```

## Verification

Tested both datasets successfully:

### Training Dataset
```
✓ Loaded 58,843 samples
✓ Vocabulary size: 187 characters
✓ First sample image shape: torch.Size([1, 128, 1464])
✓ Image path: line_images/0955_Suprasliensis_KlimentStd-0042_r1l26.png
```

### Validation Dataset
```
✓ Loaded 2,588 samples
✓ Vocabulary size: 187 characters
✓ First sample image shape: torch.Size([1, 128, 1974])
✓ Image path: ../pylaia_prosta_mova_v4_val/line_images/0027_bibliasiriechkni01luik_orig_0442_region_1567088695198_394l30.png
```

## Impact

✅ **Training script now works** with V4 dataset format
✅ **No data conversion needed** - relative paths work correctly
✅ **Backward compatible** - old datasets can be converted by creating similar `lines.txt` format
✅ **Validation dataset** correctly references `../pylaia_prosta_mova_v4_val/` directory

## Ready for Training

All systems green:
- ✅ Dataset loading fixed
- ✅ Train CER logging added
- ✅ Hyperparameters optimized
- ✅ 58,843 training + 2,588 validation samples
- ✅ EXIF rotation bug fixed
- ✅ nohup launch script created

Training command:
```bash
./run_pylaia_prosta_mova_v4_training.sh
```
Loading