Conversation
Problem: - Batch processing failed on 7/31 images with error: "Image size (180060192 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack." - High-resolution manuscript scans (~13,417 × 13,417 pixels) exceed PIL's default 178MP safety limit Root Cause: - PIL/Pillow has built-in protection against decompression bomb attacks - Default limit: Image.MAX_IMAGE_PIXELS = 178,956,970 (~178 megapixels) - Large manuscript scans routinely exceed this (180MP+) - This is a safety feature, not an actual threat in our use case Solution: - Disable decompression bomb protection: Image.MAX_IMAGE_PIXELS = None - Applied to all image loading entry points: * batch_processing.py (batch CLI) * inference_page.py (single page inference) * transcription_gui_plugin.py (GUI) Safety Consideration: - Legitimate use case: high-resolution manuscript digitization - Images from trusted sources (user's own scans) - Not a security risk in this context (not processing untrusted images) Impact: - ✅ All 31 images will now process successfully - ✅ No artificial size limitations for manuscript scans - ✅ Applies to both batch processing and GUI Alternative Approaches Considered: - Increase limit to specific value (e.g., 300MP): Too arbitrary - Check and warn per-image: Adds complexity, still requires override - Current approach: Simplest and most user-friendly Testing: - Images with 180MP will now load without error - No performance impact (limit is only for DOS protection) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Problem:
- TrOCR batch processing failed with error:
"Repo id must be in the form 'repo_name' or 'namespace/repo_name':
'/path/to/model.safetensors'. Use repo_type argument if needed."
- User provided full path to model.safetensors file
- HuggingFace from_pretrained() expects directory, not file path
Root Cause:
- TrOCRInference accepted model_path as either:
* HuggingFace repo ID (e.g., "kazars24/trocr-base-handwritten-ru")
* Local directory path (e.g., "models/my_model/")
- But if user provided specific file path (model.safetensors),
from_pretrained() interpreted it as invalid repo ID
Solution:
- Check if model_path is a file (using Path.is_file())
- If file: use parent directory for from_pretrained()
- If directory: use as-is (existing behavior)
- Add debug logging to clarify which path is used
Code Changes (inference_page.py:510-523):
```python
# If model_path points to a specific file (e.g., model.safetensors),
# use the parent directory for from_pretrained()
if self.checkpoint_path.is_file():
model_dir = self.checkpoint_path.parent
print(f"Model path is a file, using directory: {model_dir}")
else:
model_dir = self.checkpoint_path
```
Impact:
- ✅ Supports both directory and file paths for local checkpoints
- ✅ Works with: "/path/to/model/" (directory)
- ✅ Works with: "/path/to/model/model.safetensors" (file)
- ✅ Works with: "/path/to/model/pytorch_model.bin" (file)
- ✅ Backward compatible with existing code
Testing:
- TrOCR can now load from:
* Directory: models/TrOCR_pstroe_Glagolitic/
* File: models/TrOCR_pstroe_Glagolitic/model.safetensors
* Both resolve to same model directory
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add output_*/ pattern to ignore all output directories - Add *_PLAN_*.md pattern for versioned plan documents - Keeps existing output/ pattern (already working)
…er, GUI controls (min new chars, low-mode tokens, fallback threshold), continuation tuning, stats CSV logging
Changes: - Remove excessive debug logging from train_pylaia.py (lines 402-412) - Fixed training hang caused by 2,058 log writes per epoch - Ukrainian V2b now trains at 4-7 batches/second - Add EXIF rotation prevention documentation - PREPROCESSING_CHECKLIST.md: 10-point mandatory checklist - INVESTIGATION_SUMMARY.md: EXIF bug case study - TRAIN_CER_LOGGING_EXPLANATION.md: Training monitoring guide - Update transkribus_parser.py with 18-line EXIF warning comment - Add convert_ukrainian_to_pylaia.py for dataset format conversion - Add inspect_ukrainian_v2.ipynb for visual inspection (30 train + 15 val) Ukrainian V2b training status: - Dataset: 21,944 train + 814 val lines (EXIF-verified, polygon masked) - Training started: 2025-11-21 17:14:06 on GPU 0 - Model: models/pylaia_ukrainian_v2b_20251121_171406/ - Target: <10% CER (beat Ukrainian V1's 10.80%)
- Root cause: clearing worker/thread refs in _on_finished() while thread still shutting down - Solution: added _on_thread_finished() callback connected to thread.finished signal - Now refs only cleared after thread truly stops, preventing QThread destruction crash - Confirmed polygons render correctly (convex hulls working as expected)
- Root cause: code was computing convex hull from line bboxes instead of using blla's region boundaries - Solution: extract and track blla region objects during line assignment, use their boundary polygons - Fallback: compute convex hull only when blla polygon not available (e.g., after column clustering) - Result: proper region boundaries instead of one giant irregular polygon
- Root cause: Images loaded without applying EXIF rotation metadata - Result: PAGE XML dimensions mismatched (e.g., 2848x4272 vs 4272x2848 swapped) - Solution: Added ImageOps.exif_transpose() to all image loading points: * batch_processing.py: image loading and dimension checking * inference_page.py: CLI inference * inference_page_gui.py: GUI inference - This matches approach already used in prepare_gabelsberger_shorthand.py - Fixes ~32% of Ukrainian validation data that has EXIF rotation tags
- Tests images with and without EXIF rotation metadata - Verifies dimensions match PAGE XML after ImageOps.exif_transpose() - Confirms fix works on problematic Cyrillic-named files (Лист 021.JPG etc) - All 5 files with EXIF rotation (4272x2848 → 2848x4272) now match XML
- Clarified that PAGE XML files are in page/ subdirectory (Transkribus standard) - Updated all examples to show correct path: page/image.xml - Added folder structure diagram showing transkribus_export/page/ layout - Documented EXIF rotation fix for dimension mismatch errors - Added troubleshooting section for EXIF issues - Recommended batch_processing.py for Transkribus exports
Critical Bug Fixes: 1. EXIF Rotation Bug (transkribus_parser.py line 232) - V2b data extracted BEFORE ImageOps.exif_transpose() was added - Impact: All 99 Лист files with EXIF tag 8 (270° rotation) excluded - Result: Coordinates out of bounds, files skipped silently 2. Case-Sensitivity Bug (transkribus_parser.py line 344) - Only checked lowercase extensions (.jpg, .jpeg, .png) - Impact: All Лист files with .JPG extension skipped on Linux - Fix: Added uppercase variants (.JPG, .JPEG, .PNG, .TIF, .TIFF) Dataset Improvements (V2b → V2c): - Training: 21,944 → 24,706 lines (+12.6%) - Validation: 814 → 970 lines (+19.2%) - Лист training: 0 → 2,772 lines (printed text) - Лист validation: 0 → 156 lines - Vocabulary: 181 → 185 symbols Training Results: - Best Val CER: 4.76% (vs V2b's 5.53%) - Improvement: 0.77pp (13.9% relative) - Training time: 101 epochs, 2.7 hours - Model: models/pylaia_ukrainian_v2c_20251124_180634/best_model.pt Expected Impact: - Лист files: 90%+ CER (V2b) → <5% CER (V2c) - Overall: Better generalization on printed+handwritten text Files Added: - UKRAINIAN_V2C_TRAINING_GUIDE.md: Complete workflow documentation - reextract_ukrainian_v2c.py: Re-extraction with EXIF fix - convert_ukrainian_v2c_to_pylaia.py: PyLaia format conversion - train_pylaia_ukrainian_v2c.py: Training script (batch 64, GPU 1) - start_ukrainian_v2c_training.sh: Nohup launcher with monitoring
- Added OpenWebUI engine to batch_processing.py CLI with --api-key and --max-tokens arguments - Added OpenWebUI to batch GUI with API key input and model dropdown - Model dropdown fetches available models from OpenWebUI API via "Refresh" button - Auto-loads API key from OPENWEBUI_API_KEY environment variable - Fixed batch processing bug: OpenWebUI engine now stores config from load_model() and uses it during transcription - Added validation for OpenWebUI requiring API key and model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request titled "Gemini 3 adjustments" encompasses a substantial set of changes across multiple areas of the codebase, including Gemini API enhancements, data preprocessing improvements (particularly EXIF rotation fixes), new training scripts for Ukrainian datasets, PAGE XML processing tools, and various GUI and engine updates.
Key Changes:
- Enhanced Gemini API integration with thinking modes, continuation support, and improved error handling
- Critical EXIF rotation bug fix in transkribus_parser.py affecting image preprocessing
- New Ukrainian V2c training pipeline with improved data extraction (+15% more training data)
- PAGE XML batch segmentation tool with neural/classical modes and QC metrics
- Multiple utility scripts for visualization, validation, and dataset analysis
Reviewed changes
Copilot reviewed 52 out of 57 changed files in this pull request and generated 30 comments.
Show a summary per file
| File | Description |
|---|---|
| visualize_line_comparison.py | New script for comparing Church Slavonic and Prosta Mova line image segmentation with visual overlays |
| validate_gemini_enhancements.py | Validation script testing Gemini 3 parameter signatures, CSV logging, and GUI controls |
| transkribus_parser.py | Critical fix: Added ImageOps.exif_transpose() to handle EXIF rotation tags and case-insensitive file extension matching |
| transcription_gui_plugin.py | Added Image.MAX_IMAGE_PIXELS = None to handle large manuscript images |
| train_pylaia_ukrainian_v2c.py | New training script for Ukrainian V2c dataset with EXIF and case-sensitivity bug fixes |
| train_pylaia.py | Enhanced with training CER calculation for overfitting detection and improved data loading format |
| tighten_page_xml.py | New utility for tightening PAGE XML polygon segmentation to actual ink extent |
| test_exif_fix.py | Validation script verifying EXIF rotation fix works correctly |
| inference_commercial_api.py | Major enhancements: thinking modes, auto-continuation, streaming, fallback logic, and improved safety handling |
| engines/commercial_api_engine.py | GUI integration of new Gemini parameters (thinking mode, temperature, continuation settings) |
| engines/openwebui_engine.py | Added .env file support for API key loading and improved batch processing configuration |
| polyscriptor_batch_gui.py | Extended with OpenWebUI support, Qwen3-VL adapter/line-mode options, and improved model validation |
| batch_processing.py | Added EXIF rotation fix, OpenWebUI engine support, and --line-mode flag for page-based engines |
| pagexml/* | New standalone PAGE XML batch segmentation module with classical/neural/auto modes and QC metrics export |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@achimrabus I've opened a new pull request, #3, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@achimrabus I've opened a new pull request, #7, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
…ngine Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Remove import-time print statements from OpenWebUI engine
…urity warning Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Remove print statements executing during openwebui_engine module import
Remove hardcoded absolute paths from visualize_line_comparison.py
[WIP] Update changes based on review comments for Gemini 3
Clean up unused code and improve exception handling
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 52 out of 57 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@achimrabus I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com>
Remove unnecessary pass statement in Gemini retry exception handler
* Fix PIL DecompressionBomb error for large manuscript images Problem: - Batch processing failed on 7/31 images with error: "Image size (180060192 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack." - High-resolution manuscript scans (~13,417 × 13,417 pixels) exceed PIL's default 178MP safety limit Root Cause: - PIL/Pillow has built-in protection against decompression bomb attacks - Default limit: Image.MAX_IMAGE_PIXELS = 178,956,970 (~178 megapixels) - Large manuscript scans routinely exceed this (180MP+) - This is a safety feature, not an actual threat in our use case Solution: - Disable decompression bomb protection: Image.MAX_IMAGE_PIXELS = None - Applied to all image loading entry points: * batch_processing.py (batch CLI) * inference_page.py (single page inference) * transcription_gui_plugin.py (GUI) Safety Consideration: - Legitimate use case: high-resolution manuscript digitization - Images from trusted sources (user's own scans) - Not a security risk in this context (not processing untrusted images) Impact: - ✅ All 31 images will now process successfully - ✅ No artificial size limitations for manuscript scans - ✅ Applies to both batch processing and GUI Alternative Approaches Considered: - Increase limit to specific value (e.g., 300MP): Too arbitrary - Check and warn per-image: Adds complexity, still requires override - Current approach: Simplest and most user-friendly Testing: - Images with 180MP will now load without error - No performance impact (limit is only for DOS protection) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix TrOCR model loading from local checkpoint with .safetensors file Problem: - TrOCR batch processing failed with error: "Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/path/to/model.safetensors'. Use repo_type argument if needed." - User provided full path to model.safetensors file - HuggingFace from_pretrained() expects directory, not file path Root Cause: - TrOCRInference accepted model_path as either: * HuggingFace repo ID (e.g., "kazars24/trocr-base-handwritten-ru") * Local directory path (e.g., "models/my_model/") - But if user provided specific file path (model.safetensors), from_pretrained() interpreted it as invalid repo ID Solution: - Check if model_path is a file (using Path.is_file()) - If file: use parent directory for from_pretrained() - If directory: use as-is (existing behavior) - Add debug logging to clarify which path is used Code Changes (inference_page.py:510-523): ```python # If model_path points to a specific file (e.g., model.safetensors), # use the parent directory for from_pretrained() if self.checkpoint_path.is_file(): model_dir = self.checkpoint_path.parent print(f"Model path is a file, using directory: {model_dir}") else: model_dir = self.checkpoint_path ``` Impact: - ✅ Supports both directory and file paths for local checkpoints - ✅ Works with: "/path/to/model/" (directory) - ✅ Works with: "/path/to/model/model.safetensors" (file) - ✅ Works with: "/path/to/model/pytorch_model.bin" (file) - ✅ Backward compatible with existing code Testing: - TrOCR can now load from: * Directory: models/TrOCR_pstroe_Glagolitic/ * File: models/TrOCR_pstroe_Glagolitic/model.safetensors * Both resolve to same model directory 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update .gitignore: exclude output_* dirs and versioned plan docs - Add output_*/ pattern to ignore all output directories - Add *_PLAN_*.md pattern for versioned plan documents - Keeps existing output/ pattern (already working) * Gemini 3 adjustments: reasoning token detection, early fallback trigger, GUI controls (min new chars, low-mode tokens, fallback threshold), continuation tuning, stats CSV logging * Add comprehensive documentation for Gemini 3 enhancements * Add validation script for Gemini 3 enhancements * Add quick start guide for Gemini 3 enhancements * Add implementation summary * Add documentation index for easy navigation * Improve Gemini Advanced UI: add defaults, symmetrical layout * Add cost controls: cap fallback at 8192, add debug logging, warn about preview models * Add comprehensive cost control guide * Add Ukrainian V2b training preparation and fix debug logging bug Changes: - Remove excessive debug logging from train_pylaia.py (lines 402-412) - Fixed training hang caused by 2,058 log writes per epoch - Ukrainian V2b now trains at 4-7 batches/second - Add EXIF rotation prevention documentation - PREPROCESSING_CHECKLIST.md: 10-point mandatory checklist - INVESTIGATION_SUMMARY.md: EXIF bug case study - TRAIN_CER_LOGGING_EXPLANATION.md: Training monitoring guide - Update transkribus_parser.py with 18-line EXIF warning comment - Add convert_ukrainian_to_pylaia.py for dataset format conversion - Add inspect_ukrainian_v2.ipynb for visual inspection (30 train + 15 val) Ukrainian V2b training status: - Dataset: 21,944 train + 814 val lines (EXIF-verified, polygon masked) - Training started: 2025-11-21 17:14:06 on GPU 0 - Model: models/pylaia_ukrainian_v2b_20251121_171406/ - Target: <10% CER (beat Ukrainian V1's 10.80%) * Fix GUI crash: defer thread ref cleanup until thread.finished - Root cause: clearing worker/thread refs in _on_finished() while thread still shutting down - Solution: added _on_thread_finished() callback connected to thread.finished signal - Now refs only cleared after thread truly stops, preventing QThread destruction crash - Confirmed polygons render correctly (convex hulls working as expected) * Fix neural segmentation: use blla's actual region polygons - Root cause: code was computing convex hull from line bboxes instead of using blla's region boundaries - Solution: extract and track blla region objects during line assignment, use their boundary polygons - Fallback: compute convex hull only when blla polygon not available (e.g., after column clustering) - Result: proper region boundaries instead of one giant irregular polygon * Fix EXIF orientation issues in batch processing and inference - Root cause: Images loaded without applying EXIF rotation metadata - Result: PAGE XML dimensions mismatched (e.g., 2848x4272 vs 4272x2848 swapped) - Solution: Added ImageOps.exif_transpose() to all image loading points: * batch_processing.py: image loading and dimension checking * inference_page.py: CLI inference * inference_page_gui.py: GUI inference - This matches approach already used in prepare_gabelsberger_shorthand.py - Fixes ~32% of Ukrainian validation data that has EXIF rotation tags * Add EXIF rotation fix verification test - Tests images with and without EXIF rotation metadata - Verifies dimensions match PAGE XML after ImageOps.exif_transpose() - Confirms fix works on problematic Cyrillic-named files (Лист 021.JPG etc) - All 5 files with EXIF rotation (4272x2848 → 2848x4272) now match XML * Document Transkribus export structure and EXIF rotation handling - Clarified that PAGE XML files are in page/ subdirectory (Transkribus standard) - Updated all examples to show correct path: page/image.xml - Added folder structure diagram showing transkribus_export/page/ layout - Documented EXIF rotation fix for dimension mismatch errors - Added troubleshooting section for EXIF issues - Recommended batch_processing.py for Transkribus exports * Add PAGE XML to text converter with reading order support * Ukrainian V2c: Fix EXIF+case-sensitivity bugs, achieve 4.76% CER Critical Bug Fixes: 1. EXIF Rotation Bug (transkribus_parser.py line 232) - V2b data extracted BEFORE ImageOps.exif_transpose() was added - Impact: All 99 Лист files with EXIF tag 8 (270° rotation) excluded - Result: Coordinates out of bounds, files skipped silently 2. Case-Sensitivity Bug (transkribus_parser.py line 344) - Only checked lowercase extensions (.jpg, .jpeg, .png) - Impact: All Лист files with .JPG extension skipped on Linux - Fix: Added uppercase variants (.JPG, .JPEG, .PNG, .TIF, .TIFF) Dataset Improvements (V2b → V2c): - Training: 21,944 → 24,706 lines (+12.6%) - Validation: 814 → 970 lines (+19.2%) - Лист training: 0 → 2,772 lines (printed text) - Лист validation: 0 → 156 lines - Vocabulary: 181 → 185 symbols Training Results: - Best Val CER: 4.76% (vs V2b's 5.53%) - Improvement: 0.77pp (13.9% relative) - Training time: 101 epochs, 2.7 hours - Model: models/pylaia_ukrainian_v2c_20251124_180634/best_model.pt Expected Impact: - Лист files: 90%+ CER (V2b) → <5% CER (V2c) - Overall: Better generalization on printed+handwritten text Files Added: - UKRAINIAN_V2C_TRAINING_GUIDE.md: Complete workflow documentation - reextract_ukrainian_v2c.py: Re-extraction with EXIF fix - convert_ukrainian_v2c_to_pylaia.py: PyLaia format conversion - train_pylaia_ukrainian_v2c.py: Training script (batch 64, GPU 1) - start_ukrainian_v2c_training.sh: Nohup launcher with monitoring * Add OpenWebUI integration to batch GUI with model dropdown - Added OpenWebUI engine to batch_processing.py CLI with --api-key and --max-tokens arguments - Added OpenWebUI to batch GUI with API key input and model dropdown - Model dropdown fetches available models from OpenWebUI API via "Refresh" button - Auto-loads API key from OPENWEBUI_API_KEY environment variable - Fixed batch processing bug: OpenWebUI engine now stores config from load_model() and uses it during transcription - Added validation for OpenWebUI requiring API key and model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Initial plan * Initial plan * Initial plan * Initial plan * Initial plan * Move env loading to instance method, remove import-time print statements Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Remove print statements executing during module import in openwebui_engine Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Improve env loading: move import to module level, enhance docstring Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Fix trailing whitespace Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Make hardcoded paths configurable via command-line arguments Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Enhance docstring to document env vars and fallback behavior Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Remove unused variables and imports, fix exception handling Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Update documentation to reflect configurable paths Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Add validation for arguments and improve error handling Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Address code review feedback: remove redundant validation and add security warning Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> * Initial plan * Remove unnecessary pass statement from inference_commercial_api.py Co-authored-by: achimrabus <67736443+achimrabus@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
No description provided.