YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Lightweight Agentic OCR Document Extraction

A lightweight, agentic OCR pipeline to extract text and structured fields from document images using Tesseract OCR.

Features

  • Multiple Preprocessing Variants: Automatically generates and tests multiple image preprocessing variants (grayscale, thresholding, sharpening, denoise, resize, CLAHE, morphological operations)
  • Multiple PSM Modes: Tests various Tesseract page segmentation modes to find optimal results
  • Intelligent Candidate Scoring: Ranks OCR results by average confidence, word count, and text length
  • Structured Field Extraction: Extracts common document fields including:
    • DOI, ISSN, ISBN, PMID, arXiv ID
    • Volume, Issue, Pages, Year
    • Received/Accepted/Published dates
    • Title, Authors, Abstract, Keywords
    • Email addresses
  • Parallel Processing: Uses thread pools for faster processing of multiple variants

Installation

Prerequisites

  1. Install Tesseract OCR:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y tesseract-ocr

macOS:

brew install tesseract

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

  1. Install Python dependencies:
pip install pytesseract opencv-python-headless pillow numpy

Usage

Command Line

# Basic usage
python agentic_ocr_extractor.py document.jpg

# Save outputs to files
python agentic_ocr_extractor.py document.png -o output.txt -j fields.json

# Custom scale factor and PSM modes
python agentic_ocr_extractor.py scan.jpg --scale 2.0 --psm 3 6 11

# Quiet mode (suppress progress output)
python agentic_ocr_extractor.py document.jpg -q

Python API

from agentic_ocr_extractor import process_image, run_agent, extract_fields
import cv2

# Full processing pipeline
cleaned_text, fields, best_candidate = process_image(
    'document.jpg',
    output_text_path='extracted_text.txt',
    output_json_path='extracted_fields.json',
    scale_factor=1.5,
    verbose=True
)

# Access extracted fields
print(f"Title: {fields.title}")
print(f"Authors: {fields.authors}")
print(f"DOI: {fields.doi}")

# Or use individual components
bgr = cv2.imread('document.jpg')
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)

# Run agentic OCR
best = run_agent(rgb, psms=[3, 4, 6, 11], scale_factor=1.5)
print(f"Best variant: {best.variant}, PSM: {best.psm}, Confidence: {best.avg_conf}")

# Extract fields from text
fields = extract_fields(best.text)
print(fields.to_dict())

How It Works

  1. Image Loading: Reads the input image and converts to RGB format

  2. Preprocessing: Generates multiple variants of the image:

    • Raw (original)
    • Upscaled (1.5x by default)
    • Grayscale
    • Otsu threshold
    • Adaptive threshold
    • Denoised
    • Sharpened
    • CLAHE (Contrast Limited Adaptive Histogram Equalization)
    • Morphological closing
  3. OCR Execution: Runs Tesseract OCR on each variant with multiple page segmentation modes (PSM 3, 4, 6, 11 by default)

  4. Candidate Scoring: Scores each OCR result based on:

    • Average word confidence
    • Text length (penalizes very short outputs)
    • Word count (bonus for reasonable counts)
  5. Best Selection: Selects the candidate with the highest combined score

  6. Text Cleaning: Cleans the OCR output by:

    • Normalizing Unicode characters
    • Fixing common OCR artifacts
    • Cleaning whitespace
  7. Field Extraction: Uses rule-based regex patterns to extract structured fields

Tips for Better Results

  • Image Quality: Higher resolution images generally produce better results
  • Scale Factor: Try increasing the scale factor (e.g., 2.0) for images with small text
  • PSM Modes: Different document layouts may benefit from different PSM modes:
    • PSM 3: Fully automatic page segmentation
    • PSM 4: Assume a single column of text
    • PSM 6: Assume a single uniform block of text
    • PSM 11: Sparse text, find as much text as possible
  • Cropping: For PDFs rendered to images, cropping margins often improves results

License

MIT License - see LICENSE for details.

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support