Loading...

đź“‚ Improving Local Search with Computer Vision

Technology
Dec 2025

The Challenge of Unsearchable Files

In a search to optimize my file system, many documents—receipts, handwritten notes, and older paper scans—exist only as simple image files (JPG, PNG, or non-OCR PDFs). Your machine sees these files as pictures, making their content invisible to standard desktop search tools.

This project addresses that gap by implementing a robust, containerized solution to process these image-based documents and overlay a precise, searchable text layer.

The Dockerized OCR Worker Architecture

This Python worker service packaged in a minimal Docker container. This architecture offers maximum portability (running on Linux, Windows, or Mac) and isolation. The core process is built around industry-leading OCR technologies.

Leveraging OCRmyPDF and high-fidelity tessdata_best Tesseract models. This CPU-heavy approach ensures the most accurate text extraction and guarantees the resulting PDF retains its original look while having a perfect, invisible text layer.

To prevent crashes on large, high-resolution scans, the worker implements conditional resizing (files over 1200px wide are optimized) and utilizes Docker resource limits. This maintains high quality without exhausting system memory.

How the System Works

  1. Input: You drop files (Images or PDFs) into the designated /input folder.
  2. Processing: The worker automatically detects the new file and begins processing. Images are standardized, then OCR is performed.
  3. Output: Upon success, the processed file is named with a unique, sortable timestamp (e.g., receipt_YYYYMMDD_HHMMSSMS.pdf) and moved to the /output_searchable directory.
  4. Archival: The original file is safely moved to the /output_originals folder. Errors are logged, and failed files are moved to /output_errors.

This clear, folder-based queue ensures that no file is ever lost and provides a clean, fully-indexed digital archive ready for searching.

Aligning with the Unix Philosophy

While implemented in Python and Docker, this OCR worker adheres closely to the Unix philosophy, prioritizing simplicity, modularity, and clarity:

  • Do one thing and do it well: The worker's sole purpose is to convert image data to searchable text layers. It avoids complex database management or web interfaces.
  • Everything is a file: The entire process relies on standard file operations—reading from an input directory and writing to an output directory. This ensures the service is easily integrated with existing file managers and backup tools.
  • Modular Components: The container uses a pipeline of specialized tools (Pillow for image preparation, img2pdf for image-to-PDF, and Tesseract/OCRmyPDF for OCR) that work together seamlessly, rather than one monolithic application.
  • Chaining Programs: The file system acts as the "pipe." The output of one step (a standardized image) becomes the input for the next (the OCR process), creating a predictable and reliable workflow.

Explore the full source code, deployment instructions, and configuration details on GitHub:

View Project on GitHub

Articles

Top