Async Batch Processing for Multi-Site Structural Reports

You manage a portfolio of several hundred towers, and every night a mixed batch of structural inspection reports lands from third-party engineering firms, municipal inspectors, and legacy lease administrators. The files are heterogeneous — modern digital PDFs, 300-DPI scanned variance forms, and regionally inconsistent templates — and a naïve one-file-at-a-time loop stalls on the first multi-hundred-megabyte photogrammetry PDF while the municipal submission window quietly expires. This page is the hands-on build guide for the concurrency layer described in Async Batch Processing Pipelines: a bounded async worker pool that ingests every site’s report concurrently, streams each document so peak memory stays flat, falls back to OCR for scanned legacy forms, and writes a tamper-evident audit hash for every result. By the end you will have a runnable Python module that turns a chaotic overnight batch into an SLA-bound ingestion service.

Prerequisites & Context

Before running the code below, have the following in place:

Python 3.11+. The example uses asyncio.to_thread and the walrus operator (:=) for chunked hashing, both of which assume a modern interpreter.
Extraction dependencies. pip install aiofiles pdfplumber pytesseract plus the Tesseract binary on the host (apt install tesseract-ocr). The digital-PDF path is handled by the same table-and-word extraction covered in PDFplumber Extraction Workflows; the scanned-form fallback follows the confidence-thresholding approach from OCR for Legacy Inspection Forms.
A canonical site identifier per file. Each report must be keyed to an antenna structure — a TWR-#### site ID, its lease reference, and where available the FCC Antenna Structure Registration (ASR) number. Those identifiers are what let extracted data cross-reference against lease and zoning records downstream.
A regulatory baseline. The audit hash exists to satisfy FCC ASR documentation retention and to give municipal zoning auditors tamper-evident lineage; treat it as mandatory, not decorative.

The core difficulty is reconciling format drift against hard regulatory deadlines. Engineering firms shift table coordinates, rename headers, and embed high-resolution imagery that inflates file size, so the pipeline must classify, route, and stream rather than assume a single schema.

Step-by-Step Implementation

Each step maps to a specific compliance or automation concern, not just a coding convenience.

Step 1 — Model each site’s report as a typed job. Wrap every file in a StructuralReport dataclass carrying site_id (TWR-8842), lease_id, the source path, an audit_hash slot, an extracted_data payload, a compliance_status, and an error_log. Because the full state travels with the job, a quarantined report can be inspected and replayed without re-deriving where it came from.

Step 2 — Bound concurrency to the downstream store. Guard extraction with asyncio.Semaphore(max_concurrency) sized to the compliance database’s connection pool. Unbounded gather over 500 files opens thousands of handles and reproduces the out-of-memory crash faster; a fixed slot count is what keeps the batch inside the machine’s limits.

Step 3 — Hash the file for audit immutability before extraction. Stream the raw bytes through hashlib.sha256 in 64 KB chunks with aiofiles, so even a gigabyte photogrammetry PDF never loads whole. The digest is written on the ingestion-day record; regenerating it later proves the stored report is byte-for-byte unchanged.

Step 4 — Stream pages in a worker thread. pdfplumber is synchronous and CPU-bound, so run per-page extraction through asyncio.to_thread and iterate pages one at a time. This keeps the event loop responsive for other sites’ I/O and caps peak memory at roughly one page’s working set rather than the whole document.

Step 5 — Fall back to OCR with confidence thresholding. When a page yields almost no embedded text, it is a scanned legacy municipal form. Render it at 300 DPI, run pytesseract.image_to_data, discard Tesseract’s -1 non-text boxes, and reject the page below the confidence floor so low-quality recognition never silently pollutes a compliance record.

Step 6 — Set a compliance status and emit a structured audit line. Scan the assembled text for the clauses that mark a clean structural pass, classify the report as VALID, FLAGGED, or REQUIRES_REVIEW, and log the outcome with site_id, lease_id, status, and audit_hash as structured fields for the audit trail.

Step 7 — Gather the batch with per-report isolation. Run all jobs under asyncio.gather(..., return_exceptions=True) so one corrupt PDF raising an exception never aborts the 380 files that parsed cleanly.

Figure: the overnight batch as a bounded pipeline — a size-capped queue, an N-slot semaphore, format-aware workers, and a per-report SHA-256 hash feeding the compliance ledger.

Complete Runnable Example

The following self-contained module implements every step above with realistic telecom identifiers, stdlib structured logging, a custom exception hierarchy, and a SHA-256 audit hash per report. The sequence below traces one report through the async path before the code.

Figure: per-report async sequence across extraction, OCR fallback, and audit logging.

python

# pip install aiofiles pdfplumber pytesseract
import asyncio
import gc
import hashlib
import logging
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List

import aiofiles
import pdfplumber
import pytesseract

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("async_report_processor")

# --- Error categorization hierarchy ---
class ExtractionError(Exception):
    """Base exception for document processing failures."""

class FormatDriftError(ExtractionError):
    """Layout coordinates or expected headers deviate beyond tolerance."""

class OCRError(ExtractionError):
    """Text-recognition confidence fell below the compliance threshold."""

class ComplianceStatus(Enum):
    VALID = "valid"
    FLAGGED = "flagged"
    REQUIRES_REVIEW = "requires_review"

@dataclass
class StructuralReport:
    site_id: str                 # e.g. TWR-8842
    lease_id: str                # e.g. LSE-2026-0417
    asr_number: str              # FCC ASR, e.g. 1290342
    file_path: Path
    audit_hash: str = ""
    extracted_data: Dict[str, Any] = field(default_factory=dict)
    compliance_status: ComplianceStatus = ComplianceStatus.REQUIRES_REVIEW
    error_log: List[str] = field(default_factory=list)

class AsyncReportProcessor:
    PASS_CLAUSES = ("STRUCTURAL PASS", "NO DEFICIENCIES", "LEASE EXPIRY")

    def __init__(self, max_concurrency: int = 8, ocr_confidence_threshold: int = 75):
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.ocr_threshold = ocr_confidence_threshold

    async def compute_audit_hash(self, file_path: Path) -> str:
        """SHA-256 over streamed bytes for an immutable audit trail."""
        sha256 = hashlib.sha256()
        async with aiofiles.open(file_path, mode="rb") as f:
            while chunk := await f.read(65536):
                sha256.update(chunk)
        return sha256.hexdigest()

    def _extract_page_sync(self, pdf_path: str, page_idx: int, site_id: str) -> Dict[str, Any]:
        """Synchronous extraction run in a worker thread to avoid blocking the loop."""
        with pdfplumber.open(pdf_path) as pdf:
            page = pdf.pages[page_idx]
            tables = page.extract_tables()
            text = page.extract_text() or ""

            if not text.strip() and not tables:
                raise FormatDriftError(f"Empty/unparseable page {page_idx} for {site_id}")

            # OCR fallback for scanned legacy inspection forms
            if len(text.strip()) < 50:
                img = page.to_image(resolution=300)
                ocr = pytesseract.image_to_data(img.original, output_type=pytesseract.Output.DICT)
                scored = [
                    (w, float(c))
                    for w, c in zip(ocr["text"], ocr["conf"])
                    if str(c).strip() not in ("", "-1") and w.strip()
                ]
                avg_conf = sum(c for _, c in scored) / max(len(scored), 1)
                if avg_conf < self.ocr_threshold:
                    raise OCRError(f"Low OCR confidence ({avg_conf:.1f}%) on page {page_idx}")
                text = " ".join(w for w, c in scored if c > self.ocr_threshold)

            return {"page_idx": page_idx, "text": text, "tables": tables, "site_id": site_id}

    async def process_report(self, report: StructuralReport) -> StructuralReport:
        async with self.semaphore:
            report.audit_hash = await self.compute_audit_hash(report.file_path)
            try:
                with pdfplumber.open(report.file_path) as pdf:
                    total_pages = len(pdf.pages)
                for i in range(total_pages):
                    try:
                        data = await asyncio.to_thread(
                            self._extract_page_sync, str(report.file_path), i, report.site_id
                        )
                        report.extracted_data[f"page_{i + 1}"] = data
                    except (FormatDriftError, OCRError) as e:
                        report.error_log.append(str(e))
                        report.compliance_status = ComplianceStatus.FLAGGED

                full_text = " ".join(d.get("text", "") for d in report.extracted_data.values()).upper()
                if report.compliance_status is not ComplianceStatus.FLAGGED and any(
                    clause in full_text for clause in self.PASS_CLAUSES
                ):
                    report.compliance_status = ComplianceStatus.VALID
                elif report.compliance_status is not ComplianceStatus.FLAGGED:
                    report.compliance_status = ComplianceStatus.REQUIRES_REVIEW
            finally:
                gc.collect()
                logger.info(
                    "report_processed site=%s lease=%s asr=%s status=%s hash=%s",
                    report.site_id, report.lease_id, report.asr_number,
                    report.compliance_status.value, report.audit_hash[:12],
                )
            return report

    async def run_batch(self, reports: List[StructuralReport]) -> List[StructuralReport]:
        results = await asyncio.gather(
            *(self.process_report(r) for r in reports), return_exceptions=True
        )
        processed: List[StructuralReport] = []
        for r in results:
            if isinstance(r, Exception):
                logger.error("batch_task_failed error=%s", r)
            else:
                processed.append(r)
        return processed

if __name__ == "__main__":
    batch = [
        StructuralReport("TWR-8842", "LSE-2026-0417", "1290342", Path("twr8842_audit.pdf")),
        StructuralReport("TWR-9017", "LSE-2026-0388", "1288104", Path("twr9017_variance.pdf")),
    ]
    processor = AsyncReportProcessor(max_concurrency=4, ocr_confidence_threshold=75)
    done = asyncio.run(processor.run_batch(batch))
    for rep in done:
        print(f"{rep.site_id}: {rep.compliance_status.value} | hash={rep.audit_hash[:12]}")

Verification & Expected Output

With two valid PDFs present, the processor emits one structured audit line per report and a compact status summary. Successful console output looks like this:

text

2026-03-05 02:14:07 | INFO | report_processed site=TWR-9017 lease=LSE-2026-0388 asr=1288104 status=requires_review hash=9f2a1c04b7e3
2026-03-05 02:14:08 | INFO | report_processed site=TWR-8842 lease=LSE-2026-0417 asr=1290342 status=valid hash=41c8de77a0b9
TWR-8842: valid | hash=41c8de77a0b9
TWR-9017: requires_review | hash=9f2a1c04b7e3

Note that the audit lines arrive out of submission order — that is the concurrency working, with the faster file finishing first. To assert behaviour in a test, check the invariants rather than timing: assert all(r.audit_hash for r in done) confirms every report was hashed, and assert done[0].compliance_status in ComplianceStatus confirms classification ran. A failure on a single file surfaces as a batch_task_failed error line while the sibling reports still complete; if run_batch returns fewer items than it was given, inspect the logged exception rather than assuming the whole batch died.

Gotchas & Edge Cases

Tesseract’s -1 confidence sentinel. image_to_data returns -1 for layout boxes that contain no text. Averaging those raw values drags the mean far below the real recognition quality and triggers spurious OCRErrors on perfectly good scans. Filter conf == -1 (and blank tokens) before computing avg_conf, exactly as the example does.
Unicode in legacy lease PDFs. Older municipal forms embed non-breaking spaces, ligatures, and degree signs that break naïve keyword matching — "STRUCTURAL PASS" will not match "STRUCTURAL PASS". Normalise with unicodedata.normalize("NFKC", text) and collapse whitespace before scanning for compliance clauses, or a clean report gets misclassified as REQUIRES_REVIEW.
FCC ASR number format variation. ASR registrations appear as bare integers (1290342), zero-padded strings, or prefixed (ASR-1290342) depending on the issuing template. Store the digits-only canonical form and validate against the 7-digit range; matching on the raw string will scatter one tower across several audit records.
Opening the PDF twice. The example opens the file once to count pages and again per page inside the thread. On network-mounted volumes that doubles I/O latency; for large portfolios, cache the page count from a single metadata pass or keep a page handle pool if your storage is slow.

FAQ

How do I size max_concurrency for a multi-site batch?

Size it to the narrowest downstream bottleneck, not the CPU count. The most common ceiling is the compliance store’s connection pool — if it accepts 8 concurrent writers, set max_concurrency to 8 so the semaphore never lets more reports finish and flush than the database can absorb. Because each worker streams pages one at a time, memory scales with the worker count rather than the batch size, so raising concurrency trades RAM for throughput predictably.

Why run pdfplumber in a thread instead of awaiting it directly?

pdfplumber is synchronous and CPU-bound, so calling it directly on the event loop would block every other site’s I/O until the page finished parsing. asyncio.to_thread offloads that work to a thread pool, letting the loop keep hashing, queuing, and logging other reports concurrently. The extraction itself is unchanged; only its scheduling moves off the critical path.

What happens to a single corrupt report inside a large batch?

Nothing spreads. asyncio.gather(..., return_exceptions=True) captures the exception as a return value instead of cancelling its siblings, so run_batch logs a batch_task_failed line for the bad file and still returns every report that parsed cleanly. Per-page FormatDriftError and OCRError are caught even more locally — they flag just that page’s report as FLAGGED and route it for review while the rest of the document continues.

Up to the parent architecture: Async Batch Processing Pipelines
Sibling task: Extract Bolt Torque Data from PDF Reports with pdfplumber
The parent architecture: Automated Structural Report Parsing & Document Ingestion

Async Batch Processing for Multi-Site Structural Reports

Prerequisites & Context #

Step-by-Step Implementation #

Complete Runnable Example #

Verification & Expected Output #

Gotchas & Edge Cases #

FAQ #

Related #

Related pages

Prerequisites & Context

Step-by-Step Implementation

Complete Runnable Example

Verification & Expected Output

Gotchas & Edge Cases

FAQ

Related