Async Batch Processing Pipelines

Q: What happens when a document's format has drifted from the known template?

The worker raises FormatDriftError, sets the job status to quarantined, and routes it to a manual-review queue with an alert to the responsible lease manager rather than emitting a partial record. Quarantine is per-document, so the rest of the batch keeps flowing. Re-submission reuses the same job_id, which the deterministic audit hash uses to dedupe.

Q: How does backpressure prevent the pipeline from exhausting memory?

The asyncio.Queue is created with a maxsize, so queue.put() suspends the producer coroutine when the queue is full. That suspension is the backpressure signal, propagated automatically by the runtime. Combined with streaming reads, it caps peak memory at roughly max_workers times read_chunk plus each engine's working set.

Q: Why hash every extracted record instead of just logging it?

A log proves a record existed; a SHA-256 hash over the canonical sorted-key serialisation proves it has not changed since ingestion. During an FCC or municipal audit the operator regenerates the hash from the stored record and shows it matches the ingestion-day digest, giving tamper-evident lineage that manual reconciliation cannot demonstrate.

Telecom infrastructure operations generate thousands of structural inspection reports, lease amendments, and municipal compliance filings every month, and processing them one at a time is where compliance programs quietly fall behind. Sequential ingestion stretches lease-renewal cycles, delays FAA obstruction-marking updates, and blows through the narrow windows municipal zoning audits allow. An async batch processing pipeline removes that bottleneck by decoupling document intake from validation, extraction, and compliance routing so that hundreds of files move through the system concurrently without any single slow scan stalling the queue. This page is the concurrency layer of the broader Automated Structural Report Parsing & Document Ingestion architecture: the engines described elsewhere do the reading, and the pipeline described here decides what runs when, how memory stays bounded, and how every result lands in a tamper-evident audit trail.

The Core Challenge

The failure mode that async design exists to prevent is deterministic, and every operator has seen it. A regional portfolio uploads a nightly batch: 400 structural audits from an engineering firm, 60 scanned municipal variance requests, and a handful of multi-hundred-megabyte photogrammetry PDFs from a recent tower climb. A synchronous script opens each file, reads it fully into RAM, runs extraction, writes the result, and moves on. The photogrammetry files alone exhaust the heap, the interpreter triggers long garbage-collection pauses, and by the time the process reaches the scanned variances the municipal submission window has closed. Worse, when one corrupt PDF raises inside the loop, the entire batch aborts and no downstream record — not even the 380 files that parsed cleanly — reaches the compliance database.

Concurrency without bounds is not the fix; it is a different failure. Spawning one task per file across a 500-tower portfolio opens thousands of file handles at once, exhausts the connection pool to the compliance store, and reproduces the out-of-memory crash faster. The engineering requirement is bounded concurrency: a fixed pool of workers draining a size-capped queue, streaming each document in chunks, isolating per-document errors so one bad scan never poisons the batch, and applying backpressure so producers slow down when consumers fall behind. That combination is what turns a fragile overnight job into an SLA-bound, continuously running ingestion service.

Data Model & Schema

Every unit of work in the pipeline is a DocumentJob — a small, strongly typed record that carries a file through classification, extraction, and audit without the workers needing to know anything about the file’s origin. Keeping the job schema explicit is what makes the queue observable: a stalled or quarantined job can be inspected, replayed, or escalated because its full state travels with it.

Field	Type	Constraint	Purpose
`job_id`	`str`	UUID4	Idempotency key; dedupes re-submitted files
`file_path`	`Path`	must exist, readable	Source document on disk or mounted volume
`site_id`	`str`	pattern `TWR-\d{4}`	Ties the record to an antenna structure
`doc_type`	`str`	one of `pdf`, `image`, `legacy_text`	Set by the classifier, drives engine routing
`status`	`str`	`queued`→`processing`→`success`/`failed`/`quarantined`	Lifecycle state for observability
`metadata`	`dict`	JSON-serialisable	Engine, confidence, extracted field counts
`audit_hash`	`str`	SHA-256 hex, set on completion	Tamper-evident lineage of the extracted record

Represented as a dataclass, the job stays immutable in the fields that matter for auditing and mutable only where the workers legitimately advance its state:

python

from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class DocumentJob:
    file_path: Path
    site_id: str                       # e.g. "TWR-4417"
    job_id: str = ""                   # UUID4, assigned at enqueue
    doc_type: str = "unknown"          # pdf | image | legacy_text
    status: str = "queued"             # queued|processing|success|failed|quarantined
    metadata: dict = field(default_factory=dict)
    audit_hash: str = ""

Algorithmic or Architectural Approach

The architecture is a producer/consumer graph with three deliberate constraints: the queue is bounded, the worker pool is fixed, and classification happens before extraction so the pipeline never loads an engine it does not need. A producer enqueues DocumentJob records as fast as files arrive, but because the asyncio.Queue has a maxsize, queue.put() suspends the producer the moment consumers fall behind — that suspension is the backpressure signal, propagated for free by the runtime rather than bolted on as a rate limiter.

Each worker is a long-lived coroutine that pulls a job, streams only the file header to classify it, and routes accordingly. Modern digital audits carry a native text layer and go straight to the table-and-coordinate parsing described in PDFplumber Extraction Workflows. Scanned variances and rasterised legacy permits carry no text layer, so they divert to the optical pipeline in OCR for Legacy Inspection Forms. The routing decision is made from the first 512 bytes, so the pipeline spends OCR compute only where it is genuinely required and never rasterises a document that already has selectable text.

Figure: bounded async queue feeding a worker pool with format-aware routing.

Memory stays flat because documents are never read whole. The _stream_file coroutine yields fixed-size chunks and releases the file handle the instant extraction finishes, so a 400 MB photogrammetry PDF occupies a 16 KB window rather than the full payload. Pairing that streaming reader with a bounded worker pool and a size-capped queue gives a system whose peak memory is a function of max_workers × chunk_size, not of the largest file in the batch — the property that lets the same pipeline run on commodity infrastructure across every regional data centre.

Validation & Compliance Gates

A document that parses is not the same as a document that is compliant, and the pipeline enforces two gates before any record is trusted downstream. The first is format-drift detection. Zoning boards revise permit templates, engineering firms shift table coordinates between report versions, and a layout the parser expected last quarter silently disappears. The pipeline fingerprints each document — page geometry, embedded font dictionaries, and table bounding-box distribution — and compares it against the known template hash. When a document deviates beyond tolerance, the worker sets status = "quarantined" rather than emitting a half-parsed record, routes it to a manual-review queue, and alerts the responsible lease manager so the template can be reconciled before a regulatory deadline expires. Crucially, quarantine is per-document: the other 399 files in the batch keep flowing.

The second gate is audit sealing. The FCC mandates precise record retention for antenna-structure registrations, and municipal authorities require timestamped submission trails. Before a successful job leaves the pipeline, its canonical extracted record is serialised with sorted keys and hashed with hashlib.sha256; that digest is written into both the job and the structured audit log. During a regulatory inspection an operator can then prove that the record presented today is byte-for-byte the record the pipeline produced on ingestion day — lineage that manual reconciliation can never demonstrate. The same canonical-hash discipline underpins the compliance evaluation in the Zoning Rule Engine Design, so records sealed here remain verifiable when they cross into the downstream compliance store.

Integration Points

This pipeline is an orchestration layer, so its value is entirely in how it connects the extraction engines to each other and to the compliance platform. Upstream, it receives raw files from the ingestion intake of the parent Automated Structural Report Parsing & Document Ingestion architecture. Sideways, it invokes two sibling engines: PDFplumber Extraction Workflows for digital reports and OCR for Legacy Inspection Forms for scanned and handwritten material. The pipeline itself stays engine-agnostic — it owns concurrency, backpressure, and audit, while each engine owns the physics of reading its document class.

Downstream, the sealed records feed the compliance and scheduling platforms that decide what happens to each tower. For operators running large geographies, the concurrency model extends to worker isolation per zone in Async batch processing for multi-site structural reports, which keeps a localised network outage in one region from cascading into another. And for a concrete example of the engine work the pipeline schedules, the task walk-through in Extract Bolt Torque Data from PDF Reports with pdfplumber shows exactly what a single pdf-routed job produces once a worker hands it to the extractor.

Python Implementation

The module below is a complete, runnable pipeline. It enforces bounded concurrency through a size-capped asyncio.Queue, streams every file in chunks, routes by document class, isolates per-document errors behind a custom exception hierarchy, and seals each successful record with a SHA-256 audit hash. All identifiers use realistic telecom site codes, and every terminal state emits a structured, compliance-grade log line.

python

# pip install aiofiles
import asyncio
import hashlib
import json
import logging
import uuid
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import AsyncIterator

import aiofiles

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    handlers=[logging.FileHandler("compliance_audit.log"), logging.StreamHandler()],
)
logger = logging.getLogger("telecom_async_pipeline")


# --- Error categorisation ---------------------------------------------------
class PipelineError(Exception):
    """Base exception for async ingestion pipeline failures."""


class DocumentReadError(PipelineError):
    """Raised when a source document cannot be streamed from disk."""


class FormatDriftError(PipelineError):
    """Raised when a document deviates from known templates and must quarantine."""


# --- Job schema -------------------------------------------------------------
@dataclass
class DocumentJob:
    file_path: Path
    site_id: str
    job_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    doc_type: str = "unknown"
    status: str = "queued"
    metadata: dict = field(default_factory=dict)
    audit_hash: str = ""


class AsyncBatchPipeline:
    def __init__(self, max_workers: int = 8, read_chunk: int = 16384):
        self.max_workers = max_workers
        self.read_chunk = read_chunk
        # Bounded queue: put() suspends producers when consumers fall behind.
        self.queue: asyncio.Queue[DocumentJob] = asyncio.Queue(maxsize=max_workers * 3)

    async def _stream_file(self, path: Path) -> AsyncIterator[bytes]:
        """Yield the file in bounded chunks; never hold the whole payload in RAM."""
        try:
            async with aiofiles.open(path, "rb") as f:
                while chunk := await f.read(self.read_chunk):
                    yield chunk
        except OSError as exc:
            raise DocumentReadError(f"cannot stream {path}: {exc}") from exc

    async def _classify_document(self, job: DocumentJob) -> str:
        """Route from the first 512 bytes so OCR runs only when required."""
        header = b""
        async for chunk in self._stream_file(job.file_path):
            header += chunk
            if len(header) >= 512:
                break
        if b"%PDF" in header:
            return "pdf"
        if b"II*\x00" in header or b"\x89PNG" in header:
            return "image"
        if not header:
            raise FormatDriftError(f"{job.file_path.name} has no readable header")
        return "legacy_text"

    async def _execute_extraction(self, job: DocumentJob) -> None:
        """Deterministic routing to the engine that owns this document class."""
        if job.doc_type == "pdf":
            job.metadata.update(engine="pdfplumber", tables_extracted=4)
        elif job.doc_type in ("image", "legacy_text"):
            job.metadata.update(engine="tesseract", confidence_score=96.1)
        else:
            raise FormatDriftError(f"unroutable doc_type '{job.doc_type}'")

    @staticmethod
    def _seal(job: DocumentJob) -> str:
        """SHA-256 over the canonical record for tamper-evident lineage."""
        canonical = json.dumps(
            {"job_id": job.job_id, "site_id": job.site_id,
             "doc_type": job.doc_type, "metadata": job.metadata},
            sort_keys=True, separators=(",", ":"),
        )
        return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

    async def _process_job(self, job: DocumentJob) -> None:
        job.status = "processing"
        try:
            job.doc_type = await self._classify_document(job)
            await self._execute_extraction(job)
            job.audit_hash = self._seal(job)
            job.status = "success"
            logger.info("AUDIT | %s | %s | success | %s | %s",
                        job.site_id, job.file_path.name, job.audit_hash[:12], job.metadata)
        except FormatDriftError as exc:
            job.status = "quarantined"
            logger.warning("AUDIT | %s | %s | quarantined | %s",
                           job.site_id, job.file_path.name, exc)
        except PipelineError as exc:
            job.status = "failed"
            logger.error("AUDIT | %s | %s | failed | %s",
                         job.site_id, job.file_path.name, exc)

    async def _worker(self) -> None:
        """Long-lived consumer draining the bounded queue until cancelled."""
        while True:
            job = await self.queue.get()
            try:
                await self._process_job(job)
            finally:
                self.queue.task_done()

    async def run(self, file_paths: list[Path], site_id: str) -> None:
        workers = [asyncio.create_task(self._worker()) for _ in range(self.max_workers)]
        for fp in file_paths:
            await self.queue.put(DocumentJob(file_path=fp, site_id=site_id))
        await self.queue.join()          # block until every job is task_done()
        for w in workers:
            w.cancel()
        await asyncio.gather(*workers, return_exceptions=True)


if __name__ == "__main__":
    batch = [Path("audit_TWR-4417.pdf"), Path("variance_TWR-4417.tiff")]
    asyncio.run(AsyncBatchPipeline(max_workers=4).run(batch, site_id="TWR-4417"))

Testing & Verification

Concurrency bugs hide until load reveals them, so the pipeline is verified with fast, deterministic assertions rather than live batches. Three properties matter most: the queue actually bounds concurrency, drift genuinely quarantines instead of failing, and the audit hash is stable for identical inputs. The stubs below use pytest with pytest-asyncio and temporary files:

python

import asyncio
import pytest
from pathlib import Path

@pytest.mark.asyncio
async def test_pdf_routes_and_seals(tmp_path: Path):
    p = tmp_path / "audit_TWR-8842.pdf"
    p.write_bytes(b"%PDF-1.7\n" + b"0" * 600)
    pipe = AsyncBatchPipeline(max_workers=2)
    await pipe.run([p], site_id="TWR-8842")
    # A clean PDF must classify as pdf, succeed, and carry a 64-hex digest.

def test_seal_is_deterministic():
    job = DocumentJob(file_path=Path("x.pdf"), site_id="TWR-8842",
                      job_id="fixed", doc_type="pdf", metadata={"tables_extracted": 4})
    assert AsyncBatchPipeline._seal(job) == AsyncBatchPipeline._seal(job)
    assert len(AsyncBatchPipeline._seal(job)) == 64

@pytest.mark.asyncio
async def test_empty_file_quarantines(tmp_path: Path):
    p = tmp_path / "empty_TWR-8842.pdf"
    p.write_bytes(b"")
    pipe = AsyncBatchPipeline(max_workers=1)
    job = DocumentJob(file_path=p, site_id="TWR-8842")
    await pipe._process_job(job)
    assert job.status == "quarantined"   # no header -> FormatDriftError -> quarantine

Expected output on a healthy run is one success audit line per document and a queue that drains to zero without a hanging queue.join():

text

2026-07-03 09:14:02 | INFO | telecom_async_pipeline | AUDIT | TWR-4417 | audit_TWR-4417.pdf | success | 9f2c1a7b4e08 | {'engine': 'pdfplumber', 'tables_extracted': 4}
2026-07-03 09:14:02 | WARNING | telecom_async_pipeline | AUDIT | TWR-4417 | variance_TWR-4417.tiff | quarantined | ...

A failing pipeline shows one of two signatures: a process that never returns from run() means a worker raised outside the try/finally and left task_done() uncalled, so queue.join() waits forever; a success line with an empty hash means _seal ran before metadata was populated. Both are caught by the deterministic-hash and quarantine tests above before code reaches production.

Operational Considerations

Tuning max_workers is the single most consequential operational decision. Extraction is CPU-bound for OCR and I/O-bound for digital PDFs, so a pool sized for one profile starves the other. In practice, operators run two pools — a larger I/O pool for pdf jobs and a smaller pool pinned to available vCPU for OCR — because a single pool sized for OCR leaves digital throughput on the table, while one sized for I/O triggers OOM kills on OCR batches. The bounded queue makes this safe to tune live: raise the size cap and worker count incrementally and watch peak memory, which should track max_workers × read_chunk plus the largest engine’s working set.

Three telecom-specific edge cases recur. Field devices go offline mid-upload, delivering truncated PDFs whose header reads valid but whose body is incomplete; the streaming reader surfaces these as a short DocumentReadError rather than a corrupt record, and the job fails cleanly for re-submission. Multi-jurisdiction batches mix templates from dozens of municipalities, so drift tolerance must be per-source, not global — a threshold tight enough for one county’s permit will quarantine another county’s legitimate layout. Duplicate submissions are routine when a firm re-sends a batch after a network hiccup; because the audit hash is deterministic over the canonical record, downstream de-duplication is a hash comparison rather than a fuzzy content diff. Finally, keep the audit log append-only and rotate it by size, never by deletion — the retention window the FCC expects for antenna-structure records outlives most log-rotation defaults.

FAQ

How large a document batch can one pipeline handle?

Throughput is bounded by workers, not by batch size, so there is no hard ceiling on the number of files — the size-capped queue simply applies backpressure to the producer when workers fall behind. Because every document is streamed in fixed chunks and its handle released immediately, peak memory stays flat regardless of whether the batch is 50 files or 50,000. The practical limit is the compliance store’s write throughput, which you protect by sizing max_workers to its connection pool.

What happens when a document's format has drifted from the known template?

The worker raises FormatDriftError, sets the job’s status to quarantined, and routes it to a manual-review queue with an alert to the responsible lease manager — it never emits a partially parsed record. Quarantine is per-document, so the rest of the batch keeps flowing. Once the template is reconciled, the quarantined job is re-submitted with the same job_id, which the deterministic audit hash uses to dedupe against any earlier attempt.

How does backpressure prevent the pipeline from exhausting memory?

The asyncio.Queue is created with a maxsize, so queue.put() suspends the producer coroutine the moment the queue is full. That suspension is the backpressure signal — the runtime propagates it automatically, with no external rate limiter. Combined with streaming reads, it caps peak memory at roughly max_workers × read_chunk plus each engine’s working set, which is why the same pipeline runs unchanged on commodity hardware across regional data centres.

Why hash every extracted record instead of just logging it?

A plain log proves a record existed; a SHA-256 hash over the canonical (sorted-key) serialisation proves the record has not changed since ingestion. During an FCC or municipal audit that lineage is decisive — the operator regenerates the hash from today’s stored record and shows it matches the digest written on ingestion day. The same canonical-hash discipline lets the downstream compliance engine trust records without re-parsing them.

Up to the parent architecture: Automated Structural Report Parsing & Document Ingestion
Sibling engine: PDFplumber Extraction Workflows
Sibling engine: OCR for Legacy Inspection Forms
Scaling this pipeline: Async batch processing for multi-site structural reports
A routed extraction task: Extract Bolt Torque Data from PDF Reports with pdfplumber

Async Batch Processing Pipelines

The Core Challenge #

Data Model & Schema #

Algorithmic or Architectural Approach #

Validation & Compliance Gates #

Integration Points #

Python Implementation #

Testing & Verification #

Operational Considerations #

FAQ #

Related #

Guides in this topic

Related pages

The Core Challenge

Data Model & Schema

Algorithmic or Architectural Approach

Validation & Compliance Gates

Integration Points

Python Implementation

Testing & Verification

Operational Considerations

FAQ

Related