Extracting Bolt Torque Data from PDF Inspection Reports with pdfplumber

You have a directory of structural inspection PDFs from a tower climb, and buried in each one is the data that actually matters for lease compliance: the measured torque on every flange, guy-wire anchor, and base-plate bolt. The reports are machine-generated but wildly inconsistent — one vendor lays the values out in a clean specification matrix, the next embeds them in narrative prose, and a third rotates the whole page 90 degrees. Copying those numbers by hand across a portfolio is where torque verification quietly falls apart. This page is the hands-on build guide for one extraction task inside the coordinate-aware PDFplumber Extraction Workflows: a small, runnable module that confines parsing to the torque zone, reads the table when there is one, falls back to a spatial word scan when there is not, rejects any value outside engineering safety bounds, and seals every extracted record with a tamper-evident audit hash. By the end you will have a script you can point at a real report and trust in front of an auditor.

Prerequisites & Context

Before running the code below, have the following in place:

Python 3.10+ and pdfplumber. pip install pdfplumber. The library reads the PDF object stream directly, exposing per-character x0/x1/top coordinates and table bounding boxes — the precision this task depends on.
A native text layer. This workflow assumes the report carries selectable text. If the file is a 300-DPI scan of a legacy climb sheet with no text layer, pdfplumber returns empty strings and you must route it to the optical pipeline in OCR for Legacy Inspection Forms first — the “Gotchas” section shows how to detect that case.
A canonical site identifier per file. Key every report to an antenna structure — a TWR-#### site ID and, where available, its FCC Antenna Structure Registration (ASR) number — so extracted torque values cross-reference against the correct tower downstream.
A torque acceptance range. Know the engineering bounds for the bolts you inspect (this example uses 0 < value <= 5000 ft-lb as a coarse sanity gate). Values outside that band are almost always OCR noise or a misread column, not a real reading.

When you need to run this across hundreds of towers at once rather than one file at a time, the same extractor drops into the concurrency layer described in Async Batch Processing Pipelines; this page stays focused on getting a single document parsed correctly.

Step-by-Step Implementation

Each step maps to a specific compliance or reliability concern, not just a coding convenience.

Step 1 — Confine extraction to the torque zone. Rather than flatten the whole page to text, page.crop((x0, top, x1, bottom)) down to the region where the torque matrix lives. Cropping first is what stops a Torque: 150 ft-lb string quoted in a narrative footnote from being mistaken for a certified reading in the specification block.

Step 2 — Parse the table first. Call find_tables() on the cropped region and read each row with table.extract(). A well-formed specification matrix gives you the bolt identifier and its value in known columns, which is the highest-confidence path and the one to try before anything heuristic.

Step 3 — Fall back to a coordinate-aware spatial scan. When find_tables() returns nothing, the values are laid out as free text. Pull extract_words(), find the bolt identifiers with a pattern, then look rightward for a numeric token whose x0 sits within a small tolerance of the identifier’s x1. Pairing by coordinate — not by reading order — is what survives multi-column layouts.

Step 4 — Validate against engineering bounds. Coerce the matched number to a float and reject anything at or below zero or above the safety ceiling by raising a custom TorqueDriftError. A silent null two weeks before a renewal window is far more expensive than a loud exception during ingestion.

Step 5 — Tag every record with its extraction method. Stamp each TorqueRecord with table_zone or spatial_scan so a reviewer can instantly see which readings came from the trusted structured path and which from the heuristic fallback.

Step 6 — Hash the canonical record set for audit immutability. Serialise the records with sorted keys and hash the result with hashlib.sha256. Regenerating that digest later proves the torque data presented today is byte-for-byte what the extractor produced on ingestion day.

Complete Runnable Example

The module below implements every step with realistic telecom identifiers, stdlib structured logging, a custom exception, and a SHA-256 audit hash over the extracted set. The diagram traces one page through the table-first path and its spatial fallback before the code.

Figure: bolt torque extraction with table-first parsing and a coordinate-aware spatial fallback.

python

# pip install pdfplumber
import hashlib
import json
import logging
import re
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Iterator, List

import pdfplumber

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("torque_extractor")

TORQUE_RE = re.compile(r"(\d+(?:\.\d+)?)\s*(ft-lb|ft-lbs|N·m|Nm|in-lb)?", re.IGNORECASE)
BOLT_RE = re.compile(r"^(B|BOLT|FLANGE|GUY|BASE|ANCHOR)[\s\-]?(\d+|[A-Z])?$", re.IGNORECASE)
TORQUE_ZONE = (50, 100, 550, 700)   # torque matrix region; tune per vendor template
COORD_TOLERANCE = 12.0              # points of drift allowed between a bolt id and its value


class TorqueDriftError(Exception):
    """Raised when a torque value falls outside engineering safety bounds."""


@dataclass
class TorqueRecord:
    site_id: str
    bolt_id: str
    torque_value: float
    unit: str
    method: str
    page: int


def _validate(site_id: str, bolt_id: str, value: float) -> None:
    if not 0 < value <= 5000:
        raise TorqueDriftError(f"{site_id} bolt {bolt_id}: torque {value} out of safety bounds")


def _parse_row(site_id: str, row: List[str], page: int) -> Iterator[TorqueRecord]:
    if len(row) < 3 or not row[0]:
        return
    match = TORQUE_RE.search(" ".join(str(c) for c in row[2:] if c))
    if not match:
        return
    value = float(match.group(1))
    _validate(site_id, row[0], value)
    yield TorqueRecord(site_id, row[0].strip(), value, match.group(2) or "ft-lb", "table_zone", page)


def extract_torque(site_id: str, pdf_path: Path) -> List[TorqueRecord]:
    records: List[TorqueRecord] = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            crop = page.crop(TORQUE_ZONE)
            tables = crop.find_tables()
            if tables:
                for table in tables:
                    for row in table.extract():
                        records.extend(_parse_row(site_id, row, i))
                continue
            # Spatial fallback: no table on this page, pair words by coordinate.
            logger.warning("no table on page %d for %s; spatial scan", i, site_id)
            words = crop.extract_words()
            for j, word in enumerate(words):
                if not BOLT_RE.match(word["text"]):
                    continue
                for nxt in words[j + 1:j + 4]:
                    match = TORQUE_RE.search(nxt["text"])
                    if match and abs(nxt["x0"] - word["x1"]) < COORD_TOLERANCE:
                        value = float(match.group(1))
                        _validate(site_id, word["text"], value)
                        records.append(TorqueRecord(
                            site_id, word["text"], value, match.group(2) or "ft-lb", "spatial_scan", i))
                        break
    return records


def audit_hash(records: List[TorqueRecord]) -> str:
    canonical = json.dumps([asdict(r) for r in records], sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()


if __name__ == "__main__":
    recs = extract_torque("TWR-8842", Path("twr8842_flange_audit.pdf"))
    digest = audit_hash(recs)
    for r in recs:
        logger.info("TORQUE | %s | bolt=%s | %.1f %s | %s | p%d",
                    r.site_id, r.bolt_id, r.torque_value, r.unit, r.method, r.page)
    logger.info("AUDIT | TWR-8842 | %d records | %s", len(recs), digest[:12])
    print(f"TWR-8842: {len(recs)} torque records | audit={digest[:12]}")

Verification & Expected Output

Against a clean report whose flange-bolt matrix parses from the table path, the extractor emits one structured line per reading plus a sealing audit line:

text

2026-06-10 08:41:12 | INFO | TORQUE | TWR-8842 | bolt=FLANGE-1 | 150.0 ft-lb | table_zone | p1
2026-06-10 08:41:12 | INFO | TORQUE | TWR-8842 | bolt=FLANGE-2 | 148.5 ft-lb | table_zone | p1
2026-06-10 08:41:12 | INFO | TORQUE | TWR-8842 | bolt=GUY-A | 220.0 ft-lb | table_zone | p2
2026-06-10 08:41:12 | INFO | AUDIT | TWR-8842 | 3 records | 9f2c1a7b4e08
TWR-8842: 3 torque records | audit=9f2c1a7b4e08

To assert behaviour in a test, check invariants rather than exact values: assert all(r.torque_value > 0 for r in recs) confirms validation ran, and assert audit_hash(recs) == audit_hash(recs) confirms the digest is deterministic for identical input. Two failure signatures matter. A no table on page N warning followed by zero records means the crop zone missed the matrix — widen TORQUE_ZONE and re-run. A TorqueDriftError naming a value like 1500000 almost always means a column merged two fields (a part number ran into the torque cell); inspect that row’s raw table.extract() output rather than loosening the safety bound.

Gotchas & Edge Cases

Rotated inspection pages. Climb-sheet exporters frequently emit landscape torque matrices as a portrait page with page.rotation == 90, which scrambles the fixed TORQUE_ZONE coordinates and returns an empty crop. Check page.rotation and either call page.crop(...) on page.dedupe_chars() after correcting for the rotation, or swap the bbox axes before cropping — a silently empty zone is the most common cause of a report parsing to zero records.
Unit ambiguity and Unicode. The same bolt may be specced in ft-lb, N·m, or in-lb, and legacy templates render the middot as a non-breaking variant so N·m does not match N·m. Normalise with unicodedata.normalize("NFKC", text) before the regex, and never assume a default unit silently — a base-plate value read as ft-lb when it was really N·m understates torque by roughly 1.36x and can pass a bounds check while being physically wrong.
Wrapped multi-column bolt tables. When a torque cell wraps to a second visual line, table.extract() can split one reading across two rows, leaving the value row with an empty bolt id. Guard on row[0] (as the example does) so a headerless continuation row is skipped rather than paired with the wrong bolt.
No text layer at all. A scan with no embedded text yields extract_words() == [] on every page and produces zero records with no error. Detect it early (if not page.chars: route_to_ocr(...)) and hand the file to the OCR pipeline instead of reporting a clean tower with no torque data.

FAQ

What if the inspection report is a scanned image with no text layer?

pdfplumber only reads text that already exists in the PDF object stream, so a flat 300-DPI scan of a paper climb sheet returns empty words and tables on every page. Detect that case by checking page.chars or the total extracted text length, and route the file to the optical path in OCR for Legacy Inspection Forms, which rasterises the page, runs recognition with a confidence threshold, and hands back a searchable text layer. Only then does this table-first and spatial-scan workflow apply.

How do I tune the crop zone for a new vendor template?

Open one representative report and print page.find_tables() bounding boxes, or render page.to_image().debug_tablefinder() to see exactly where the torque matrix sits in PDF points. Set TORQUE_ZONE to enclose that region with a little margin. Because the extractor tags each record with its method, you can validate the new zone by confirming the readings come back as table_zone rather than falling through to spatial_scan — a template that only ever hits the fallback usually means the crop missed the matrix.

Why parse the table first and keep a spatial scan as a fallback?

Structured table extraction is the highest-confidence path because the bolt identifier and its value arrive in known columns, so it should always be tried first. But not every vendor emits a real table — some lay torque values out as positioned free text with no ruling lines, and find_tables() returns nothing. The coordinate-aware spatial scan recovers those readings by pairing a bolt identifier with the nearest numeric token to its right, so a layout without tables degrades to a slightly lower-confidence method instead of silently losing data.

Up to the parent workflow: PDFplumber Extraction Workflows
Sibling task: Async Batch Processing for Multi-Site Structural Reports
The full architecture: Automated Structural Report Parsing & Document Ingestion

Extracting Bolt Torque Data from PDF Inspection Reports with pdfplumber

Prerequisites & Context #

Step-by-Step Implementation #

Complete Runnable Example #

Verification & Expected Output #

Gotchas & Edge Cases #

FAQ #

Related #

Related pages

Prerequisites & Context

Step-by-Step Implementation

Complete Runnable Example

Verification & Expected Output

Gotchas & Edge Cases

FAQ

Related