yfedoseev/pdf_oxide
 Watch   
 Star   
 Fork   
2026-05-18 00:14:03
pdf_oxide

v0.3.50 | True destructive PDF redaction, PAdES-B-T/B-LT long-term-validation signatures, a runtime cryptographic algorithm-governance policy, and split-PDF-by-bookmarks across all seven bindings, plus a signature-date correctness fix.

Added

  • True destructive redaction (#231) — the prior "redaction" only drew a filled rectangle over content whose bytes survived (recoverable by copy-paste / pdftotext / a hex editor). Redaction is now destructive: the text under each region is physically removed from the content stream — every glyph whose ISO 32000-1:2008 §9.4.4 text-rendering box intersects the (edge-padded) region is deleted, survivors are re-emitted with a fresh absolute Tm and no TJ deltas so neither the glyphs nor a width/shift side channel (Bland et al., PETS 2023) remain; the page is rewritten so the original content object is dropped by the garbage-collected full rewrite (no residual recoverable bytes); an opaque overlay marks the area (ISO 32000-1:2008 §12.5.6.23, "remove all traces … clipping shall not be used"). Composite/Type0/unknown fonts are refused rather than risk a silent under-redaction (fail-closed). New DocumentEditor::add_redaction / redaction_count / apply_redactions_destructive plus the pdf_redaction_add/count/apply/scrub_metadata C ABI and Python, WASM, Node, C#, Go bindings and a pdf-oxide redact INPUT --rect PAGE:x0,y0,x1,y1 [--from-annotations] [--fill R,G,B] [--no-scrub-metadata] CLI. The legacy apply_page_redactions/apply_all_redactions keep their signatures. Standalone document sanitization (DocumentEditor::sanitize_document, the live pdf_redaction_scrub_metadata C ABI, Python sanitize_document, WASM sanitizeDocument, and the already-wired Node/C#/Go scrub paths) strips the /Info dictionary, the catalog XMP /Metadata stream, document JavaScript (/OpenAction, /AA, /Names/JavaScript) and /Names/EmbeddedFiles; the removed object subtrees are hard-excluded from the rewritten file so a secret cannot survive even as a GC-missed orphan (G6). Geometric image/path/XObject pruning remains roadmap; composite-font text and encrypted documents are refused (not under-redacted).
  • PAdES long-term-validation signatures (#235) — signing now produces ETSI EN 319 142-1 PAdES baseline signatures, not just bare adbe.pkcs7.detached: B-B embeds the RFC 5035 ESS signing-certificate-v2 signed attribute; B-T adds an RFC 3161 signature-time-stamp unsigned attribute over the signature value; B-LT appends a Document Security Store (ISO 32000-2:2020 §12.8.4.3 — certs/CRLs/OCSPs + a per-signature /VRI keyed by the uppercase-hex SHA-1 of the signature's /Contents) as an append-only second incremental update, so the original signature's byte range is untouched and stays Valid. Read side: read_dss parses a /DSS and classify_pades_level reports a signature's level (B-B/B-T/B-LT). New sign_pdf_bytes_pades / PadesLevel / RevocationMaterial / DocumentSecurityStore in core, the pdf_sign_bytes_pades / pdf_signature_get_pades_level / pdf_document_get_dss / pdf_dss_* C ABI, and Python, WASM, Node, C#, Go bindings. B-LTA is also produced: a /Type /DocTimeStamp (/SubFilter /ETSI.RFC3161) RFC 3161 timestamp over the whole file including the DSS, appended as a third incremental update so the archival timestamp covers the signature and its validation material; has_document_timestamp is the document-scoped reader signal (classify_pades_level stays signature-scoped and tops out at B-LT by design — the frozen pdf_signature_get_pades_level C ABI has no document handle). The legacy sign_pdf_bytes adbe.pkcs7.detached path is byte-for-byte unchanged. Final ETSI conformance is gated on the EU DSS demonstration-validator release check (online TSA fetch is CGo/native-only — WASM takes a pre-fetched RFC 3161 token).
  • Runtime crypto-governance policy (#230) — a process-wide crypto::SecurityPolicy (modes compat / strict / fips-strict, plus an allow:/deny:<alg>@<read|write> override grammar) layered as an orthogonal, set-once decorator over the existing CryptoProvider. Read/write asymmetry lets a deployment read legacy RC4/MD5 PDFs while forbidding weak crypto on write or new signatures; fail-closed throughout (unknown algorithm / unparseable spec ⇒ deny). Includes a content-keyed inventory() governance report and a pluggable AuditSink. Exposed across all seven surfaces (Rust, Python, C ABI, Go, C#, WASM, Node) as set_crypto_policy / crypto_policy / crypto_inventory. Default (compat) behaviour is byte-for-byte unchanged. The residual password-key-derivation MD5 (ISO 32000-1 §7.6.3 Algorithm 1/2/3/5/7) is now also routed through the governed provider, so a strict/fips-strict policy denies legacy R≤4 at the primitive level, not only the operation gate — closing the gap noted in the v0.3.50 slice. The hashing is byte-identical under compat (existing encrypted PDFs still decrypt; newly written ones are bit-for-bit unchanged). Non-security opaque MD5 (file identifier, embedded-file /CheckSum) is deliberately left direct so a strict policy still permits AES-256 writes. A machine-readable CycloneDX 1.6 Cryptographic Bill of Materials of the algorithms a run actually exercised is exported via crypto_cbom (core cbom_json + C ABI / Python / WASM / Go / Node / C# bindings) — the structured complement to crypto_inventory for CBOM/SPDX-crypto governance. The policy now also recognises and governs post-quantum algorithms: PolicyMode::Cnsa2 (CNSA 2.0 — new crypto must be FIPS-approved and 192-bit-class or stronger; 128-bit classical and L1/L2 PQC denied for write) and PolicyMode::PqcReady (Strict semantics that additionally recognise/permit ML-DSA/ML-KEM for classical+PQC dual-stacking during migration), plus ML-DSA-44/65/87 (FIPS 204) and ML-KEM-512/768/1024 (FIPS 203) AlgorithmIds in inventory()/CBOM/the policy grammar. This is governance vocabulary (the policy decides; the actual ML-DSA/ML-KEM primitives are a separate provider concern — a sign attempt fails closed until they land). Set via the string grammar (crypto_policy("cnsa2")), so all seven bindings get it with no API change; frozen AlgorithmId bit indices are preserved (PQC ids appended). A governed RSA modulus-size floor is also enforced for signing: SecurityPolicy::min_rsa_modulus_bits (per-mode default — Compat 0, Strict/PqcReady 2048, FipsStrict/Cnsa2 3072 per NIST SP 800-131A / CNSA 2.0) makes sign_pdf_bytes/sign_pdf_bytes_pades fail closed with a weak RSA key — the key-strength gate the algorithm-level min_security_bits cannot see. Default compat keeps no floor (byte-for-byte unchanged). (Finer X.509 cert-policy governance — keyUsage / extendedKeyUsage / validity-window enforcement for the signing certificate — is the remaining #230 roadmap item, tracked as a focused follow-up. Per-document policy override (Phase G) was design-assessed and deliberately deferred: the active policy is set-once specifically because a mid-flight downgrade is an attack vector, so a runtime widening override (e.g. relax-for-one-document) cannot be added safely; the only sound shape is an explicit per-document policy threaded through every crypto call site — a large cross-cutting change, tracked as a separate follow-up, not a set-once relaxation.)
  • Split a PDF by bookmarks (#482) — new pdf-oxide split --by-bookmarks [--bookmark-prefix P] [--bookmark-level N] [--ignore-case] [--no-front-matter] CLI, plus plan_split_by_bookmarks / split_by_bookmarks* in core and every binding (Python, WASM, C ABI, Go, C#, Node). Splits at outline boundaries into one PDF per (optionally prefix-filtered) bookmark, with collision-free, filesystem-safe filenames. Outline parsing now resolves named destinations (catalog /Dests dictionary and the /Names/Dests name tree, ISO 32000-1 §12.3.2.3 / §7.9.6), bounded against malformed/cyclic name trees. Plain per-page split is unchanged (backward compatible).
  • Full idiomatic cross-binding parity for #230/#231/#235/#482 — every feature is now exposed idiomatically in all supported bindings (Rust, Python, C ABI, WASM, C#, Go-cgo, Go-purego, Node/TS):
    • A new additive C ABI pdf_document_has_timestamp(doc) exposes the document-scoped PAdES-B-LTA reader signal that pdf_signature_get_pades_level (signature-scoped, ≤B-LT by design) cannot report; surfaced as Python has_document_timestamp, WASM hasDocumentTimestamp, C# PdfDocument.HasDocumentTimestamp, Go (*PdfDocument).HasDocumentTimestamp, and Node PdfDocument.hasDocumentTimestamp / SignatureManager.
    • Python now re-exports the entire signing/PAdES surface (sign_pdf_bytes, sign_pdf_bytes_pades, Certificate, Signature, PadesLevel, RevocationMaterial, Dss) plus crypto_cbom from the top-level pdf_oxide package under idiomatic names (the functions were previously reachable only as py_-prefixed symbols on the private extension module).
    • The standalone document sanitization entrypoint (#231) is now a first-class SanitizeDocument() on the C# and Go (cgo + purego) DocumentEditor (previously the live pdf_redaction_scrub_metadata C ABI had no managed/Go wrapper).
    • The Go purego (CGO-free) backend, previously read-side only, now covers crypto-governance (#230), destructive redaction + sanitize (#231), PAdES signing + DSS read + B-LTA (#235), and split-by-bookmarks (#482) with signatures identical to the cgo backend.
    • Node/TS gains idiomatic signPdfBytesPades, PadesLevel, PdfDocument.getDocumentSecurityStore/hasDocumentTimestamp/ planSplitByBookmarks, setCryptoPolicy/cryptoPolicy/ cryptoInventory/cryptoCbom, and SecurityManager / SignatureManager / OutlineManager methods, all with generated TypeScript declarations. Behaviour and the frozen PadesLevel integer mapping are unchanged.

Fixed

  • Wrong dates in digital-signature timestampsformat_pdf_date hard-coded the month/day to 0101 and approximated the year as 1970 + days/365, so every signature /M value (and document timestamps) was an incorrect ≈Jan-1-of-leap-drifted-year (ISO 32000-1 §7.9.4). Replaced with one leap-year-correct, de-duplicated implementation (the two divergent copies are gone).

Security

  • Redaction now actually removes content (#231) — the Node editing-manager redaction methods previously called native pdf_redaction_* symbols that did not exist (silently no-op'ing — a security-critical operation pretending to succeed while removing nothing). Those C ABI symbols now exist and perform true destructive redaction (see Added); the binding gap is closed across all surfaces. A [BLOCK] integration test builds a real PDF containing a secret, redacts it through the public API, and asserts the secret is absent from both re-extracted text and the raw saved bytes (idempotent).
  • PAdES long-term-validation signatures (#235) — PDF signatures can now carry the ESS signing-certificate-v2 binding (RFC 5035, defeats certificate-substitution), an RFC 3161 timestamp (B-T), and a Document Security Store for offline long-term validation (B-LT). The DSS is added as an append-only incremental update so pre-existing signatures provably remain Valid (asserted by the I1–I7 integrity-invariant suite in tests/pades_ltv.rs); a tampered signed region still fails verification (negative test). See Added for scope and the EU-DSS conformance gate.

Thanks

  • @Suleman-Elahi for requesting split-by-bookmarks (#482).
  • @jedzill4 for volunteering on destructive redaction (#231).

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-16 14:56:23
pdf_oxide

v0.3.49 | Off-byte-0 PDF header recovery, sparse-trailer Catalog discovery, a render-path thread-safety fix, and release-automation hardening.

Fixed

  • Linearized PDFs with a non-zero %PDF- header offset (#509) — files whose %PDF- header is preceded by leading bytes (e.g. a captive- portal HTML redirect injected ahead of a Linearized PDF) are now read instead of rejected with Trailer missing /Root entry. The xref- offset shift for header-offset PDFs no longer requires the final trailer to carry /Root; xref reconstruction now rejects a parsed- but-/Root-less trailer and falls through to Catalog discovery; and catalog() scans for /Type /Catalog when the trailer omits /Root (matching Poppler / PDFium behaviour, ISO 32000-2 §7.5.2 / 1.7 Implementation Note G.6).

  • Render-path data race under concurrent rendering (#505) — the process-wide embedded-font classification cache keyed on Arc::as_ptr could return a stale (is_byte_indexed, has_unicode_cmap) for an unrelated font when an allocation address was recycled across threads, intermittently surfacing as ParseException [1000] from RenderPage / RenderPageFit under Parallel.ForEach. The unsound global cache is removed; the cmap classification is now computed locally per call (a cheap ttf_parser table probe), so concurrent renders can no longer collide.

  • Test helper make_type0_font used a non-production Encoding variant (#504) — the helper now maps Identity-H / Identity-V to Encoding::Identity exactly as the real font parser does, so the affected Type0 tests exercise the production code path instead of a variant production never produces. Purely test-correctness; no user- facing behaviour change.

CI / Infrastructure

  • Release-notes title extraction hardened (#506)extract-release-notes.sh now bounds the subtitle scan to the requested version's section (no longer silently inheriting an older version's > blockquote), concatenates multi-line blockquotes instead of truncating at the first line, and fails loudly when the version section or its subtitle is missing. A validate-changelog PR/release-branch gate plus a release-title sanity check stop a malformed CHANGELOG from ever reaching the publish step, and a self- contained regression test covers the missing-section, missing- subtitle, multi-line, and cross-version false-scrape cases.

  • GitHub Deployments visibility for regular publishes (#493) — each publish job in release.yml (crates.io, PyPI, npm, npm-native, NuGet, Homebrew/Scoop) now declares an environment:, so standard- pipeline publishes appear under the Deployments view with their artifact URL, matching what the FIPS pipeline already did.

Thanks

  • @Goldziher (kreuzberg-dev) — opened #509 with a clean standalone reproducer (no app code), a pinned test file, a full multi-engine cross-check against Poppler, and a 156-PDF corpus survey that isolated this as the single legitimate file the parser rejected. That report turned a vague "Linearized PDF fails" into a precise header-offset + sparse-trailer root cause.

The remaining fixes (#506, #505, #504, #493) were surfaced internally while reviewing the v0.3.45–v0.3.47 release automation, the post-merge main CI runs, and the v0.3.47 PR review.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-16 05:12:00
pdf_oxide

v0.3.48 | Pluggable cryptographic provider — FIPS 140-3 compliance for

This release lands the office converter integration (#159): bidirectional PDF ↔ DOCX/PPTX/XLSX round-trip with layout-preserving fidelity, exposed through all seven bindings (Rust, Python, Node, WASM, C FFI, C#, Go). Typical text-heavy PDFs round-trip through an Office file and back at near-pixel parity to the source. The corpus harness used to validate the integration covers 26 PDFs spanning academic papers, hymnals, multi-column newspapers, slide decks, government forms, and policy documents.

Closes the v0.3.14-milestone feature request "PDF to Word/DOCX export": text styling (fonts / sizes / colours) preserved via layout-mode writers + Unicode/CJK system-font fallback; paragraphs / headings / lists preserved via positional frame anchors; image placement preserved via raster Image XObject + Form XObject rasterization. Tables flow through positional shapes (grid-aware reconstruction is still follow-up work).

Added

  • Bidirectional PDF ↔ DOCX/PPTX/XLSX conversion (#159) — new OfficeConverter API converts in both directions across DOCX, PPTX, and XLSX. Layout-preserving writers (src/converters/{docx,pptx,xlsx}_layout.rs) emit one positionally- anchored shape / frame per PDF text span; the back-direction render path (render_positional_ir / render_pptx_positional) reproduces the source page near-identically. Available on every binding via the 09-new-features/office_conversion/ examples.

  • Unicode + CJK system-font fallback for office round-trip (src/fonts/unicode_fallback.rs) — when the source PDF embeds a CID- only font subset the writer can't re-embed, a system Unicode face (DejaVu Sans → FreeSans → Noto Sans → Tinos / Arimo) and a CJK face (DroidSansFallbackFull → IPAGothic → NanumGothic → Unifont) are registered automatically. needs_unicode_fallback is WinAnsi-aware (curly quotes / em-en dashes / bullet / ellipsis / trademark stay on the source font); CJK ranges (Han / Hiragana / Katakana / Hangul / Compatibility Forms / Halfwidth–Fullwidth) route to the CJK face first. Restores Hebrew, Arabic, Latin Extended, Chinese, Japanese, and Korean characters that previously rendered as ? glyphs across all three formats.

  • Music-notation region detection + rasterization (src/converters/music_region_finder.rs) — hymnals and sheet-music PDFs (Finale Maestro, SMuFL Bravura, Sibelius Petrucci / Opus, Adobe Sonata, LilyPond Emmentaler, …) are detected by combining a music- font allowlist with a 5-line staff-clustering pass on extract_paths. Detected music systems are rasterized once at 150 DPI and embedded as positioned PNGs; the source spans / shapes inside each music region are suppressed so glyph substitutions don't overlay the bitmap. Hymnal-style PDFs now round-trip with their staves and noteheads preserved instead of emitting random Latin characters from the missing music face.

  • Form XObject + inline-image rasterizer shared helper (src/converters/form_xobject_finder.rs::rasterize_form_and_inline_regions) — the layout-mode writers and the flow-mode pdf_to_ir path share one helper that renders each page once at 150 DPI and crops per region. Vector figures (academic-paper charts, agency logos drawn as Form XObjects) survive the office round-trip; the prior per- region full-page render was replaced.

  • Per-run text colour preservation — PDF→DOCX/PPTX/XLSX now emits <w:color> / <a:solidFill> for spans carrying explicit colour; the back-render path drops to rich_paragraph instead of text_in_rect when any inline run has a colour so the colour survives the PDF render. Sibling office_oxide parser changes expose the colour on TextSpan for the docx, pptx slide, and pptx shape paths.

Fixed

  • Rotated-text watermark filter (src/converters/pdf_to_ir.rs::span_overlaps_rotated_chars) — page-edge arXiv:NNNN.NNNNN [cat] DATE watermarks were leaking into the office round-trip as horizontal text strips mid-page. The new origin-based filter matches each span to its nearest extract_chars glyph by (origin_x, origin_y) distance and uses that glyph's rotation_degrees to decide drop. Gated by a page- level chars_horizontal_dominant heuristic (≥75 % chars at ~0°) so PDFs whose text-matrix decomposition spuriously reports rotation = 90° for every glyph (Finale slide-mode decks) are left alone. Catches the watermark family across multiple arxiv papers.

  • Multi-column page handling in layout-mode line grouping (src/converters/layout_lines.rs::group_spans_into_lines) — refuses to merge a candidate span into the active line when its bbox.x sits more than max_font_size * 4 past the line's right edge. Threshold (~36-48 pt for body text) is wider than any justified inter-word gap but narrower than typical column gutters (60+ pt). Fixes German multi-column newspapers and 2-column arxiv papers where columns previously merged into one frame.

  • Drop-cap guard for layout-mode line groupinggroup_spans_ into_lines rejects merges when the candidate span's font size differs from the line's existing spans by > 2×. Anchors Nature- Methods-style drop-cap "A" wraps at the correct visual position instead of fusing them into a single heading-class frame with the body text below.

  • OpenType / CFF cmap rebuild and injection (src/fonts/cmap_injector.rs, src/document.rs) — two real bugs in the cmap-injection path that produced corrupted lowercase glyphs on strict OS renderers:

    • build_format4_cmap over-reported subtable length by 2 bytes (double-counted the reservedPad field). Strict ttf-parser / CoreText paths silently rejected the cmap; some Win/macOS renderers then mapped the affected codepoints to the wrong glyph.
    • extract_embedded_fonts_with_unicode_maps_and_widths was driving its Unicode→GID table off char_to_unicode, whose CID-as- Unicode fallback overwrote authoritative ToUnicode entries with identity mappings on Identity-H fonts. Now reads the ToUnicode CMap directly and filters U+FFFD plus C0 controls.
  • Shape-artefact filter for layout-mode DOCX (src/converters/docx_layout.rs) — drop solid-black rects > 25% page area (slide-background artefacts), solid-white rects > 50% page area (page-background rects emitted before text — would occlude the rendered text in the back-PDF), and rects > 1.2× page extent (extractor noise that wiped the entire frame).

  • XLSX layout-mode page count gate raised (src/document.rs::to_xlsx_bytes)LAYOUT_MAX_PAGES raised 30 → 200. The 134-page arxiv dissertation was being routed to flow-mode ir_to_xlsx, whose column-A row-N layout collapses the centered cover page into the top of column A. Layout-mode handles 100+ page documents fine; the gate now triggers only for very large reports.

Performance

  • ExtGState resolve cache: 75× speedup on vector-heavy PDFs (src/rendering/page_renderer.rs)apply_ext_g_state was deep-cloning the per-Form ExtGState HashMap on every gs operator. Vector figures (scatter / contour plots emitted as Form XObjects) trigger this thousands of times per page — a typical academic paper with a dense plot can hit ~10 000 gs ops with 10 000+ unique ExtGState names. The clone dominated render time. The resource dict is now resolved once at the top of execute_operators and parsed-effect (ParsedExtGState) results are cached per dict_name. Measured on a ~10-page vector-heavy arXiv paper: PDF→DOCX dropped from 263 s to 3 s.

  • Debug-only path-rasterizer clones gated by log level (src/rendering/path_rasterizer.rs)path.clone().transform was unconditional, used only to populate pixel_bounds in a log::debug! line. Same vector figures hit this path tens of thousands of times per page. Gated behind log::log_enabled!(Level::Debug).


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-13 11:27:27
pdf_oxide

v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix.

This release closes the remaining bugs surfaced by the kreuzberg integration (issue #484) and ships the related text-extraction quality fixes. Word-F1 against the pdftotext-derived ground truth corpus now meets the kreuzberg quality floor for every PDF in the issue 484 set.

Fixed

  • kreuzberg regression suite — all 24 PDFs now meet the F1 floor (#484)extract_text previously failed three documents reported by @Goldziher on the kreuzberg corpus: pdfa_039.pdf (swimming-results table) returned F1 0.810, pr-136-example.pdf (CJK financial document) returned F1 0.709, and annotations.pdf returned F1 0.545. Three separate root-cause fixes restore them to F1 ≥ 0.85:

    • eliminate duplicate emission of multi-row table labels — the text-only spatial fallback in detect_tables_with_lines now requires config.text_fallback=true (which extract_text does not pass) so report-style PDFs with decorative ruling lines no longer get their cell content emitted twice; span_in_table adds a text-match fallback to catch label spans whose font ascent extends slightly above the cell's ink box (issue-53-example.pdf F1 0.867 → 0.992).
    • tighten cross-font glue and decimal merge for CJK + Latin layoutscross_font_word_glue no longer fires on a CJK ↔ non-CJK boundary (CJK ideographs satisfy is_alphabetic() per Unicode and were being concatenated with adjacent Latin); the decimal_merge heuristic requires a column-boundary-sized gap (gap > 0.4 em) so per-glyph Tj operators in CJK documents stop mangling "2013" into "201.3" (pr-136 F1 0.709 → 0.884).
    • narrow CJK boundary forced-space to script glyphs onlyshould_insert_space now actively inserts a space at the CJK ↔ non-CJK boundary to match pdftotext tokenisation, but restricted to actual script glyphs (ideographs, kana, hangul); fullwidth ASCII operators like < > = μ stay inline with adjacent digits/Latin so compound tokens like "60000≤Q<80000" are preserved (issue-336 text quality gate stays at PASS). Reported by @Goldziher.
  • extract_spans now exposes a merge_tm_tj_runs opt-out (#488) — Same-line Tm+Tj runs were unconditionally batched into a single TextSpan, throwing away the per-Tm positioning that downstream layout-analysis code (e.g. column-aware table detection) needs. SpanMergingConfig::merge_tm_tj_runs (default true for backward compatibility) now flushes the span buffer at every Tm operator so callers can opt in to one span per Tm+Tj group, matching the granularity of pdftotext -bbox-layout. Reported by @haberman.

  • saveEncryptedToBytes no longer panics in browser WASM (#492)generate_file_id (per ISO 32000-1 §14.4) called std::time::SystemTime::now(), which is unimplemented on wasm32-unknown-unknown. Cfg-gated so the WASM build derives the file identifier from uuid::Uuid::new_v4() only — still a unique opaque 16-byte ID per the spec. Reported by @eersis-byte.

  • CJK fullwidth operator spacing in to_markdown / to_html (#485) — Four coordinated changes restore issue-336-example.pdf to PASS on all three quality gates (text, markdown, html):

    • pipeline/converters/has_horizontal_gap suppresses space insertion when one side is CJK and the other is CJK or a fullwidth/math operator (≤, <, >, =, μ, etc.), mirroring the text-extraction CJK-pair suppression.
    • extract_cell_text no longer inserts an unconditional space between adjacent spans on the same row of a table cell — uses the same gap-aware separator rules as the inline-flow path so multi-span cells like 60000≤Q<80000 (rendered as 5 separate Tj operators) keep their compound tokens intact.
    • consolidate_adjacent_table_fragments (new helper in spatial_table_detector) merges vertically-adjacent tables that share an identical column structure. The line-based detector emits one fragment per ruling-rule strip on PDFs that draw a horizontal rule between every pair of rows; each fragment was failing is_real_grid and falling through to paragraph flow with column-based reading order, producing orphan <p>40000≤Q</p> / <p><55000</p> pairs. Consolidating before the filter lets the merged multi-row table survive.
    • is_real_grid accepts wide consolidated tables that have dense data rows alongside sparse header / multi-row-label rows — the strict 70 % dense-ratio gate was rejecting real tables whose column headers split across multiple visual rows. Score improvements on issue-336-example.pdf: text 0.612 → 0.820, markdown 0.577 → 0.863, html 0.632 → 0.646 (all PASS their thresholds).
  • Text-only spatial table fallback for line-less tables in to_markdown (#486) — partial fix. extract_page_tables now opts in to a relaxed text-only detection when the caller is a converter (text_fallback= true), with the column ceiling raised from 15 to 25 so that sailing-score grids with 16-18 score columns are no longer rejected outright. The fragmented-table consolidation from #485 also kicks in here, recovering most of the row labels and identifier columns. nougat_018.pdf markdown still trails its threshold (0.656 vs 0.90) because the score columns themselves — variable-width sparse cells with parenthesised drop-scores — evade column detection; that is the remaining piece tracked separately.

  • HTML table cell rendering aligned with markdown (#487) — partial fix. to_html now uses the same span-walking and bold/italic preservation as to_markdown's render_table_markdown. Three of four affected docs improved by 1-4 % Jaccard but two (nougat_018, nougat_026) still trail the threshold pending the table-fragmentation work above.

  • RTL inline emphasis stripping in markdown extraction (#459) — RTL detection now strips <strong> / <em> markers from visually-reversed runs in to_markdown consistently with the plain-text path; spec basis ISO 32000-1 §14.8.2.3.3 (Reverse- Order Show Strings). 46 unit tests in tests/test_rtl_script_support.rs cover the detector, BiDi algorithm, and inline-flow integration.

  • Multi-byte CMap parsing and array-form beginbfrange (§9.7.5)beginbfrange ... endbfrange array notation <src> <src> [<dst1> <dst2> ...] was not fully covered; the CMap parser now matches the spec's allowed grammar so multi-byte CIDs map correctly through ToUnicode CMaps.

  • /StructTreeRoot-only tagged PDFs (§14.7.4) — Documents that declare /StructTreeRoot in the catalog without a /MarkInfo dictionary (PDF 1.4 documents, valid per the spec) now correctly use the structure tree for table-cell content extraction. Resolves /OBJR content-item references during tree traversal so OBJR-referenced annotations and XObjects are no longer lost.

  • Indirect references in MediaBox/CropBox accessors (§7.7.3.4) — Page attribute accessors now resolve /MediaBox and /CropBox through indirect references and the /Pages inheritance chain. This is what made the Bucket A errors in the issue 484 retest comment (annotations*.pdf, pdfa_039.pdf) parse successfully.

  • CTM-aware cache key for Form XObject span extraction — Form XObject spans were cached by XObject reference alone, returning stale coordinates for the same XObject reused on multiple pages with different CTM transforms. Cache key now includes the CTM so repeated XObjects produce correctly-positioned spans on each invocation.

  • notdefrange U+FFFD no longer blocks the CID-as-Unicode fallback (§9.10.2) — Per the spec, U+FFFD (REPLACEMENT CHARACTER) signals "no proper Unicode mapping", so a notdefrange hit must not stop the priority list. The Identity CID-as-Unicode fallback (Priority 3) now fires correctly for composite fonts whose ToUnicode CMap returns U+FFFD.

  • ToUnicode Priority-3 fallback guarded for composite fonts (§9.10.2) — The CID-as-Unicode fallback is now only applied to fonts whose CMap is one of the predefined composite-font CMaps or whose CIDFont uses one of the Adobe character collections, matching the spec's enumeration; misapplication on other fonts could produce mojibake on previously-working files.

  • Reject prose / TOC / underline-annotation false-positive tables in to_html and to_markdown — Wide pages of ordinary paragraph text were sometimes detected as multi-column tables: word x-positions cluster into "columns" by accident, and decorative horizontal rules (newsletter mastheads, annotation underlines, page borders) tricked the line-based detector into treating two adjacent lines as a header + data row. The detection pipeline now applies several post-is_real_grid guards that look at the shape of the candidate's cell content rather than just its grid geometry:

    • looks_like_prose_table rejects a candidate when more than 12 % of cells end with a mid-sentence , or ;, more than 25 % of cells start with a lowercase ASCII letter (continuation fragments like "and", "the", "to"), or more than 10 % of cells are pure leader dots (the . . . . . . runs in tables of contents).
    • The text-only spatial fallback and the horizontal-rule- bounded path both now require ≥ 3 rows of evidence. A title plus a wrapped body line is the signature of prose, not a table; only the line-based intersection / cluster paths (which have authoritative visual evidence) still accept 2-row tables.
    • should_insert_space no longer forces a space at the CJK ↔ ASCII-punctuation boundary. The boundary forced- space added in v0.3.47 was correctly inserting a space at "神鹰集团" + "2015" but was wrongly producing "する ." instead of "する." in Japanese technical text; ASCII clause punctuation hugs the preceding token in every script, so the rule is now suppressed when the transitioning glyph IS the punctuation.
    • text_fallback defaults back to true on TableDetectionConfig. The new prose-shape filter replaces the gate-based protection added earlier in the cycle, so the public extract_tables API again detects line-less data tables out of the box.

Notes

  • tests/test_corpus_extraction_quality.rs now strips markdown formatting markers (**bold**, *italic*, | separators, ---|---|--- rule, # heading, ``` fences) before computing Jaccard against the plain-text GT — mirrors the HTML test's existing strip_html step so the score reflects text content rather than formatting markup.
  • All 19 quality-gate Jaccard tests in tests/test_corpus_extraction_quality.rs now pass (up from 13 at the start of this branch). The kreuzberg issue 484 corpus passes its F1 floor on every PDF.

Thanks

This release was driven entirely by community bug reports and the kreuzberg integration test feedback loop:

  • @Goldziher (kreuzberg-dev) — opened #484 with a calibrated 166-PDF regression suite and follow-up retest comments that turned every remaining gap into a focused root-cause fix
  • @haberman — opened #488 with a minimal Rust reproducer for the Tm+Tj merging issue
  • @eersis-byte — opened #492 with the WASM SystemTime panic backtrace

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-11 11:52:29
pdf_oxide

v0.3.46 | Pluggable cryptographic provider — FIPS 140-3 compliance for

Added

  • Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering (#446, #481)page.render_pixmap() (Python), renderToPixmap() (Node.js / Go), and Page.RenderToRgba() (C#) expose the premultiplied RGBA8888 buffer directly from tiny_skia::Pixmap::data(), eliminating the encode→decode roundtrip for callers that need raw pixels (PIL, sharp, System.Drawing.Bitmap, image.RGBA). Downscaling is now SIMD-accelerated via fast_image_resize (ARM NEON, x86 AVX2), replacing the previous bilinear path. Concurrent render_* calls on the same PdfDocument are now safe: all rendering functions take &PdfDocument (shared reference) and all interior-mutable state is already guarded by per-field Mutex, so the FFI layer no longer produces aliased &mut references and concurrent renders run without a global serialisation bottleneck. Requested by @mara004 and @potatochipcoconut.

  • ConversionOptions::exclude_regions / include_region (#484) — New spatial filtering fields allow callers to exclude rectangular regions from extraction output or restrict extraction to a single bounding rectangle. Backed by SpatialCollectionFiltering trait methods filter_by_rect / exclude_rects.

  • PageFontStats (#484) — New layout::PageFontStats struct computed in O(n) over spans; exposes dominant_em, dominant_line_height, dominant_char_width, and body_font_name. All layout heuristics now derive absolute thresholds from these measurements instead of hardcoded constants, improving correctness across a wider range of font sizes.

Fixed

  • JBIG2-compressed scanner PDFs render as blank pages (#332) — The pass-through Jbig2Decoder returned compressed bytes unchanged, causing a dimension mismatch and a silent image drop. Integrates hayro-jbig2 v0.3 (pure-Rust, Apache-2.0 OR MIT); embedded JBIG2 bitstreams are decoded via hayro_jbig2::Image::new_embedded, with JBIG2Globals loaded from /DecodeParms when present. BitsPerComponent is overridden to 8 post-decode so to_dynamic_image() does not attempt CCITT bilevel decompression of already-decoded pixels. Reported by @frederikhors, who also confirmed the original vertical-flip / glyph-substitution symptom is resolved in v0.3.45.

  • add_text on existing PDF produces blank or discarded content (#483)DocumentEditor::add_text on a page of an existing PDF either blanked the page or (when combined with select_pages) silently returned the unmodified original. Root causes: the storage-side page-index mapping after select_pages was off by one, and add_text failed to preserve the existing content stream when writing the new text layer. Both are fixed; an end-to-end regression suite is added. Reported by @stephenjudkins.

  • Text extraction corpus quality improvements across 166 PDFs (#484) — Systematic audit driven by @Goldziher's calibrated 166-document corpus (the kreuzberg test suite), which provides per-document ground-truth .txt files and a word-F1 harness. Multiple extraction failures identified and fixed:

    • Newline/CR-only spans treated as line breaks — Spans consisting entirely of \n or \r bytes are now emitted as a single newline rather than verbatim byte sequences, eliminating spurious blank lines from some PDF generators.
    • Annotation text double-emittedappend_non_widget_annotation_text was called after the main span assembly pass even though annotation_content_spans() already inlined annotation /Contents into the span list. The redundant call is removed.
    • Markup annotation /Contents correctly filtered — Per ISO 32000-1 §12.5.6.2, /Contents on Highlight, Underline, StrikeOut, Squiggly, Caret, Ink, FileAttachment, and Redact annotations is popup/tooltip text, not page content. These subtypes are now excluded from annotation_content_spans and append_non_widget_annotation_text.
    • No space inserted between adjacent CJK charactersshould_insert_space now returns false when both the trailing and leading characters are CJK (Hiragana, Katakana, CJK Unified Ideographs, Hangul, CJK Extension B).
    • Unicode ligatures preserved; adjacent CJK spans merged — Latin ligatures (U+FB00–U+FB06) are now preserved in the span stream rather than dropped. Adjacent CJK spans from the same run are merged into a single span, eliminating inter-character noise.
    • Lower→upper CID range boundary split restored — The CID range boundary split now consistently applies the lower→upper ordering correction that was accidentally dropped; the fix propagates to Markdown and HTML output paths.
    • Non-adjacent subscript/superscript spans mergedmerge_sub_superscript_spans handles spans separated by intervening content, using em-relative thresholds [-0.1×em, +0.25×em] instead of hardcoded absolute values so detection scales with body font size.
    • Column-spanning decimals split at table cell boundaries — Decimal numbers that span two adjacent table cells are split at the cell boundary rather than merged into a single token.
    • Position-aware space insertion between adjacent MCID spans — Spaces between MCID-tagged spans are inserted based on actual rendered x-positions rather than always or never.
    • Boundary split on letter→digit transition onlychar_widths_boundary_split now splits only at a letter-to-digit boundary (e.g. Theorem1), removing false splits on UpperCamelCase terms that previously broke word-shape heuristics.
    • Same-line threshold formula fixedsame_line_threshold now uses (min_fs × 1.2).max(max_fs × 0.3), handling mixed-size lines (heading + caption on the same line) without cliff effects.
    • Bare-word identifiers and corrupt StructTreeRoot handled — Parser now tolerates bare-word tokens as dictionary values; a corrupt or absent StructTreeRoot no longer aborts extraction.
    • Standard-14 font matching strips SUBSET+ prefix; accepts canonical PostScript aliases — Per ISO 32000-1 §9.6.2.2 Annex D, standard font names are matched after stripping any ABCDEF+ prefix. HelveticaOblique (no hyphen) is now accepted alongside Helvetica-Oblique.
    • Explicit /DW tracked in FontInfohas_explicit_dw: bool added; has_explicit_widths() returns true when /DW is explicitly present, enabling correct width lookup for CIDFonts that declare only /DW (no /W array).
    • CIDFont width fallback corrected — When /DW is absent and a CID is not in the /W array, get_glyph_width now falls through to default_width rather than cid_default_width, matching real-world PDF behaviour.
    • Word extractor honours split_boundary_before — Words that straddle a table-cell or column boundary are no longer merged.
    • Ligature expansion optionConversionOptions gains expand_ligatures: bool (default false). When enabled, Latin ligatures (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st) are expanded to component letters.
    • Extraction warnings APIPdfDocument::warnings() (clones) and take_warnings() (drains) expose non-fatal extraction warnings (missing MCIDs, encrypted-PDF fallback) accumulated during a run.
  • Same-line span reorder: x-gap validation guard (#413) — After the row-aware sort, mixed-baseline glyphs (superscripts, subscripts) could appear before their base glyphs. The reorder_same_line_runs helper now validates that a candidate run is horizontally contiguous before X-sorting it; runs with a large X gap are left in row-aware order, preventing disjoint footer/header content from being collapsed into a fake same-line sequence. Fixes "8th" ordering (was "th8"). Contributed by @RolandWArnold in PR #413.

  • Layout word-merge O(n²) → O(n) — The word-merge pass previously re-scanned the entire accumulator for every candidate span; it is now O(n) via an index map.

  • Wide spatial false-positive tables rejected via dense-row-ratio — Table detection now computes the fraction of rows with dense (≥50%) column coverage and rejects candidates below the threshold, eliminating false positives on wide but sparsely populated layouts.

  • Bare-identifier lexer leniency confined to dict-value position — The lexer's tolerance for bare (unquoted) name-like tokens is now restricted to dictionary value positions, preventing mis-tokenisation of content streams where the same byte sequences are valid operators.

  • Typographic Unicode spaces normalised in extracted spans — Non-breaking, thin, en, em, and other Unicode space variants in span text are normalised to ASCII space before the word-spacing heuristics run, eliminating invisible gaps in the extracted output.

Performance

  • Rendering: per-segment font re-parsing eliminated — The text rasterizer no longer re-parses font data on every span segment; Arc clones across the hot render loop and redundant CJK subsetter invocations are also eliminated, reducing CPU time for text-heavy pages by 30–60%.

Dependencies

  • fast_image_resize added (#454) — New dependency enabling SIMD-accelerated (ARM NEON, x86 AVX2) image downscaling for the raw-RGBA render path.

CI

  • FIPS release workflow now validates on pull requestsrelease-fips.yml now triggers on PRs to main that touch source, language-binding, or workflow files. The full build across all five platforms and all four language bindings runs without publishing, so the tag push is a pure deployment step after a confirmed-green PR.
  • macOS x86_64 FIPS builds moved to free runners — All four macos-13-xlarge (paid Intel Larger Runner, causing indefinite queue waits on plans without access) replaced with macos-latest (free ARM runner cross-compiling to x86_64-apple-darwin).
  • Cargo registry caching added to all 20 FIPS build jobs — Per-target cache keys ($runner_os-$target-fips-cargo-$lock_hash) are restored before each build, substantially reducing re-run time on warm caches.

Community contributors

  • @RolandWArnold — contributed the same-line x-gap validation fix in PR #413. Roland diagnosed that reorder_same_line_runs was collapsing disjoint footer/header spans into a fake same-line sequence and designed the horizontal-contiguity guard that prevents it. The fix also correctly handles superscript/subscript ordering ("8th" instead of "th8").
  • @Goldziher (Na'aman Hirschfeld) — filed #484 with a calibrated 166-document corpus, per-document ground-truth .txt files, and a word-F1 harness, providing the systematic test bed that drove the bulk of the extraction improvements in this release.
  • @stephenjudkins (Stephen Judkins) — filed #483 with a minimal, precisely-scoped reproduction of the add_text regression that made the root-cause analysis straightforward.
  • @mara004 and @potatochipcoconut — requested the raw RGBA pixel buffer API in comments on #325 with clear use cases across PIL, sharp, System.Drawing.Bitmap, and Go's image.RGBA, and engaged on the pixel-format details (premultiplied vs straight alpha, tiny-skia format constraints) that shaped the final API design.
  • @frederikhors — reported the JBIG2 blank-page symptom in a comment on #332 and confirmed that both the JBIG2 fix and the earlier vertical-flip regression are resolved.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-07 13:31:27
pdf_oxide

v0.3.45 | Pluggable cryptographic provider — FIPS 140-3 compliance for

Fixed

  • CJK OTF (CFF) font subsetter corrupts glyph order (#449) — OTF fonts with CFF outlines (SFNT magic OTTO) were embedded as FontFile2 / CIDFontType2 (the TrueType path), causing PDF readers to misparse the CFF data and render wrong glyphs. Writer now detects CFF magic post-subsetting and emits the correct PDF object graph: FontFile3 (with /Subtype /CIDFontType0C) + CIDFontType0 (no CIDToGIDMap).
  • AwsLcProvider::verify_rsa_pkcs1v15 now fully implemented (#475) — Changed SignatureVerifier::verify_rsa_pkcs1v15 to accept the raw message bytes (consistent with verify_rsa_pss / verify_ecdsa). Under the default RustCryptoProvider the hash is now computed inside the trait implementation. Under AwsLcProvider (FIPS) the new call path uses aws-lc-rs's RSA_PKCS1_2048_8192_SHA{256,384,512} verifiers — RSA-PKCS#1 v1.5 signature verification now works under FIPS instead of returning SignerVerify::Unknown.
  • render_page_fit produces images smaller than the requested box (#480) — Integer-DPI conversion via floor() lost up to 3 pixels from the constrained dimension (e.g. a 1040 px fit yielded 1037 px on Letter). The renderer now computes a float scale directly (fit_px / page_pt) and stores it in the crate-private RenderOptions::scale_override field, bypassing the DPI round-trip entirely. The constrained dimension is now exact for all integer pixel inputs. Reported by @gevorgter.

Added

  • legacy-crypto compile-time feature flag (default-on) (#230) — New default-on Cargo feature that gates MD5 key-derivation and RC4 cipher support for PDF Standard Security R≤4 documents. Downstream crates that must not load legacy cryptography can opt out with default-features = false; they will receive a clear Error::InvalidPdf instead of silently accepting RC4/MD5-encrypted PDFs. The md-5 crate is now an optional dependency gated behind this feature. RC4 (pure Rust, no crate) is also disabled: both RustCryptoProvider::rc4() and rc4_crypt_impl are compiled out, and the provider returns AlgorithmNotPermitted at runtime when the feature is absent. Phase A of Issue #230.

Changed

  • Stub parity gate for Python wheels (#464)rylai.toml now uses --features python only (matching the released wheel) so generated .pyi stubs no longer include symbols from office or other optional features. A new CI step (Verify stub symbol parity) checks that every stub symbol exists in the installed wheel.
  • TypeScript 6 + @types/node 25 upgrade for JS bindings (#438, #440) — JS dev dependencies bumped to TypeScript ^6.0.3 and @types/node ^25.6.0. tsconfig.json gains "types": ["node"] (required by @types/node 25's ambient-global model) and "ignoreDeprecations": "6.0" (to acknowledge the TS6-deprecated moduleResolution: node — full migration to node16 deferred until the import-path audit is done).

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-06 11:42:11
pdf_oxide

v0.3.44 | Pluggable cryptographic provider — FIPS 140-3 compliance for

Highlights

  • pdf_oxide::crypto::CryptoProvider trait — new abstraction that decouples PDF encryption and signature paths from any one cryptography crate. Two providers ship out of the box:
    • RustCryptoProvider (default): pure-Rust stack as before (sha2, aes, rsa, p256, p384, getrandom, md-5, sha1). Permits every algorithm PDF specs reference, including the legacy MD5+RC4 path required by ISO 32000-1 R≤4 documents.
    • AwsLcProvider (opt-in via --features fips): backed by aws-lc-rs, FIPS 140-3 validated since 2024. Refuses MD5 / SHA-1-for-signing / RC4 with Error::AlgorithmNotPermitted and a clear remediation message.
  • Single source of randomness. src/encryption/algorithms.rs's former SHA-256(uuid_v4 || timestamp_ns || …) cascade is replaced with crypto::active().random_bytes() — under the default provider this is getrandom::fill() (OS entropy pool); under FIPS it's aws_lc_rs::rand::SystemRandom. Cryptographically suitable for AES-256 file keys and salts; auditable.
  • Closes #236.

Architecture

Three sub-traits compose into CryptoProvider:

  • Hasher — incremental hashing (update / finalize).
  • SymmetricCipher — AES-128/256-CBC (PKCS#7 + no-padding) and RC4.
  • SignatureVerifier — RSA-PKCS#1-v1.5, RSA-PSS, ECDSA P-256/P-384.

Plus an opaque Signer handle so HSM / PKCS#11 / Cloud KMS backends can plug in via SigningKeyMaterial (which is #[non_exhaustive] — future variants for HSM slots etc. are not breaking changes).

The is_legacy_allowed() policy bit lets each provider declare whether MD5 / SHA-1-sign / RC4 are permitted. PDF Standard Security R≤4 documents are gated at EncryptionHandler::new: under a FIPS provider they fail with a remediation message ("re-encrypt at R=6 or build pdf_oxide without the 'fips' feature so the default 'rust-crypto' provider stays active") rather than panic deep inside the cipher path.

Usage

use std::sync::Arc;
use pdf_oxide::crypto::{set_provider, AwsLcProvider};

set_provider(Arc::new(AwsLcProvider::new()))?;
let doc = pdf_oxide::PdfDocument::open("encrypted-r6.pdf")?;

See docs/CRYPTO_PROVIDERS.md for the algorithm coverage matrix, custom-provider walkthrough (sovereign-jurisdiction algorithms, HSMs), and the legacy-PDF policy table.

CI

  • New fips job in .github/workflows/ci.yml builds with --features fips, runs the 11-test AwsLcProvider suite including a cross_provider_aes_compat check that asserts the FIPS and rust-crypto AES paths produce byte-identical output, and enforces clippy -D warnings under the FIPS feature.

Release

  • New .github/workflows/release-fips.yml workflow (manually triggered) builds and publishes parallel FIPS distributions on every package index, all from the same Rust source compiled with --features fips so each binary contains only AWS-LC's FIPS-validated module:

    Ecosystem Package Install
    PyPI pdf_oxide_fips pip install pdf_oxide_fips==0.3.44
    npm pdf-oxide-fips npm install pdf-oxide-fips@0.3.44
    NuGet PdfOxide.Fips dotnet add package PdfOxide.Fips --version 0.3.44
    Go github.com/yfedoseev/pdf_oxide/go-fips go get github.com/yfedoseev/pdf_oxide/go-fips@v0.3.44

    Platform matrix in v0.3.44 (every binding × every platform):

    Platform Python npm NuGet Go
    Linux x86_64
    Linux aarch64
    macOS x86_64
    macOS arm64
    Windows x86_64

    All distributions move in lockstep with the regular release — FIPS and default variants of the same release tag are byte-equal in their non-crypto code paths. Per-platform smoke tests in the workflow confirm the FIPS provider is reachable AND crypto_use_fips() (or equivalent) flips the active provider as expected — catches API mismatches before publishing.

    Why pdf_oxide_fips (underscore) for Python: PyPI normalizes hyphens / underscores to the same canonical form per PEP 503 (pip install pdf_oxide_fips and pip install pdf-oxide-fips resolve to the same package). Using underscore in pyproject.toml makes the wheel filename and the import pdf_oxide path identical to the default distribution — only the package name differs.

    Why parallel distributions instead of pip install pdf_oxide[fips]: Python extras (PEP 508) can add Python dependencies but cannot swap the compiled .so baked inside a wheel. The industry pattern (cryptography, pyOpenSSL) ships separate FIPS distributions; we follow suit.

    Why a go-fips submodule path: Go modules are import-path-bound, so users pick at go get time:

    go get github.com/yfedoseev/pdf_oxide/go            # default
    go get github.com/yfedoseev/pdf_oxide/go-fips       # FIPS
    

    Both submodules re-export the same Go API; only the linked native static lib differs.

Fixes

  • Restore manylinux_2_28 glibc floor for Python wheels. 0.3.42 and 0.3.43 published only manylinux_2_35 Linux glibc wheels because the release workflow ran maturin build directly on ubuntu-latest (Ubuntu 24.04, glibc 2.39), letting the runner's glibc set the wheel tag. That excluded Amazon Linux 2023 / AWS Lambda Python (glibc 2.34), RHEL 8, Ubuntu 20.04 and Debian 11 — pip rejected the wheel and fell back to a source build that OOM-killed rustup-init inside the Lambda build container. Reported by @potatochipcoconut on PR #463. Both release.yml (default wheels) and release-fips.yml (pdf_oxide_fips wheels) now build the Linux glibc wheels via PyO3/maturin-action inside the manylinux_2_28 container, and a CI guard step fails the job if a manylinux_2_28 wheel is not produced for either Linux target — preventing this regression from recurring. The 0.3.21 baseline (originally added in #284) is restored.

Performance — extract_pages_to_bytes 12–54× faster

Extraction of page ranges from large PDFs is now bound by serialisation work instead of redundant document rebuilds and tree walks. Closes #474, reported by community contributor @potatochipcoconut, whose careful root-cause writeup (chunk-by-chunk timings, comparison against PyMuPDF's doc.select(), and a profiling-grade reproduction case from an AWS Lambda IDP pipeline) made this fix possible.

Measured on the public 1112-page / 38 MB Artificial Intelligence — A Modern Approach corpus (pdfs_slow2/) on an idle laptop:

Workload 0.3.43 0.3.44 Speedup
extract_pages_to_bytes(0..300) 7301 ms / 36 MB out 382 ms / 12 MB out 19× + 3× smaller
extract_pages_to_bytes(0..50) 7983 ms / 36 MB out 155 ms / 4 MB out 51× + 9× smaller
Sequential 23 × 50-page chunks ~3 min 1542 ms total ~120×

Extrapolating to the reporter's 12k-page / 50 MB document chunked into five 3000-page slices: an AWS Lambda invocation that previously timed out at 900 s after two chunks now finishes the entire five-chunk batch in roughly 30 s.

Root causes

All in src/editor/document_editor.rs + src/document.rs:

  1. Triple full-document rewrite. extract_pages_to_bytes serialised the whole doc, re-parsed the bytes, removed pages one at a time, and serialised again — three full passes when one would do. Replaced with a non-mutating in-place trimmed page_order, restored after the save (even on Err).
  2. Garbage collector walked the original page tree. The trimmed /Pages dict was rebuilt locally inside write_full_to_writer, but collect_reachable_ids() started its BFS from the unmodified catalog and pulled in every dropped page's resources — so the output never shrank no matter how few pages were kept. Fixed by staging the trimmed /Pages dict in modified_objects before the save; the GC walker already prefers staged dicts over source.
  3. get_page_ref(i) in a 0..n loop is O(n²). Each call walks the page tree from the root and stops at the i-th leaf, so collecting all n leaf refs walks 1 + 2 + … + n nodes. New helper PdfDocument::all_page_refs() does it in one DFS. The flat-tree common case (root /Pages whose /Count matches Kids.len()) reads the ref array straight out of /Kids without touching individual leaves at all.

The same n² loop pattern was lurking in four other call sites on the reporter's hot path (their pipeline does PDF/A validate + convert before the chunked extract). All five collapsed to a single all_page_refs() call:

  • src/outline.rsfind_page_index (O(n²) per outline entry → O(n³) on documents with bookmarks).
  • src/editor/document_editor.rs line ~4275 — page-ref → index map for partial form-flatten.
  • src/editor/document_editor.rs line ~4505 — same map for get_form_fields().
  • src/compliance/validators.rsvalidate_fonts (doc.validate_pdf_a('2b')).
  • src/compliance/converter.rs — per-page /AA strip (doc.convert_to_pdfa('2b')).

New API

Two additions, both directly requested by @potatochipcoconut in #474; both available in Rust and Python (the other bindings can be added on demand):

# Batch extraction — same single-call efficiency, ergonomic for
# the chunked-for-OCR / chunked-for-S3 pattern.
chunks = doc.extract_page_ranges_to_bytes(
    [(0, 3000), (3000, 6000), (6000, 9000), (9000, 12000)]
)

# In-place selection — equivalent to PyMuPDF's doc.select(...).
# After this call, the document holds only the listed pages,
# in the order given. doc.save() / doc.save_to_bytes() then
# emit only those pages with garbage-collected resources.
doc.select_pages([1, 4, 7, 99])

Known limitation

PDFs whose /Pages root publishes shared /Resources used by all leaf pages (typical of high-resolution book scans, atypical of office documents with subset fonts) still produce full-size chunk output: GC correctly preserves resources reachable from kept pages, and a single shared resource pool stays reachable as long as any kept page references it. The principled fix is per-page resource sub-setting — parsing each kept page's content stream to determine which fonts / XObjects are actually used and emitting a minimal /Resources for that page. That is a feature, not a bug fix, and is deferred from this release. The wall-clock speedup (12–54×) holds regardless.

Tests

  • 5050 lib tests pass under --features python,fips (5039 default + 11 FIPS-only).
  • 119 encryption tests still pass byte-equal post-rewire to the trait.
  • 69 signatures tests still pass byte-equal post-rewire.
  • Hash vectors validated against NIST FIPS 180-4 for SHA-256/384/512 and RFC 1321 / 3174 for MD5 / SHA-1.
  • New regression tests cover the issue #474 workflow: test_extract_pages_chunked_sequential (4 sequential chunks on the same DocumentEditor, source observably unchanged between calls), test_extract_pages_non_sequential (out-of-order indices [3, 0, 4]), test_extract_page_ranges_to_bytes_batch, test_select_pages_in_place, and test_select_pages_out_of_range.

Known follow-ups (v0.3.45)

  • AwsLcProvider RSA-PKCS#1 v1.5 verify-from-digest (#475)AwsLcProvider::verify_rsa_pkcs1v15 is currently a stub; PDF/CMS signatures using RSA-PKCS#1 v1.5 return SignerVerify::Unknown instead of verifying under FIPS. Blocked on aws-lc-rs exposing a stable RSA_PKCS1_PRIM_VERIFY API. RustCryptoProvider (default) is not affected.
  • AwsLcProvider signing wiring — signing calls are currently routed to RustCryptoProvider. Full AWS-LC signing integration lands in v0.3.45.
  • musllinux Python wheels for the FIPS variant — FIPS musllinux wheels (Alpine / musl libc) require a musl-targeted aws-lc-fips-sys build; work in progress.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-04 03:47:01
pdf_oxide

v0.3.43 | Cross-binding parity, WASI build target, and a basket of issue fixes.

Highlights

  • render_page_fit() now ships in all five bindings (Rust core + Python, Node.js / TypeScript, C#, Go). Picks the largest DPI such that both rendered dimensions fit inside a target pixel box, preserving aspect ratio. No more "what DPI hits 1024×768?" math on the caller's side. Fixes #441, closes #448.
  • Idiomatic page iteration parity across bindings. Rust gets page_indices(), Python gets .pages, Node.js gets [Symbol.asyncIterator] (the sync [Symbol.iterator] was already there). C# Pages and Go Pages() were already shipped. Closes #447.
  • WASI build targetcargo build --target wasm32-wasip1 now builds the lib cleanly on stable Rust. Unblocks @RALaBarge's external pdf-oxide-wasi stdin→stdout wrapper and any other consumer wanting to embed pdf_oxide in a sandboxed WASI runtime. CI now gates that the WASI build stays green. Closes #214.
  • Spurious-table fix on dense word grids — Roland's #405 lands via cherry-pick. A new has_split_modal_column_groups validator inspects the column co-occurrence graph across modal rows and rejects candidates whose populated columns split into two or more disconnected components — the signature of two adjacent text flows mis-clustered as one table. Composes cleanly with v0.3.42's Table::is_real_grid filter. Validated against the 86-PDF cross-build corpus: 888 / 888 byte-equal — zero observable change on common documents, the gate's value is in the safety net for adversarial cases.

Fixes

  • #456PdfDocument::open(path) now populates source_bytes, unblocking convert_to_pdf_a(), the C FFI pdf_document_get_source_bytes, and any other API that re-reads the in-memory copy. Path-loaded documents previously got an empty Vec<u8> and hit "Invalid PDF header: File is empty (0 bytes read)" from the PDF/A converter. Reported by @potatochipcoconut on PR #445.
  • #451 — Standard14 PostScript fonts with no open-source equivalent (Symbol, ZapfDingbats) are now downgraded from hard FontNotEmbedded errors to a new KnownUnembeddableFont warning during PDF/A conversion. A document that's otherwise compliant no longer fails solely because of one symbolic font.
  • #395 — closed; verified the off-by-one C# ExceptionMapper fix in v0.3.38 actually resolves the reported RenderPageSignatureException [8500]. Added a Rust regression test that opens @gevorgter's exact reproducer PDF and asserts render_page succeeds. The fixture is pinned in pdf_oxide_tests.
  • #462 — dropped the scripts/modernize_stubs.py post-processor and the python_version = "3.8" setting from rylai.toml. Rylai's default already emits PEP-585 / PEP-604 syntax with from __future__ import annotations at the top, so post-processing was duplicate work in opposite directions. Runtime support for Python 3.8/3.9 is unaffected — .pyi stubs are type-checker artifacts, never imported at runtime. Reported by @monchin with a clean diagnosis of the root cause.

Behavior changes

  • PdfDocument::open(path) now reads the file once into memory rather than streaming via BufReader<File>. The doc comment already promised "Reads the entire file into memory"; this makes it true. Memory usage on open() is now equivalent to from_bytes(std::fs::read(path)?). Required by #456; the streaming reader was a partial optimisation no caller could rely on (every code path that touched source_bytes already required the in-memory copy).
  • PdfReader enum collapsed to a single in-memory variant — removed unused File variant. std::io::{Read, Seek, BufRead, …} imports are no longer cfg-gated, which is what unblocked the wasm32-wasip1 build target.

Dependencies

  • Batch-applied 9 dependabot bumps onto release/v0.3.43: CI workflows (golangci-lint-action v7→v9, setup-go 5.5→6.4, setup-node 4.4→6.4, github-script SHA refresh, scorecard-action 2.4.0→2.4.3), Go (testify 1.8→1.11 — was declared but unimported, dropped entirely), JS (rimraf 5→6 — @types/node deferred to a follow-up after a TypeScript-strict shake-out), Python (onnx ≥1.14→≥1.19.1).
  • The RustCrypto 0.8 stack (pkcs8 0.11, spki 0.8, der 0.8, digest 0.11, crypto-common 0.2, block-buffer 0.12) stays pinned — rsa 0.10 and p256/p384 0.14 are still RC upstream. See the existing pin note at Cargo.toml:185-187.

Internal

  • New wasm32-wasip1 build smoke check in .github/workflows/ci.yml alongside the existing wasm32-unknown-unknown job.
  • Regenerated SBOMs (pdf_oxide_cli/sbom.cdx.json, pdf_oxide_mcp/sbom.cdx.json) for 0.3.43.
  • New regression tests:
    • tests/test_issue_456_path_open_source_bytes.rs
    • tests/test_issue_447_page_indices.rs
    • tests/test_issue_395_render_page.rs
  • New unit tests on compliance::converter::downgrade_known_unembeddable_fonts.

Validation

86-PDF stratified corpus comparison (academic, mixed, forms, government, newspapers, theses, plus the three #211 fixtures), 888 sampled (pdf, page, method) triples across extract_text, to_plain_text, to_markdown, to_html:

  • v0.3.43 vs v0.3.42 — 888 / 888 byte-equal, zero deltas
  • v0.3.43 vs PyPI v0.3.41 — 860 equal, 28 reorder/de-dup, 0 real content losses (same profile as v0.3.42's regression report)

Community contributors

This release exists because of the community. Special thanks to:

  • @RolandWArnold — landed the spurious-table fix in #405. After iterating away from an earlier density-gate framing, the shipped form is has_split_modal_column_groups: a connected- component check on the column co-occurrence graph across modal rows that flags two-flow grids the regular-row-ratio gate accepts. Roland's doc-comment explicitly flags it as a heuristic, making it easy to revisit later. The fix composes with v0.3.42's struct-tree-aware reading-order rewire without any merge conflict.
  • @RALaBarge — built an external WASI binary wrapper for pdf_oxide (pdf-oxide-wasi) and reported in #214 that it required nightly Rust because of an internal ceil_char_boundary call. That call was already removed; this release fixes the second hidden blocker (cfg-gated std::io imports) and adds CI gating so the WASI target stays green.
  • @gevorgter — flagged two rendering-area gaps: the C# binding's misleading SignatureException on RenderPage (#395, fixed in v0.3.38, regression-guarded here) and the lack of a pixel-dimension render API (#441, closed by render_page_fit shipping in all five bindings).
  • @potatochipcoconut — surfaced the convert_to_pdf_a failure on path-loaded documents while testing PR #445; the investigation traced it to the empty source_bytes field and produced the one-line fix in this release (#456).
  • @monchin — pointed out (#462) that scripts/modernize_stubs.py was redundant work because rylai itself controls the typing flavour via its python_version setting, and noted that office/barcodes/ocr feature alignment between rylai.toml and the released wheel is worth a follow-up. The cleaner stub pipeline ships in this release.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-03 14:39:44
pdf_oxide

v0.3.42 | Text-extraction reading-order rewire — fixes [#211](https://github.com/yfedoseev/pdf_oxide/issues/211)

Highlights

  • extract_words and extract_text_lines now honor the structure tree on tagged PDFs (per ISO 32000-1:2008 §14.7 / §14.8.2.3) instead of applying XY-Cut block partitioning. On the three #211 fixtures from pdfplumber's public test corpus this restores correct reading order for centered titles above body text (Quebec municipal minutes case) and stops splitting prose lines across phantom column gutters in form-style layouts (US child-welfare report case).
  • Spurious markdown / HTML tables on form-style layouts (label-colon- value pairs) are gone — spatial table detection is now gated on a real-grid validator (≥2 rows × ≥2 cols, ≥50% of rows with at least two non-empty cells).
  • New include_artifacts kwarg on extract_words / extract_text_lines (Python) gates the spec-correct behavior of excluding /Artifact-tagged content (running headers, footers, page numbers, watermarks; ISO 32000-1:2008 §14.8.2.2.1). Default is True — preserves pre-0.3.42 behavior so existing scripts don't lose content. Pass include_artifacts=False to opt into the spec-correct exclude. The default may flip in a future major release once the artifact-detection heuristic is hardened against false positives on docs whose body text recurs across pages.
  • The default API surface is now knob-free: region, word_gap_threshold, line_gap_threshold, profile are deprecated on extract_words / extract_text_lines (Python). They still work but emit DeprecationWarning; they will move to a separate extract_*_advanced surface in a future release.
  • ~6× faster on extract_words / extract_text_lines because the XY-Cut partition is no longer in the hot path.

Fixes

  • #211 — extract_words / extract_text_lines produce wrong reading order on tagged PDFs. Headings and prose lines that XY-Cut had moved out of position now appear where the document author marked them via the /StructTreeRoot MCID order. Reported by @ankursri494 against pdfplumber's pdf_structure.pdf, 2023-06-20-PV.pdf, and 150109DSP-Milw-505-90D.pdf test fixtures.

Behavior changes

  • extract_words(page) / extract_text_lines(page) gain an include_artifacts kwarg (default True — backward-compatible). Pass include_artifacts=False to drop spans tagged as artifacts per ISO 32000-1:2008 §14.8.2.2.1. Word counts on documents with running headers / footers will decrease in that mode.
  • Multi-column reading-order detection on untagged PDFs is now conservative: column-aware mode opts in only when the page presents ≥3 distinct vertical gutters, each ≥median_char_width × 4 wide, with text on both sides. 1- and 2-column synthetic layouts default to row-aware top-to-bottom ordering — matches pdfplumber. Tagged multi-column PDFs are unaffected: they reach the column-aware path via the structure tree.
  • to_markdown(page) / to_html(page) no longer emit <table> for layout-only structures detected by the spatial heuristic. Real tables (<Table> in the struct tree, or grids ≥2×2 with ≥50% of rows populating ≥2 cells) still render as tables.

Refactor #457 — internal

  • New pdf_oxide::pipeline::page_reading_order(doc, page) helper: single source of truth for canonical reading-order span sequence. Tagged + struct tree (no /Suspects) → walks the tree; otherwise → geometric top-to-bottom + y-tolerance. Companion variant page_reading_order_no_artifacts strips spans tagged as /Artifact for the spec-correct exclude case.
  • extract_words_with_thresholds and extract_text_lines_with_thresholds delegate through the helper for the default code path (artifacts retained). New extract_words_with_thresholds_no_artifacts and extract_text_lines_with_thresholds_no_artifacts surfaces are available for the spec-correct artifact-excluded behavior. The profile=Some(...) path retains its previous XY-Cut behavior pending the planned removal of the profile kwarg.
  • GeometricStrategy now defaults to row-aware top-to-bottom ordering; column-aware mode gated by the strict multi-column criterion above.
  • Table::is_real_grid() introduced as the real-table validator; extract_page_tables filters the spatial heuristic's output through it.

Validation

75-PDF stratified-sample corpus (academic, mixed, forms, government, newspapers, theses, plus the three #211 fixtures) compared between 0.3.41 and 0.3.42 across all eight extraction methods on the first 3 pages of each PDF — 1592 comparisons total. Zero content regressions: every word the baseline extracted is also extracted by 0.3.42; only ordering / line-grouping / table-rendering changed.

Dependencies

  • #453 — drop the unused lzw direct dependency. LzwDecoder already routed through weezl plus a custom fallback; the lzw crate was declared in Cargo.toml but never imported. Silences RUSTSEC-2020-0144 (unmaintained advisory) for downstream cargo-deny consumers as a side-effect.
  • #454 (partial)cargo update lockfile refresh: fax 0.2.6 → 0.2.7, imageproc 0.26.1 → 0.26.2, js-sys / web-sys 0.3.95 → 0.3.97, pdfium-render 0.9.0 → 0.9.1, rustls 0.23.39 → 0.23.40, wasm-bindgen family 0.2.118 → 0.2.120, plus 12 other transitive patch / minor bumps. The remaining major-version items in #454 (RustCrypto 0.8 stack — pkcs8 0.11, spki 0.8, der 0.8, digest 0.11, crypto-common 0.2, block-buffer 0.12) stay pinned: rsa 0.10 and p256 0.14 / p384 0.14 are still RC upstream as of 2026-04 (see the existing pin note in Cargo.toml:185-187).

Community contributors

This release exists because of the community. Special thanks to:

  • @ankursri494 — reported #211 with three carefully chosen pdfplumber-corpus fixtures (pdf_structure.pdf, 2023-06-20-PV.pdf, 150109DSP-Milw-505-90D.pdf) that isolate three distinct failure modes — wrong reading order on tagged PDFs, dropped document headings, and prose-line splits at form gutters. They also kept the issue alive through two rounds of "is this still broken on the latest version?", which forced the deeper investigation that ultimately exposed the architectural gap behind #457. Without that persistence and that specific repro set, this rewire would not have shipped.
  • @lingcoder — flagged the unmaintained lzw advisory in #453 with a precise pointer to RUSTSEC-2020-0144 and the weezl migration path; the investigation surfaced that the dep was unreferenced entirely, turning it into a one-line cleanup.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

2026-05-01 22:44:51
pdf_oxide

v0.3.41 | Real PDF/A conversion, LaTeX symbolic-font glyph rendering fix, and

Community contributors

This release exists because of the community. Special thanks to:

  • @FireMasterK — reported #307 with a precise reproduction case: a LaTeX-generated PDF where accented characters and ligatures (ú, á, fi) rendered as blank gaps across all pages. The report identified the exact document class (DC/EC TrueType fonts with Mac Roman cmap, no /Encoding dict), which made the root cause in render_cid_direct() straightforward to isolate and fix.

  • @sparkyandrew — followed up on #425 with #443, noticing that the output PDF was 2.32 MB when the two source images summed to under 1.6 MB — even after the #425 image-pipeline fix. That single observation pinpointed the missing XObject deduplication: the same image data encoded twice produced two independent compressed streams. Fixed.

  • @potatochipcoconut#418, the original PDF/A binding-completeness report that drove the full implementation in #442. convert_to_pdf_a() existed in Rust but was a no-op: it recorded actions and returned success while leaving the document bytes untouched. The report surfaced this silently-broken state across all seven bindings.

  • @nickpetrovic — filed #444 with a precise four-row reproduction table showing ligature glyphs in subset Calibri fonts decoded to wrong Unicode codepoints (tiO, tf[, fte). The report included the exact PDF and the per-font-subset mapping failures, which led directly to the ICCBased color-space warn spam fix and the rowspan-label reading-order scramble fix.

  • @RubberDuckShobe — reported #450: any PDF containing a PNG with an alpha channel showed a diagonal stripe through the image. A minimal reproduction confirmed the bug was reproducible across Acrobat, Preview, and browser PDF viewers. The report made the scope unambiguous — every image with transparency was affected — and led directly to the missing DecodeParms fix in build_soft_mask_dict().

  • @truffle-dev — first code contribution to the project: completed the CLI output-path fix for #412 in #452. The original audit in #412 covered all 11 CLI commands with exact line references and two proposed design options; the PR was clean on first submission. Picks up the four commands (crop, decrypt, delete, reorder) missed by the earlier partial fix, and also enforces -o/--output for merge instead of silently defaulting to the first input's directory.

Scope at-a-glance

  • Real PDF/A conversion — XMP metadata stream, pdfaid:part/conformance identification, OutputIntents (sRGB), language tag, JavaScript removal; all 7 bindings (#418, #442).
  • Symbolic TrueType glyph rendering — non-ASCII bytes (ú=0xFA, á=0xE1, fi=0x85) in DC/EC-style LaTeX fonts with Mac Roman cmap no longer suppressed as spaces (partially fixes #307; follow-up cases reported by FireMasterK on 2026-04-29 remain open).
  • Image XObject deduplication — same image embedded twice no longer re-encoded as two separate compressed streams; PDF size matches the sum of source images (#443).
  • Diagonal-line artifact in transparent images fixed — missing DecodeParms in the soft-mask XObject caused a visible diagonal stripe in any PNG with an alpha channel (#450).
  • Barcode SVG generationpdf_barcode_get_svg no longer returns ERR_UNSUPPORTED; generates real SVG for all 8 barcode types including QR (#421).
  • CLI output routingcrop, decrypt, delete, and reorder now write default output beside the input file instead of the current working directory; merge now requires -o/--output and errors up front instead of silently defaulting to the first input's directory. Completes #412.

Real PDF/A conversion (#418, #442)

convert_to_pdf_a() previously recorded conversion actions and returned success, but the document bytes were unchanged — the XMP metadata stream was constructed in memory and then discarded. This release rewrites the conversion core end-to-end:

  • XMP metadata stream — a standards-compliant XMP packet is serialised and written as an indirect object, then wired into the document catalog as /Metadata. pdfaid:part and pdfaid:conformance are set per level: A1b → 1/B, A2b → 2/B, A2u → 2/U, A3b → 3/B.
  • OutputIntents — a GTS_PDFA1 output intent referencing sRGB is injected when none is present. Idempotent: a second call detects the existing intent and does not duplicate it.
  • Language tag/Lang is written to the catalog when the validator raises MissingLanguage.
  • JavaScript removal/Names/JavaScript entries are stripped when present.
  • Source bytes patcheddoc.source_bytes is updated in-place; the document is immediately re-parseable after conversion.
  • Font embedding (rendering feature) — embed_font() now resolves the 14 standard PDF Type1 PostScript names (Helvetica, Courier, Times-Roman, …) to the metrically-equivalent URW Base 35 open-source fonts shipped by default on Linux (Nimbus Sans, Nimbus Mono PS, Nimbus Roman). With --features rendering all B-level PDFs convert to 0 remaining errors, including FontNotEmbedded. Three bugs were fixed in the embedding pipeline:
    • try_fix_error dedup applied to error codes, so only the first FontNotEmbedded error was processed; remaining fonts were skipped — fixed to dedup per-error-code for non-font errors only.
    • write_full_to_writer wrote font objects from the original source instead of preferring staged modified_objects — fixed to use the same priority order as the general object sweep.
    • add_structure() only added /StructTreeRoot but not /MarkInfo /Marked true; the validator requires both for PDF/A-*a conformance — fixed.

Test coverage — 17 new end-to-end roundtrip tests in tests/test_pdfa_roundtrip.rs verify every fixable scenario (validate → convert → validate). The showcase_pdfa_conversion CI example is rewritten to assert correctness and panics on any regression.

All seven bindings expose the updated function:

Binding API
Rust convert_to_pdf_a(&mut doc, PdfALevel::A2b)?
Python pdf_oxide.convert_to_pdf_a(doc, "A2b")
WASM convertToPdfA(doc, "A2b")
C FFI pdf_oxide_convert_to_pdf_a(doc, level, &out)
C# Compliance.ConvertToPdfA(doc, PdfALevel.A2b)
Go compliance.ConvertToPdfA(doc, compliance.PdfALevelA2b)
Node.js compliance.convertToPdfA(doc, "A2b")

Symbolic TrueType glyph rendering fix (#307)

LaTeX-generated PDFs using DC/EC fonts (Dcr10, Dcsl10, etc.) embed symbolic TrueType fonts with these characteristics:

  • /Flags has the symbolic bit set (bit 3 = 4)
  • No /Encoding dictionary
  • Mac Roman format-0 cmap (platform 1, encoding 0): byte code → glyph ID
  • No Windows Unicode cmap

pdf_oxide correctly routes these through the render_cid_direct() path, which resolves each content-stream byte to a glyph ID via the Mac Roman cmap. The bug was one line in the space-detection guard:

// Before — bytes without a Unicode mapping fell through to unwrap_or(' ')
let char_at_pos = char_str.chars().next().unwrap_or(' ');
if char_at_pos.is_whitespace() { /* skip draw */ }

Any byte whose Unicode mapping returned None — including ú (0xFA → GID 85), á (0xE1 → GID 83), and fi (0x85 → GID 75) — was treated as a space, so the is_whitespace() guard blocked glyph drawing entirely.

// After — '\0' is not whitespace; GID ≠ 0 glyphs are drawn correctly
let char_at_pos = char_str.chars().next().unwrap_or('\0');

Verified pixel-perfect against Poppler and MuPDF on the #307 reproduction PDF. Regression-tested across 69 PDFs (120 page comparisons) — zero regressions in rendering, plain text, Markdown, and HTML extraction.

Text extraction fixes (#444)

Two issues surfaced while investigating #444 (Calibri ligature mis-mapping, which is an upstream macOS Quartz PDF producer bug with no fix possible on our side):

ICCBased color space warn spam — PDF producers that register ICCBased profiles under user-defined names (e.g. Cs1, Cs2) caused the text extractor to fire a WARN log on every sc/SC/scn/SCN operator that used such a name. The catch-all _ branch in the color-space handler did not know how to handle named references, so it logged and left the color unchanged. The fix: apply a component-count fallback in that branch (1 component → gray, 3 → RGB, 4 → CMYK) and demote the log to DEBUG. Affected PDFs with large amounts of colored text (like typical Office documents) emitted 96+ spurious warnings per page; now silent.

Text span reading-order scramblingreorder_rowspan_labels, a function that promotes vertically-centered table row labels to sort at the top of their row block, was incorrectly activating on single-column prose documents (resumes, reports). It identified spans at rightward X positions as a "sparse column" and promoted them to wrong Y coordinates, causing line-continuation text like "to assess technical needs and" or "-making." to appear before the earlier line they followed.

Root cause: the label-candidate filter did not exclude spans whose Y-band already appears in the dense column. Genuine rowspan labels are vertically between data rows, so their Y-band is absent from the dense column. Line-continuation spans share the Y-band of the main column text and must not be treated as labels. The fix adds that exclusion:

// Before — any sparse-column span in the data Y range
y > data_bot && y < data_top

// After — additionally exclude spans that align with a dense-column row
y > data_bot && y < data_top && !dense_bands.contains(&band_of(y))

The original rowspan-label behavior for actual table layouts (CJK lab reports, mixed-column tables) is preserved; the existing test confirms that genuine between-row labels are still promoted correctly.

Image XObject deduplication (#443)

When the same image data was passed to page.image() or from_bytes() on multiple pages, pdf_oxide encoded it as independent XObjects — each carrying the full compressed pixel data. A 760 KB PNG embedded twice contributed 1.52 MB instead of 760 KB; the #443 reproduction produced 2.32 MB from images totalling under 1.6 MB.

The fix hashes the normalised stream bytes after calling image_content_to_xobject_stream(). Hashing before normalisation failed across API paths: an image supplied via page.image() (which accepts raw file bytes and decodes them internally) and the same image supplied via ImageContent::from_bytes() produced different pre-encoding byte strings but identical post-normalisation compressed streams. Hashing after normalisation ensures the key is stable regardless of which API path the caller used. The key is (hash, byte_length) over the compressed pixel data; if a matching entry is already registered in the document's XObject map, the existing reference is reused and no new stream is written.

Diagonal-line artifact in images with transparency (#450)

PDFs with PNG images that have an alpha channel displayed a diagonal stripe across the image when opened in Acrobat, Preview, and most other viewers.

Root cause: compress_image_data() prepends a PNG None-filter byte (0x00) before every scanline before Flate-compressing the pixel data. This is required by FlateDecode with DecodeParms/Predictor=15. The main image XObject carried the correct DecodeParms dictionary — but build_soft_mask_dict(), which builds the /SMask XObject for the alpha channel, emitted no DecodeParms at all. Viewers therefore decompressed the raw Flate stream, then treated the leading 0x00 filter byte of each row as an alpha pixel, shifting every row one byte to the right. The cumulative horizontal offset over hundreds of rows appears as a diagonal stripe.

Fixed by adding the same DecodeParms dictionary to the soft-mask stream:

DecodeParms { Predictor=15, Colors=1, BitsPerComponent=8, Columns=<width> }

Reported by @RubberDuckShobe in #450. Any PDF built with page.image() or ImageContent::from_bytes() where the source PNG has an alpha channel was affected; the fix is purely in the soft-mask stream header and does not change pixel data.

Barcode SVG generation (#421)

pdf_barcode_get_svg was a stub returning ERR_UNSUPPORTED. Two root causes were blocking a real implementation:

  1. Format sentinel collisionpdf_generate_qr_code stored FfiBarcodeImage.format = 0, the same value as pdf_generate_barcode with format = 0 (Code128). The get_svg function had no way to distinguish QR from Code128. Fixed: QR codes now use the internal sentinel value 100 (outside the 0–7 range of 1D barcode types); the public pdf_barcode_get_format return value for QR codes changes from 0 to 100 accordingly.

  2. Missing SVG rendering pathbarcoders 2.0 ships barcoders::generators::svg::SVG (enabled by default via features = ["svg"]), so no new dependency was required. For 1D barcodes, the encoding step is now factored into a private encode_1d helper shared by both generate_1d (PNG) and the new generate_1d_svg (SVG). For QR codes, generate_qr_svg rebuilds the code matrix from qrcode::QrCode::to_colors() and emits a compact inline SVG with <rect> elements — no raster stage.

pdf_barcode_get_svg now returns a valid SVG string for all supported barcode types (Code128, Code39, EAN-13, EAN-8, UPC-A, ITF, Code93, Codabar, QR) when the barcodes feature is enabled.

CLI output routing (#412, #452)

A previous partial fix (commit 9dd94c0) introduced output_beside() / output_dir_beside() helpers and converted five commands (watermark, compress, flatten, rotate, split). Four binary-output commands were missed and continued resolving the default output path relative to the current working directory:

  • crop — now writes <stem>_cropped.pdf beside the input file.
  • decrypt — now writes <stem>_decrypted.pdf beside the input file.
  • delete — now writes <stem>_deleted.pdf beside the input file.
  • reorder — now writes <stem>_reordered.pdf beside the input file.

merge previously silently defaulted to writing merged.pdf in the directory of the first input file when -o/--output was omitted. This silent fallback was the riskiest behavior in the CLI: callers who expected output beside a specific file got a surprise in a potentially unrelated directory. merge now requires -o/--output and exits with a clear error message if it is missing.

No library code was changed — all five files are in pdf_oxide_cli.



Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.