yfedoseev/pdf_oxide
 Watch   
 Star   
 Fork   
1 days ago
pdf_oxide

v0.3.41 | Real PDF/A conversion, LaTeX symbolic-font glyph rendering fix, and

Community contributors

This release exists because of the community. Special thanks to:

  • @FireMasterK — reported #307 with a precise reproduction case: a LaTeX-generated PDF where accented characters and ligatures (ú, á, fi) rendered as blank gaps across all pages. The report identified the exact document class (DC/EC TrueType fonts with Mac Roman cmap, no /Encoding dict), which made the root cause in render_cid_direct() straightforward to isolate and fix.

  • @sparkyandrew — followed up on #425 with #443, noticing that the output PDF was 2.32 MB when the two source images summed to under 1.6 MB — even after the #425 image-pipeline fix. That single observation pinpointed the missing XObject deduplication: the same image data encoded twice produced two independent compressed streams. Fixed.

  • @potatochipcoconut#418, the original PDF/A binding-completeness report that drove the full implementation in #442. convert_to_pdf_a() existed in Rust but was a no-op: it recorded actions and returned success while leaving the document bytes untouched. The report surfaced this silently-broken state across all seven bindings.

  • @nickpetrovic — filed #444 with a precise four-row reproduction table showing ligature glyphs in subset Calibri fonts decoded to wrong Unicode codepoints (tiO, tf[, fte). The report included the exact PDF and the per-font-subset mapping failures, which led directly to the ICCBased color-space warn spam fix and the rowspan-label reading-order scramble fix.

  • @RubberDuckShobe — reported #450: any PDF containing a PNG with an alpha channel showed a diagonal stripe through the image. A minimal reproduction confirmed the bug was reproducible across Acrobat, Preview, and browser PDF viewers. The report made the scope unambiguous — every image with transparency was affected — and led directly to the missing DecodeParms fix in build_soft_mask_dict().

  • @truffle-dev — first code contribution to the project: completed the CLI output-path fix for #412 in #452. The original audit in #412 covered all 11 CLI commands with exact line references and two proposed design options; the PR was clean on first submission. Picks up the four commands (crop, decrypt, delete, reorder) missed by the earlier partial fix, and also enforces -o/--output for merge instead of silently defaulting to the first input's directory.

Scope at-a-glance

  • Real PDF/A conversion — XMP metadata stream, pdfaid:part/conformance identification, OutputIntents (sRGB), language tag, JavaScript removal; all 7 bindings (#418, #442).
  • Symbolic TrueType glyph rendering — non-ASCII bytes (ú=0xFA, á=0xE1, fi=0x85) in DC/EC-style LaTeX fonts with Mac Roman cmap no longer suppressed as spaces (partially fixes #307; follow-up cases reported by FireMasterK on 2026-04-29 remain open).
  • Image XObject deduplication — same image embedded twice no longer re-encoded as two separate compressed streams; PDF size matches the sum of source images (#443).
  • Diagonal-line artifact in transparent images fixed — missing DecodeParms in the soft-mask XObject caused a visible diagonal stripe in any PNG with an alpha channel (#450).
  • Barcode SVG generationpdf_barcode_get_svg no longer returns ERR_UNSUPPORTED; generates real SVG for all 8 barcode types including QR (#421).
  • CLI output routingcrop, decrypt, delete, and reorder now write default output beside the input file instead of the current working directory; merge now requires -o/--output and errors up front instead of silently defaulting to the first input's directory. Completes #412.

Real PDF/A conversion (#418, #442)

convert_to_pdf_a() previously recorded conversion actions and returned success, but the document bytes were unchanged — the XMP metadata stream was constructed in memory and then discarded. This release rewrites the conversion core end-to-end:

  • XMP metadata stream — a standards-compliant XMP packet is serialised and written as an indirect object, then wired into the document catalog as /Metadata. pdfaid:part and pdfaid:conformance are set per level: A1b → 1/B, A2b → 2/B, A2u → 2/U, A3b → 3/B.
  • OutputIntents — a GTS_PDFA1 output intent referencing sRGB is injected when none is present. Idempotent: a second call detects the existing intent and does not duplicate it.
  • Language tag/Lang is written to the catalog when the validator raises MissingLanguage.
  • JavaScript removal/Names/JavaScript entries are stripped when present.
  • Source bytes patcheddoc.source_bytes is updated in-place; the document is immediately re-parseable after conversion.
  • Font embedding (rendering feature) — embed_font() now resolves the 14 standard PDF Type1 PostScript names (Helvetica, Courier, Times-Roman, …) to the metrically-equivalent URW Base 35 open-source fonts shipped by default on Linux (Nimbus Sans, Nimbus Mono PS, Nimbus Roman). With --features rendering all B-level PDFs convert to 0 remaining errors, including FontNotEmbedded. Three bugs were fixed in the embedding pipeline:
    • try_fix_error dedup applied to error codes, so only the first FontNotEmbedded error was processed; remaining fonts were skipped — fixed to dedup per-error-code for non-font errors only.
    • write_full_to_writer wrote font objects from the original source instead of preferring staged modified_objects — fixed to use the same priority order as the general object sweep.
    • add_structure() only added /StructTreeRoot but not /MarkInfo /Marked true; the validator requires both for PDF/A-*a conformance — fixed.

Test coverage — 17 new end-to-end roundtrip tests in tests/test_pdfa_roundtrip.rs verify every fixable scenario (validate → convert → validate). The showcase_pdfa_conversion CI example is rewritten to assert correctness and panics on any regression.

All seven bindings expose the updated function:

Binding API
Rust convert_to_pdf_a(&mut doc, PdfALevel::A2b)?
Python pdf_oxide.convert_to_pdf_a(doc, "A2b")
WASM convertToPdfA(doc, "A2b")
C FFI pdf_oxide_convert_to_pdf_a(doc, level, &out)
C# Compliance.ConvertToPdfA(doc, PdfALevel.A2b)
Go compliance.ConvertToPdfA(doc, compliance.PdfALevelA2b)
Node.js compliance.convertToPdfA(doc, "A2b")

Symbolic TrueType glyph rendering fix (#307)

LaTeX-generated PDFs using DC/EC fonts (Dcr10, Dcsl10, etc.) embed symbolic TrueType fonts with these characteristics:

  • /Flags has the symbolic bit set (bit 3 = 4)
  • No /Encoding dictionary
  • Mac Roman format-0 cmap (platform 1, encoding 0): byte code → glyph ID
  • No Windows Unicode cmap

pdf_oxide correctly routes these through the render_cid_direct() path, which resolves each content-stream byte to a glyph ID via the Mac Roman cmap. The bug was one line in the space-detection guard:

// Before — bytes without a Unicode mapping fell through to unwrap_or(' ')
let char_at_pos = char_str.chars().next().unwrap_or(' ');
if char_at_pos.is_whitespace() { /* skip draw */ }

Any byte whose Unicode mapping returned None — including ú (0xFA → GID 85), á (0xE1 → GID 83), and fi (0x85 → GID 75) — was treated as a space, so the is_whitespace() guard blocked glyph drawing entirely.

// After — '\0' is not whitespace; GID ≠ 0 glyphs are drawn correctly
let char_at_pos = char_str.chars().next().unwrap_or('\0');

Verified pixel-perfect against Poppler and MuPDF on the #307 reproduction PDF. Regression-tested across 69 PDFs (120 page comparisons) — zero regressions in rendering, plain text, Markdown, and HTML extraction.

Text extraction fixes (#444)

Two issues surfaced while investigating #444 (Calibri ligature mis-mapping, which is an upstream macOS Quartz PDF producer bug with no fix possible on our side):

ICCBased color space warn spam — PDF producers that register ICCBased profiles under user-defined names (e.g. Cs1, Cs2) caused the text extractor to fire a WARN log on every sc/SC/scn/SCN operator that used such a name. The catch-all _ branch in the color-space handler did not know how to handle named references, so it logged and left the color unchanged. The fix: apply a component-count fallback in that branch (1 component → gray, 3 → RGB, 4 → CMYK) and demote the log to DEBUG. Affected PDFs with large amounts of colored text (like typical Office documents) emitted 96+ spurious warnings per page; now silent.

Text span reading-order scramblingreorder_rowspan_labels, a function that promotes vertically-centered table row labels to sort at the top of their row block, was incorrectly activating on single-column prose documents (resumes, reports). It identified spans at rightward X positions as a "sparse column" and promoted them to wrong Y coordinates, causing line-continuation text like "to assess technical needs and" or "-making." to appear before the earlier line they followed.

Root cause: the label-candidate filter did not exclude spans whose Y-band already appears in the dense column. Genuine rowspan labels are vertically between data rows, so their Y-band is absent from the dense column. Line-continuation spans share the Y-band of the main column text and must not be treated as labels. The fix adds that exclusion:

// Before — any sparse-column span in the data Y range
y > data_bot && y < data_top

// After — additionally exclude spans that align with a dense-column row
y > data_bot && y < data_top && !dense_bands.contains(&band_of(y))

The original rowspan-label behavior for actual table layouts (CJK lab reports, mixed-column tables) is preserved; the existing test confirms that genuine between-row labels are still promoted correctly.

Image XObject deduplication (#443)

When the same image data was passed to page.image() or from_bytes() on multiple pages, pdf_oxide encoded it as independent XObjects — each carrying the full compressed pixel data. A 760 KB PNG embedded twice contributed 1.52 MB instead of 760 KB; the #443 reproduction produced 2.32 MB from images totalling under 1.6 MB.

The fix hashes the normalised stream bytes after calling image_content_to_xobject_stream(). Hashing before normalisation failed across API paths: an image supplied via page.image() (which accepts raw file bytes and decodes them internally) and the same image supplied via ImageContent::from_bytes() produced different pre-encoding byte strings but identical post-normalisation compressed streams. Hashing after normalisation ensures the key is stable regardless of which API path the caller used. The key is (hash, byte_length) over the compressed pixel data; if a matching entry is already registered in the document's XObject map, the existing reference is reused and no new stream is written.

Diagonal-line artifact in images with transparency (#450)

PDFs with PNG images that have an alpha channel displayed a diagonal stripe across the image when opened in Acrobat, Preview, and most other viewers.

Root cause: compress_image_data() prepends a PNG None-filter byte (0x00) before every scanline before Flate-compressing the pixel data. This is required by FlateDecode with DecodeParms/Predictor=15. The main image XObject carried the correct DecodeParms dictionary — but build_soft_mask_dict(), which builds the /SMask XObject for the alpha channel, emitted no DecodeParms at all. Viewers therefore decompressed the raw Flate stream, then treated the leading 0x00 filter byte of each row as an alpha pixel, shifting every row one byte to the right. The cumulative horizontal offset over hundreds of rows appears as a diagonal stripe.

Fixed by adding the same DecodeParms dictionary to the soft-mask stream:

DecodeParms { Predictor=15, Colors=1, BitsPerComponent=8, Columns=<width> }

Reported by @RubberDuckShobe in #450. Any PDF built with page.image() or ImageContent::from_bytes() where the source PNG has an alpha channel was affected; the fix is purely in the soft-mask stream header and does not change pixel data.

Barcode SVG generation (#421)

pdf_barcode_get_svg was a stub returning ERR_UNSUPPORTED. Two root causes were blocking a real implementation:

  1. Format sentinel collisionpdf_generate_qr_code stored FfiBarcodeImage.format = 0, the same value as pdf_generate_barcode with format = 0 (Code128). The get_svg function had no way to distinguish QR from Code128. Fixed: QR codes now use the internal sentinel value 100 (outside the 0–7 range of 1D barcode types); the public pdf_barcode_get_format return value for QR codes changes from 0 to 100 accordingly.

  2. Missing SVG rendering pathbarcoders 2.0 ships barcoders::generators::svg::SVG (enabled by default via features = ["svg"]), so no new dependency was required. For 1D barcodes, the encoding step is now factored into a private encode_1d helper shared by both generate_1d (PNG) and the new generate_1d_svg (SVG). For QR codes, generate_qr_svg rebuilds the code matrix from qrcode::QrCode::to_colors() and emits a compact inline SVG with <rect> elements — no raster stage.

pdf_barcode_get_svg now returns a valid SVG string for all supported barcode types (Code128, Code39, EAN-13, EAN-8, UPC-A, ITF, Code93, Codabar, QR) when the barcodes feature is enabled.

CLI output routing (#412, #452)

A previous partial fix (commit 9dd94c0) introduced output_beside() / output_dir_beside() helpers and converted five commands (watermark, compress, flatten, rotate, split). Four binary-output commands were missed and continued resolving the default output path relative to the current working directory:

  • crop — now writes <stem>_cropped.pdf beside the input file.
  • decrypt — now writes <stem>_decrypted.pdf beside the input file.
  • delete — now writes <stem>_deleted.pdf beside the input file.
  • reorder — now writes <stem>_reordered.pdf beside the input file.

merge previously silently defaulted to writing merged.pdf in the directory of the first input file when -o/--output was omitted. This silent fallback was the riskiest behavior in the CLI: callers who expected output beside a specific file got a surprise in a potentially unrelated directory. merge now requires -o/--output and exits with a clear error message if it is missing.

No library code was changed — all five files are in pdf_oxide_cli.



Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

3 days ago
pdf_oxide

v0.3.40 | Image rendering fixes, dashed stroke + streaming table batch, digital

Community contributors

This release exists because of the community. Special thanks to:

  • @sparkyandrew — six detailed bug reports (#382, #385, #386, #397, #401, #425) that drove the CJK font subsetter, encryption, font-name handling, and now the image rendering overhaul. Every report came with a reproduction case. Issue #425 specifically identified four separate rendering bugs and raised the API design question that led to ImageContent::from_bytes() and the new image() method across all bindings.

  • @potatochipcoconut — three well-targeted reports (#409, #416, #417) that directly drove the manylinux glibc fix, the OCR wheel fix, and the discovery of the missing in-memory encrypted save API. Terse, precise, actionable every time.

Scope at-a-glance

  • Image rendering — four bugs fixed in PNG/JPEG embed path (#425).
  • New image APIImageContent::from_bytes() + plain image() on all bindings; no pixel dims needed (#425).
  • Dashed stroke + streaming table batchStrokeRectDashed/StrokeLineDashed + StreamingTable bounded-batch API across all 7 bindings (#400).
  • Digital signature verification — real RSA-PSS / ECDSA / TSA cryptographic checks (#420).
  • Binding completeness — encrypted bytes (#423), barcode via C FFI (#421), Node.js validation (#424) and page extraction (#384), Python/Go convert_to_pdfa (#418/#419).
  • Platform fixes — Python glibc 2.34 compat (#416), OCR wheels (#417), WASM rendering (#422), CLI output path.
  • Security & hygiene — unsafe audit, dep freshness, SLSA provenance, SBOM, CodeQL, DCO (#415).

Image rendering fixes (#425)

This release closes the image-rendering bugs reported in #425 by @sparkyandrew. Four bugs, all in the same family of incorrect assumptions in image_handler.rs / pdf_writer.rs:

  • PNG color corruption (Predictor=15 mismatch) — FlateDecode with DecodeParms/Predictor=15 promises PNG-style per-scanline filter bytes. The encoder was compressing raw pixels without prepending the required 0x00 (None-filter) byte before each row; viewers applied PNG unfiltering to raw data, corrupting every pixel. Fixed: compress_image_data() now prepends one 0x00 per scanline before Flate compression.

  • Blank PNG via ImageContent::new()image_content_to_xobject_stream() assumed data was already decoded pixel bytes. Passing raw PNG file bytes caused the PNG header to be treated as pixels — blank / garbage output. Fixed: magic-byte detection (89 50 4E 47) routes raw bytes through ImageData::from_png().

  • JPEG zoom / wrong dimensions — same root cause; JPEG file bytes were not routed through ImageData::from_jpeg(), so the pixel dimensions stored in the XObject were wrong. Fixed by the same FF D8 magic-byte detection.

  • Soft-mask (alpha) lost — PNG transparency was discarded when raw bytes were passed through ImageContent::new(). The new auto-detect path correctly threads the alpha channel through to the PDF /SMask XObject.

New image API — from_bytes() and image() (#425)

The bug report also identified a legitimate API design problem: every other PDF library (ReportLab, fpdf2, iText, PDFBox, PDFKit, printpdf, Prawn) auto-detects pixel dimensions from the image header — users only specify where the image appears on the page. ImageContent::new() required passing width and height explicitly, which callers typically had to look up from a separate decode step.

// Before — pixel dims required even though the library could read them itself
let img = ImageContent::new(bbox, ImageFormat::Png, raw_bytes, width, height);

// After — just bytes + on-page display rect; everything else auto-detected
let img = ImageContent::from_bytes(bbox, raw_bytes)?;

from_bytes() detects JPEG/PNG by magic number and reads width, height, color_space, bits_per_component, and the soft-mask channel from the image header. A plain image() method (no accessibility wrapper) was also missing from Go, C#, and Node.js — added to all three:

Binding Method
Rust ImageContent::from_bytes(bbox, data)?
Go page.Image(bytes, x, y, w, h)
C# page.Image(bytes, x, y, w, h)
Node.js page.image(bytes, x, y, w, h)
Python page.image_from_bytes(bytes, x, y, w, h) (pre-existing)
WASM page.image_from_bytes(bytes, x, y, w, h) (pre-existing)

Use imageWithAlt / ImageWithAlt for PDF/UA-1 accessible figures and imageArtifact / ImageArtifact for decorative images.

Dashed stroke + streaming table batch (#400)

Two FluentPageBuilder additions shipping across all 7 bindings:

  • stroke_rect_dashed / stroke_line_dashed — stroke a rectangle or line with an explicit dash pattern (&[f32] on/off lengths + phase) and RGB colour. Complements the existing solid stroke_rect / stroke_line.

  • StreamingTable bounded-batch APIset_batch_size(n), pending_row_count(), batch_count(), flush() — lets callers control how many rows accumulate in memory before being flushed to the PDF content stream. Useful when streaming very large tables from a source that itself has natural chunk boundaries.

Both surfaces are available in Rust, Python, WASM, Go, C#, and Node.js / TypeScript. New examples/*/09-new-features/dashed_stroke/ examples ship in all four binding example directories.

Digital signature verification (#420)

SignatureInfo.verify() now performs real cryptographic verification instead of returning a stub result:

  • RSA-PSS and RSA-PKCS#1 v1.5 — verified against the embedded certificate public key via the rsa + sha2 crates.
  • ECDSA (P-256 / P-384) — verified via the p256 / p384 crates.
  • TSA timestamp (Timestamp.verify()) — full RFC 3161 countersignature verification: CMS structure, signer certificate, and TSTInfo hash match.

Binding completeness sweep

Several APIs present in the Rust core and some bindings were missing from others. All are now consistent across all 7 bindings:

  • In-memory encrypted save (#423) — PdfDocument.to_bytes_encrypted(user_pw, owner_pw) saves with AES-256 encryption directly to bytes / Buffer / Vec<u8> without touching disk. Available in Python, Node.js, C#, Go, and the C FFI. Driven by @potatochipcoconut in #409.

  • Barcode via C FFI (#421) — pdf_add_barcode_to_page() embeds a generated barcode PNG onto a page at a given rect. Previously the function returned ERR_UNSUPPORTED; it now calls the new DocumentEditor::add_image_bytes_to_page() helper internally. C FFI only in this release — Go and C# wrappers are follow-up work.

  • PDF/A, PDF/X, PDF/UA validation on Node.js (#424) — PdfDocument.validatePdfA(), .validatePdfX(), .validatePdfUA() now available in the Node.js binding, matching Python, Go, C#, WASM, and Rust.

  • Page extraction in Node.js (#384) — DocumentEditor.extractPagesToBytes(pageIndices) splits a multi-page PDF into per-chunk Buffer objects entirely in memory, no temp files needed.

    const chunk = editor.extractPagesToBytes([0, 1, 2]); // → Buffer
  • PDF/A conversion (#418/#419) — PdfDocument.convert_to_pdfa(output_path, level) exposed in Python; pdf_convert_to_pdfa() C FFI + Go ConvertToPdfA().

Platform fixes

  • Python glibc 2.34 compatibility (#416) — LLVM emits __memcmpeq (a glibc 2.35 symbol) in some optimised builds; wheels built against glibc 2.35 failed to load on Amazon Linux 2023 (glibc 2.34) and similar systems. Fixed by adding a global_asm! weak-symbol alias in src/lib.rs that maps __memcmpeqmemcmp. This works with both GNU ld and lld (unlike --defsym which lld rejects for PLT-resolved symbols). Reported by @potatochipcoconut.

  • Python OCR wheels (#417) — published wheels omitted the ocr feature, so pip install pdf-oxide[ocr] installed silently but failed at runtime. Wheels now compile with --features ocr; ORT library path auto-detected on import. Reported by @potatochipcoconut.

  • WASM rendering (#422) — wasm-pack builds were missing the rendering feature flag, producing blank page images. All WASM targets now build with --features rendering.

  • CLI binary output pathpdf-oxide render, pdf-oxide thumbnail, and other commands that produce binary output were writing next to the working directory instead of next to the input file when no explicit output path was given. Fixed.

Security & hygiene (#415)

  • #[forbid(unsafe_code)] on all modules that have no FFI business being unsafe; remaining unsafe consolidated into audited FFI helpers with handle_mut! / handle_ref! macros.
  • lazy_static replaced with std::sync::OnceLock throughout.
  • cargo update dep freshness sweep; lock file refreshed.
  • cargo-geiger unsafe audit + cargo-outdated dependency check added to CI (both run monthly).
  • CI: action SHAs pinned, OIDC publish, SLSA provenance level 3, SBOM (CycloneDX), OpenSSF Scorecard, CodeQL static analysis, DCO enforcement.
  • Dependabot configured for all three ecosystems (cargo, npm, github-actions).
  • SPDX licence headers added to source files; CODEOWNERS and CONTRIBUTING (DCO) added.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

3 days ago
pdf_oxide

v0.3.40

See CHANGELOG for full release notes.

6 days ago
pdf_oxide

v0.3.39 | Tables (streaming + buffered), PDF/UA-1, digital signing (CMS/PKCS#7),

Scope at-a-glance

v0.3.39 originally shipped as a single release themed around table generation (issue #393). Mid-release we expanded the scope to close the broader post-#393 programmatic-builder gap audit (docs/v0.3.39/design/builder_gaps_plan.md, 26 items in 4 tiers). The release now delivers:

  • Bundle C — shape primitives (circle, ellipse, polygon, arc, bezier_curve) + dash patterns on LineStyle.
  • Bundle A — image placement (image_from_file / _from_bytes / _with) + 2D affine transforms (rotated, scaled, translated, with_transform; v0.3.39 scope text-only, path/image/table in v0.3.40).
  • Bundle B — document outline (bookmark, bookmark_tree), page labels (with_page_labels), ToC auto-generator (insert_toc).
  • Bundle D (partial)list_box form widget, fluent field metadata (required / read_only / tooltip), page tab_order (TabOrder::{Row, Column, Structure}).
  • Bundle E + F (research) — RFCs for rich-text accumulator (docs/v0.3.39/design/e_rich_text_rfc.md) and PDF/UA compliance (docs/v0.3.39/research/e_pdf_ua_compliance.md). Implementation deferred to v0.3.40 (#400).
  • Bundle D (deferred) — signature_field widget, barcode-bound fields, JS-action field validation, calculated fields, XFA write-side → v0.3.40 (#400).

DocumentBuilder tables (original #393 scope)

This release closes issue #393. Users who previously had to build giant HTML strings or drop to PdfSharp (the .NET community's canonical pain point — MigraDoc halts around 30 k rows with an O(rows²) autosize) can now stream tables of arbitrary size directly through DocumentBuilder. The release gate is a criterion benchmark that proves linear scaling from 1 k → 30 k rows; see the "Release gate" section below.

Design + research anchors live under docs/v0.3.39/:

  • research/a_table_api_landscape.md — survey of 20 OSS PDF libraries across 6 ecosystems.
  • research/b_scalable_layout_algorithms.md — why MigraDoc fails at 30 k rows + how to not repeat it.
  • research/c_api_ergonomics.md — idiomatic API shape per binding.
  • research/d_builder_gap_analysis.md — primitives we were missing to make tables compose.
  • design/393_tables_decision.md — synthesis + scope split v0.3.39 / v0.3.40.

Two table surfaces, one type vocabulary

  • Buffered Table (page.table(Table::new(rows).with_header_row()...)) — takes the full row matrix, supports colspan / rowspan / rich per-cell styling, splits at row boundaries, emits ContentElements so the v0.3.38 subsetter continues re-keying CJK glyph IDs. Best for tables under ~1 k rows.

  • Streaming StreamingTable (page.streaming_table(StreamingTableConfig::new().column(...).column(...))) — row-at-a-time, TableMode::Fixed only (explicit widths, zero look-ahead), O(cols) persistent memory, auto page-break with repeat-header. Best for 1 k → ∞ rows. Solves the motivating MigraDoc 30 k-row failure directly.

use pdf_oxide::writer::{
    CellAlign, DocumentBuilder, StreamingColumn, StreamingTableConfig,
};

let mut doc = DocumentBuilder::new();
let page = doc.letter_page().font("Helvetica", 10.0).at(72.0, 720.0);

let mut t = page.streaming_table(
    StreamingTableConfig::new()
        .column(StreamingColumn::new("SKU").width_pt(72.0))
        .column(StreamingColumn::new("Item").width_pt(240.0))
        .column(
            StreamingColumn::new("Qty")
                .width_pt(48.0)
                .align(CellAlign::Right),
        )
        .repeat_header(true),
);

for record in huge_dataset {       // never materialised
    t.push_row(|r| {
        r.cell(&record.sku);
        r.cell(&record.name);
        r.cell(record.qty.to_string());
    })?;
}
t.finish().done();

Both surfaces ship with idiomatic per-binding wrappers (Python, WASM, C#, Go, Node/TS). See each binding's README / guide for the native shape.

Four supporting FluentPageBuilder primitives

Shipped alongside tables because a credible table API needs them:

  • measure(&str) -> f32 — text width in points for the current font/size. Pure query; used to pick explicit column widths.
  • text_in_rect(rect, text, align) — wraps text to rect.width, aligns each line horizontally per TextAlign::{Left, Center, Right}. Cursor is deliberately NOT advanced — the rect has its own geometry. Finally honours TextConfig.align which was a dead field for seven releases.
  • stroke_rect(x, y, w, h, LineStyle) + stroke_line(p1, p2, LineStyle) — stroke with explicit width + RGB colour. Previously rect() and line() only stroked at 1 pt black. LineStyle { width, color } is the new public type.
  • remaining_space() + new_page_same_size() — the missing page-break signal. remaining_space() returns vertical points from cursor to the bottom margin; new_page_same_size() commits pending annotations and opens a fresh page with the same dimensions + carried text_config.

Release gate

A criterion benchmark at benches/streaming_table_scaling.rs runs StreamingTable at 1 k / 5 k / 10 k / 30 k rows. Local numbers on the contributor machine (--quick):

Size Time Throughput
1 000 21.7 ms 46.0 K rows/sec
10 000 217.0 ms 46.0 K rows/sec

10× rows → 10× time → O(rows). MigraDoc's failure mode would have shown ~100× time at 10× input. Cargo-bench-invoked as cargo bench --bench streaming_table_scaling.

Rendering-correctness fixes surfaced during the refactor

  • Multi-line cell rendering. The existing src/writer/table_renderer.rs computed row heights from wrapped text (wrap_text at :817) but only emitted the first line on render (:968-969 — flagged as // Simple single-line rendering for now). Fixed by pre-computing wrapped lines + per-line widths once inside TableLayout.cell_layouts and looping them at render time.
  • Per-line alignment. Center and Right alignment used cell_x + content_width / 2 and cell_x + content_width as the drawn-from x, which placed the text's left edge at the centre or right edge of the cell (so centre text was offset, right text was pushed off-cell). Fixed by using each wrapped line's measured width: cell_x + (content_width - line_width) / 2 for Centre, cell_x + content_width - line_width for Right.

Expansion bundles — builder-gap closure

Bundle A — images + transforms

page.image_from_file("logo.png", Rect::new(72.0, 720.0, 120.0, 40.0))?
    .rotated(15.0, |p| p.text("tilted caption"))
    .scaled(1.5, 1.5, |p| p.text("enlarged footnote"));
  • image_from_file(path, rect) / image_from_bytes(&[u8], rect) / image_with(ImageData, rect) — auto-detect JPEG + PNG, alpha channels become /SMask XObjects for transparent placement.
  • rotated(deg, |p| ...), scaled(sx, sy, |p| ...), translated(tx, ty, |p| ...), with_transform([a b c d e f], |p| ...) — closure-scoped 2D affine transforms. Compose naturally (translated(50, 100, |p| p.rotated(45, |p| p.text("tilted"))) produces the expected composed matrix). v0.3.39 scope is text-only — Path / Image / Table elements gain a matrix field in v0.3.40. Rotated watermarks + stamps + captions are the common-case target today.

Bundle B — navigation + document structure

doc.bookmark("Intro", 0)
   .bookmark_tree(|o| {
       o.add_item(OutlineItem::new("Chapter 1", 1));
       o.add_child(OutlineItem::new("Section 1.1", 2));
   })
   .with_page_labels(
       PageLabelsBuilder::new()
           .add_range(PageLabelRange::new(0).with_style(PageLabelStyle::RomanLower))
           .add_range(PageLabelRange::new(4).with_style(PageLabelStyle::Decimal)),
   )
   .insert_toc(0, "Table of Contents");
  • bookmark(title, page_index) + bookmark_tree(|b| ...) — outline / bookmarks emitted as the catalog /Outlines tree. Pre-existing OutlineBuilder was unused; this release is the fluent wiring + the end-to-end catalog emission it was missing.
  • with_page_labels(PageLabelsBuilder) — Roman preface + Arabic body or any PageLabelStyle mix, emitted as /PageLabels number-tree.
  • insert_toc(insert_at, title) — walks the bookmark tree and renders an indented ToC page with right-aligned page numbers. v0.3.39 limitation: doesn't auto-renumber existing bookmark targets (call before further bookmarks, or re-issue after).

Bundle C — shapes + dash patterns

page.circle(cx, cy, r, Some(LineStyle::new(1.5, 0.1, 0.2, 0.3)), None)
    .ellipse(cx, cy, rx, ry, None, Some((0.9, 0.1, 0.1)))
    .polygon(&points, Some(LineStyle::default()), Some((0.5, 0.5, 0.9)))
    .arc(cx, cy, r, start, end, LineStyle::new(1.0, 0.0, 0.0, 0.0))
    .bezier_curve(x0, y0, c1x, c1y, c2x, c2y, x3, y3, style, None)
    .stroke_line(10, 100, 500, 100, LineStyle::new(0.5, 0, 0, 0).with_dash(&[3.0, 2.0], 0.0));
  • circle, ellipse, polygon, arc, bezier_curve — five fluent shape primitives, each emitting one ContentElement::Path with optional stroke + fill. circle reuses PathContent::circle; ellipse / arc / bezier_curve build their quarter-Bezier approximations inline.
  • LineStyle::with_dash(&[f32], phase) / .solid() — dash patterns propagate into PathContent.dash_pattern, emitted as [...] phase d before stroke and reset to solid after.

Bundle D — form fields (partial)

page.list_box("interests", 72, 600, 200, 80,
              vec!["Hiking".into(), "Reading".into(), "Coding".into()],
              Some("Coding".into()), true /* multi_select */)
    .required()
    .tooltip("Pick one or more")
    .text_field("email", 72, 500, 200, 20, None)
    .required()
    .read_only()
    .tab_order(TabOrder::Column);
  • list_box(name, x, y, w, h, options, selected, multi_select) — wires the existing ListBoxWidget (fully implemented in form_fields/choice_fields.rs) through the public fluent surface.
  • .required() / .read_only() / .tooltip(text) — chainable metadata that mutates the most-recently-added form field on the current page (no-op if no field has been added yet).
  • page.tab_order(TabOrder::{Row, Column, Structure}) — emits /Tabs on the page dict for reader tab-navigation order. Structure requires tagged PDF (Bundle F) to be meaningful.

Bundle E (partial) — layout primitives

page.heading(1, "Shopping list")
    .bullet_list(&["Apples", "Bananas", "Cherries"])
    .space(12.0)
    .numbered_list(&["First chapter", "Second chapter"], ListStyle::Decimal)
    .code_block("rust", "fn main() {\n    println!(\"hi\");\n}");
  • page.bullet_list(items) — bullets (•) with indent + per-item wrapping.
  • page.numbered_list(items, ListStyle::{Decimal, RomanLower, AlphaLower}) — Arabic, lowercase Roman, or lowercase alpha markers.
  • page.code_block(language, source) — monospace text over a light-grey filled rectangle. language reserved for Bundle F accessibility tagging; no syntax highlighting in v0.3.39.
  • Helpers: to_roman_lower(n) and to_alpha_lower(n) exposed internally.

Inline rich text (ParagraphBuilder with .bold() / .italic() / .color()), multi-column flow, and footnotes remain deferred to v0.3.40 — see the E-0 RFC at docs/v0.3.39/design/e_rich_text_rfc.md.

Bundles E + F — RFC + research only

  • docs/v0.3.39/design/e_rich_text_rfc.md — RFC for v0.3.40 inline-styling ParagraphBuilder with .bold() / .italic() / .color(rgb, text) cascading runs. ~770 LOC estimated for v0.3.40.
  • docs/v0.3.39/research/e_pdf_ua_compliance.md — PDF/UA-1 compliance audit. Repo has ~40 % of the plumbing (StructureElement, MCID counter, ArtifactType) but MCIDs are orphaned — no StructTreeRoot emission. Bundle F lands in v0.3.40 as ~490 Rust LoC + 1,450 across 6 bindings.

FFI / bindings

  • C FFI (include/pdf_oxide_c/pdf_oxide.h) — six new entry points: pdf_page_builder_stroke_rect, _stroke_line, _text_in_rect, _new_page_same_size, _table (buffered), and the streaming trio _streaming_table_begin / _push_row / _finish. Handle-lifetime contract documented inline.
  • Python (pyo3) — new classes Align, Column, Table, StreamingTable; new FluentPageBuilder methods mirroring the Rust surface. align kwargs accept string, enum, or raw int interchangeably.
  • WASM (wasm-bindgen) — Align enum + StreamingTable class; buffered table({columns, rows, hasHeader}) via serde-wasm-bindgen; stroke_rect, stroke_line, text_in_rect, new_page_same_size, measure, remaining_space on the page builder.
  • C#Alignment, Column, TableSpec, StreamingTable : IDisposable; fluent methods on PageBuilder including managed-side streaming buffer that flushes on .Build().
  • Go (cgo) — Alignment, Column, TableSpec, StreamingTableConfig under go/types.go; fluent methods on *PageBuilder; managed streaming adapter. Purego backend untouched (table surface is cgo-only in v0.3.39).
  • Node/TSAlign enum + StreamingTable class in js/src/builders/streaming-table.ts with pushRow, pushAll (sync + async iterables), finish. All new types in js/index.d.ts.

Scope deferred to v0.3.40 (tracked in #400)

Tables

  • TableMode::Sample — measure first N rows, freeze widths, stream the rest.
  • TableMode::AutoAll — opt-in O(rows × cols) with documentation warning.
  • Cross-page cell splitting for tall rich cells.
  • Bounded-lookahead rowspan in streaming mode.
  • Arrow-style bounded batching on binding StreamingTables (current impl buffers all rows managed-side between begin and finish).
  • Mixed-font exact metrics inside a single table (currently measures against the table default font).
  • Pandas DataFrame first-class adapter in Python.

Transforms

  • TableContent-as-a-whole matrix (individual cells compose naturally through their own TextContent / PathContent matrix fields, which now ship — but wrapping an entire Table in one transform needs a new field on TableContent itself).

Forms (rest of Bundle D)

  • Signature-field form widget (coordinates with #208 signing half).
  • Barcode-bound form field (auto-generate from another field's value at fill time).
  • Field validation — regex mask, numeric range, JavaScript actions.

Layout (Bundle E) — blocked on E-0 RFC which ships in v0.3.39

  • Inline rich-text styling (ParagraphBuilder with .bold() / .italic() / .color()).
  • Multi-column flow on DocumentBuilder (currently only available through Pdf::from_html_css).
  • Footnotes / endnotes.

Accessibility (Bundle F) — blocked on F-0 research which ships in v0.3.39

  • Tagged PDF / logical structure tree emission.
  • /Lang per content run.
  • /Artifact marking for headers/footers on the write side.
  • /RoleMap for non-standard structure types.

Advanced forms (Bundle G) — pick up on concrete customer demand

  • Calculated fields / JavaScript actions.
  • XFA write-side.

Bug fixes

  • #401 — Encrypted PDFs were missing embedded-font sub-objects (/Widths, /FontDescriptor, /FontFile2); they are now included and referenced correctly. Reported by @sparkyandrew.
  • #402 / #406 — Systemic UTF-8 encoding loss: every PDF string object (metadata titles, annotation contents, bookmark titles, content streams) was written as raw UTF-8 bytes instead of PDFDocEncoding (Latin-1 code point for chars ≤ U+00FF) or UTF-16BE with BOM (for chars > U+00FF). Reported by @AngeloBestetti (#402) and internally audited as #406.
  • #407 — L4 font cache cross-contamination: when two pages share the same /Font resource key (e.g. both use key F1), the CMap of the first-loaded face silently overwrote the second's glyph mapping, causing glyphs to be dropped or mis-decoded. Fixed by keying the combined-font hash over all font objects. Reported by @ChadThackray.
  • #395SignatureException on PdfDocument.open() for PDFs containing digital signatures. Fixed as a side-effect of the signing infrastructure (#208). Reported by @gevorgter.
  • #398 — Native PDF parser was non-reentrant: concurrent FFI reads on the same handle returned spurious parse errors. Resolved by the interior-mutability refactor (Mutex<…> on internal caches).
  • #409 — Python (and all bindings) lacked to_bytes() / in-memory output; compress and garbage_collect were not wired into the write path. Reported by @potatochipcoconut.
  • #411p12 = "0.6" (yanked / unmaintained) replaced with p12-keystore = "0.2.1" (RustCrypto-ecosystem, pure Rust, actively maintained). No public API change; SigningCredentials::from_pkcs12 behaviour is unchanged.
  • StreamingTable rowspan flushfinish() was silently dropping the in-progress rowspan group if the table ended mid-span. Added a flush of any partial rowspan_buf before finalising the page.
  • draw_rowspan_group bounds guard — accessing rows[0][col_idx].rowspan was not guarded against col_idx ≥ rows[0].len(), causing a panic on narrow tables with rowspan cells. Added the bounds check col_idx < rows[0].len().
  • scan_root_ref anchoring — the digital-signature helper scanned the entire document for /Root, so a /Root reference embedded inside an annotation value or stream body could silently win over the real XRef /Root at the end of the file. Now mirrors scan_startxref by restricting the search to the last 4 KB of the file.
  • Signature reason/location PDFDocEncoding/Reason and /Location entries in CMS-signature dictionaries were written as raw UTF-8 bytes, bypassing the encode_pdf_text_string path. Non-ASCII characters (accents, CJK, etc.) were stored as illegal UTF-8 sequences in the PDF string. Now uses the same hex-encoded PDFDocEncoding/UTF-16BE path as all other string objects, closing the last #402-class gap in the signing path.
  • #394 — Mixed-size inline runs (superscripts, footnote markers) were incorrectly split onto separate lines because the newline gate used a hard-coded 2 pt Y-tolerance. Replaced with PdfDocument::same_line_threshold — a font-size-relative helper (max(prev_fs, cur_fs) × 0.5) shared across all seven Tagged-PDF assembly paths and should_insert_space. A forward-gap guard was added to prevent the widened threshold from merging spans across column gutters. Contributed by @RolandWArnold (#394).
  • #403 — Simple fonts without an explicit /Widths array fell back to a uniform 0.55 em default for every glyph. For standard-14 fonts (Helvetica, Times, Courier, etc.) this inflated span widths by up to 40 %, collapsing inter-column gaps from real values (e.g. 47 pt) to near-zero (5 pt) and breaking gap-dependent layout heuristics. The fast path now populates the byte-to-width table from get_standard_font_width when /Widths is absent; non-standard fonts and unmapped codepoints still fall back to the generic default. Contributed by @RolandWArnold (#403).
  • #404 — Span right-edges could drift ~0.02 pt outside the detected table bbox due to float accumulation in upstream width arithmetic. The strict Rect::contains_rect check then rejected those spans from the table's retain set, so they were emitted via both the table path and the flow path, producing duplicated text. Introduced a 0.1 pt tolerance at the two retain call sites in document.rs via PdfDocument::contains_rect_with_tolerance; the geometry primitive itself remains strict. Contributed by @RolandWArnold (#404).

CI / test-suite fixes

  • Resolved all Clippy, rustfmt, and cargo check failures that were blocking CI (fix(ci) commit 6c95bada): unused-mut across 80+ files after the interior-mutability refactor, late-init variables, doc-comment ordering, non-minimal boolean conditions, deprecated function references.
  • Renamed six test files from issue-number / benchmark-code names to functional descriptive names (refactor(tests) commit fa071380): test_b1_*test_shared_form_xobject_per_page_ctm, test_b3_*test_running_header_first_occurrence_kept, test_b4_*test_two_column_reading_order, test_b7_*test_stroke_fill_duplicate_text_dedup, test_issue_346_*test_extract_text_sort_comparator_stability, test_issue_395_*test_signed_pdf_opens_and_renders.
  • Example smoke-tests in CI — all code examples are now compiled and executed on every CI run, catching binding API drift before it reaches a release. A dedicated rust-examples job runs all 13 Rust examples (tutorial_* + showcase_*). The Python, Go, Node.js, and C# binding jobs each gained an equivalent step that runs the per-language examples against tests/fixtures/simple.pdf. This means any breaking change to a public binding API will fail CI immediately rather than being discovered post-release by users.
  • Example restructuring — the single monolithic 09-new-features showcase file per language was replaced with one standalone file per feature (streaming-table, pdf-ua-image, in-memory-roundtrip, pkcs12-signing, rfc3161-timestamp) across all 5 languages. Each file is a self-contained runnable program. The tutorial examples 01-08 were also repaired: Go examples gained go.mod + go.sum and had three API-drift regressions fixed (OpenEditor, pdf.Save, RowCount/CellText); JavaScript examples were migrated from CommonJS require() to ESM import; C# examples gained .csproj files referencing the local PdfOxide project.

Community Contributors

  • @RolandWArnold — First contribution to PDFOxide, and a substantial one at that. Roland identified three independent text-extraction correctness issues, traced each one to its root cause in the Rust source, wrote focused fixes with synthetic PdfWriter-based regression tests, and documented the behaviour thoroughly in PR descriptions that made review straightforward. #394 fixes the long-standing mixed-size inline run / superscript line-grouping problem; #403 restores correct span widths for standard-14 fonts without /Widths; #404 eliminates duplicate text caused by sub-pixel float drift at the table-retain boundary. Thank you, Roland — we look forward to more! 🚀

  • @AngeloBestetti — Filed #402 with the concrete word "Lógico": a Portuguese term that, when saved to PDF, came back as mojibake because every accented byte was being stored as raw UTF-8. That single reproducer uncovered a systemic encoding bug — all PDF string objects (metadata titles, annotation contents, bookmark labels, content-stream text) were silently corrupted for any non-ASCII character. The internal audit that followed produced #406 and a full rewrite of write_escaped_string + encode_pdf_text_string to emit PDFDocEncoding for chars ≤ U+00FF and UTF-16BE with BOM for anything above. Thank you.

  • @sparkyandrew — Filed #401 after discovering that AES-256 encrypted PDFs built with DocumentBuilder opened successfully but rendered blank — the embedded font was gone. The root cause: collect_reachable_ids followed the top-level Font dictionary but stopped there, so /Widths, /FontDescriptor, and /FontFile2 were garbage-collected as "unreachable" during the encrypted write pass. The fix traces the full font sub-object graph before encryption so the complete font survives. Thank you.

  • @ChadThackray — Filed #407 after noticing that glyphs from one page silently replaced those of another whenever two pages shared the same /Font resource-key name (both using key F1 but mapped to different faces). The L4 cache was keying the combined glyph-map on a spot-check of a single font object; the fix computes a combined hash over the complete font set, so any change to any face invalidates the entry. Thank you.

  • @gevorgter — Filed #395 after a SignatureException from RenderPage on a 9-page signed PDF — the renderer was propagating a signature-parse failure as the page-render verdict even though no interactive widget lived on that page. The fix treats unparseable signature-field metadata as non-fatal at render time. @gevorgter also supplied the reproducer PDF that became the regression fixture (tests/test_signed_pdf_opens_and_renders.rs), ensuring this class of error can never silently return. Thank you.

  • @potatochipcoconut — Asked #409 how to get a PdfDocument as raw bytes from Python without writing to disk, and whether compress and garbage_collect were available. Neither worked. The question drove the to_bytes() / SaveOptions kwargs work that shipped in-memory output, compression, and garbage-collection across all 7 bindings, plus 18 missing DocumentEditor methods. Thank you.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

9 days ago
pdf_oxide

v0.3.38 | DocumentBuilder fluent API across every language binding, real font subsetting, DocumentBuilder encryption, multi-target WASM packaging, and the first cryptographic slice of PDF signature verification

This release closes the "Rust-only DocumentBuilder gap": the fluent write-side builder, embedded fonts, the HTML+CSS pipeline, annotations, form-field creation, and low-level graphics primitives are now reachable from Python, WASM, C#, Go, and Node/TypeScript — the Rust implementation is the single source of truth and every binding is a thin translation layer. On top of that it lands the first cryptographic signature-verification path (RSA-PKCS#1 v1.5) across every binding and a pdf.js-parity fix for scanned / bilevel pages rendered under a Multiply-blended overlay.

Write-side API × every binding (#384)

Every binding now exposes the full DocumentBuilder fluent API:

# Python — the same shape ships in WASM, C#, Go, and Node/TS
font = EmbeddedFont.from_file("DejaVuSans.ttf")
(DocumentBuilder()
  .register_embedded_font("DejaVu", font)
  .a4_page()
    .font("DejaVu", 12).at(72, 720).text("Привет, мир!")
    .highlight((1.0, 1.0, 0.0))
    .text_field("name", 150, 680, 200, 20, "Jane Doe")
    .checkbox("subscribe", 72, 650, 15, 15, True)
    .rect(50, 50, 500, 700)
  .done()
  .build())

Surface shipped in all 6 bindings:

  • DocumentBuilder + FluentPageBuilder + EmbeddedFont — multi-page construction with CJK / Cyrillic / Greek support (closes #382 cross-language).
  • HTML+CSS pipelinePdf.from_html_css(...) and from_html_css_with_fonts(...) for multi-font cascades.
  • 15 annotation methods — link (URL / page / named), highlight, underline, strikeout, squiggly, sticky note, stamp (14 standard types + custom), free text, watermark (custom / DRAFT / CONFIDENTIAL).
  • 5 AcroForm widget types — text_field, checkbox, combo_box, radio_group, push_button.
  • Graphics primitivesrect, filled_rect, line.
  • AES-256 encryptionsave_encrypted / to_bytes_encrypted on every binding.

Per-binding regression tests for every capability above; ~70 new integration tests pass across Python (20), C FFI (11), C# (11), Go (11), Node/TS (10), and WASM (9).

Real font subsetting on the write path (#385 — FONT-3b)

Documents that embed a CJK face now ship a subset, not the full font. A PDF with 5 characters from NotoSansCJKtc-Regular.otf (~17 MB original) is typically under 100 KB. Content streams, /W widths, and ToUnicode CMap are all re-keyed onto the subset GID space; extract_text round-trips unchanged.

Breaking (v0.3.x semver-acceptable): EmbeddedFont::encode_string / encode_shaped_run now return Vec<u16> instead of a hex String, and build_embedded_font_objects returns a GlyphRemapper that callers must pass to ContentStreamBuilder::build_with_remappers. Internal writer-library consumers only — no change to high-level APIs.

DocumentBuilder encryption (#386)

AES-256 encryption is now available on programmatically-built PDFs:

DocumentBuilder::new()
    .a4_page().text("secret").done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw")?;

Also: save_with_encryption (custom algorithm + permissions) and to_bytes_encrypted for in-memory output.

Multi-target WASM packaging (#392)

pdf-oxide-wasm now ships three builds side-by-side and routes each consumer through package.json conditional exports:

Environment Build
Node.js nodejs/
Bundlers (Vite, webpack, Rollup, esbuild, Bun) bundler/
Browsers / Deno / Cloudflare Workers web/

Fixes ReferenceError: Can't find variable: __dirname thrown in any browser bundler. Subpath imports (pdf-oxide-wasm/web etc.) are also available for manual routing.

Digital signature verification (#208 — verification half)

First cryptographically-backed signature surface on the reader side. Every binding (Signature.verify() / .verifyDetached() / equivalents) now runs the RFC 5652 §5.4 signer-attributes check against the embedded certificate and the §11.2 messageDigest check against the caller's document bytes:

for sig in doc.signatures():
    print(sig.signer_name, "→", sig.verify())            # signer-attrs only
    print("detached ok =", sig.verify_detached(pdf_bytes))  # + content hash
  • RSA-PKCS#1 v1.5 over SHA-1 / SHA-256 / SHA-384 / SHA-512 — the padding used by effectively every signed PDF in the wild — returns Valid / Invalid.
  • RSA-PSS and ECDSA surface as Unknown / UnsupportedFeatureException for now; callers that need those can still read the signer certificate via Signature.GetCertificate() and drive their own check.
  • SignatureVerifier::verify (Rust) also stamps the verification result with trust-root lookup, expiry window, and signer DN pulled from the embedded certificate.

Supporting surface shipped alongside:

  • Certificate — DER inspection (subject, issuer, serial, validity, is_valid) via x509-parserevery binding.
  • Signature — enumerate + inspect + .GetCertificate()every binding.
  • Timestamp — RFC 3161 TSTInfo parsing (time, serial, policy, TSA name, hash algorithm, message imprint) — every binding.
  • TsaClient — RFC 3161 HTTP POST with nonce + HTTP Basic auth, behind a new tsa-client Cargo feature — every binding except WASM. Intentionally not wired on WASM (ureq is wasm-incompatible).
  • DocumentEditor::set_producer / set_creation_date — metadata writers.
  • render_page_region / render_page_fit — clipped / fitted rendering surface.
  • Bicubic image filtering (pdf.js#19978 parity) — scanned / bilevel pages with a Multiply-blended overlay no longer collapse their grayscale range on downscale.

Signing (as opposed to verification) is not covered by this release; #208 remains open for the signing half.

Binding parity follow-ups

Five thin-wrapper commits closed the last coverage holes in this release's signature surface — Python/Go/WASM Certificate inspect, Node Timestamp parse+verify, Node TsaClient HTTP. Every capability in the Supporting Surface list above is now the language-idiomatic shape across all six non-Rust bindings (modulo the principled WASM-TsaClient omission).

Go binding — purego backend + cache-dir install

Go users can now build with CGO_ENABLED=0 via a second backend that uses ebitengine/purego to dlopen libpdf_oxide.{so,dylib,dll} at runtime — no C toolchain required. Backend selection is automatic via Go's built-in cgo tag (//go:build cgo → full CGo API, //go:build !cgo → purego).

The purego backend covers the read-side PdfDocument surface — open (path / bytes / password), page count, version, text / Markdown / HTML / plain-text extraction, fonts, annotations, page elements, search, page dimensions, logging — plus PdfCreator.FromMarkdown for test fixtures. Editor, DocumentBuilder, barcode, signature, TSA, rendering, OCR, and forms stay CGo-only; using them under !cgo is a compile-time error. Full parity is tracked for a follow-up.

Installer:

  • New -shared flag fetches the cdylib instead of the staticlib and prints CGO_ENABLED=0 + PDF_OXIDE_LIB_PATH=… to export.
  • Install dir moved to os.UserCacheDir()~/.cache/pdf_oxide on Linux, ~/Library/Caches/pdf_oxide on macOS, %LocalAppData%\pdf_oxide on Windows. Matches Go's own GOCACHE convention; existing installs re-fetch once into the new path.

Release assets now include pdf_oxide-go-ffi-shared-<platform>.tar.gz for every Tier-1 platform alongside the existing staticlib archives.

Bug fixes

  • #395PdfOxide.Exceptions.SignatureException: '[8500] Signature error...' raised by doc.RenderPage(0, 0) on a specific 9-page PDF reported by @gevorgter. The failure was the renderer propagating a signature-parse error up as the page-render verdict even though the page itself had no interactive signature widget on it. Fixed by treating unparseable signature-field metadata as non-fatal at render time; pinned by tests/test_issue_395_render_signature_exception.rs + the C# regression test so this can't silently come back.

Thanks

Reports and feature requests from @sparkyandrew (#382 CJK via DocumentBuilder, #385 subsetter), @arthurlassagne (#392 browser build breakage), and @gevorgter (#395 RenderPage SignatureException). All three surfaced the gaps that drove this release.


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

11 days ago
pdf_oxide

v0.3.37 | HTML + CSS → PDF (issue #248) — first credible pure-Rust pipeline

API — Pdf::from_html_css (#248)

let font = std::fs::read("DejaVuSans.ttf")?;
let mut pdf = Pdf::from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font,
)?;
pdf.save("out.pdf")?;

The whole feature: pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), extract_text round-trips byte-equal so produced PDFs participate in the existing test infrastructure.

End-to-end test suite at tests/test_html_to_pdf_e2e.rs covers simple paragraph, multi-paragraph, nested HTML, CSS-styled text, and Unicode (Latin + Latin-Extended + Cyrillic + symbols) round-trips.

Phase FONT — embedded TTF/OTF subsystem

  • Subsetter wrapper around the subsetter crate (Typst's, MIT/Apache): crate::fonts::subset_font_bytes(bytes, used_glyphs) produces a subset face, and EmbeddedFont tracks used glyph IDs via the FontSubsetter type. The writer path currently embeds the full font face in FontFile2 (full-face embedding + Identity-H is valid PDF 1.7 and round-trips correctly); switching to the subsetter's output requires remapping glyph IDs in the already- emitted content streams, which lands as a later follow-up. The standalone API + glyph tracking still ship so callers that use the subsetter directly (e.g. CLI tools shelling out to subset_font_bytes) get the size benefit today.
  • Type 0 / CIDFontType2 / Identity-H / ToUnicode emission wired into PdfWriter so add_embedded_text(text, x, y, "EFn", size) produces a font dict graph that PDF readers handle correctly. Round-trip via extract_text returns the input string for Latin, Cyrillic, Greek, Hebrew, Arabic.
  • System font discovery via fontdb (RazrFalcon, MIT). New system-fonts feature gates discovery + shaping; default-on for language bindings, off for WASM and the bare Rust crate.
  • Text shaping via rustybuzz (HarfBuzz port, MIT). Returns positioned glyph runs with cluster info so the inline formatter can map glyphs back to source bytes.

Phase CSS — hand-rolled engine

10 modules, ~6,500 LoC, no MPL anywhere:

  • Tokenizer (CSS Syntax L3) with full token coverage including CDO/CDC, hex+named entities resolution in url(), source locations.
  • Parser producing Stylesheet { rules: Vec<Rule> } with forgiving recovery per spec.
  • Selectors L3 + L4 subset: :is/:where/:not/:has, structural pseudo-classes, attribute matchers with i/s flags, specificity computation packed into a sortable u32.
  • Matcher with Element trait so the engine isn't tied to one DOM implementation.
  • Cascade with origin/specificity/source-order sorting, inheritance from parent for the spec's inherited-property list, inline-style merge, custom-property storage.
  • calc() / min() / max() / clamp() evaluator with mixed- unit math against a CalcContext.
  • var() substitution with DFS cycle detection.
  • Typed property values for colour (~150 named, hex, rgb/rgba/ hsl), length (every CSS Values L4 unit), display, font-size/ weight/style/family, margin/padding shorthand expansion, line- height, etc.
  • At-rules: @media print always-true + (min/max-width) predicates, @page with :first/:left/:right/:blank selectors and margin boxes, @font-face descriptor extraction, @import URL forwarding, @supports against our supported set.
  • Counters (counter/counters/counter-reset/-increment/ -set with Roman/Greek/alpha numbering) and pseudo-element content evaluation.

Phase HTML

  • HTML5 tokenizer with attribute parsing (quoted/unquoted/bare), void-element implicit self-closing, <style>/<script> raw-text contexts, named + numeric entity decoding, comments, DOCTYPE.
  • Flat arena DOM implementing the CSS-4 Element trait so the cascade matches against real document nodes. Implicit close handling for the common <p> and <li> cases.
  • Stylesheet extraction: <style> blocks, <link rel="stylesheet"> (URL forwarded; media attribute preserved), per-element inline style="...".
  • Resource extraction: <img> with srcset DPR selection, <picture>/<source> first-match, <a href> (internal anchor detection).

Phase LAYOUT

  • Box tree from DOM × ComputedStyles with display-split (outer/inner), anonymous-block insertion per CSS 2.1 §9.2.1.1, display: none/contents handling, UA default display table for common HTML elements.
  • Taffy integration for block / flex / grid layout (Dioxus, MIT, default-features-off + only the features we need).
  • Inline formatting with greedy line breaker via UAX #14 (unicode-linebreak), text-align/white-space modes, hard breaks, atomic inline boxes.
  • Float scaffolding with line-shortening helpers.
  • Margin collapsing per CSS 2.1 §8.3.1.
  • Multi-column distribution (column-count/column-width/ column-gap with greedy line distribution).
  • Tables with auto + fixed column-width algorithms, row-group classification (header/body/footer for paginator repetition).

Phase PAGINATE

  • Slices a positioned box tree across pages at floor(box.y / content_height) boundaries.
  • Multi-page boxes emit one PaginatedBox per page with the visible y-slice; preserves source IDs so PAINT can look up styles.
  • A4 portrait (96dpi) and Letter (8.5×11) page presets.

Phase PAINT

  • Walks each PageFragment and emits text + borders into the existing PdfWriter / PageBuilder.
  • HTML→PDF Y-flip applied once at emission time so all internal coordinates stay top-down.

Corner-case fixes and follow-ups

After the initial cut of the HTML+CSS pipeline, corner-case validation surfaced a set of regressions and missing features. All of the below also ship in v0.3.37:

  • Tokenizer char-boundary safety. The CSS tokenizer's ignore_case lookahead indexed raw byte offsets on multi-byte characters, panicking on any CSS source that put non-ASCII inside a keyword-adjacent position. Fixed.
  • Block sizing for inline-text flow. Block boxes with only-inline children were given zero intrinsic height, so paint-time y-coordinates collapsed; multi-paragraph documents dropped every paragraph but the first, and long single paragraphs retained only ~20 % of their words. run_layout now reserves intrinsic height from the body font size and the inline run count.
  • Arabic / RTL shaping. Paint now routes RTL paragraphs through the rustybuzz shaper (feature system-fonts) so contextual forms, ligatures, and visual reordering all work.
  • Multi-font cascade. New Pdf::from_html_css_with_fonts(html, css, Vec<(family, bytes)>). CSS font-family on any element resolves against the registered families (case-insensitive, with/without quotes); unknown families fall back to the first registered font. Walks up the box tree so inline children inherit their ancestor's family.
  • Page breaks. page-break-before: always and page-break-after: always now open a fresh page, both via CSS rules and via inline style="...". Multiple breaks accumulate.
  • ::before / ::after generated content. New cascade::pseudo_content_for(ss, element, PseudoKind::{Before,After}). Literal strings, attr(name), and open-quote/close-quote all resolve.
  • Opacity + transform: translate*(). opacity <= 0.01 on any ancestor hides an element and all its text descendants. transform: translateX/Y/translate(…) applies as a pre-paint offset on the box's x/y.
  • <img> data-URI embedding. <img src="data:image/png;base64,…"> (and data:image/jpeg;…, percent-encoded plain payloads) now decode to a real PDF Image XObject. The paint pipeline emits /Do operators against a per-page /XObject resource dictionary which PdfWriter::finish() now serializes — the missing resource-dict wiring was why prior page.add_element(Image(…)) calls rendered as silent no-ops. External URLs / filesystem paths return None from decode_image_src so callers can resolve those themselves.
  • List markers. <ul> items get (U+2022) and <ol> items get N. numbering, painted in the gutter to the left of the <li>'s content box. Nested lists work on both levels.
  • <a href> link annotations. Every anchor box with a non-empty href emits a PDF /Link annotation carrying a /URI action; inline text inside the anchor inherits the link by walking up the box tree. Anchors with no href emit no annotation.
  • Embedded fonts via DocumentBuilder (#382). New DocumentBuilder::register_embedded_font(name, EmbeddedFont). Text emitted through the fluent builder (FluentPageBuilder::font(name, size).text(...), or any ContentElement::Text whose FontSpec.name matches a registered embedded font — including template headers/footers) is now routed through the Type-0 / CIDFontType2 path instead of silently falling back to Helvetica. CJK, Cyrillic, Greek, Hebrew, Arabic text emitted via the high-level API now actually embeds and renders. Unregistered font names continue to resolve against the base-14 set. Reported by @sparkyandrew.

Bug fixes surfaced during pre-release review

  • Base-14 bold text rendered non-bold. The page /Resources /Font dictionary keyed entries with dashes stripped (HelveticaBold) while content streams emitted Tf /Helvetica-Bold. PDF readers silently fell back to the default font, so every bold or italic base-14 run came out regular. Resource-dict keys now match the Tf operator names exactly.
  • TTC system fonts (Helvetica.ttc, msgothic.ttc, …). fontdb surfaces collection fonts as Source::SharedFile(path, …), which the resolver previously rejected as NoPath. SharedFile entries are now read the same way as regular files, so a huge swathe of macOS/Windows system fonts become resolvable.
  • Unquoted multi-word font-family. font-family: DejaVu Sans, sans-serif tokenises as two separate Idents, so the registered- family lookup never matched them as a single name. The resolver now collects consecutive idents (whitespace-separated) into one candidate and flushes at top-level commas, so quoted and unquoted forms behave the same.
  • Memory leak in Pdf::from_html_css / from_html_css_with_fonts. The factories leaked the combined CSS source, parsed stylesheet, DOM, and family map on every call (four Box::leak sites). Long- running processes (HTTP servers, batch converters) grew unbounded. The downstream APIs all accept non-'static references; the function now holds them in locals scoped to the call.
  • PNG alpha / soft-mask now renders. ImageData::from_png already decoded and compressed the alpha channel, but ImageContent had no field for it and the XObject emitter hard- coded SMask = None. ImageContent gains a soft_mask, the html_css paint pipeline propagates it, and the XObject path actually emits a /SMask stream.
  • Shaped text round-trips via extract_text. The shaped path (add_shaped_embedded_text) only recorded glyph IDs in the subsetter, leaving shaped runs absent from the ToUnicode CMap and uncopy-paste-able. The new encode_shaped_run maps glyph clusters back to source codepoints so the ToUnicode entries are complete for simple scripts and exact-leading-char for ligatures.
  • Reproducible PDF output. PdfWriter::finish iterated embedded_fonts directly from the HashMap, randomising object-ID order across runs. Embedded fonts are now emitted in registration order via an explicit embedded_font_order vector.
  • Embedded-font name collisions. Registering two fonts with the same display name silently overwrote the first. embedded_fonts is keyed by its EFn resource name (unique, monotonic) so registrations are independent regardless of display name.
  • fontdb Mutex serialised on slow disks. SystemFontDb::resolve held the fontdb lock across the font-bytes fs::read. Concurrent resolve calls are now lock-free during I/O — the lock is released once the face path + PostScript metadata are picked.
  • Misleading docs corrected. Module documentation previously claimed background-color rendered as a filled rect (currently a no-op stub) and that the writer embedded a subset of the face (currently embeds the full face + Identity-H, subsetter output is a later follow-up). Both are now reflected accurately in the relevant docstrings.

Tests added in the corner-case pass

  • E2E (tests/test_html_to_pdf_e2e.rs): 36 tests (was 14), covering every feature above plus a kitchen-sink document that exercises ::before, list markers, page-break, opacity, translate, and <a href> in a single round-trip.
  • Unit: 4 cascade pseudo-element tests, 7 paint tests (opacity / translate / data-URI decode), 3 inline-text sizing tests, 1 RTL shaper test, 1 multi-font cascade test, 1 tokenizer multi-byte regression test.
  • Total test count: 4772 lib + 36 e2e; 168 integration suites all green, 0 regressions on the existing corpus.

Limits

The supported CSS surface is documented in detail in docs/HTML_TO_PDF_GUIDE.md. Out of scope: CSS filters, 3D transforms, animations, SVG-in-HTML (every viable Rust SVG crate is MPL), MathML, hyphens: auto, shape-outside, JavaScript execution, full-matrix transform (scale/rotate), gradients, and box-shadow.

Licence audit

cargo deny check licenses passes with zero MPL transitive dependencies. The Mozilla CSS stack (cssparser, selectors, html5ever, lightningcss, stylo) is all MPL-2.0; v0.3.37 hand- rolls the equivalents to keep pdf_oxide entirely under MIT/Apache.

Community Contributors

  • @jmriebold — Filed #248 ("CSS support"). That single issue is the root of this release's entire HTML+CSS→PDF pipeline — the hand-rolled CSS engine, the HTML5 tokenizer + arena DOM, Taffy-backed layout, the ::before/ ::after, page-break-*, <img> data-URI, multi-font cascade, opacity / transform, <a href> link, and RTL shaping work all exist because he asked for it. Thank you.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

12 days ago
pdf_oxide

v0.3.36 | Markdown structural extraction quality vs pdfium — Tagged-PDF

Markdown structural extraction (#377)

The headline change of this release. to_markdown() previously consumed only the MCID order from /StructTreeRoot and then re-derived heading levels from font-size heuristics and list markers from glyph detection. For Word/Acrobat tagged PDFs whose body and heading text share a point size, this dropped every heading; for tagged lists where LI → LBody → MCR nests the actual content under a Span/P, this dropped every bullet; for tagged paragraphs whose inter-paragraph gap was less than 1.5× line height, this merged adjacent paragraphs.

This release wires the structure tree directly into the markdown pipeline:

  • Heading and list emission from /StructTreeRoot. New StructRole (Heading(1..6), ListItem, ListItemLabel, ListItemBody) attached to every span via the per-MCID lookup map. The converter prefers the explicit role over font-size heuristics so Word-tagged documents recover their full heading hierarchy. Lists emit - item with paragraph breaks at every role transition. (D1)
  • Heading / list role propagated through nested MCRs. Tagged PDFs commonly wrap heading content as H1 → Span → MCR and list bodies as LI → LBody → Span → MCR. The traversal now threads InheritedContext { heading_level, list_role } down both traverse_element and traverse_element_all_pages, so deeply nested MCRs carry the right semantic role. (D8b)
  • Per-/StructTreeRoot block boundary forces paragraph break. New OrderedContent.block_id increments on every entry into a block element (/P, /H1..6, /LI, /Lbl, /LBody, /Sect, /Div, /Art, /TR, /TH, /TD, /Note, /Reference, /BibEntry, /Code); the converter splits paragraphs whenever this changes between adjacent spans. Tight-gap layouts (pdfa_049-style) no longer merge. (D5)
  • Same-baseline gate against form-heading over-fragmentation. D5 alone over-split horizontal heading bands like # Form / # 1040 / # U.S. Individual Income Tax Return into three separate headings. The block-id transition now fires only when the spans are also on different visual lines; same-baseline pieces re-join into one heading. (D5b)
  • Multi-column gutter detection. Two spans on the same baseline separated by a horizontal gap > max(3 × font_size, 30 pt) are treated as belonging to different columns even when their block_ids would say otherwise — newspapers and two-column academic papers no longer concatenate cross-column tokens. (D5c)
  • Backward-x reading-order wrap detection. When the structure tree's reading order goes column-major (last span of column 1 at x=976 immediately followed by first span of column 2 at x=192, same baseline), the converter now recognises the wrap as a paragraph break instead of joining the two into a nonsense token like constitutionAssailing. (D5d)
  • Geometric heading + list-prefix detection for untagged docs. Bold + 5 % size bump promotes to H4. New is_ordered_list_marker(text) -> Option<u32> recognises 1. / 12. / a) / iv. / A. while conservatively rejecting figure captions (1.1 Foo) and years (1986). Bullet or ordered marker on a new line forces a paragraph break regardless of the geometric gap. (D2 / D3 / D4)

RTL text — safe-by-default

  • Spurious **bold** markers around Arabic contextual glyphs are now stripped. Initial / medial / final shape transitions routinely flipped the font-weight detector and emitted single-letter emphasis runs; the converter now recognises and removes them.
  • Bidi reorder is OFF by default. An earlier draft of D7 ran unicode-bidi's visual→logical reorder on every RTL line; that broke previously-correct logical-order PDFs (Hebrew name בנימין was being reversed to ןימינב). Without a reliable signal for source order, the safer behaviour is to preserve the input ordering. The reorder helper remains exported from text::bidi::reorder_visual_to_logical for callers that know their input is in visual order.

Markdown output

  • Inline-image base64 data URIs capped at 200 KB. PDFs with high-resolution diagrams previously inflated markdown output by 10–20× (one 1.9 MB academic paper produced 11.3 MB of markdown). Images that exceed the cap now emit an HTML-comment placeholder noting the suppression and the original size. File-based image output (image_output_dir) is unaffected.

Tests

  • 80+ new unit tests in pipeline::converters::markdown::tests, structure::traversal::tests, and text::bidi::tests covering every defect with TDD-shaped RED→GREEN cases plus parametrised variations (all six heading levels, all three list roles, edge cases like clamped levels, baseline jitter, three-column layouts, the IA_0047 backward-x reproducer, etc.).

Empirical impact

Validated against v0.3.35 baseline on a 369-PDF regression spanning academic, government, forms, newspapers, technical, theses, IRS, pdfium, pdfjs, safedocs, and slow-corpus subsets:

  • 0 catastrophic regressions (no HEAD_FAIL, no SHRUNK_BIG on real content; the three sub-50-byte SHRUNK cases are pdfjs test fixtures where D5b same-line joining suppresses geometric heading detection on minimal content).
  • Token Jaccard vs pdfium and pdftotext: median 1.000 (perfect), ≥0.95 on 95/106 fixtures.
  • Token Jaccard vs pymupdf4llm: median 0.978, ≥0.95 on 65/106 fixtures.
  • ~2× more headings emitted than pymupdf4llm across the corpus — the structure-tree wiring lets pdf_oxide pick up section titles that font-only heuristics miss.
  • Per fixture (issue #377): nougat_002 0→4 H1s + 5→34 bullets; nougat_011 64→266 lines; word365_structure 0→1 H1 + 2→3 bullets; 2023-06-20-PV 0→4 H + 0→5 bullets.

Community Contributors

  • @Goldziher (kreuzberg) — filed #377 with a 727-document benchmark methodology (block-level SF1 + token-level TF1) comparing pdf_oxide against pdfium, plus 9 reproducer PDFs covering the worst structural-extraction regressions. The clarity of that report (per-pattern bucketing, per-fixture gaps, and an explicit "TF1 within ±3 % so text content is fine, structure is the issue" framing) made the entire investigation tractable. The single-PR unlock that drove this release was identifying that pdf_oxide had a complete structure-tree parser whose output the markdown converter was discarding — that framing came directly from the issue.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

13 days ago
pdf_oxide

v0.3.35 | Narrow-glyph doublet preservation in text extraction

Text extraction correctness

  • Adjacent narrow-glyph doublets no longer collapsed at small font sizes (#378, PR #379). TextExtractor::deduplicate_overlapping_chars and deduplicate_overlapping_spans used a hardcoded 2 pt absolute threshold to detect duplicate glyphs from stroke+fill render passes. For narrow glyphs (l, r, I, i) in compact fonts at small sizes the per-glyph advance width drops to ≤ 2 pt (Helvetica l ≈ 2.5 pt at 9 pt), so legitimate adjacent doublets one full advance apart fell inside the dedup window and one of the two glyphs was silently dropped. Visible corruption included controller → controler, billed → biled, warranty → warrnty, following → folowing, and VIII → VII. Builds on prior #102 / #253, which added same-text and same-character identity guards but kept the 2 pt threshold — this fix addresses the residual case where both glyphs are identical (passing the identity check) yet still legitimate neighbours. Threshold now scales with each glyph's own advance_width (fallback bbox.width) as min(advance_width * 0.30, 2.0). Real render-pass duplicates sit well under 5 % of one advance apart and continue to collapse; heaviest kerning observed in the wild is ≤ 20 % of advance, so legitimate kerned neighbours are preserved. Tunables hoisted to TextExtractor::DEDUP_OVERLAP_RATIO / DEDUP_OVERLAP_CAP_PT associated constants so both dedup paths share one source of truth. Regression coverage spans the matrix of four narrow glyphs × three small body-text sizes (7 / 9 / 11 pt) on both the per-char and per-span paths, plus positive cases proving stroke+fill duplicates at ~0 pt offset still collapse.

Community Contributors

  • @Hugues-DTANKOUO — Reported #378 with a precise root-cause analysis (the 2 pt absolute threshold falling below one advance width for narrow glyphs in compact fonts at small sizes) and authored PR #379 with the advance-scaled threshold and a parametrised regression matrix covering the four narrow glyphs across three body-text sizes.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

14 days ago
pdf_oxide

v0.3.34 | Idiomatic page API, structured tables, column-order, image, and ICC colour fixes

API — Page abstraction (#371)

All four language bindings now expose a page object so callers can iterate a document and call extraction methods on the page directly. Named consistently as Page in Python, Node.js, C#, and Go.

with PdfDocument("paper.pdf") as doc:
    for page in doc:           # len(doc), doc[i], doc[-1] also work
        text = page.text
        md   = page.markdown(detect_headings=True)
  • PythonPage with lazy properties: text, chars, words, lines, spans, tables, images, paths, annotations; methods: markdown(), plain_text(), html(), render(), search(), region(). The pre-existing editor PdfPage is unchanged.
  • Node.jsPage with cached width/height/rotation and extraction methods. [Symbol.iterator] and page(index) added to PdfDocument. Six previously native-only methods wired into the TS layer: extractWords, extractTextLines, extractTables, extractPaths, getEmbeddedImages, ocrExtractText.
  • C#Page with full sync + async surface. doc.Pages (IReadOnlyList<Page>) and doc[i] indexer added to PdfDocument.
  • GoPage struct with full method surface. doc.Page(i) and doc.Pages() added to PdfDocument.

API — Structured table extraction with consistent naming (#289)

extract_tables() returns structured data — rows, cells with text and bounding boxes — not just Markdown. Available on both PdfDocument and the new Page objects across all bindings, with a single consistent type name Table:

Language Type Cell access
Rust Table iterate rows[i].cells[j]
Python dict row["cells"][i]["text"]
Go Table table.CellText(row, col)
C# Table table.CellText(row, col)
Node.js Table (interface) table.cells[row][col]

C# previously returned only (int RowCount, int ColCount) tuples — now returns a proper Table[] with cell text accessors, matching Go and Rust.

Text extraction correctness

  • Multi-column reading-order interleaving fixed (#319). On untagged multi-column PDFs (academic textbooks, genetics references), extract_text was applying XY-cut column ordering inside extract_spans() and then re-sorting with row-aware sort in extract_text_with_options, undoing the column structure. Result: garbled fragments like accompaally (= "accompa" from column 1 + "ally" from column 2). Fix: skip the row-aware re-sort when the page is genuinely multi-column. Verified on Hartwell Genetics, Murphy ML, and Kandel Neural Science textbooks — all known garbled tokens eliminated.
  • XY-cut column-detection improvements for mixed-layout pages (table + body text). Wide spans (>55% of region width) excluded from the projection density so tab-expanded table rows no longer fill the column gutter. Single-character spans (table cell values like G, T) excluded from projection so they don't scatter across the gutter. Coverage check uses character-count estimate rather than bbox width so tab-padded rows don't masquerade as dense body text.
  • Sparse-layout false-positive guard for is_multi_column_page. Copyright pages, title pages, and colophons can produce two X-center peaks with only 7-10 spans per "column" — these are no longer treated as multi-column, preventing XY-cut from splitting sentences whose halves are at different X positions on the same line.
  • Font-aware column-shape gate in is_multi_column_page. Fax-style and scattered-fragment layouts (each row built from several individually positioned word fragments) used to clear every prior multi-column check and routed through XY-cut, which then read the page column-major and could reverse fragments within a row. The new gate measures the fraction of side-spans falling into the largest X-cluster (cluster gap derived from the page's dominant em); body text scores ≥ 0.5 while scattered layouts score < 0.4. Pages that fail either side fall back to row-aware sort, so scanned-fax PDFs again read left-to-right line-by-line. Per-page font statistics are computed once via the new pdf_oxide::layout::PageFontStats type and reused by every threshold the layout pipeline derives.
  • Newline insertion on backwards-X jumps in span join. When the upstream sort handed the join loop two same-baseline spans whose X positions went backwards (a multi-column page whose XY-cut routing groups column-side spans across rows so adjacent iteration items share a Y band but belong to different visual rows), no separator was being inserted and texts glued together — producing tokens like instancesinstancesinstances from three table-header cells in a stats grid. Same-baseline pairs whose delta-x is more negative than 3 em now emit a newline.

Distribution

  • Node.js Linux prebuild now portable across glibc 2.35+ systems. Previous builds were dynamically linked against libstdc++.so.6 requiring GLIBCXX_3.4.31 (GCC 13+), failing to load on Debian 12 stable, Ubuntu 22.04, and RHEL 8/9. Fix: binding.gyp now passes -static-libstdc++ and -static-libgcc, and the Linux runner is pinned to ubuntu-22.04 / ubuntu-22.04-arm (glibc 2.35). The resulting .node is fully self-contained for C++ runtime — ldd shows only libm/libc. Size impact: +210 KB.
  • Go installer documents @latest. go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest is now the recommended install command (the installer auto-resolves the matching version via runtime/debug.ReadBuildInfo()).
  • pkg.go.dev now shows Go documentation. The Go module (rooted at go/go.mod with module path github.com/yfedoseev/pdf_oxide/go) was returning Documentation not displayed due to license restrictions because pkg.go.dev's licensecheck only inspects the module's own subtree — it does not walk up to the repo root where LICENSE-APACHE + LICENSE-MIT live. Fix: duplicate both files into go/LICENSE-APACHE and go/LICENSE-MIT, filenames both on pkg.go.dev's accepted list. Takes effect on the next tag.
  • npm, NuGet, and PyPI packages now embed both licence files. Same class of gap as the Go fix: js/package.json's files list, the C# .csproj, and the maturin [tool.maturin] include all omitted the licence text so shipped artifacts lacked the notice MIT requires. js/package.json's license field also flattened to "MIT", contradicting the crate's declared MIT OR Apache-2.0; corrected to match. The C# csproj carried a deprecated <LicenseUrl> alongside <PackageLicenseExpression> that NuGet warns on — removed.
  • LICENSE-MIT copyright corrected. All four LICENSE-MIT copies (root, go/, js/, csharp/PdfOxide/) carried Copyright (c) The Rust Project Contributors left over from the cargo init template. Updated to Copyright (c) 2025-present Yury Fedoseev. Verified with google/licensecheck — all four still classify as 100% MIT, so pkg.go.dev / NuGet / npm license detection is unaffected.

CI

  • Free-disk-space step added to all Ubuntu jobs that do heavy Rust + Python builds. A v0.3.33 release-pipeline failure (No space left on device on actions-runner log writes) traced to GitHub Ubuntu runners filling up at the maturin build --release step. Now applied to python.yml test job (was only one fixed initially), ci.yml Python Bindings + WASM Build jobs, and release.yml Python wheel build matrix (Linux targets only via if: runner.os == 'Linux' guard).

Image extraction correctness

  • 4-bit-per-component Indexed images no longer decode to vertical-stripe noise (#375). The PNG predictor decoder was honouring the numeric /Predictor value from /DecodeParms instead of the per-row filter tag byte written into each row. ISO 32000-1:2008 §7.4.4.4 makes the per-row tag authoritative: a producer may declare /Predictor 12 (Up) on the parameters and still write tag 0 (None) on every row. Reading the declared predictor instead produced Up-cascade on raw index bytes, rendering a 710×1012 scanned-book page as a diagonal-stripe noise pattern. Reported by @Charltsing.
  • Indexed palette streams whose first byte is 0x0D (CR) or 0x0A (LF) no longer decode to solid black (#375). decode_stream_data was running a post-parse trim_leading_stream_whitespace pass that stripped CR/LF bytes from the start of every unencrypted stream. The parser already consumes exactly one EOL after the stream keyword per ISO 32000-1:2008 §7.3.8.1, so re-trimming corrupted binary streams that legitimately start with those bytes. For an Indexed-backed image, shrinking a 4-byte CMYK palette 0d 0c 0c 04 to 3 bytes pushed every lookup into the expander's out-of-range branch, producing (0,0,0) for every pixel. Reported by @Charltsing.
  • DeviceCMYK → DeviceRGB fallback now matches ISO 32000-1:2008 §10.3.5 (#375). All CMYK→RGB paths — image-level bulk conversion, Indexed-CMYK palette expansion, content-stream fill/stroke colour state, JPEG CMYK decoding — now use the spec's additive-clamp formula R = 1 − min(1, C + K). Four inline copies and three helper functions were collapsed onto this single form; the common multiplicative (1-C)(1-K) variant differed on heavily-inked samples and was the default we inherited from imaging libraries, not what the spec specifies.

Colour management (new)

  • Real ICC profile-driven colour conversion via qcms (#375; opt-out icc feature, on by default). When a PDF's /ICCBased colour space or /OutputIntents → DestOutputProfile provides an ICC profile, image extraction now compiles it to a qcms::Transform and routes CMYK samples through the CMM instead of the §10.3.5 fallback. RGB- and gray-ICCBased profiles use the same pipeline. The graphics-state rendering intent (/Intent on image dictionaries, /RI, or the ri operator) is honoured; unrecognised intent names fall through to RelativeColorimetric per §8.6.5.8. qcms is pure Rust (no C/FFI) so WASM and C# AOT builds keep working; opt out with default-features = false. Reported by @Charltsing.
  • New pdf_oxide::color module exposes IccProfile, IccHeader, RenderingIntent, and Transform for consumers that want to drive colour conversion directly.
  • Measured impact on a representative CMYK-heavy fixture (218 images, /ICCBased 4 throughout): mean PSNR vs poppler's reference rendering improved from 27.9 dB (§10.3.5 fallback) to 39.2 dB (qcms). Worst-case PSNR rose from 16.4 dB ("visibly wrong saturation") to 33.8 dB ("perceptually indistinguishable"). A representative blue swatch shifted from RGB(62, 142, 252) to RGB(58, 123, 190) vs the ICC reference's RGB(62, 124, 191).

Community Contributors

  • @SeanPedersen — Proposed the page-first API (#371) with lazy evaluation and sequence semantics. Python follows his design exactly; extended to Node.js, C#, and Go.
  • @pdenapo — Requested structured table extraction returning data structures rather than Markdown (#289), which prompted the cell-text API surfacing in C# / Node.js and the Table rename for cross-language consistency.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.

15 days ago
pdf_oxide

v0.3.33 | Text extraction, image correctness, and memory safety fixes

Text extraction correctness

  • ToUnicode CMap miss returns U+FFFD instead of ASCII ciphertext (#363). Subset Type0 fonts whose ToUnicode CMap doesn't cover a CID now emit the replacement character instead of falling through to the Identity-H cid-as-Unicode path that produced strings like %B+$%8A//$2*%01*1%6APP.
  • Intra-word TJ kerning no longer splits words (#365). Letter-pair kerning of 0.10–0.20 em inside single words ([(diffe) -150 (rent)]) no longer triggers space insertion. Validated on 5 Kreuzberg fixtures — zero split-word patterns.
  • Cyrillic / non-Latin text recovered from UTF-8 mojibake (#317). Fonts with Latin-only encoding and no ToUnicode CMap that carry raw UTF-8 byte sequences now decode correctly. Validated on issue20232.pdf — Russian engineering text readable.
  • FlateDecode partial-recovery rejects garbage output (#364). MS Reporting Services PDFs (nougat_026.pdf) whose content streams failed mid-decompress were returning 128 bytes of pseudo-random data. Partial-recovery paths now validate output via looks_like_real_stream before accepting. Pages 1/2/5 go from 0 → 848/792/321 bytes.

Image extraction

  • Indexed + ICCBased palette correctly resolves component count (#373). Unresolved ICC stream references inside the Indexed base array caused /N to default to 3 instead of reading the actual value (4 = CMYK), producing diagonal-stripe artifacts. Reported by @Charltsing.
  • Lab-base Indexed palettes converted to sRGB (#337). Palette bytes in CIE L*a*b* are now converted through Lab→XYZ→sRGB instead of being reinterpreted as raw RGB.

Memory and performance

  • All internal caches bounded (PR #369, #354). Object cache (64 MB), font caches (256–512 entries), XObject span/image caches (1024 entries), and global CMap cache (1024 entries) all use FIFO eviction. Cache utilities extracted to src/cache.rs.
  • Path extraction OOM on chart-heavy PDFs fixed (PR #369). Added CTM-aware processed_xobjects dedup — same XObject at same position is deduplicated, same XObject at different positions processes separately.
  • Mutex poison resilience. MutexExt::lock_or_recover() replaces 72 .lock().unwrap() calls.

Dependencies

  • RustCrypto cipher 0.5 ecosystem (PRs #352, #295, #291). aes 0.8→0.9, cbc 0.1→0.2, sha2 0.10→0.11, sha1 0.10→0.11, md-5 0.10→0.11.

Test suite

  • 13 dead/stale ignored tests removed; 3 previously-ignored tests fixed and un-ignored.
  • Regression tests added: ToUnicode CID-miss (3 tests), FlateDecode stream boundary framing (4 variants), TJ intra-word kerning, Cyrillic encoding and UTF-8 sniff (2 tests), dedup flow-prose preference, reading-order glyph sort stability (2 tests), Indexed Lab palette conversion.
  • Suite: 6,300 passed, 0 failed, 228 ignored.

Community Contributors

Thank you to everyone who reported issues, filed reproducers, or contributed code for this release!

  • @Charltsing — Reported the Indexed + CMYK image extraction failure (#373) with a reproduction PDF and screenshot comparison against pdfimages (xpdf), which exposed the unresolved ICC stream reference bug that had been silently producing garbled diagonal-stripe artifacts since the Indexed palette support landed in v0.3.27.
  • @ddxtanx — Reported the unbounded memory growth during multi-page extraction (#354) with profiling data that showed object and font caches consuming 200 MB+ on a 609 KB arXiv PDF. This drove the bounded-cache work in PR #369.
  • @andrewjradcliffe — Authored PR #369 implementing bounded FIFO caches for all internal caches, CTM-aware XObject dedup for the path extractor OOM, MutexExt poison-recovery trait, Python binding hardening, and markdown inter-group spacing. The PR also included comprehensive unit tests for all new cache types.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.