v0.3.41 | Real PDF/A conversion, LaTeX symbolic-font glyph rendering fix, and
This release exists because of the community. Special thanks to:
-
@FireMasterK — reported #307 with a precise reproduction case: a LaTeX-generated PDF where accented characters and ligatures (ú, á, fi) rendered as blank gaps across all pages. The report identified the exact document class (DC/EC TrueType fonts with Mac Roman cmap, no
/Encodingdict), which made the root cause inrender_cid_direct()straightforward to isolate and fix. -
@sparkyandrew — followed up on #425 with #443, noticing that the output PDF was 2.32 MB when the two source images summed to under 1.6 MB — even after the #425 image-pipeline fix. That single observation pinpointed the missing XObject deduplication: the same image data encoded twice produced two independent compressed streams. Fixed.
-
@potatochipcoconut — #418, the original PDF/A binding-completeness report that drove the full implementation in #442.
convert_to_pdf_a()existed in Rust but was a no-op: it recorded actions and returned success while leaving the document bytes untouched. The report surfaced this silently-broken state across all seven bindings. -
@nickpetrovic — filed #444 with a precise four-row reproduction table showing ligature glyphs in subset Calibri fonts decoded to wrong Unicode codepoints (
ti→O,tf→[,ft→e). The report included the exact PDF and the per-font-subset mapping failures, which led directly to the ICCBased color-space warn spam fix and the rowspan-label reading-order scramble fix. -
@RubberDuckShobe — reported #450: any PDF containing a PNG with an alpha channel showed a diagonal stripe through the image. A minimal reproduction confirmed the bug was reproducible across Acrobat, Preview, and browser PDF viewers. The report made the scope unambiguous — every image with transparency was affected — and led directly to the missing
DecodeParmsfix inbuild_soft_mask_dict(). -
@truffle-dev — first code contribution to the project: completed the CLI output-path fix for #412 in #452. The original audit in #412 covered all 11 CLI commands with exact line references and two proposed design options; the PR was clean on first submission. Picks up the four commands (
crop,decrypt,delete,reorder) missed by the earlier partial fix, and also enforces-o/--outputformergeinstead of silently defaulting to the first input's directory.
- Real PDF/A conversion — XMP metadata stream,
pdfaid:part/conformanceidentification, OutputIntents (sRGB), language tag, JavaScript removal; all 7 bindings (#418, #442). - Symbolic TrueType glyph rendering — non-ASCII bytes (ú=0xFA, á=0xE1, fi=0x85) in DC/EC-style LaTeX fonts with Mac Roman cmap no longer suppressed as spaces (partially fixes #307; follow-up cases reported by FireMasterK on 2026-04-29 remain open).
- Image XObject deduplication — same image embedded twice no longer re-encoded as two separate compressed streams; PDF size matches the sum of source images (#443).
- Diagonal-line artifact in transparent images fixed — missing
DecodeParmsin the soft-mask XObject caused a visible diagonal stripe in any PNG with an alpha channel (#450). - Barcode SVG generation —
pdf_barcode_get_svgno longer returnsERR_UNSUPPORTED; generates real SVG for all 8 barcode types including QR (#421). - CLI output routing —
crop,decrypt,delete, andreordernow write default output beside the input file instead of the current working directory;mergenow requires-o/--outputand errors up front instead of silently defaulting to the first input's directory. Completes #412.
convert_to_pdf_a() previously recorded conversion actions and returned success, but the document bytes were unchanged — the XMP metadata stream was constructed in memory and then discarded. This release rewrites the conversion core end-to-end:
- XMP metadata stream — a standards-compliant XMP packet is serialised and written as an indirect object, then wired into the document catalog as
/Metadata.pdfaid:partandpdfaid:conformanceare set per level: A1b →1/B, A2b →2/B, A2u →2/U, A3b →3/B. - OutputIntents — a
GTS_PDFA1output intent referencing sRGB is injected when none is present. Idempotent: a second call detects the existing intent and does not duplicate it. - Language tag —
/Langis written to the catalog when the validator raisesMissingLanguage. - JavaScript removal —
/Names/JavaScriptentries are stripped when present. - Source bytes patched —
doc.source_bytesis updated in-place; the document is immediately re-parseable after conversion. - Font embedding (
renderingfeature) —embed_font()now resolves the 14 standard PDF Type1 PostScript names (Helvetica, Courier, Times-Roman, …) to the metrically-equivalent URW Base 35 open-source fonts shipped by default on Linux (Nimbus Sans,Nimbus Mono PS,Nimbus Roman). With--features renderingall B-level PDFs convert to 0 remaining errors, includingFontNotEmbedded. Three bugs were fixed in the embedding pipeline:try_fix_errordedup applied to error codes, so only the firstFontNotEmbeddederror was processed; remaining fonts were skipped — fixed to dedup per-error-code for non-font errors only.write_full_to_writerwrote font objects from the original source instead of preferring stagedmodified_objects— fixed to use the same priority order as the general object sweep.add_structure()only added/StructTreeRootbut not/MarkInfo /Marked true; the validator requires both for PDF/A-*a conformance — fixed.
Test coverage — 17 new end-to-end roundtrip tests in tests/test_pdfa_roundtrip.rs verify every fixable scenario (validate → convert → validate). The showcase_pdfa_conversion CI example is rewritten to assert correctness and panics on any regression.
All seven bindings expose the updated function:
| Binding | API |
|---|---|
| Rust | convert_to_pdf_a(&mut doc, PdfALevel::A2b)? |
| Python | pdf_oxide.convert_to_pdf_a(doc, "A2b") |
| WASM | convertToPdfA(doc, "A2b") |
| C FFI | pdf_oxide_convert_to_pdf_a(doc, level, &out) |
| C# | Compliance.ConvertToPdfA(doc, PdfALevel.A2b) |
| Go | compliance.ConvertToPdfA(doc, compliance.PdfALevelA2b) |
| Node.js | compliance.convertToPdfA(doc, "A2b") |
LaTeX-generated PDFs using DC/EC fonts (Dcr10, Dcsl10, etc.) embed symbolic TrueType fonts with these characteristics:
/Flagshas the symbolic bit set (bit 3 = 4)- No
/Encodingdictionary - Mac Roman format-0 cmap (platform 1, encoding 0): byte code → glyph ID
- No Windows Unicode cmap
pdf_oxide correctly routes these through the render_cid_direct() path, which resolves each content-stream byte to a glyph ID via the Mac Roman cmap. The bug was one line in the space-detection guard:
// Before — bytes without a Unicode mapping fell through to unwrap_or(' ')
let char_at_pos = char_str.chars().next().unwrap_or(' ');
if char_at_pos.is_whitespace() { /* skip draw */ }
Any byte whose Unicode mapping returned None — including ú (0xFA → GID 85), á (0xE1 → GID 83), and fi (0x85 → GID 75) — was treated as a space, so the is_whitespace() guard blocked glyph drawing entirely.
// After — '\0' is not whitespace; GID ≠ 0 glyphs are drawn correctly
let char_at_pos = char_str.chars().next().unwrap_or('\0');
Verified pixel-perfect against Poppler and MuPDF on the #307 reproduction PDF. Regression-tested across 69 PDFs (120 page comparisons) — zero regressions in rendering, plain text, Markdown, and HTML extraction.
Two issues surfaced while investigating #444 (Calibri ligature mis-mapping, which is an upstream macOS Quartz PDF producer bug with no fix possible on our side):
ICCBased color space warn spam — PDF producers that register ICCBased profiles under user-defined names (e.g. Cs1, Cs2) caused the text extractor to fire a WARN log on every sc/SC/scn/SCN operator that used such a name. The catch-all _ branch in the color-space handler did not know how to handle named references, so it logged and left the color unchanged. The fix: apply a component-count fallback in that branch (1 component → gray, 3 → RGB, 4 → CMYK) and demote the log to DEBUG. Affected PDFs with large amounts of colored text (like typical Office documents) emitted 96+ spurious warnings per page; now silent.
Text span reading-order scrambling — reorder_rowspan_labels, a function that promotes vertically-centered table row labels to sort at the top of their row block, was incorrectly activating on single-column prose documents (resumes, reports). It identified spans at rightward X positions as a "sparse column" and promoted them to wrong Y coordinates, causing line-continuation text like "to assess technical needs and" or "-making." to appear before the earlier line they followed.
Root cause: the label-candidate filter did not exclude spans whose Y-band already appears in the dense column. Genuine rowspan labels are vertically between data rows, so their Y-band is absent from the dense column. Line-continuation spans share the Y-band of the main column text and must not be treated as labels. The fix adds that exclusion:
// Before — any sparse-column span in the data Y range
y > data_bot && y < data_top
// After — additionally exclude spans that align with a dense-column row
y > data_bot && y < data_top && !dense_bands.contains(&band_of(y))
The original rowspan-label behavior for actual table layouts (CJK lab reports, mixed-column tables) is preserved; the existing test confirms that genuine between-row labels are still promoted correctly.
When the same image data was passed to page.image() or from_bytes() on multiple pages, pdf_oxide encoded it as independent XObjects — each carrying the full compressed pixel data. A 760 KB PNG embedded twice contributed 1.52 MB instead of 760 KB; the #443 reproduction produced 2.32 MB from images totalling under 1.6 MB.
The fix hashes the normalised stream bytes after calling image_content_to_xobject_stream(). Hashing before normalisation failed across API paths: an image supplied via page.image() (which accepts raw file bytes and decodes them internally) and the same image supplied via ImageContent::from_bytes() produced different pre-encoding byte strings but identical post-normalisation compressed streams. Hashing after normalisation ensures the key is stable regardless of which API path the caller used. The key is (hash, byte_length) over the compressed pixel data; if a matching entry is already registered in the document's XObject map, the existing reference is reused and no new stream is written.
PDFs with PNG images that have an alpha channel displayed a diagonal stripe across the image when opened in Acrobat, Preview, and most other viewers.
Root cause: compress_image_data() prepends a PNG None-filter byte (0x00) before every scanline before Flate-compressing the pixel data. This is required by FlateDecode with DecodeParms/Predictor=15. The main image XObject carried the correct DecodeParms dictionary — but build_soft_mask_dict(), which builds the /SMask XObject for the alpha channel, emitted no DecodeParms at all. Viewers therefore decompressed the raw Flate stream, then treated the leading 0x00 filter byte of each row as an alpha pixel, shifting every row one byte to the right. The cumulative horizontal offset over hundreds of rows appears as a diagonal stripe.
Fixed by adding the same DecodeParms dictionary to the soft-mask stream:
DecodeParms { Predictor=15, Colors=1, BitsPerComponent=8, Columns=<width> }
Reported by @RubberDuckShobe in #450. Any PDF built with page.image() or ImageContent::from_bytes() where the source PNG has an alpha channel was affected; the fix is purely in the soft-mask stream header and does not change pixel data.
pdf_barcode_get_svg was a stub returning ERR_UNSUPPORTED. Two root causes were blocking a real implementation:
-
Format sentinel collision —
pdf_generate_qr_codestoredFfiBarcodeImage.format = 0, the same value aspdf_generate_barcodewithformat = 0(Code128). Theget_svgfunction had no way to distinguish QR from Code128. Fixed: QR codes now use the internal sentinel value100(outside the 0–7 range of 1D barcode types); the publicpdf_barcode_get_formatreturn value for QR codes changes from0to100accordingly. -
Missing SVG rendering path —
barcoders2.0 shipsbarcoders::generators::svg::SVG(enabled by default viafeatures = ["svg"]), so no new dependency was required. For 1D barcodes, the encoding step is now factored into a privateencode_1dhelper shared by bothgenerate_1d(PNG) and the newgenerate_1d_svg(SVG). For QR codes,generate_qr_svgrebuilds the code matrix fromqrcode::QrCode::to_colors()and emits a compact inline SVG with<rect>elements — no raster stage.
pdf_barcode_get_svg now returns a valid SVG string for all supported barcode types (Code128, Code39, EAN-13, EAN-8, UPC-A, ITF, Code93, Codabar, QR) when the barcodes feature is enabled.
A previous partial fix (commit 9dd94c0) introduced output_beside() / output_dir_beside() helpers and converted five commands (watermark, compress, flatten, rotate, split). Four binary-output commands were missed and continued resolving the default output path relative to the current working directory:
crop— now writes<stem>_cropped.pdfbeside the input file.decrypt— now writes<stem>_decrypted.pdfbeside the input file.delete— now writes<stem>_deleted.pdfbeside the input file.reorder— now writes<stem>_reordered.pdfbeside the input file.
merge previously silently defaulted to writing merged.pdf in the directory of the first input file when -o/--output was omitted. This silent fallback was the riskiest behavior in the CLI: callers who expected output beside a specific file got a surprise in a potentially unrelated directory. merge now requires -o/--output and exits with a clear error message if it is missing.
No library code was changed — all five files are in pdf_oxide_cli.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.40 | Image rendering fixes, dashed stroke + streaming table batch, digital
This release exists because of the community. Special thanks to:
-
@sparkyandrew — six detailed bug reports (#382, #385, #386, #397, #401, #425) that drove the CJK font subsetter, encryption, font-name handling, and now the image rendering overhaul. Every report came with a reproduction case. Issue #425 specifically identified four separate rendering bugs and raised the API design question that led to
ImageContent::from_bytes()and the newimage()method across all bindings. -
@potatochipcoconut — three well-targeted reports (#409, #416, #417) that directly drove the manylinux glibc fix, the OCR wheel fix, and the discovery of the missing in-memory encrypted save API. Terse, precise, actionable every time.
- Image rendering — four bugs fixed in PNG/JPEG embed path (#425).
- New image API —
ImageContent::from_bytes()+ plainimage()on all bindings; no pixel dims needed (#425). - Dashed stroke + streaming table batch —
StrokeRectDashed/StrokeLineDashed+StreamingTablebounded-batch API across all 7 bindings (#400). - Digital signature verification — real RSA-PSS / ECDSA / TSA cryptographic checks (#420).
- Binding completeness — encrypted bytes (#423), barcode via C FFI (#421), Node.js validation (#424) and page extraction (#384), Python/Go
convert_to_pdfa(#418/#419). - Platform fixes — Python glibc 2.34 compat (#416), OCR wheels (#417), WASM rendering (#422), CLI output path.
- Security & hygiene — unsafe audit, dep freshness, SLSA provenance, SBOM, CodeQL, DCO (#415).
This release closes the image-rendering bugs reported in #425 by @sparkyandrew. Four bugs, all in the same family of incorrect assumptions in image_handler.rs / pdf_writer.rs:
-
PNG color corruption (
Predictor=15mismatch) —FlateDecodewithDecodeParms/Predictor=15promises PNG-style per-scanline filter bytes. The encoder was compressing raw pixels without prepending the required0x00(None-filter) byte before each row; viewers applied PNG unfiltering to raw data, corrupting every pixel. Fixed:compress_image_data()now prepends one0x00per scanline before Flate compression. -
Blank PNG via
ImageContent::new()—image_content_to_xobject_stream()assumeddatawas already decoded pixel bytes. Passing raw PNG file bytes caused the PNG header to be treated as pixels — blank / garbage output. Fixed: magic-byte detection (89 50 4E 47) routes raw bytes throughImageData::from_png(). -
JPEG zoom / wrong dimensions — same root cause; JPEG file bytes were not routed through
ImageData::from_jpeg(), so the pixel dimensions stored in the XObject were wrong. Fixed by the sameFF D8magic-byte detection. -
Soft-mask (alpha) lost — PNG transparency was discarded when raw bytes were passed through
ImageContent::new(). The new auto-detect path correctly threads the alpha channel through to the PDF/SMaskXObject.
The bug report also identified a legitimate API design problem: every other PDF library (ReportLab, fpdf2, iText, PDFBox, PDFKit, printpdf, Prawn) auto-detects pixel dimensions from the image header — users only specify where the image appears on the page. ImageContent::new() required passing width and height explicitly, which callers typically had to look up from a separate decode step.
// Before — pixel dims required even though the library could read them itself
let img = ImageContent::new(bbox, ImageFormat::Png, raw_bytes, width, height);
// After — just bytes + on-page display rect; everything else auto-detected
let img = ImageContent::from_bytes(bbox, raw_bytes)?;
from_bytes() detects JPEG/PNG by magic number and reads width, height, color_space, bits_per_component, and the soft-mask channel from the image header. A plain image() method (no accessibility wrapper) was also missing from Go, C#, and Node.js — added to all three:
| Binding | Method |
|---|---|
| Rust | ImageContent::from_bytes(bbox, data)? |
| Go | page.Image(bytes, x, y, w, h) |
| C# | page.Image(bytes, x, y, w, h) |
| Node.js | page.image(bytes, x, y, w, h) |
| Python | page.image_from_bytes(bytes, x, y, w, h) (pre-existing) |
| WASM | page.image_from_bytes(bytes, x, y, w, h) (pre-existing) |
Use imageWithAlt / ImageWithAlt for PDF/UA-1 accessible figures and imageArtifact / ImageArtifact for decorative images.
Two FluentPageBuilder additions shipping across all 7 bindings:
-
stroke_rect_dashed/stroke_line_dashed— stroke a rectangle or line with an explicit dash pattern (&[f32]on/off lengths + phase) and RGB colour. Complements the existing solidstroke_rect/stroke_line. -
StreamingTablebounded-batch API —set_batch_size(n),pending_row_count(),batch_count(),flush()— lets callers control how many rows accumulate in memory before being flushed to the PDF content stream. Useful when streaming very large tables from a source that itself has natural chunk boundaries.
Both surfaces are available in Rust, Python, WASM, Go, C#, and Node.js / TypeScript. New examples/*/09-new-features/dashed_stroke/ examples ship in all four binding example directories.
SignatureInfo.verify() now performs real cryptographic verification instead of returning a stub result:
- RSA-PSS and RSA-PKCS#1 v1.5 — verified against the embedded certificate public key via the
rsa+sha2crates. - ECDSA (P-256 / P-384) — verified via the
p256/p384crates. - TSA timestamp (
Timestamp.verify()) — full RFC 3161 countersignature verification: CMS structure, signer certificate, and TSTInfo hash match.
Several APIs present in the Rust core and some bindings were missing from others. All are now consistent across all 7 bindings:
-
In-memory encrypted save (#423) —
PdfDocument.to_bytes_encrypted(user_pw, owner_pw)saves with AES-256 encryption directly tobytes/Buffer/Vec<u8>without touching disk. Available in Python, Node.js, C#, Go, and the C FFI. Driven by @potatochipcoconut in #409. -
Barcode via C FFI (#421) —
pdf_add_barcode_to_page()embeds a generated barcode PNG onto a page at a given rect. Previously the function returnedERR_UNSUPPORTED; it now calls the newDocumentEditor::add_image_bytes_to_page()helper internally. C FFI only in this release — Go and C# wrappers are follow-up work. -
PDF/A, PDF/X, PDF/UA validation on Node.js (#424) —
PdfDocument.validatePdfA(),.validatePdfX(),.validatePdfUA()now available in the Node.js binding, matching Python, Go, C#, WASM, and Rust. -
Page extraction in Node.js (#384) —
DocumentEditor.extractPagesToBytes(pageIndices)splits a multi-page PDF into per-chunkBufferobjects entirely in memory, no temp files needed.const chunk = editor.extractPagesToBytes([0, 1, 2]); // → Buffer
-
PDF/A conversion (#418/#419) —
PdfDocument.convert_to_pdfa(output_path, level)exposed in Python;pdf_convert_to_pdfa()C FFI + GoConvertToPdfA().
-
Python glibc 2.34 compatibility (#416) — LLVM emits
__memcmpeq(a glibc 2.35 symbol) in some optimised builds; wheels built against glibc 2.35 failed to load on Amazon Linux 2023 (glibc 2.34) and similar systems. Fixed by adding aglobal_asm!weak-symbol alias insrc/lib.rsthat maps__memcmpeq→memcmp. This works with both GNU ld and lld (unlike--defsymwhich lld rejects for PLT-resolved symbols). Reported by @potatochipcoconut. -
Python OCR wheels (#417) — published wheels omitted the
ocrfeature, sopip install pdf-oxide[ocr]installed silently but failed at runtime. Wheels now compile with--features ocr; ORT library path auto-detected on import. Reported by @potatochipcoconut. -
WASM rendering (#422) —
wasm-packbuilds were missing therenderingfeature flag, producing blank page images. All WASM targets now build with--features rendering. -
CLI binary output path —
pdf-oxide render,pdf-oxide thumbnail, and other commands that produce binary output were writing next to the working directory instead of next to the input file when no explicit output path was given. Fixed.
#[forbid(unsafe_code)]on all modules that have no FFI business being unsafe; remaining unsafe consolidated into audited FFI helpers withhandle_mut!/handle_ref!macros.lazy_staticreplaced withstd::sync::OnceLockthroughout.cargo updatedep freshness sweep; lock file refreshed.cargo-geigerunsafe audit +cargo-outdateddependency check added to CI (both run monthly).- CI: action SHAs pinned, OIDC publish, SLSA provenance level 3, SBOM (CycloneDX), OpenSSF Scorecard, CodeQL static analysis, DCO enforcement.
- Dependabot configured for all three ecosystems (
cargo,npm,github-actions). - SPDX licence headers added to source files;
CODEOWNERSandCONTRIBUTING(DCO) added.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.39 | Tables (streaming + buffered), PDF/UA-1, digital signing (CMS/PKCS#7),
v0.3.39 originally shipped as a single release themed around table generation (issue #393). Mid-release we expanded the scope to close the broader post-#393 programmatic-builder gap audit (docs/v0.3.39/design/builder_gaps_plan.md, 26 items in 4 tiers). The release now delivers:
- Bundle C — shape primitives (
circle,ellipse,polygon,arc,bezier_curve) + dash patterns onLineStyle. - Bundle A — image placement (
image_from_file/_from_bytes/_with) + 2D affine transforms (rotated,scaled,translated,with_transform; v0.3.39 scope text-only, path/image/table in v0.3.40). - Bundle B — document outline (
bookmark,bookmark_tree), page labels (with_page_labels), ToC auto-generator (insert_toc). - Bundle D (partial) —
list_boxform widget, fluent field metadata (required/read_only/tooltip), pagetab_order(TabOrder::{Row, Column, Structure}). - Bundle E + F (research) — RFCs for rich-text accumulator (
docs/v0.3.39/design/e_rich_text_rfc.md) and PDF/UA compliance (docs/v0.3.39/research/e_pdf_ua_compliance.md). Implementation deferred to v0.3.40 (#400). - Bundle D (deferred) — signature_field widget, barcode-bound fields, JS-action field validation, calculated fields, XFA write-side → v0.3.40 (#400).
This release closes issue #393. Users who previously had to build giant HTML strings or drop to PdfSharp (the .NET community's canonical pain point — MigraDoc halts around 30 k rows with an O(rows²) autosize) can now stream tables of arbitrary size directly through DocumentBuilder. The release gate is a criterion benchmark that proves linear scaling from 1 k → 30 k rows; see the "Release gate" section below.
Design + research anchors live under docs/v0.3.39/:
research/a_table_api_landscape.md— survey of 20 OSS PDF libraries across 6 ecosystems.research/b_scalable_layout_algorithms.md— why MigraDoc fails at 30 k rows + how to not repeat it.research/c_api_ergonomics.md— idiomatic API shape per binding.research/d_builder_gap_analysis.md— primitives we were missing to make tables compose.design/393_tables_decision.md— synthesis + scope split v0.3.39 / v0.3.40.
-
Buffered
Table(page.table(Table::new(rows).with_header_row()...)) — takes the full row matrix, supportscolspan/rowspan/ rich per-cell styling, splits at row boundaries, emitsContentElements so the v0.3.38 subsetter continues re-keying CJK glyph IDs. Best for tables under ~1 k rows. -
Streaming
StreamingTable(page.streaming_table(StreamingTableConfig::new().column(...).column(...))) — row-at-a-time,TableMode::Fixedonly (explicit widths, zero look-ahead), O(cols) persistent memory, auto page-break with repeat-header. Best for 1 k → ∞ rows. Solves the motivating MigraDoc 30 k-row failure directly.
use pdf_oxide::writer::{
CellAlign, DocumentBuilder, StreamingColumn, StreamingTableConfig,
};
let mut doc = DocumentBuilder::new();
let page = doc.letter_page().font("Helvetica", 10.0).at(72.0, 720.0);
let mut t = page.streaming_table(
StreamingTableConfig::new()
.column(StreamingColumn::new("SKU").width_pt(72.0))
.column(StreamingColumn::new("Item").width_pt(240.0))
.column(
StreamingColumn::new("Qty")
.width_pt(48.0)
.align(CellAlign::Right),
)
.repeat_header(true),
);
for record in huge_dataset { // never materialised
t.push_row(|r| {
r.cell(&record.sku);
r.cell(&record.name);
r.cell(record.qty.to_string());
})?;
}
t.finish().done();
Both surfaces ship with idiomatic per-binding wrappers (Python, WASM, C#, Go, Node/TS). See each binding's README / guide for the native shape.
Shipped alongside tables because a credible table API needs them:
measure(&str) -> f32— text width in points for the current font/size. Pure query; used to pick explicit column widths.text_in_rect(rect, text, align)— wrapstexttorect.width, aligns each line horizontally perTextAlign::{Left, Center, Right}. Cursor is deliberately NOT advanced — the rect has its own geometry. Finally honoursTextConfig.alignwhich was a dead field for seven releases.stroke_rect(x, y, w, h, LineStyle)+stroke_line(p1, p2, LineStyle)— stroke with explicit width + RGB colour. Previouslyrect()andline()only stroked at 1 pt black.LineStyle { width, color }is the new public type.remaining_space()+new_page_same_size()— the missing page-break signal.remaining_space()returns vertical points from cursor to the bottom margin;new_page_same_size()commits pending annotations and opens a fresh page with the same dimensions + carriedtext_config.
A criterion benchmark at benches/streaming_table_scaling.rs runs StreamingTable at 1 k / 5 k / 10 k / 30 k rows. Local numbers on the contributor machine (--quick):
| Size | Time | Throughput |
|---|---|---|
| 1 000 | 21.7 ms | 46.0 K rows/sec |
| 10 000 | 217.0 ms | 46.0 K rows/sec |
10× rows → 10× time → O(rows). MigraDoc's failure mode would have shown ~100× time at 10× input. Cargo-bench-invoked as cargo bench --bench streaming_table_scaling.
- Multi-line cell rendering. The existing
src/writer/table_renderer.rscomputed row heights from wrapped text (wrap_textat:817) but only emitted the first line on render (:968-969— flagged as// Simple single-line rendering for now). Fixed by pre-computing wrapped lines + per-line widths once insideTableLayout.cell_layoutsand looping them at render time. - Per-line alignment.
CenterandRightalignment usedcell_x + content_width / 2andcell_x + content_widthas the drawn-from x, which placed the text's left edge at the centre or right edge of the cell (so centre text was offset, right text was pushed off-cell). Fixed by using each wrapped line's measured width:cell_x + (content_width - line_width) / 2for Centre,cell_x + content_width - line_widthfor Right.
page.image_from_file("logo.png", Rect::new(72.0, 720.0, 120.0, 40.0))?
.rotated(15.0, |p| p.text("tilted caption"))
.scaled(1.5, 1.5, |p| p.text("enlarged footnote"));
image_from_file(path, rect)/image_from_bytes(&[u8], rect)/image_with(ImageData, rect)— auto-detect JPEG + PNG, alpha channels become/SMaskXObjects for transparent placement.rotated(deg, |p| ...),scaled(sx, sy, |p| ...),translated(tx, ty, |p| ...),with_transform([a b c d e f], |p| ...)— closure-scoped 2D affine transforms. Compose naturally (translated(50, 100, |p| p.rotated(45, |p| p.text("tilted")))produces the expected composed matrix). v0.3.39 scope is text-only — Path / Image / Table elements gain a matrix field in v0.3.40. Rotated watermarks + stamps + captions are the common-case target today.
doc.bookmark("Intro", 0)
.bookmark_tree(|o| {
o.add_item(OutlineItem::new("Chapter 1", 1));
o.add_child(OutlineItem::new("Section 1.1", 2));
})
.with_page_labels(
PageLabelsBuilder::new()
.add_range(PageLabelRange::new(0).with_style(PageLabelStyle::RomanLower))
.add_range(PageLabelRange::new(4).with_style(PageLabelStyle::Decimal)),
)
.insert_toc(0, "Table of Contents");
bookmark(title, page_index)+bookmark_tree(|b| ...)— outline / bookmarks emitted as the catalog/Outlinestree. Pre-existingOutlineBuilderwas unused; this release is the fluent wiring + the end-to-end catalog emission it was missing.with_page_labels(PageLabelsBuilder)— Roman preface + Arabic body or any PageLabelStyle mix, emitted as/PageLabelsnumber-tree.insert_toc(insert_at, title)— walks the bookmark tree and renders an indented ToC page with right-aligned page numbers. v0.3.39 limitation: doesn't auto-renumber existing bookmark targets (call before further bookmarks, or re-issue after).
page.circle(cx, cy, r, Some(LineStyle::new(1.5, 0.1, 0.2, 0.3)), None)
.ellipse(cx, cy, rx, ry, None, Some((0.9, 0.1, 0.1)))
.polygon(&points, Some(LineStyle::default()), Some((0.5, 0.5, 0.9)))
.arc(cx, cy, r, start, end, LineStyle::new(1.0, 0.0, 0.0, 0.0))
.bezier_curve(x0, y0, c1x, c1y, c2x, c2y, x3, y3, style, None)
.stroke_line(10, 100, 500, 100, LineStyle::new(0.5, 0, 0, 0).with_dash(&[3.0, 2.0], 0.0));
circle,ellipse,polygon,arc,bezier_curve— five fluent shape primitives, each emitting oneContentElement::Pathwith optional stroke + fill.circlereusesPathContent::circle;ellipse/arc/bezier_curvebuild their quarter-Bezier approximations inline.LineStyle::with_dash(&[f32], phase)/.solid()— dash patterns propagate intoPathContent.dash_pattern, emitted as[...] phase dbefore stroke and reset to solid after.
page.list_box("interests", 72, 600, 200, 80,
vec!["Hiking".into(), "Reading".into(), "Coding".into()],
Some("Coding".into()), true /* multi_select */)
.required()
.tooltip("Pick one or more")
.text_field("email", 72, 500, 200, 20, None)
.required()
.read_only()
.tab_order(TabOrder::Column);
list_box(name, x, y, w, h, options, selected, multi_select)— wires the existingListBoxWidget(fully implemented inform_fields/choice_fields.rs) through the public fluent surface..required()/.read_only()/.tooltip(text)— chainable metadata that mutates the most-recently-added form field on the current page (no-op if no field has been added yet).page.tab_order(TabOrder::{Row, Column, Structure})— emits/Tabson the page dict for reader tab-navigation order.Structurerequires tagged PDF (Bundle F) to be meaningful.
page.heading(1, "Shopping list")
.bullet_list(&["Apples", "Bananas", "Cherries"])
.space(12.0)
.numbered_list(&["First chapter", "Second chapter"], ListStyle::Decimal)
.code_block("rust", "fn main() {\n println!(\"hi\");\n}");
page.bullet_list(items)— bullets (•) with indent + per-item wrapping.page.numbered_list(items, ListStyle::{Decimal, RomanLower, AlphaLower})— Arabic, lowercase Roman, or lowercase alpha markers.page.code_block(language, source)— monospace text over a light-grey filled rectangle.languagereserved for Bundle F accessibility tagging; no syntax highlighting in v0.3.39.- Helpers:
to_roman_lower(n)andto_alpha_lower(n)exposed internally.
Inline rich text (ParagraphBuilder with .bold() / .italic() / .color()), multi-column flow, and footnotes remain deferred to v0.3.40 — see the E-0 RFC at docs/v0.3.39/design/e_rich_text_rfc.md.
docs/v0.3.39/design/e_rich_text_rfc.md— RFC for v0.3.40 inline-stylingParagraphBuilderwith.bold()/.italic()/.color(rgb, text)cascading runs. ~770 LOC estimated for v0.3.40.docs/v0.3.39/research/e_pdf_ua_compliance.md— PDF/UA-1 compliance audit. Repo has ~40 % of the plumbing (StructureElement, MCID counter, ArtifactType) but MCIDs are orphaned — no StructTreeRoot emission. Bundle F lands in v0.3.40 as ~490 Rust LoC + 1,450 across 6 bindings.
- C FFI (
include/pdf_oxide_c/pdf_oxide.h) — six new entry points:pdf_page_builder_stroke_rect,_stroke_line,_text_in_rect,_new_page_same_size,_table(buffered), and the streaming trio_streaming_table_begin/_push_row/_finish. Handle-lifetime contract documented inline. - Python (pyo3) — new classes
Align,Column,Table,StreamingTable; newFluentPageBuildermethods mirroring the Rust surface.alignkwargs accept string, enum, or raw int interchangeably. - WASM (wasm-bindgen) —
Alignenum +StreamingTableclass; bufferedtable({columns, rows, hasHeader})via serde-wasm-bindgen;stroke_rect,stroke_line,text_in_rect,new_page_same_size,measure,remaining_spaceon the page builder. - C# —
Alignment,Column,TableSpec,StreamingTable : IDisposable; fluent methods onPageBuilderincluding managed-side streaming buffer that flushes on.Build(). - Go (cgo) —
Alignment,Column,TableSpec,StreamingTableConfigundergo/types.go; fluent methods on*PageBuilder; managed streaming adapter. Purego backend untouched (table surface is cgo-only in v0.3.39). - Node/TS —
Alignenum +StreamingTableclass injs/src/builders/streaming-table.tswithpushRow,pushAll(sync + async iterables),finish. All new types injs/index.d.ts.
Tables
TableMode::Sample— measure first N rows, freeze widths, stream the rest.TableMode::AutoAll— opt-in O(rows × cols) with documentation warning.- Cross-page cell splitting for tall rich cells.
- Bounded-lookahead rowspan in streaming mode.
- Arrow-style bounded batching on binding StreamingTables (current impl buffers all rows managed-side between
beginandfinish). - Mixed-font exact metrics inside a single table (currently measures against the table default font).
- Pandas DataFrame first-class adapter in Python.
Transforms
TableContent-as-a-whole matrix (individual cells compose naturally through their ownTextContent/PathContentmatrix fields, which now ship — but wrapping an entire Table in one transform needs a new field onTableContentitself).
Forms (rest of Bundle D)
- Signature-field form widget (coordinates with #208 signing half).
- Barcode-bound form field (auto-generate from another field's value at fill time).
- Field validation — regex mask, numeric range, JavaScript actions.
Layout (Bundle E) — blocked on E-0 RFC which ships in v0.3.39
- Inline rich-text styling (
ParagraphBuilderwith.bold()/.italic()/.color()). - Multi-column flow on
DocumentBuilder(currently only available throughPdf::from_html_css). - Footnotes / endnotes.
Accessibility (Bundle F) — blocked on F-0 research which ships in v0.3.39
- Tagged PDF / logical structure tree emission.
/Langper content run./Artifactmarking for headers/footers on the write side./RoleMapfor non-standard structure types.
Advanced forms (Bundle G) — pick up on concrete customer demand
- Calculated fields / JavaScript actions.
- XFA write-side.
- #401 — Encrypted PDFs were missing embedded-font sub-objects (
/Widths,/FontDescriptor,/FontFile2); they are now included and referenced correctly. Reported by @sparkyandrew. - #402 / #406 — Systemic UTF-8 encoding loss: every PDF string object (metadata titles, annotation contents, bookmark titles, content streams) was written as raw UTF-8 bytes instead of PDFDocEncoding (Latin-1 code point for chars ≤ U+00FF) or UTF-16BE with BOM (for chars > U+00FF). Reported by @AngeloBestetti (#402) and internally audited as #406.
- #407 — L4 font cache cross-contamination: when two pages share the same
/Fontresource key (e.g. both use keyF1), the CMap of the first-loaded face silently overwrote the second's glyph mapping, causing glyphs to be dropped or mis-decoded. Fixed by keying the combined-font hash over all font objects. Reported by @ChadThackray. - #395 —
SignatureExceptiononPdfDocument.open()for PDFs containing digital signatures. Fixed as a side-effect of the signing infrastructure (#208). Reported by @gevorgter. - #398 — Native PDF parser was non-reentrant: concurrent FFI reads on the same handle returned spurious parse errors. Resolved by the interior-mutability refactor (
Mutex<…>on internal caches). - #409 — Python (and all bindings) lacked
to_bytes()/ in-memory output;compressandgarbage_collectwere not wired into the write path. Reported by @potatochipcoconut. - #411 —
p12 = "0.6"(yanked / unmaintained) replaced withp12-keystore = "0.2.1"(RustCrypto-ecosystem, pure Rust, actively maintained). No public API change;SigningCredentials::from_pkcs12behaviour is unchanged. - StreamingTable rowspan flush —
finish()was silently dropping the in-progress rowspan group if the table ended mid-span. Added a flush of any partialrowspan_bufbefore finalising the page. draw_rowspan_groupbounds guard — accessingrows[0][col_idx].rowspanwas not guarded againstcol_idx ≥ rows[0].len(), causing a panic on narrow tables with rowspan cells. Added the bounds checkcol_idx < rows[0].len().scan_root_refanchoring — the digital-signature helper scanned the entire document for/Root, so a/Rootreference embedded inside an annotation value or stream body could silently win over the real XRef/Rootat the end of the file. Now mirrorsscan_startxrefby restricting the search to the last 4 KB of the file.- Signature reason/location PDFDocEncoding —
/Reasonand/Locationentries in CMS-signature dictionaries were written as raw UTF-8 bytes, bypassing theencode_pdf_text_stringpath. Non-ASCII characters (accents, CJK, etc.) were stored as illegal UTF-8 sequences in the PDF string. Now uses the same hex-encoded PDFDocEncoding/UTF-16BE path as all other string objects, closing the last #402-class gap in the signing path. - #394 — Mixed-size inline runs (superscripts, footnote markers) were incorrectly split onto separate lines because the newline gate used a hard-coded 2 pt Y-tolerance. Replaced with
PdfDocument::same_line_threshold— a font-size-relative helper (max(prev_fs, cur_fs) × 0.5) shared across all seven Tagged-PDF assembly paths andshould_insert_space. A forward-gap guard was added to prevent the widened threshold from merging spans across column gutters. Contributed by @RolandWArnold (#394). - #403 — Simple fonts without an explicit
/Widthsarray fell back to a uniform 0.55 em default for every glyph. For standard-14 fonts (Helvetica, Times, Courier, etc.) this inflated span widths by up to 40 %, collapsing inter-column gaps from real values (e.g. 47 pt) to near-zero (5 pt) and breaking gap-dependent layout heuristics. The fast path now populates the byte-to-width table fromget_standard_font_widthwhen/Widthsis absent; non-standard fonts and unmapped codepoints still fall back to the generic default. Contributed by @RolandWArnold (#403). - #404 — Span right-edges could drift ~0.02 pt outside the detected table bbox due to float accumulation in upstream width arithmetic. The strict
Rect::contains_rectcheck then rejected those spans from the table's retain set, so they were emitted via both the table path and the flow path, producing duplicated text. Introduced a 0.1 pt tolerance at the two retain call sites indocument.rsviaPdfDocument::contains_rect_with_tolerance; the geometry primitive itself remains strict. Contributed by @RolandWArnold (#404).
- Resolved all Clippy,
rustfmt, andcargo checkfailures that were blocking CI (fix(ci)commit6c95bada): unused-mut across 80+ files after the interior-mutability refactor, late-init variables, doc-comment ordering, non-minimal boolean conditions, deprecated function references. - Renamed six test files from issue-number / benchmark-code names to functional descriptive names (
refactor(tests)commitfa071380):test_b1_*→test_shared_form_xobject_per_page_ctm,test_b3_*→test_running_header_first_occurrence_kept,test_b4_*→test_two_column_reading_order,test_b7_*→test_stroke_fill_duplicate_text_dedup,test_issue_346_*→test_extract_text_sort_comparator_stability,test_issue_395_*→test_signed_pdf_opens_and_renders. - Example smoke-tests in CI — all code examples are now compiled and executed on every CI run, catching binding API drift before it reaches a release. A dedicated
rust-examplesjob runs all 13 Rust examples (tutorial_*+showcase_*). The Python, Go, Node.js, and C# binding jobs each gained an equivalent step that runs the per-language examples againsttests/fixtures/simple.pdf. This means any breaking change to a public binding API will fail CI immediately rather than being discovered post-release by users. - Example restructuring — the single monolithic
09-new-featuresshowcase file per language was replaced with one standalone file per feature (streaming-table,pdf-ua-image,in-memory-roundtrip,pkcs12-signing,rfc3161-timestamp) across all 5 languages. Each file is a self-contained runnable program. The tutorial examples01-08were also repaired: Go examples gainedgo.mod+go.sumand had three API-drift regressions fixed (OpenEditor,pdf.Save,RowCount/CellText); JavaScript examples were migrated from CommonJSrequire()to ESMimport; C# examples gained.csprojfiles referencing the localPdfOxideproject.
-
@RolandWArnold — First contribution to PDFOxide, and a substantial one at that. Roland identified three independent text-extraction correctness issues, traced each one to its root cause in the Rust source, wrote focused fixes with synthetic
PdfWriter-based regression tests, and documented the behaviour thoroughly in PR descriptions that made review straightforward. #394 fixes the long-standing mixed-size inline run / superscript line-grouping problem; #403 restores correct span widths for standard-14 fonts without/Widths; #404 eliminates duplicate text caused by sub-pixel float drift at the table-retain boundary. Thank you, Roland — we look forward to more! 🚀 -
@AngeloBestetti — Filed #402 with the concrete word
"Lógico": a Portuguese term that, when saved to PDF, came back as mojibake because every accented byte was being stored as raw UTF-8. That single reproducer uncovered a systemic encoding bug — all PDF string objects (metadata titles, annotation contents, bookmark labels, content-stream text) were silently corrupted for any non-ASCII character. The internal audit that followed produced #406 and a full rewrite ofwrite_escaped_string+encode_pdf_text_stringto emit PDFDocEncoding for chars ≤ U+00FF and UTF-16BE with BOM for anything above. Thank you. -
@sparkyandrew — Filed #401 after discovering that AES-256 encrypted PDFs built with
DocumentBuilderopened successfully but rendered blank — the embedded font was gone. The root cause:collect_reachable_idsfollowed the top-levelFontdictionary but stopped there, so/Widths,/FontDescriptor, and/FontFile2were garbage-collected as "unreachable" during the encrypted write pass. The fix traces the full font sub-object graph before encryption so the complete font survives. Thank you. -
@ChadThackray — Filed #407 after noticing that glyphs from one page silently replaced those of another whenever two pages shared the same
/Fontresource-key name (both using keyF1but mapped to different faces). The L4 cache was keying the combined glyph-map on a spot-check of a single font object; the fix computes a combined hash over the complete font set, so any change to any face invalidates the entry. Thank you. -
@gevorgter — Filed #395 after a
SignatureExceptionfromRenderPageon a 9-page signed PDF — the renderer was propagating a signature-parse failure as the page-render verdict even though no interactive widget lived on that page. The fix treats unparseable signature-field metadata as non-fatal at render time. @gevorgter also supplied the reproducer PDF that became the regression fixture (tests/test_signed_pdf_opens_and_renders.rs), ensuring this class of error can never silently return. Thank you. -
@potatochipcoconut — Asked #409 how to get a
PdfDocumentas raw bytes from Python without writing to disk, and whethercompressandgarbage_collectwere available. Neither worked. The question drove theto_bytes()/SaveOptionskwargs work that shipped in-memory output, compression, and garbage-collection across all 7 bindings, plus 18 missingDocumentEditormethods. Thank you.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.38 | DocumentBuilder fluent API across every language binding, real font subsetting, DocumentBuilder encryption, multi-target WASM packaging, and the first cryptographic slice of PDF signature verification
This release closes the "Rust-only DocumentBuilder gap": the fluent write-side builder, embedded fonts, the HTML+CSS pipeline, annotations, form-field creation, and low-level graphics primitives are now reachable from Python, WASM, C#, Go, and Node/TypeScript — the Rust implementation is the single source of truth and every binding is a thin translation layer. On top of that it lands the first cryptographic signature-verification path (RSA-PKCS#1 v1.5) across every binding and a pdf.js-parity fix for scanned / bilevel pages rendered under a Multiply-blended overlay.
Every binding now exposes the full DocumentBuilder fluent API:
# Python — the same shape ships in WASM, C#, Go, and Node/TS
font = EmbeddedFont.from_file("DejaVuSans.ttf")
(DocumentBuilder()
.register_embedded_font("DejaVu", font)
.a4_page()
.font("DejaVu", 12).at(72, 720).text("Привет, мир!")
.highlight((1.0, 1.0, 0.0))
.text_field("name", 150, 680, 200, 20, "Jane Doe")
.checkbox("subscribe", 72, 650, 15, 15, True)
.rect(50, 50, 500, 700)
.done()
.build())
Surface shipped in all 6 bindings:
- DocumentBuilder + FluentPageBuilder + EmbeddedFont — multi-page construction with CJK / Cyrillic / Greek support (closes #382 cross-language).
- HTML+CSS pipeline —
Pdf.from_html_css(...)andfrom_html_css_with_fonts(...)for multi-font cascades. - 15 annotation methods — link (URL / page / named), highlight, underline, strikeout, squiggly, sticky note, stamp (14 standard types + custom), free text, watermark (custom / DRAFT / CONFIDENTIAL).
- 5 AcroForm widget types — text_field, checkbox, combo_box, radio_group, push_button.
- Graphics primitives —
rect,filled_rect,line. - AES-256 encryption —
save_encrypted/to_bytes_encryptedon every binding.
Per-binding regression tests for every capability above; ~70 new integration tests pass across Python (20), C FFI (11), C# (11), Go (11), Node/TS (10), and WASM (9).
Documents that embed a CJK face now ship a subset, not the full font. A PDF with 5 characters from NotoSansCJKtc-Regular.otf (~17 MB original) is typically under 100 KB. Content streams, /W widths, and ToUnicode CMap are all re-keyed onto the subset GID space; extract_text round-trips unchanged.
Breaking (v0.3.x semver-acceptable): EmbeddedFont::encode_string / encode_shaped_run now return Vec<u16> instead of a hex String, and build_embedded_font_objects returns a GlyphRemapper that callers must pass to ContentStreamBuilder::build_with_remappers. Internal writer-library consumers only — no change to high-level APIs.
AES-256 encryption is now available on programmatically-built PDFs:
DocumentBuilder::new()
.a4_page().text("secret").done()
.save_encrypted("out.pdf", "user-pw", "owner-pw")?;
Also: save_with_encryption (custom algorithm + permissions) and to_bytes_encrypted for in-memory output.
pdf-oxide-wasm now ships three builds side-by-side and routes each consumer through package.json conditional exports:
| Environment | Build |
|---|---|
| Node.js | nodejs/ |
| Bundlers (Vite, webpack, Rollup, esbuild, Bun) | bundler/ |
| Browsers / Deno / Cloudflare Workers | web/ |
Fixes ReferenceError: Can't find variable: __dirname thrown in any browser bundler. Subpath imports (pdf-oxide-wasm/web etc.) are also available for manual routing.
First cryptographically-backed signature surface on the reader side. Every binding (Signature.verify() / .verifyDetached() / equivalents) now runs the RFC 5652 §5.4 signer-attributes check against the embedded certificate and the §11.2 messageDigest check against the caller's document bytes:
for sig in doc.signatures():
print(sig.signer_name, "→", sig.verify()) # signer-attrs only
print("detached ok =", sig.verify_detached(pdf_bytes)) # + content hash
- RSA-PKCS#1 v1.5 over SHA-1 / SHA-256 / SHA-384 / SHA-512 — the padding used by effectively every signed PDF in the wild — returns
Valid/Invalid. - RSA-PSS and ECDSA surface as
Unknown/UnsupportedFeatureExceptionfor now; callers that need those can still read the signer certificate viaSignature.GetCertificate()and drive their own check. SignatureVerifier::verify(Rust) also stamps the verification result with trust-root lookup, expiry window, and signer DN pulled from the embedded certificate.
Supporting surface shipped alongside:
Certificate— DER inspection (subject, issuer, serial, validity,is_valid) viax509-parser— every binding.Signature— enumerate + inspect +.GetCertificate()— every binding.Timestamp— RFC 3161TSTInfoparsing (time, serial, policy, TSA name, hash algorithm, message imprint) — every binding.TsaClient— RFC 3161 HTTP POST with nonce + HTTP Basic auth, behind a newtsa-clientCargo feature — every binding except WASM. Intentionally not wired on WASM (ureq is wasm-incompatible).DocumentEditor::set_producer/set_creation_date— metadata writers.render_page_region/render_page_fit— clipped / fitted rendering surface.- Bicubic image filtering (pdf.js#19978 parity) — scanned / bilevel pages with a Multiply-blended overlay no longer collapse their grayscale range on downscale.
Signing (as opposed to verification) is not covered by this release; #208 remains open for the signing half.
Five thin-wrapper commits closed the last coverage holes in this release's signature surface — Python/Go/WASM Certificate inspect, Node Timestamp parse+verify, Node TsaClient HTTP. Every capability in the Supporting Surface list above is now the language-idiomatic shape across all six non-Rust bindings (modulo the principled WASM-TsaClient omission).
Go users can now build with CGO_ENABLED=0 via a second backend that uses ebitengine/purego to dlopen libpdf_oxide.{so,dylib,dll} at runtime — no C toolchain required. Backend selection is automatic via Go's built-in cgo tag (//go:build cgo → full CGo API, //go:build !cgo → purego).
The purego backend covers the read-side PdfDocument surface — open (path / bytes / password), page count, version, text / Markdown / HTML / plain-text extraction, fonts, annotations, page elements, search, page dimensions, logging — plus PdfCreator.FromMarkdown for test fixtures. Editor, DocumentBuilder, barcode, signature, TSA, rendering, OCR, and forms stay CGo-only; using them under !cgo is a compile-time error. Full parity is tracked for a follow-up.
Installer:
- New
-sharedflag fetches the cdylib instead of the staticlib and printsCGO_ENABLED=0+PDF_OXIDE_LIB_PATH=…to export. - Install dir moved to
os.UserCacheDir()—~/.cache/pdf_oxideon Linux,~/Library/Caches/pdf_oxideon macOS,%LocalAppData%\pdf_oxideon Windows. Matches Go's ownGOCACHEconvention; existing installs re-fetch once into the new path.
Release assets now include pdf_oxide-go-ffi-shared-<platform>.tar.gz for every Tier-1 platform alongside the existing staticlib archives.
- #395 —
PdfOxide.Exceptions.SignatureException: '[8500] Signature error...'raised bydoc.RenderPage(0, 0)on a specific 9-page PDF reported by @gevorgter. The failure was the renderer propagating a signature-parse error up as the page-render verdict even though the page itself had no interactive signature widget on it. Fixed by treating unparseable signature-field metadata as non-fatal at render time; pinned bytests/test_issue_395_render_signature_exception.rs+ the C# regression test so this can't silently come back.
Reports and feature requests from @sparkyandrew (#382 CJK via DocumentBuilder, #385 subsetter), @arthurlassagne (#392 browser build breakage), and @gevorgter (#395 RenderPage SignatureException). All three surfaced the gaps that drove this release.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.37 | HTML + CSS → PDF (issue #248) — first credible pure-Rust pipeline
let font = std::fs::read("DejaVuSans.ttf")?;
let mut pdf = Pdf::from_html_css(
"<h1>Hello</h1><p>World</p>",
"h1 { color: blue; font-size: 24pt }",
font,
)?;
pdf.save("out.pdf")?;
The whole feature: pass HTML + CSS + font bytes, get a paginated PDF back. Pure Rust, MIT/Apache only (no MPL transitive deps), extract_text round-trips byte-equal so produced PDFs participate in the existing test infrastructure.
End-to-end test suite at tests/test_html_to_pdf_e2e.rs covers simple paragraph, multi-paragraph, nested HTML, CSS-styled text, and Unicode (Latin + Latin-Extended + Cyrillic + symbols) round-trips.
- Subsetter wrapper around the
subsettercrate (Typst's, MIT/Apache):crate::fonts::subset_font_bytes(bytes, used_glyphs)produces a subset face, andEmbeddedFonttracks used glyph IDs via theFontSubsettertype. The writer path currently embeds the full font face inFontFile2(full-face embedding + Identity-H is valid PDF 1.7 and round-trips correctly); switching to the subsetter's output requires remapping glyph IDs in the already- emitted content streams, which lands as a later follow-up. The standalone API + glyph tracking still ship so callers that use the subsetter directly (e.g. CLI tools shelling out tosubset_font_bytes) get the size benefit today. - Type 0 / CIDFontType2 / Identity-H / ToUnicode emission wired into
PdfWritersoadd_embedded_text(text, x, y, "EFn", size)produces a font dict graph that PDF readers handle correctly. Round-trip viaextract_textreturns the input string for Latin, Cyrillic, Greek, Hebrew, Arabic. - System font discovery via
fontdb(RazrFalcon, MIT). Newsystem-fontsfeature gates discovery + shaping; default-on for language bindings, off for WASM and the bare Rust crate. - Text shaping via
rustybuzz(HarfBuzz port, MIT). Returns positioned glyph runs withclusterinfo so the inline formatter can map glyphs back to source bytes.
10 modules, ~6,500 LoC, no MPL anywhere:
- Tokenizer (CSS Syntax L3) with full token coverage including CDO/CDC, hex+named entities resolution in url(), source locations.
- Parser producing
Stylesheet { rules: Vec<Rule> }with forgiving recovery per spec. - Selectors L3 + L4 subset:
:is/:where/:not/:has, structural pseudo-classes, attribute matchers withi/sflags, specificity computation packed into a sortable u32. - Matcher with
Elementtrait so the engine isn't tied to one DOM implementation. - Cascade with origin/specificity/source-order sorting, inheritance from parent for the spec's inherited-property list, inline-style merge, custom-property storage.
calc()/min()/max()/clamp()evaluator with mixed- unit math against aCalcContext.var()substitution with DFS cycle detection.- Typed property values for colour (~150 named, hex, rgb/rgba/ hsl), length (every CSS Values L4 unit), display, font-size/ weight/style/family, margin/padding shorthand expansion, line- height, etc.
- At-rules:
@media printalways-true +(min/max-width)predicates,@pagewith:first/:left/:right/:blankselectors and margin boxes,@font-facedescriptor extraction,@importURL forwarding,@supportsagainst our supported set. - Counters (
counter/counters/counter-reset/-increment/-setwith Roman/Greek/alpha numbering) and pseudo-element content evaluation.
- HTML5 tokenizer with attribute parsing (quoted/unquoted/bare), void-element implicit self-closing,
<style>/<script>raw-text contexts, named + numeric entity decoding, comments, DOCTYPE. - Flat arena DOM implementing the CSS-4
Elementtrait so the cascade matches against real document nodes. Implicit close handling for the common<p>and<li>cases. - Stylesheet extraction:
<style>blocks,<link rel="stylesheet">(URL forwarded;mediaattribute preserved), per-element inlinestyle="...". - Resource extraction:
<img>with srcset DPR selection,<picture>/<source>first-match,<a href>(internal anchor detection).
- Box tree from DOM × ComputedStyles with display-split (outer/inner), anonymous-block insertion per CSS 2.1 §9.2.1.1,
display: none/contentshandling, UA default display table for common HTML elements. - Taffy integration for block / flex / grid layout (Dioxus, MIT, default-features-off + only the features we need).
- Inline formatting with greedy line breaker via UAX #14 (
unicode-linebreak),text-align/white-spacemodes, hard breaks, atomic inline boxes. - Float scaffolding with line-shortening helpers.
- Margin collapsing per CSS 2.1 §8.3.1.
- Multi-column distribution (
column-count/column-width/column-gapwith greedy line distribution). - Tables with auto + fixed column-width algorithms, row-group classification (header/body/footer for paginator repetition).
- Slices a positioned box tree across pages at
floor(box.y / content_height)boundaries. - Multi-page boxes emit one PaginatedBox per page with the visible y-slice; preserves source IDs so PAINT can look up styles.
- A4 portrait (96dpi) and Letter (8.5×11) page presets.
- Walks each PageFragment and emits text + borders into the existing
PdfWriter/PageBuilder. - HTML→PDF Y-flip applied once at emission time so all internal coordinates stay top-down.
After the initial cut of the HTML+CSS pipeline, corner-case validation surfaced a set of regressions and missing features. All of the below also ship in v0.3.37:
- Tokenizer char-boundary safety. The CSS tokenizer's
ignore_caselookahead indexed raw byte offsets on multi-byte characters, panicking on any CSS source that put non-ASCII inside a keyword-adjacent position. Fixed. - Block sizing for inline-text flow. Block boxes with only-inline children were given zero intrinsic height, so paint-time
y-coordinates collapsed; multi-paragraph documents dropped every paragraph but the first, and long single paragraphs retained only ~20 % of their words.run_layoutnow reserves intrinsic height from the body font size and the inline run count. - Arabic / RTL shaping. Paint now routes RTL paragraphs through the rustybuzz shaper (feature
system-fonts) so contextual forms, ligatures, and visual reordering all work. - Multi-font cascade. New
Pdf::from_html_css_with_fonts(html, css, Vec<(family, bytes)>). CSSfont-familyon any element resolves against the registered families (case-insensitive, with/without quotes); unknown families fall back to the first registered font. Walks up the box tree so inline children inherit their ancestor's family. - Page breaks.
page-break-before: alwaysandpage-break-after: alwaysnow open a fresh page, both via CSS rules and via inlinestyle="...". Multiple breaks accumulate. ::before/::aftergenerated content. Newcascade::pseudo_content_for(ss, element, PseudoKind::{Before,After}). Literal strings,attr(name), andopen-quote/close-quoteall resolve.- Opacity +
transform: translate*().opacity <= 0.01on any ancestor hides an element and all its text descendants.transform: translateX/Y/translate(…)applies as a pre-paint offset on the box's x/y. <img>data-URI embedding.<img src="data:image/png;base64,…">(anddata:image/jpeg;…, percent-encoded plain payloads) now decode to a real PDF Image XObject. The paint pipeline emits/Dooperators against a per-page/XObjectresource dictionary whichPdfWriter::finish()now serializes — the missing resource-dict wiring was why priorpage.add_element(Image(…))calls rendered as silent no-ops. External URLs / filesystem paths returnNonefromdecode_image_srcso callers can resolve those themselves.- List markers.
<ul>items get•(U+2022) and<ol>items getN.numbering, painted in the gutter to the left of the<li>'s content box. Nested lists work on both levels. <a href>link annotations. Every anchor box with a non-emptyhrefemits a PDF/Linkannotation carrying a/URIaction; inline text inside the anchor inherits the link by walking up the box tree. Anchors with nohrefemit no annotation.- Embedded fonts via
DocumentBuilder(#382). NewDocumentBuilder::register_embedded_font(name, EmbeddedFont). Text emitted through the fluent builder (FluentPageBuilder::font(name, size).text(...), or anyContentElement::TextwhoseFontSpec.namematches a registered embedded font — including template headers/footers) is now routed through the Type-0 / CIDFontType2 path instead of silently falling back to Helvetica. CJK, Cyrillic, Greek, Hebrew, Arabic text emitted via the high-level API now actually embeds and renders. Unregistered font names continue to resolve against the base-14 set. Reported by @sparkyandrew.
- Base-14 bold text rendered non-bold. The page
/Resources /Fontdictionary keyed entries with dashes stripped (HelveticaBold) while content streams emittedTf /Helvetica-Bold. PDF readers silently fell back to the default font, so every bold or italic base-14 run came out regular. Resource-dict keys now match theTfoperator names exactly. - TTC system fonts (Helvetica.ttc, msgothic.ttc, …).
fontdbsurfaces collection fonts asSource::SharedFile(path, …), which the resolver previously rejected asNoPath. SharedFile entries are now read the same way as regular files, so a huge swathe of macOS/Windows system fonts become resolvable. - Unquoted multi-word
font-family.font-family: DejaVu Sans, sans-seriftokenises as two separateIdents, so the registered- family lookup never matched them as a single name. The resolver now collects consecutive idents (whitespace-separated) into one candidate and flushes at top-level commas, so quoted and unquoted forms behave the same. - Memory leak in
Pdf::from_html_css/from_html_css_with_fonts. The factories leaked the combined CSS source, parsed stylesheet, DOM, and family map on every call (fourBox::leaksites). Long- running processes (HTTP servers, batch converters) grew unbounded. The downstream APIs all accept non-'static references; the function now holds them in locals scoped to the call. - PNG alpha / soft-mask now renders.
ImageData::from_pngalready decoded and compressed the alpha channel, butImageContenthad no field for it and the XObject emitter hard- codedSMask = None.ImageContentgains asoft_mask, the html_css paint pipeline propagates it, and the XObject path actually emits a/SMaskstream. - Shaped text round-trips via
extract_text. The shaped path (add_shaped_embedded_text) only recorded glyph IDs in the subsetter, leaving shaped runs absent from the ToUnicode CMap and uncopy-paste-able. The newencode_shaped_runmaps glyph clusters back to source codepoints so the ToUnicode entries are complete for simple scripts and exact-leading-char for ligatures. - Reproducible PDF output.
PdfWriter::finishiteratedembedded_fontsdirectly from the HashMap, randomising object-ID order across runs. Embedded fonts are now emitted in registration order via an explicitembedded_font_ordervector. - Embedded-font name collisions. Registering two fonts with the same display name silently overwrote the first.
embedded_fontsis keyed by itsEFnresource name (unique, monotonic) so registrations are independent regardless of display name. - fontdb Mutex serialised on slow disks.
SystemFontDb::resolveheld the fontdb lock across the font-bytesfs::read. Concurrent resolve calls are now lock-free during I/O — the lock is released once the face path + PostScript metadata are picked. - Misleading docs corrected. Module documentation previously claimed
background-colorrendered as a filled rect (currently a no-op stub) and that the writer embedded a subset of the face (currently embeds the full face + Identity-H, subsetter output is a later follow-up). Both are now reflected accurately in the relevant docstrings.
- E2E (
tests/test_html_to_pdf_e2e.rs): 36 tests (was 14), covering every feature above plus a kitchen-sink document that exercises::before, list markers, page-break, opacity, translate, and<a href>in a single round-trip. - Unit: 4 cascade pseudo-element tests, 7 paint tests (opacity / translate / data-URI decode), 3 inline-text sizing tests, 1 RTL shaper test, 1 multi-font cascade test, 1 tokenizer multi-byte regression test.
- Total test count: 4772 lib + 36 e2e; 168 integration suites all green, 0 regressions on the existing corpus.
The supported CSS surface is documented in detail in docs/HTML_TO_PDF_GUIDE.md. Out of scope: CSS filters, 3D transforms, animations, SVG-in-HTML (every viable Rust SVG crate is MPL), MathML, hyphens: auto, shape-outside, JavaScript execution, full-matrix transform (scale/rotate), gradients, and box-shadow.
cargo deny check licenses passes with zero MPL transitive dependencies. The Mozilla CSS stack (cssparser, selectors, html5ever, lightningcss, stylo) is all MPL-2.0; v0.3.37 hand- rolls the equivalents to keep pdf_oxide entirely under MIT/Apache.
- @jmriebold — Filed #248 ("CSS support"). That single issue is the root of this release's entire HTML+CSS→PDF pipeline — the hand-rolled CSS engine, the HTML5 tokenizer + arena DOM, Taffy-backed layout, the
::before/::after,page-break-*,<img>data-URI, multi-font cascade, opacity / transform,<a href>link, and RTL shaping work all exist because he asked for it. Thank you.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.36 | Markdown structural extraction quality vs pdfium — Tagged-PDF
The headline change of this release. to_markdown() previously consumed only the MCID order from /StructTreeRoot and then re-derived heading levels from font-size heuristics and list markers from glyph detection. For Word/Acrobat tagged PDFs whose body and heading text share a point size, this dropped every heading; for tagged lists where LI → LBody → MCR nests the actual content under a Span/P, this dropped every bullet; for tagged paragraphs whose inter-paragraph gap was less than 1.5× line height, this merged adjacent paragraphs.
This release wires the structure tree directly into the markdown pipeline:
- Heading and list emission from
/StructTreeRoot. NewStructRole(Heading(1..6), ListItem, ListItemLabel, ListItemBody) attached to every span via the per-MCID lookup map. The converter prefers the explicit role over font-size heuristics so Word-tagged documents recover their full heading hierarchy. Lists emit- itemwith paragraph breaks at every role transition. (D1) - Heading / list role propagated through nested MCRs. Tagged PDFs commonly wrap heading content as
H1 → Span → MCRand list bodies asLI → LBody → Span → MCR. The traversal now threadsInheritedContext { heading_level, list_role }down bothtraverse_elementandtraverse_element_all_pages, so deeply nested MCRs carry the right semantic role. (D8b) - Per-
/StructTreeRootblock boundary forces paragraph break. NewOrderedContent.block_idincrements on every entry into a block element (/P,/H1..6,/LI,/Lbl,/LBody,/Sect,/Div,/Art,/TR,/TH,/TD,/Note,/Reference,/BibEntry,/Code); the converter splits paragraphs whenever this changes between adjacent spans. Tight-gap layouts (pdfa_049-style) no longer merge. (D5) - Same-baseline gate against form-heading over-fragmentation. D5 alone over-split horizontal heading bands like
# Form / # 1040 / # U.S. Individual Income Tax Returninto three separate headings. The block-id transition now fires only when the spans are also on different visual lines; same-baseline pieces re-join into one heading. (D5b) - Multi-column gutter detection. Two spans on the same baseline separated by a horizontal gap >
max(3 × font_size, 30 pt)are treated as belonging to different columns even when their block_ids would say otherwise — newspapers and two-column academic papers no longer concatenate cross-column tokens. (D5c) - Backward-x reading-order wrap detection. When the structure tree's reading order goes column-major (last span of column 1 at x=976 immediately followed by first span of column 2 at x=192, same baseline), the converter now recognises the wrap as a paragraph break instead of joining the two into a nonsense token like
constitutionAssailing. (D5d) - Geometric heading + list-prefix detection for untagged docs. Bold + 5 % size bump promotes to H4. New
is_ordered_list_marker(text) -> Option<u32>recognises1./12./a)/iv./A.while conservatively rejecting figure captions (1.1 Foo) and years (1986). Bullet or ordered marker on a new line forces a paragraph break regardless of the geometric gap. (D2 / D3 / D4)
- Spurious
**bold**markers around Arabic contextual glyphs are now stripped. Initial / medial / final shape transitions routinely flipped the font-weight detector and emitted single-letter emphasis runs; the converter now recognises and removes them. - Bidi reorder is OFF by default. An earlier draft of D7 ran
unicode-bidi's visual→logical reorder on every RTL line; that broke previously-correct logical-order PDFs (Hebrew nameבנימיןwas being reversed toןימינב). Without a reliable signal for source order, the safer behaviour is to preserve the input ordering. The reorder helper remains exported fromtext::bidi::reorder_visual_to_logicalfor callers that know their input is in visual order.
- Inline-image base64 data URIs capped at 200 KB. PDFs with high-resolution diagrams previously inflated markdown output by 10–20× (one 1.9 MB academic paper produced 11.3 MB of markdown). Images that exceed the cap now emit an HTML-comment placeholder noting the suppression and the original size. File-based image output (
image_output_dir) is unaffected.
- 80+ new unit tests in
pipeline::converters::markdown::tests,structure::traversal::tests, andtext::bidi::testscovering every defect with TDD-shaped RED→GREEN cases plus parametrised variations (all six heading levels, all three list roles, edge cases like clamped levels, baseline jitter, three-column layouts, the IA_0047 backward-x reproducer, etc.).
Validated against v0.3.35 baseline on a 369-PDF regression spanning academic, government, forms, newspapers, technical, theses, IRS, pdfium, pdfjs, safedocs, and slow-corpus subsets:
- 0 catastrophic regressions (no
HEAD_FAIL, noSHRUNK_BIGon real content; the three sub-50-byte SHRUNK cases are pdfjs test fixtures where D5b same-line joining suppresses geometric heading detection on minimal content). - Token Jaccard vs pdfium and pdftotext: median 1.000 (perfect), ≥0.95 on 95/106 fixtures.
- Token Jaccard vs pymupdf4llm: median 0.978, ≥0.95 on 65/106 fixtures.
- ~2× more headings emitted than pymupdf4llm across the corpus — the structure-tree wiring lets pdf_oxide pick up section titles that font-only heuristics miss.
- Per fixture (issue #377): nougat_002 0→4 H1s + 5→34 bullets; nougat_011 64→266 lines; word365_structure 0→1 H1 + 2→3 bullets; 2023-06-20-PV 0→4 H + 0→5 bullets.
- @Goldziher (kreuzberg) — filed #377 with a 727-document benchmark methodology (block-level SF1 + token-level TF1) comparing pdf_oxide against pdfium, plus 9 reproducer PDFs covering the worst structural-extraction regressions. The clarity of that report (per-pattern bucketing, per-fixture gaps, and an explicit "TF1 within ±3 % so text content is fine, structure is the issue" framing) made the entire investigation tractable. The single-PR unlock that drove this release was identifying that pdf_oxide had a complete structure-tree parser whose output the markdown converter was discarding — that framing came directly from the issue.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.35 | Narrow-glyph doublet preservation in text extraction
- Adjacent narrow-glyph doublets no longer collapsed at small font sizes (#378, PR #379).
TextExtractor::deduplicate_overlapping_charsanddeduplicate_overlapping_spansused a hardcoded 2 pt absolute threshold to detect duplicate glyphs from stroke+fill render passes. For narrow glyphs (l,r,I,i) in compact fonts at small sizes the per-glyph advance width drops to ≤ 2 pt (Helvetical≈ 2.5 pt at 9 pt), so legitimate adjacent doublets one full advance apart fell inside the dedup window and one of the two glyphs was silently dropped. Visible corruption includedcontroller → controler,billed → biled,warranty → warrnty,following → folowing, andVIII → VII. Builds on prior #102 / #253, which added same-text and same-character identity guards but kept the 2 pt threshold — this fix addresses the residual case where both glyphs are identical (passing the identity check) yet still legitimate neighbours. Threshold now scales with each glyph's ownadvance_width(fallbackbbox.width) asmin(advance_width * 0.30, 2.0). Real render-pass duplicates sit well under 5 % of one advance apart and continue to collapse; heaviest kerning observed in the wild is ≤ 20 % of advance, so legitimate kerned neighbours are preserved. Tunables hoisted toTextExtractor::DEDUP_OVERLAP_RATIO/DEDUP_OVERLAP_CAP_PTassociated constants so both dedup paths share one source of truth. Regression coverage spans the matrix of four narrow glyphs × three small body-text sizes (7 / 9 / 11 pt) on both the per-char and per-span paths, plus positive cases proving stroke+fill duplicates at ~0 pt offset still collapse.
- @Hugues-DTANKOUO — Reported #378 with a precise root-cause analysis (the 2 pt absolute threshold falling below one advance width for narrow glyphs in compact fonts at small sizes) and authored PR #379 with the advance-scaled threshold and a parametrised regression matrix covering the four narrow glyphs across three body-text sizes.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.34 | Idiomatic page API, structured tables, column-order, image, and ICC colour fixes
All four language bindings now expose a page object so callers can iterate a document and call extraction methods on the page directly. Named consistently as Page in Python, Node.js, C#, and Go.
with PdfDocument("paper.pdf") as doc:
for page in doc: # len(doc), doc[i], doc[-1] also work
text = page.text
md = page.markdown(detect_headings=True)
- Python —
Pagewith lazy properties:text,chars,words,lines,spans,tables,images,paths,annotations; methods:markdown(),plain_text(),html(),render(),search(),region(). The pre-existing editorPdfPageis unchanged. - Node.js —
Pagewith cachedwidth/height/rotationand extraction methods.[Symbol.iterator]andpage(index)added toPdfDocument. Six previously native-only methods wired into the TS layer:extractWords,extractTextLines,extractTables,extractPaths,getEmbeddedImages,ocrExtractText. - C# —
Pagewith full sync + async surface.doc.Pages(IReadOnlyList<Page>) anddoc[i]indexer added toPdfDocument. - Go —
Pagestruct with full method surface.doc.Page(i)anddoc.Pages()added toPdfDocument.
extract_tables() returns structured data — rows, cells with text and bounding boxes — not just Markdown. Available on both PdfDocument and the new Page objects across all bindings, with a single consistent type name Table:
| Language | Type | Cell access |
|---|---|---|
| Rust | Table |
iterate rows[i].cells[j] |
| Python | dict |
row["cells"][i]["text"] |
| Go | Table |
table.CellText(row, col) |
| C# | Table |
table.CellText(row, col) |
| Node.js | Table (interface) |
table.cells[row][col] |
C# previously returned only (int RowCount, int ColCount) tuples — now returns a proper Table[] with cell text accessors, matching Go and Rust.
- Multi-column reading-order interleaving fixed (#319). On untagged multi-column PDFs (academic textbooks, genetics references),
extract_textwas applying XY-cut column ordering insideextract_spans()and then re-sorting with row-aware sort inextract_text_with_options, undoing the column structure. Result: garbled fragments likeaccompaally(= "accompa" from column 1 + "ally" from column 2). Fix: skip the row-aware re-sort when the page is genuinely multi-column. Verified on Hartwell Genetics, Murphy ML, and Kandel Neural Science textbooks — all known garbled tokens eliminated. - XY-cut column-detection improvements for mixed-layout pages (table + body text). Wide spans (>55% of region width) excluded from the projection density so tab-expanded table rows no longer fill the column gutter. Single-character spans (table cell values like
G,T) excluded from projection so they don't scatter across the gutter. Coverage check uses character-count estimate rather than bbox width so tab-padded rows don't masquerade as dense body text. - Sparse-layout false-positive guard for
is_multi_column_page. Copyright pages, title pages, and colophons can produce two X-center peaks with only 7-10 spans per "column" — these are no longer treated as multi-column, preventing XY-cut from splitting sentences whose halves are at different X positions on the same line. - Font-aware column-shape gate in
is_multi_column_page. Fax-style and scattered-fragment layouts (each row built from several individually positioned word fragments) used to clear every prior multi-column check and routed through XY-cut, which then read the page column-major and could reverse fragments within a row. The new gate measures the fraction of side-spans falling into the largest X-cluster (cluster gap derived from the page's dominant em); body text scores ≥ 0.5 while scattered layouts score < 0.4. Pages that fail either side fall back to row-aware sort, so scanned-fax PDFs again read left-to-right line-by-line. Per-page font statistics are computed once via the newpdf_oxide::layout::PageFontStatstype and reused by every threshold the layout pipeline derives. - Newline insertion on backwards-X jumps in span join. When the upstream sort handed the join loop two same-baseline spans whose X positions went backwards (a multi-column page whose XY-cut routing groups column-side spans across rows so adjacent iteration items share a Y band but belong to different visual rows), no separator was being inserted and texts glued together — producing tokens like
instancesinstancesinstancesfrom three table-header cells in a stats grid. Same-baseline pairs whose delta-x is more negative than 3 em now emit a newline.
- Node.js Linux prebuild now portable across glibc 2.35+ systems. Previous builds were dynamically linked against
libstdc++.so.6requiringGLIBCXX_3.4.31(GCC 13+), failing to load on Debian 12 stable, Ubuntu 22.04, and RHEL 8/9. Fix:binding.gypnow passes-static-libstdc++and-static-libgcc, and the Linux runner is pinned toubuntu-22.04/ubuntu-22.04-arm(glibc 2.35). The resulting.nodeis fully self-contained for C++ runtime —lddshows onlylibm/libc. Size impact: +210 KB. - Go installer documents
@latest.go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latestis now the recommended install command (the installer auto-resolves the matching version viaruntime/debug.ReadBuildInfo()). - pkg.go.dev now shows Go documentation. The Go module (rooted at
go/go.modwith module pathgithub.com/yfedoseev/pdf_oxide/go) was returningDocumentation not displayed due to license restrictionsbecause pkg.go.dev's licensecheck only inspects the module's own subtree — it does not walk up to the repo root whereLICENSE-APACHE+LICENSE-MITlive. Fix: duplicate both files intogo/LICENSE-APACHEandgo/LICENSE-MIT, filenames both on pkg.go.dev's accepted list. Takes effect on the next tag. - npm, NuGet, and PyPI packages now embed both licence files. Same class of gap as the Go fix:
js/package.json'sfileslist, the C#.csproj, and the maturin[tool.maturin] includeall omitted the licence text so shipped artifacts lacked the notice MIT requires.js/package.json'slicensefield also flattened to"MIT", contradicting the crate's declaredMIT OR Apache-2.0; corrected to match. The C# csproj carried a deprecated<LicenseUrl>alongside<PackageLicenseExpression>that NuGet warns on — removed. LICENSE-MITcopyright corrected. All fourLICENSE-MITcopies (root,go/,js/,csharp/PdfOxide/) carriedCopyright (c) The Rust Project Contributorsleft over from thecargo inittemplate. Updated toCopyright (c) 2025-present Yury Fedoseev. Verified with google/licensecheck — all four still classify as 100% MIT, so pkg.go.dev / NuGet / npm license detection is unaffected.
- Free-disk-space step added to all Ubuntu jobs that do heavy Rust + Python builds. A v0.3.33 release-pipeline failure (
No space left on deviceonactions-runnerlog writes) traced to GitHub Ubuntu runners filling up at thematurin build --releasestep. Now applied topython.ymltest job (was only one fixed initially),ci.ymlPython Bindings + WASM Build jobs, andrelease.ymlPython wheel build matrix (Linux targets only viaif: runner.os == 'Linux'guard).
- 4-bit-per-component Indexed images no longer decode to vertical-stripe noise (#375). The PNG predictor decoder was honouring the numeric
/Predictorvalue from/DecodeParmsinstead of the per-row filter tag byte written into each row. ISO 32000-1:2008 §7.4.4.4 makes the per-row tag authoritative: a producer may declare/Predictor 12(Up) on the parameters and still write tag 0 (None) on every row. Reading the declared predictor instead produced Up-cascade on raw index bytes, rendering a 710×1012 scanned-book page as a diagonal-stripe noise pattern. Reported by @Charltsing. - Indexed palette streams whose first byte is
0x0D(CR) or0x0A(LF) no longer decode to solid black (#375).decode_stream_datawas running a post-parsetrim_leading_stream_whitespacepass that stripped CR/LF bytes from the start of every unencrypted stream. The parser already consumes exactly one EOL after thestreamkeyword per ISO 32000-1:2008 §7.3.8.1, so re-trimming corrupted binary streams that legitimately start with those bytes. For an Indexed-backed image, shrinking a 4-byte CMYK palette0d 0c 0c 04to 3 bytes pushed every lookup into the expander's out-of-range branch, producing(0,0,0)for every pixel. Reported by @Charltsing. - DeviceCMYK → DeviceRGB fallback now matches ISO 32000-1:2008 §10.3.5 (#375). All CMYK→RGB paths — image-level bulk conversion, Indexed-CMYK palette expansion, content-stream fill/stroke colour state, JPEG CMYK decoding — now use the spec's additive-clamp formula
R = 1 − min(1, C + K). Four inline copies and three helper functions were collapsed onto this single form; the common multiplicative(1-C)(1-K)variant differed on heavily-inked samples and was the default we inherited from imaging libraries, not what the spec specifies.
- Real ICC profile-driven colour conversion via qcms (#375; opt-out
iccfeature, on by default). When a PDF's/ICCBasedcolour space or/OutputIntents → DestOutputProfileprovides an ICC profile, image extraction now compiles it to aqcms::Transformand routes CMYK samples through the CMM instead of the §10.3.5 fallback. RGB- and gray-ICCBased profiles use the same pipeline. The graphics-state rendering intent (/Intenton image dictionaries,/RI, or therioperator) is honoured; unrecognised intent names fall through toRelativeColorimetricper §8.6.5.8. qcms is pure Rust (no C/FFI) so WASM and C# AOT builds keep working; opt out withdefault-features = false. Reported by @Charltsing. - New
pdf_oxide::colormodule exposesIccProfile,IccHeader,RenderingIntent, andTransformfor consumers that want to drive colour conversion directly. - Measured impact on a representative CMYK-heavy fixture (218 images,
/ICCBased 4throughout): mean PSNR vs poppler's reference rendering improved from 27.9 dB (§10.3.5 fallback) to 39.2 dB (qcms). Worst-case PSNR rose from 16.4 dB ("visibly wrong saturation") to 33.8 dB ("perceptually indistinguishable"). A representative blue swatch shifted fromRGB(62, 142, 252)toRGB(58, 123, 190)vs the ICC reference'sRGB(62, 124, 191).
- @SeanPedersen — Proposed the page-first API (#371) with lazy evaluation and sequence semantics. Python follows his design exactly; extended to Node.js, C#, and Go.
- @pdenapo — Requested structured table extraction returning data structures rather than Markdown (#289), which prompted the cell-text API surfacing in C# / Node.js and the
Tablerename for cross-language consistency.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.33 | Text extraction, image correctness, and memory safety fixes
- ToUnicode CMap miss returns U+FFFD instead of ASCII ciphertext (#363). Subset Type0 fonts whose ToUnicode CMap doesn't cover a CID now emit the replacement character instead of falling through to the Identity-H
cid-as-Unicodepath that produced strings like%B+$%8A//$2*%01*1%6APP. - Intra-word TJ kerning no longer splits words (#365). Letter-pair kerning of 0.10–0.20 em inside single words (
[(diffe) -150 (rent)]) no longer triggers space insertion. Validated on 5 Kreuzberg fixtures — zero split-word patterns. - Cyrillic / non-Latin text recovered from UTF-8 mojibake (#317). Fonts with Latin-only encoding and no ToUnicode CMap that carry raw UTF-8 byte sequences now decode correctly. Validated on
issue20232.pdf— Russian engineering text readable. - FlateDecode partial-recovery rejects garbage output (#364). MS Reporting Services PDFs (
nougat_026.pdf) whose content streams failed mid-decompress were returning 128 bytes of pseudo-random data. Partial-recovery paths now validate output vialooks_like_real_streambefore accepting. Pages 1/2/5 go from 0 → 848/792/321 bytes.
- Indexed + ICCBased palette correctly resolves component count (#373). Unresolved ICC stream references inside the Indexed base array caused
/Nto default to 3 instead of reading the actual value (4 = CMYK), producing diagonal-stripe artifacts. Reported by @Charltsing. - Lab-base Indexed palettes converted to sRGB (#337). Palette bytes in CIE L*a*b* are now converted through Lab→XYZ→sRGB instead of being reinterpreted as raw RGB.
- All internal caches bounded (PR #369, #354). Object cache (64 MB), font caches (256–512 entries), XObject span/image caches (1024 entries), and global CMap cache (1024 entries) all use FIFO eviction. Cache utilities extracted to
src/cache.rs. - Path extraction OOM on chart-heavy PDFs fixed (PR #369). Added CTM-aware
processed_xobjectsdedup — same XObject at same position is deduplicated, same XObject at different positions processes separately. - Mutex poison resilience.
MutexExt::lock_or_recover()replaces 72.lock().unwrap()calls.
- RustCrypto cipher 0.5 ecosystem (PRs #352, #295, #291).
aes0.8→0.9,cbc0.1→0.2,sha20.10→0.11,sha10.10→0.11,md-50.10→0.11.
- 13 dead/stale ignored tests removed; 3 previously-ignored tests fixed and un-ignored.
- Regression tests added: ToUnicode CID-miss (3 tests), FlateDecode stream boundary framing (4 variants), TJ intra-word kerning, Cyrillic encoding and UTF-8 sniff (2 tests), dedup flow-prose preference, reading-order glyph sort stability (2 tests), Indexed Lab palette conversion.
- Suite: 6,300 passed, 0 failed, 228 ignored.
Thank you to everyone who reported issues, filed reproducers, or contributed code for this release!
- @Charltsing — Reported the Indexed + CMYK image extraction failure (#373) with a reproduction PDF and screenshot comparison against pdfimages (xpdf), which exposed the unresolved ICC stream reference bug that had been silently producing garbled diagonal-stripe artifacts since the Indexed palette support landed in v0.3.27.
- @ddxtanx — Reported the unbounded memory growth during multi-page extraction (#354) with profiling data that showed object and font caches consuming 200 MB+ on a 609 KB arXiv PDF. This drove the bounded-cache work in PR #369.
- @andrewjradcliffe — Authored PR #369 implementing bounded FIFO caches for all internal caches, CTM-aware XObject dedup for the path extractor OOM,
MutexExtpoison-recovery trait, Python binding hardening, and markdown inter-group spacing. The PR also included comprehensive unit tests for all new cache types.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.