v0.3.50 | True destructive PDF redaction, PAdES-B-T/B-LT long-term-validation signatures, a runtime cryptographic algorithm-governance policy, and split-PDF-by-bookmarks across all seven bindings, plus a signature-date correctness fix.
- True destructive redaction (#231) — the prior "redaction" only drew a filled rectangle over content whose bytes survived (recoverable by copy-paste /
pdftotext/ a hex editor). Redaction is now destructive: the text under each region is physically removed from the content stream — every glyph whose ISO 32000-1:2008 §9.4.4 text-rendering box intersects the (edge-padded) region is deleted, survivors are re-emitted with a fresh absoluteTmand noTJdeltas so neither the glyphs nor a width/shift side channel (Bland et al., PETS 2023) remain; the page is rewritten so the original content object is dropped by the garbage-collected full rewrite (no residual recoverable bytes); an opaque overlay marks the area (ISO 32000-1:2008 §12.5.6.23, "remove all traces … clipping shall not be used"). Composite/Type0/unknown fonts are refused rather than risk a silent under-redaction (fail-closed). NewDocumentEditor::add_redaction/redaction_count/apply_redactions_destructiveplus thepdf_redaction_add/count/apply/scrub_metadataC ABI and Python, WASM, Node, C#, Go bindings and apdf-oxide redact INPUT --rect PAGE:x0,y0,x1,y1 [--from-annotations] [--fill R,G,B] [--no-scrub-metadata]CLI. The legacyapply_page_redactions/apply_all_redactionskeep their signatures. Standalone document sanitization (DocumentEditor::sanitize_document, the livepdf_redaction_scrub_metadataC ABI, Pythonsanitize_document, WASMsanitizeDocument, and the already-wired Node/C#/Go scrub paths) strips the/Infodictionary, the catalog XMP/Metadatastream, document JavaScript (/OpenAction,/AA,/Names/JavaScript) and/Names/EmbeddedFiles; the removed object subtrees are hard-excluded from the rewritten file so a secret cannot survive even as a GC-missed orphan (G6). Geometric image/path/XObject pruning remains roadmap; composite-font text and encrypted documents are refused (not under-redacted). - PAdES long-term-validation signatures (#235) — signing now produces ETSI EN 319 142-1 PAdES baseline signatures, not just bare
adbe.pkcs7.detached: B-B embeds the RFC 5035 ESSsigning-certificate-v2signed attribute; B-T adds an RFC 3161signature-time-stampunsigned attribute over the signature value; B-LT appends a Document Security Store (ISO 32000-2:2020 §12.8.4.3 — certs/CRLs/OCSPs + a per-signature/VRIkeyed by the uppercase-hex SHA-1 of the signature's/Contents) as an append-only second incremental update, so the original signature's byte range is untouched and staysValid. Read side:read_dssparses a/DSSandclassify_pades_levelreports a signature's level (B-B/B-T/B-LT). Newsign_pdf_bytes_pades/PadesLevel/RevocationMaterial/DocumentSecurityStorein core, thepdf_sign_bytes_pades/pdf_signature_get_pades_level/pdf_document_get_dss/pdf_dss_*C ABI, and Python, WASM, Node, C#, Go bindings. B-LTA is also produced: a/Type /DocTimeStamp(/SubFilter /ETSI.RFC3161) RFC 3161 timestamp over the whole file including the DSS, appended as a third incremental update so the archival timestamp covers the signature and its validation material;has_document_timestampis the document-scoped reader signal (classify_pades_levelstays signature-scoped and tops out at B-LT by design — the frozenpdf_signature_get_pades_levelC ABI has no document handle). The legacysign_pdf_bytesadbe.pkcs7.detachedpath is byte-for-byte unchanged. Final ETSI conformance is gated on the EU DSS demonstration-validator release check (online TSA fetch is CGo/native-only — WASM takes a pre-fetched RFC 3161 token). - Runtime crypto-governance policy (#230) — a process-wide
crypto::SecurityPolicy(modescompat/strict/fips-strict, plus anallow:/deny:<alg>@<read|write>override grammar) layered as an orthogonal, set-once decorator over the existingCryptoProvider. Read/write asymmetry lets a deployment read legacy RC4/MD5 PDFs while forbidding weak crypto on write or new signatures; fail-closed throughout (unknown algorithm / unparseable spec ⇒ deny). Includes a content-keyedinventory()governance report and a pluggableAuditSink. Exposed across all seven surfaces (Rust, Python, C ABI, Go, C#, WASM, Node) asset_crypto_policy/crypto_policy/crypto_inventory. Default (compat) behaviour is byte-for-byte unchanged. The residual password-key-derivation MD5 (ISO 32000-1 §7.6.3 Algorithm 1/2/3/5/7) is now also routed through the governed provider, so astrict/fips-strictpolicy denies legacy R≤4 at the primitive level, not only the operation gate — closing the gap noted in the v0.3.50 slice. The hashing is byte-identical undercompat(existing encrypted PDFs still decrypt; newly written ones are bit-for-bit unchanged). Non-security opaque MD5 (file identifier, embedded-file/CheckSum) is deliberately left direct so a strict policy still permits AES-256 writes. A machine-readable CycloneDX 1.6 Cryptographic Bill of Materials of the algorithms a run actually exercised is exported viacrypto_cbom(corecbom_json+ C ABI / Python / WASM / Go / Node / C# bindings) — the structured complement tocrypto_inventoryfor CBOM/SPDX-crypto governance. The policy now also recognises and governs post-quantum algorithms:PolicyMode::Cnsa2(CNSA 2.0 — new crypto must be FIPS-approved and 192-bit-class or stronger; 128-bit classical and L1/L2 PQC denied for write) andPolicyMode::PqcReady(Strict semantics that additionally recognise/permit ML-DSA/ML-KEM for classical+PQC dual-stacking during migration), plus ML-DSA-44/65/87 (FIPS 204) and ML-KEM-512/768/1024 (FIPS 203)AlgorithmIds ininventory()/CBOM/the policy grammar. This is governance vocabulary (the policy decides; the actual ML-DSA/ML-KEM primitives are a separate provider concern — a sign attempt fails closed until they land). Set via the string grammar (crypto_policy("cnsa2")), so all seven bindings get it with no API change; frozenAlgorithmIdbit indices are preserved (PQC ids appended). A governed RSA modulus-size floor is also enforced for signing:SecurityPolicy::min_rsa_modulus_bits(per-mode default — Compat 0, Strict/PqcReady 2048, FipsStrict/Cnsa2 3072 per NIST SP 800-131A / CNSA 2.0) makessign_pdf_bytes/sign_pdf_bytes_padesfail closed with a weak RSA key — the key-strength gate the algorithm-levelmin_security_bitscannot see. Defaultcompatkeeps no floor (byte-for-byte unchanged). (Finer X.509 cert-policy governance — keyUsage / extendedKeyUsage / validity-window enforcement for the signing certificate — is the remaining #230 roadmap item, tracked as a focused follow-up. Per-document policy override (Phase G) was design-assessed and deliberately deferred: the active policy is set-once specifically because a mid-flight downgrade is an attack vector, so a runtime widening override (e.g. relax-for-one-document) cannot be added safely; the only sound shape is an explicit per-document policy threaded through every crypto call site — a large cross-cutting change, tracked as a separate follow-up, not a set-once relaxation.) - Split a PDF by bookmarks (#482) — new
pdf-oxide split --by-bookmarks [--bookmark-prefix P] [--bookmark-level N] [--ignore-case] [--no-front-matter]CLI, plusplan_split_by_bookmarks/split_by_bookmarks*in core and every binding (Python, WASM, C ABI, Go, C#, Node). Splits at outline boundaries into one PDF per (optionally prefix-filtered) bookmark, with collision-free, filesystem-safe filenames. Outline parsing now resolves named destinations (catalog/Destsdictionary and the/Names→/Destsname tree, ISO 32000-1 §12.3.2.3 / §7.9.6), bounded against malformed/cyclic name trees. Plain per-pagesplitis unchanged (backward compatible). - Full idiomatic cross-binding parity for #230/#231/#235/#482 — every feature is now exposed idiomatically in all supported bindings (Rust, Python, C ABI, WASM, C#, Go-cgo, Go-purego, Node/TS):
- A new additive C ABI
pdf_document_has_timestamp(doc)exposes the document-scoped PAdES-B-LTA reader signal thatpdf_signature_get_pades_level(signature-scoped, ≤B-LT by design) cannot report; surfaced as Pythonhas_document_timestamp, WASMhasDocumentTimestamp, C#PdfDocument.HasDocumentTimestamp, Go(*PdfDocument).HasDocumentTimestamp, and NodePdfDocument.hasDocumentTimestamp/SignatureManager. - Python now re-exports the entire signing/PAdES surface (
sign_pdf_bytes,sign_pdf_bytes_pades,Certificate,Signature,PadesLevel,RevocationMaterial,Dss) pluscrypto_cbomfrom the top-levelpdf_oxidepackage under idiomatic names (the functions were previously reachable only aspy_-prefixed symbols on the private extension module). - The standalone document sanitization entrypoint (#231) is now a first-class
SanitizeDocument()on the C# and Go (cgo + purego)DocumentEditor(previously the livepdf_redaction_scrub_metadataC ABI had no managed/Go wrapper). - The Go purego (CGO-free) backend, previously read-side only, now covers crypto-governance (#230), destructive redaction + sanitize (#231), PAdES signing + DSS read + B-LTA (#235), and split-by-bookmarks (#482) with signatures identical to the cgo backend.
- Node/TS gains idiomatic
signPdfBytesPades,PadesLevel,PdfDocument.getDocumentSecurityStore/hasDocumentTimestamp/ planSplitByBookmarks,setCryptoPolicy/cryptoPolicy/ cryptoInventory/cryptoCbom, andSecurityManager/SignatureManager/OutlineManagermethods, all with generated TypeScript declarations. Behaviour and the frozenPadesLevelinteger mapping are unchanged.
- A new additive C ABI
- Wrong dates in digital-signature timestamps —
format_pdf_datehard-coded the month/day to0101and approximated the year as1970 + days/365, so every signature/Mvalue (and document timestamps) was an incorrect ≈Jan-1-of-leap-drifted-year (ISO 32000-1 §7.9.4). Replaced with one leap-year-correct, de-duplicated implementation (the two divergent copies are gone).
- Redaction now actually removes content (#231) — the Node
editing-managerredaction methods previously called nativepdf_redaction_*symbols that did not exist (silently no-op'ing — a security-critical operation pretending to succeed while removing nothing). Those C ABI symbols now exist and perform true destructive redaction (see Added); the binding gap is closed across all surfaces. A[BLOCK]integration test builds a real PDF containing a secret, redacts it through the public API, and asserts the secret is absent from both re-extracted text and the raw saved bytes (idempotent). - PAdES long-term-validation signatures (#235) — PDF signatures can now carry the ESS
signing-certificate-v2binding (RFC 5035, defeats certificate-substitution), an RFC 3161 timestamp (B-T), and a Document Security Store for offline long-term validation (B-LT). The DSS is added as an append-only incremental update so pre-existing signatures provably remainValid(asserted by the I1–I7 integrity-invariant suite intests/pades_ltv.rs); a tampered signed region still fails verification (negative test). See Added for scope and the EU-DSS conformance gate.
- @Suleman-Elahi for requesting split-by-bookmarks (#482).
- @jedzill4 for volunteering on destructive redaction (#231).
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.49 | Off-byte-0 PDF header recovery, sparse-trailer Catalog discovery, a render-path thread-safety fix, and release-automation hardening.
-
Linearized PDFs with a non-zero
%PDF-header offset (#509) — files whose%PDF-header is preceded by leading bytes (e.g. a captive- portal HTML redirect injected ahead of a Linearized PDF) are now read instead of rejected withTrailer missing /Root entry. The xref- offset shift for header-offset PDFs no longer requires the final trailer to carry/Root; xref reconstruction now rejects a parsed- but-/Root-less trailer and falls through to Catalog discovery; andcatalog()scans for/Type /Catalogwhen the trailer omits/Root(matching Poppler / PDFium behaviour, ISO 32000-2 §7.5.2 / 1.7 Implementation Note G.6). -
Render-path data race under concurrent rendering (#505) — the process-wide embedded-font classification cache keyed on
Arc::as_ptrcould return a stale(is_byte_indexed, has_unicode_cmap)for an unrelated font when an allocation address was recycled across threads, intermittently surfacing asParseException [1000]fromRenderPage/RenderPageFitunderParallel.ForEach. The unsound global cache is removed; the cmap classification is now computed locally per call (a cheapttf_parsertable probe), so concurrent renders can no longer collide. -
Test helper
make_type0_fontused a non-productionEncodingvariant (#504) — the helper now mapsIdentity-H/Identity-VtoEncoding::Identityexactly as the real font parser does, so the affected Type0 tests exercise the production code path instead of a variant production never produces. Purely test-correctness; no user- facing behaviour change.
-
Release-notes title extraction hardened (#506) —
extract-release-notes.shnow bounds the subtitle scan to the requested version's section (no longer silently inheriting an older version's>blockquote), concatenates multi-line blockquotes instead of truncating at the first line, and fails loudly when the version section or its subtitle is missing. Avalidate-changelogPR/release-branch gate plus a release-title sanity check stop a malformed CHANGELOG from ever reaching the publish step, and a self- contained regression test covers the missing-section, missing- subtitle, multi-line, and cross-version false-scrape cases. -
GitHub Deployments visibility for regular publishes (#493) — each publish job in
release.yml(crates.io, PyPI, npm, npm-native, NuGet, Homebrew/Scoop) now declares anenvironment:, so standard- pipeline publishes appear under the Deployments view with their artifact URL, matching what the FIPS pipeline already did.
- @Goldziher (kreuzberg-dev) — opened #509 with a clean standalone reproducer (no app code), a pinned test file, a full multi-engine cross-check against Poppler, and a 156-PDF corpus survey that isolated this as the single legitimate file the parser rejected. That report turned a vague "Linearized PDF fails" into a precise header-offset + sparse-trailer root cause.
The remaining fixes (#506, #505, #504, #493) were surfaced internally while reviewing the v0.3.45–v0.3.47 release automation, the post-merge main CI runs, and the v0.3.47 PR review.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.48 | Pluggable cryptographic provider — FIPS 140-3 compliance for
This release lands the office converter integration (#159): bidirectional PDF ↔ DOCX/PPTX/XLSX round-trip with layout-preserving fidelity, exposed through all seven bindings (Rust, Python, Node, WASM, C FFI, C#, Go). Typical text-heavy PDFs round-trip through an Office file and back at near-pixel parity to the source. The corpus harness used to validate the integration covers 26 PDFs spanning academic papers, hymnals, multi-column newspapers, slide decks, government forms, and policy documents.
Closes the v0.3.14-milestone feature request "PDF to Word/DOCX export": text styling (fonts / sizes / colours) preserved via layout-mode writers + Unicode/CJK system-font fallback; paragraphs / headings / lists preserved via positional frame anchors; image placement preserved via raster Image XObject + Form XObject rasterization. Tables flow through positional shapes (grid-aware reconstruction is still follow-up work).
-
Bidirectional PDF ↔ DOCX/PPTX/XLSX conversion (#159) — new
OfficeConverterAPI converts in both directions across DOCX, PPTX, and XLSX. Layout-preserving writers (src/converters/{docx,pptx,xlsx}_layout.rs) emit one positionally- anchored shape / frame per PDF text span; the back-direction render path (render_positional_ir/render_pptx_positional) reproduces the source page near-identically. Available on every binding via the09-new-features/office_conversion/examples. -
Unicode + CJK system-font fallback for office round-trip (
src/fonts/unicode_fallback.rs) — when the source PDF embeds a CID- only font subset the writer can't re-embed, a system Unicode face (DejaVu Sans → FreeSans → Noto Sans → Tinos / Arimo) and a CJK face (DroidSansFallbackFull → IPAGothic → NanumGothic → Unifont) are registered automatically.needs_unicode_fallbackis WinAnsi-aware (curly quotes / em-en dashes / bullet / ellipsis / trademark stay on the source font); CJK ranges (Han / Hiragana / Katakana / Hangul / Compatibility Forms / Halfwidth–Fullwidth) route to the CJK face first. Restores Hebrew, Arabic, Latin Extended, Chinese, Japanese, and Korean characters that previously rendered as?glyphs across all three formats. -
Music-notation region detection + rasterization (
src/converters/music_region_finder.rs) — hymnals and sheet-music PDFs (Finale Maestro, SMuFL Bravura, Sibelius Petrucci / Opus, Adobe Sonata, LilyPond Emmentaler, …) are detected by combining a music- font allowlist with a 5-line staff-clustering pass onextract_paths. Detected music systems are rasterized once at 150 DPI and embedded as positioned PNGs; the source spans / shapes inside each music region are suppressed so glyph substitutions don't overlay the bitmap. Hymnal-style PDFs now round-trip with their staves and noteheads preserved instead of emitting random Latin characters from the missing music face. -
Form XObject + inline-image rasterizer shared helper (
src/converters/form_xobject_finder.rs::rasterize_form_and_inline_regions) — the layout-mode writers and the flow-modepdf_to_irpath share one helper that renders each page once at 150 DPI and crops per region. Vector figures (academic-paper charts, agency logos drawn as Form XObjects) survive the office round-trip; the prior per- region full-page render was replaced. -
Per-run text colour preservation — PDF→DOCX/PPTX/XLSX now emits
<w:color>/<a:solidFill>for spans carrying explicit colour; the back-render path drops torich_paragraphinstead oftext_in_rectwhen any inline run has a colour so the colour survives the PDF render. Siblingoffice_oxideparser changes expose the colour onTextSpanfor the docx, pptx slide, and pptx shape paths.
-
Rotated-text watermark filter (
src/converters/pdf_to_ir.rs::span_overlaps_rotated_chars) — page-edgearXiv:NNNN.NNNNN [cat] DATEwatermarks were leaking into the office round-trip as horizontal text strips mid-page. The new origin-based filter matches each span to its nearestextract_charsglyph by(origin_x, origin_y)distance and uses that glyph'srotation_degreesto decide drop. Gated by a page- levelchars_horizontal_dominantheuristic (≥75 % chars at ~0°) so PDFs whose text-matrix decomposition spuriously reports rotation = 90° for every glyph (Finale slide-mode decks) are left alone. Catches the watermark family across multiple arxiv papers. -
Multi-column page handling in layout-mode line grouping (
src/converters/layout_lines.rs::group_spans_into_lines) — refuses to merge a candidate span into the active line when itsbbox.xsits more thanmax_font_size * 4past the line's right edge. Threshold (~36-48 pt for body text) is wider than any justified inter-word gap but narrower than typical column gutters (60+ pt). Fixes German multi-column newspapers and 2-column arxiv papers where columns previously merged into one frame. -
Drop-cap guard for layout-mode line grouping —
group_spans_ into_linesrejects merges when the candidate span's font size differs from the line's existing spans by > 2×. Anchors Nature- Methods-style drop-cap "A" wraps at the correct visual position instead of fusing them into a single heading-class frame with the body text below. -
OpenType / CFF cmap rebuild and injection (
src/fonts/cmap_injector.rs,src/document.rs) — two real bugs in the cmap-injection path that produced corrupted lowercase glyphs on strict OS renderers:build_format4_cmapover-reported subtable length by 2 bytes (double-counted thereservedPadfield). Strict ttf-parser / CoreText paths silently rejected the cmap; some Win/macOS renderers then mapped the affected codepoints to the wrong glyph.extract_embedded_fonts_with_unicode_maps_and_widthswas driving its Unicode→GID table offchar_to_unicode, whose CID-as- Unicode fallback overwrote authoritative ToUnicode entries with identity mappings on Identity-H fonts. Now reads the ToUnicode CMap directly and filters U+FFFD plus C0 controls.
-
Shape-artefact filter for layout-mode DOCX (
src/converters/docx_layout.rs) — drop solid-black rects > 25% page area (slide-background artefacts), solid-white rects > 50% page area (page-background rects emitted before text — would occlude the rendered text in the back-PDF), and rects > 1.2× page extent (extractor noise that wiped the entire frame). -
XLSX layout-mode page count gate raised (
src/document.rs::to_xlsx_bytes) —LAYOUT_MAX_PAGESraised 30 → 200. The 134-page arxiv dissertation was being routed to flow-modeir_to_xlsx, whose column-A row-N layout collapses the centered cover page into the top of column A. Layout-mode handles 100+ page documents fine; the gate now triggers only for very large reports.
-
ExtGState resolve cache: 75× speedup on vector-heavy PDFs (
src/rendering/page_renderer.rs) —apply_ext_g_statewas deep-cloning the per-Form ExtGState HashMap on everygsoperator. Vector figures (scatter / contour plots emitted as Form XObjects) trigger this thousands of times per page — a typical academic paper with a dense plot can hit ~10 000gsops with 10 000+ unique ExtGState names. The clone dominated render time. The resource dict is now resolved once at the top ofexecute_operatorsand parsed-effect (ParsedExtGState) results are cached perdict_name. Measured on a ~10-page vector-heavy arXiv paper: PDF→DOCX dropped from 263 s to 3 s. -
Debug-only path-rasterizer clones gated by log level (
src/rendering/path_rasterizer.rs) —path.clone().transformwas unconditional, used only to populatepixel_boundsin alog::debug!line. Same vector figures hit this path tens of thousands of times per page. Gated behindlog::log_enabled!(Level::Debug).
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix.
This release closes the remaining bugs surfaced by the kreuzberg integration (issue #484) and ships the related text-extraction quality fixes. Word-F1 against the pdftotext-derived ground truth corpus now meets the kreuzberg quality floor for every PDF in the issue 484 set.
-
kreuzberg regression suite — all 24 PDFs now meet the F1 floor (#484) —
extract_textpreviously failed three documents reported by @Goldziher on the kreuzberg corpus:pdfa_039.pdf(swimming-results table) returned F1 0.810,pr-136-example.pdf(CJK financial document) returned F1 0.709, andannotations.pdfreturned F1 0.545. Three separate root-cause fixes restore them to F1 ≥ 0.85:eliminate duplicate emission of multi-row table labels— the text-only spatial fallback indetect_tables_with_linesnow requiresconfig.text_fallback=true(whichextract_textdoes not pass) so report-style PDFs with decorative ruling lines no longer get their cell content emitted twice;span_in_tableadds a text-match fallback to catch label spans whose font ascent extends slightly above the cell's ink box (issue-53-example.pdf F1 0.867 → 0.992).tighten cross-font glue and decimal merge for CJK + Latin layouts—cross_font_word_glueno longer fires on a CJK ↔ non-CJK boundary (CJK ideographs satisfyis_alphabetic()per Unicode and were being concatenated with adjacent Latin); thedecimal_mergeheuristic requires a column-boundary-sized gap (gap > 0.4 em) so per-glyph Tj operators in CJK documents stop mangling "2013" into "201.3" (pr-136 F1 0.709 → 0.884).narrow CJK boundary forced-space to script glyphs only—should_insert_spacenow actively inserts a space at the CJK ↔ non-CJK boundary to match pdftotext tokenisation, but restricted to actual script glyphs (ideographs, kana, hangul); fullwidth ASCII operators like < > = μ stay inline with adjacent digits/Latin so compound tokens like "60000≤Q<80000" are preserved (issue-336 text quality gate stays at PASS). Reported by @Goldziher.
-
extract_spansnow exposes amerge_tm_tj_runsopt-out (#488) — Same-line Tm+Tj runs were unconditionally batched into a singleTextSpan, throwing away the per-Tm positioning that downstream layout-analysis code (e.g. column-aware table detection) needs.SpanMergingConfig::merge_tm_tj_runs(defaulttruefor backward compatibility) now flushes the span buffer at every Tm operator so callers can opt in to one span per Tm+Tj group, matching the granularity ofpdftotext -bbox-layout. Reported by @haberman. -
saveEncryptedToBytesno longer panics in browser WASM (#492) —generate_file_id(per ISO 32000-1 §14.4) calledstd::time::SystemTime::now(), which is unimplemented onwasm32-unknown-unknown. Cfg-gated so the WASM build derives the file identifier fromuuid::Uuid::new_v4()only — still a unique opaque 16-byte ID per the spec. Reported by @eersis-byte. -
CJK fullwidth operator spacing in
to_markdown/to_html(#485) — Four coordinated changes restoreissue-336-example.pdfto PASS on all three quality gates (text, markdown, html):pipeline/converters/has_horizontal_gapsuppresses space insertion when one side is CJK and the other is CJK or a fullwidth/math operator (≤, <, >, =, μ, etc.), mirroring the text-extraction CJK-pair suppression.extract_cell_textno longer inserts an unconditional space between adjacent spans on the same row of a table cell — uses the same gap-aware separator rules as the inline-flow path so multi-span cells like60000≤Q<80000(rendered as 5 separate Tj operators) keep their compound tokens intact.consolidate_adjacent_table_fragments(new helper inspatial_table_detector) merges vertically-adjacent tables that share an identical column structure. The line-based detector emits one fragment per ruling-rule strip on PDFs that draw a horizontal rule between every pair of rows; each fragment was failingis_real_gridand falling through to paragraph flow with column-based reading order, producing orphan<p>40000≤Q</p>/<p><55000</p>pairs. Consolidating before the filter lets the merged multi-row table survive.is_real_gridaccepts wide consolidated tables that have dense data rows alongside sparse header / multi-row-label rows — the strict 70 % dense-ratio gate was rejecting real tables whose column headers split across multiple visual rows. Score improvements onissue-336-example.pdf: text 0.612 → 0.820, markdown 0.577 → 0.863, html 0.632 → 0.646 (all PASS their thresholds).
-
Text-only spatial table fallback for line-less tables in
to_markdown(#486) — partial fix.extract_page_tablesnow opts in to a relaxed text-only detection when the caller is a converter (text_fallback= true), with the column ceiling raised from 15 to 25 so that sailing-score grids with 16-18 score columns are no longer rejected outright. The fragmented-table consolidation from #485 also kicks in here, recovering most of the row labels and identifier columns.nougat_018.pdfmarkdown still trails its threshold (0.656 vs 0.90) because the score columns themselves — variable-width sparse cells with parenthesised drop-scores — evade column detection; that is the remaining piece tracked separately. -
HTML table cell rendering aligned with markdown (#487) — partial fix.
to_htmlnow uses the same span-walking and bold/italic preservation asto_markdown'srender_table_markdown. Three of four affected docs improved by 1-4 % Jaccard but two (nougat_018, nougat_026) still trail the threshold pending the table-fragmentation work above. -
RTL inline emphasis stripping in markdown extraction (#459) — RTL detection now strips
<strong>/<em>markers from visually-reversed runs into_markdownconsistently with the plain-text path; spec basis ISO 32000-1 §14.8.2.3.3 (Reverse- Order Show Strings). 46 unit tests intests/test_rtl_script_support.rscover the detector, BiDi algorithm, and inline-flow integration. -
Multi-byte CMap parsing and array-form
beginbfrange(§9.7.5) —beginbfrange ... endbfrangearray notation<src> <src> [<dst1> <dst2> ...]was not fully covered; the CMap parser now matches the spec's allowed grammar so multi-byte CIDs map correctly through ToUnicode CMaps. -
/StructTreeRoot-only tagged PDFs (§14.7.4) — Documents that declare/StructTreeRootin the catalog without a/MarkInfodictionary (PDF 1.4 documents, valid per the spec) now correctly use the structure tree for table-cell content extraction. Resolves/OBJRcontent-item references during tree traversal so OBJR-referenced annotations and XObjects are no longer lost. -
Indirect references in MediaBox/CropBox accessors (§7.7.3.4) — Page attribute accessors now resolve
/MediaBoxand/CropBoxthrough indirect references and the/Pagesinheritance chain. This is what made the Bucket A errors in the issue 484 retest comment (annotations*.pdf,pdfa_039.pdf) parse successfully. -
CTM-aware cache key for Form XObject span extraction — Form XObject spans were cached by XObject reference alone, returning stale coordinates for the same XObject reused on multiple pages with different CTM transforms. Cache key now includes the CTM so repeated XObjects produce correctly-positioned spans on each invocation.
-
notdefrangeU+FFFD no longer blocks the CID-as-Unicode fallback (§9.10.2) — Per the spec, U+FFFD (REPLACEMENT CHARACTER) signals "no proper Unicode mapping", so a notdefrange hit must not stop the priority list. The Identity CID-as-Unicode fallback (Priority 3) now fires correctly for composite fonts whose ToUnicode CMap returns U+FFFD. -
ToUnicode Priority-3 fallback guarded for composite fonts (§9.10.2) — The CID-as-Unicode fallback is now only applied to fonts whose CMap is one of the predefined composite-font CMaps or whose CIDFont uses one of the Adobe character collections, matching the spec's enumeration; misapplication on other fonts could produce mojibake on previously-working files.
-
Reject prose / TOC / underline-annotation false-positive tables in
to_htmlandto_markdown— Wide pages of ordinary paragraph text were sometimes detected as multi-column tables: word x-positions cluster into "columns" by accident, and decorative horizontal rules (newsletter mastheads, annotation underlines, page borders) tricked the line-based detector into treating two adjacent lines as a header + data row. The detection pipeline now applies several post-is_real_gridguards that look at the shape of the candidate's cell content rather than just its grid geometry:looks_like_prose_tablerejects a candidate when more than 12 % of cells end with a mid-sentence,or;, more than 25 % of cells start with a lowercase ASCII letter (continuation fragments like "and", "the", "to"), or more than 10 % of cells are pure leader dots (the. . . . . .runs in tables of contents).- The text-only spatial fallback and the horizontal-rule- bounded path both now require ≥ 3 rows of evidence. A title plus a wrapped body line is the signature of prose, not a table; only the line-based intersection / cluster paths (which have authoritative visual evidence) still accept 2-row tables.
should_insert_spaceno longer forces a space at the CJK ↔ ASCII-punctuation boundary. The boundary forced- space added in v0.3.47 was correctly inserting a space at "神鹰集团" + "2015" but was wrongly producing "する ." instead of "する." in Japanese technical text; ASCII clause punctuation hugs the preceding token in every script, so the rule is now suppressed when the transitioning glyph IS the punctuation.text_fallbackdefaults back totrueonTableDetectionConfig. The new prose-shape filter replaces the gate-based protection added earlier in the cycle, so the publicextract_tablesAPI again detects line-less data tables out of the box.
tests/test_corpus_extraction_quality.rsnow strips markdown formatting markers (**bold**,*italic*,|separators,---|---|---rule,# heading,```fences) before computing Jaccard against the plain-text GT — mirrors the HTML test's existingstrip_htmlstep so the score reflects text content rather than formatting markup.- All 19 quality-gate Jaccard tests in
tests/test_corpus_extraction_quality.rsnow pass (up from 13 at the start of this branch). The kreuzberg issue 484 corpus passes its F1 floor on every PDF.
This release was driven entirely by community bug reports and the kreuzberg integration test feedback loop:
- @Goldziher (kreuzberg-dev) — opened #484 with a calibrated 166-PDF regression suite and follow-up retest comments that turned every remaining gap into a focused root-cause fix
- @haberman — opened #488 with a minimal Rust reproducer for the Tm+Tj merging issue
- @eersis-byte — opened #492 with the WASM
SystemTimepanic backtrace
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.46 | Pluggable cryptographic provider — FIPS 140-3 compliance for
-
Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering (#446, #481) —
page.render_pixmap()(Python),renderToPixmap()(Node.js / Go), andPage.RenderToRgba()(C#) expose the premultiplied RGBA8888 buffer directly fromtiny_skia::Pixmap::data(), eliminating the encode→decode roundtrip for callers that need raw pixels (PIL, sharp,System.Drawing.Bitmap,image.RGBA). Downscaling is now SIMD-accelerated viafast_image_resize(ARM NEON, x86 AVX2), replacing the previous bilinear path. Concurrentrender_*calls on the samePdfDocumentare now safe: all rendering functions take&PdfDocument(shared reference) and all interior-mutable state is already guarded by per-fieldMutex, so the FFI layer no longer produces aliased&mutreferences and concurrent renders run without a global serialisation bottleneck. Requested by @mara004 and @potatochipcoconut. -
ConversionOptions::exclude_regions/include_region(#484) — New spatial filtering fields allow callers to exclude rectangular regions from extraction output or restrict extraction to a single bounding rectangle. Backed bySpatialCollectionFilteringtrait methodsfilter_by_rect/exclude_rects. -
PageFontStats(#484) — Newlayout::PageFontStatsstruct computed in O(n) over spans; exposesdominant_em,dominant_line_height,dominant_char_width, andbody_font_name. All layout heuristics now derive absolute thresholds from these measurements instead of hardcoded constants, improving correctness across a wider range of font sizes.
-
JBIG2-compressed scanner PDFs render as blank pages (#332) — The pass-through
Jbig2Decoderreturned compressed bytes unchanged, causing a dimension mismatch and a silent image drop. Integrateshayro-jbig2v0.3 (pure-Rust, Apache-2.0 OR MIT); embedded JBIG2 bitstreams are decoded viahayro_jbig2::Image::new_embedded, with JBIG2Globals loaded from/DecodeParmswhen present.BitsPerComponentis overridden to 8 post-decode soto_dynamic_image()does not attempt CCITT bilevel decompression of already-decoded pixels. Reported by @frederikhors, who also confirmed the original vertical-flip / glyph-substitution symptom is resolved in v0.3.45. -
add_texton existing PDF produces blank or discarded content (#483) —DocumentEditor::add_texton a page of an existing PDF either blanked the page or (when combined withselect_pages) silently returned the unmodified original. Root causes: the storage-side page-index mapping afterselect_pageswas off by one, andadd_textfailed to preserve the existing content stream when writing the new text layer. Both are fixed; an end-to-end regression suite is added. Reported by @stephenjudkins. -
Text extraction corpus quality improvements across 166 PDFs (#484) — Systematic audit driven by @Goldziher's calibrated 166-document corpus (the kreuzberg test suite), which provides per-document ground-truth
.txtfiles and a word-F1 harness. Multiple extraction failures identified and fixed:- Newline/CR-only spans treated as line breaks — Spans consisting entirely of
\nor\rbytes are now emitted as a single newline rather than verbatim byte sequences, eliminating spurious blank lines from some PDF generators. - Annotation text double-emitted —
append_non_widget_annotation_textwas called after the main span assembly pass even thoughannotation_content_spans()already inlined annotation/Contentsinto the span list. The redundant call is removed. - Markup annotation
/Contentscorrectly filtered — Per ISO 32000-1 §12.5.6.2,/Contentson Highlight, Underline, StrikeOut, Squiggly, Caret, Ink, FileAttachment, and Redact annotations is popup/tooltip text, not page content. These subtypes are now excluded fromannotation_content_spansandappend_non_widget_annotation_text. - No space inserted between adjacent CJK characters —
should_insert_spacenow returnsfalsewhen both the trailing and leading characters are CJK (Hiragana, Katakana, CJK Unified Ideographs, Hangul, CJK Extension B). - Unicode ligatures preserved; adjacent CJK spans merged — Latin ligatures (U+FB00–U+FB06) are now preserved in the span stream rather than dropped. Adjacent CJK spans from the same run are merged into a single span, eliminating inter-character noise.
- Lower→upper CID range boundary split restored — The CID range boundary split now consistently applies the lower→upper ordering correction that was accidentally dropped; the fix propagates to Markdown and HTML output paths.
- Non-adjacent subscript/superscript spans merged —
merge_sub_superscript_spanshandles spans separated by intervening content, using em-relative thresholds[-0.1×em, +0.25×em]instead of hardcoded absolute values so detection scales with body font size. - Column-spanning decimals split at table cell boundaries — Decimal numbers that span two adjacent table cells are split at the cell boundary rather than merged into a single token.
- Position-aware space insertion between adjacent MCID spans — Spaces between MCID-tagged spans are inserted based on actual rendered x-positions rather than always or never.
- Boundary split on letter→digit transition only —
char_widths_boundary_splitnow splits only at a letter-to-digit boundary (e.g.Theorem1), removing false splits on UpperCamelCase terms that previously broke word-shape heuristics. - Same-line threshold formula fixed —
same_line_thresholdnow uses(min_fs × 1.2).max(max_fs × 0.3), handling mixed-size lines (heading + caption on the same line) without cliff effects. - Bare-word identifiers and corrupt
StructTreeRoothandled — Parser now tolerates bare-word tokens as dictionary values; a corrupt or absentStructTreeRootno longer aborts extraction. - Standard-14 font matching strips
SUBSET+prefix; accepts canonical PostScript aliases — Per ISO 32000-1 §9.6.2.2 Annex D, standard font names are matched after stripping anyABCDEF+prefix.HelveticaOblique(no hyphen) is now accepted alongsideHelvetica-Oblique. - Explicit
/DWtracked inFontInfo—has_explicit_dw: booladded;has_explicit_widths()returnstruewhen/DWis explicitly present, enabling correct width lookup for CIDFonts that declare only/DW(no/Warray). - CIDFont width fallback corrected — When
/DWis absent and a CID is not in the/Warray,get_glyph_widthnow falls through todefault_widthrather thancid_default_width, matching real-world PDF behaviour. - Word extractor honours
split_boundary_before— Words that straddle a table-cell or column boundary are no longer merged. - Ligature expansion option —
ConversionOptionsgainsexpand_ligatures: bool(defaultfalse). When enabled, Latin ligatures (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st) are expanded to component letters. - Extraction warnings API —
PdfDocument::warnings()(clones) andtake_warnings()(drains) expose non-fatal extraction warnings (missing MCIDs, encrypted-PDF fallback) accumulated during a run.
- Newline/CR-only spans treated as line breaks — Spans consisting entirely of
-
Same-line span reorder: x-gap validation guard (#413) — After the row-aware sort, mixed-baseline glyphs (superscripts, subscripts) could appear before their base glyphs. The
reorder_same_line_runshelper now validates that a candidate run is horizontally contiguous before X-sorting it; runs with a large X gap are left in row-aware order, preventing disjoint footer/header content from being collapsed into a fake same-line sequence. Fixes"8th"ordering (was"th8"). Contributed by @RolandWArnold in PR #413. -
Layout word-merge O(n²) → O(n) — The word-merge pass previously re-scanned the entire accumulator for every candidate span; it is now O(n) via an index map.
-
Wide spatial false-positive tables rejected via dense-row-ratio — Table detection now computes the fraction of rows with dense (≥50%) column coverage and rejects candidates below the threshold, eliminating false positives on wide but sparsely populated layouts.
-
Bare-identifier lexer leniency confined to dict-value position — The lexer's tolerance for bare (unquoted) name-like tokens is now restricted to dictionary value positions, preventing mis-tokenisation of content streams where the same byte sequences are valid operators.
-
Typographic Unicode spaces normalised in extracted spans — Non-breaking, thin, en, em, and other Unicode space variants in span text are normalised to ASCII space before the word-spacing heuristics run, eliminating invisible gaps in the extracted output.
- Rendering: per-segment font re-parsing eliminated — The text rasterizer no longer re-parses font data on every span segment;
Arcclones across the hot render loop and redundant CJK subsetter invocations are also eliminated, reducing CPU time for text-heavy pages by 30–60%.
fast_image_resizeadded (#454) — New dependency enabling SIMD-accelerated (ARM NEON, x86 AVX2) image downscaling for the raw-RGBA render path.
- FIPS release workflow now validates on pull requests —
release-fips.ymlnow triggers on PRs tomainthat touch source, language-binding, or workflow files. The full build across all five platforms and all four language bindings runs without publishing, so the tag push is a pure deployment step after a confirmed-green PR. - macOS x86_64 FIPS builds moved to free runners — All four
macos-13-xlarge(paid Intel Larger Runner, causing indefinite queue waits on plans without access) replaced withmacos-latest(free ARM runner cross-compiling tox86_64-apple-darwin). - Cargo registry caching added to all 20 FIPS build jobs — Per-target cache keys (
$runner_os-$target-fips-cargo-$lock_hash) are restored before each build, substantially reducing re-run time on warm caches.
- @RolandWArnold — contributed the same-line x-gap validation fix in PR #413. Roland diagnosed that
reorder_same_line_runswas collapsing disjoint footer/header spans into a fake same-line sequence and designed the horizontal-contiguity guard that prevents it. The fix also correctly handles superscript/subscript ordering ("8th"instead of"th8"). - @Goldziher (Na'aman Hirschfeld) — filed #484 with a calibrated 166-document corpus, per-document ground-truth
.txtfiles, and a word-F1 harness, providing the systematic test bed that drove the bulk of the extraction improvements in this release. - @stephenjudkins (Stephen Judkins) — filed #483 with a minimal, precisely-scoped reproduction of the
add_textregression that made the root-cause analysis straightforward. - @mara004 and @potatochipcoconut — requested the raw RGBA pixel buffer API in comments on #325 with clear use cases across PIL, sharp,
System.Drawing.Bitmap, and Go'simage.RGBA, and engaged on the pixel-format details (premultiplied vs straight alpha, tiny-skia format constraints) that shaped the final API design. - @frederikhors — reported the JBIG2 blank-page symptom in a comment on #332 and confirmed that both the JBIG2 fix and the earlier vertical-flip regression are resolved.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.45 | Pluggable cryptographic provider — FIPS 140-3 compliance for
- CJK OTF (CFF) font subsetter corrupts glyph order (#449) — OTF fonts with CFF outlines (SFNT magic
OTTO) were embedded asFontFile2 / CIDFontType2(the TrueType path), causing PDF readers to misparse the CFF data and render wrong glyphs. Writer now detects CFF magic post-subsetting and emits the correct PDF object graph:FontFile3(with/Subtype /CIDFontType0C) +CIDFontType0(noCIDToGIDMap). AwsLcProvider::verify_rsa_pkcs1v15now fully implemented (#475) — ChangedSignatureVerifier::verify_rsa_pkcs1v15to accept the raw message bytes (consistent withverify_rsa_pss/verify_ecdsa). Under the defaultRustCryptoProviderthe hash is now computed inside the trait implementation. UnderAwsLcProvider(FIPS) the new call path uses aws-lc-rs'sRSA_PKCS1_2048_8192_SHA{256,384,512}verifiers — RSA-PKCS#1 v1.5 signature verification now works under FIPS instead of returningSignerVerify::Unknown.render_page_fitproduces images smaller than the requested box (#480) — Integer-DPI conversion viafloor()lost up to 3 pixels from the constrained dimension (e.g. a 1040 px fit yielded 1037 px on Letter). The renderer now computes a float scale directly (fit_px / page_pt) and stores it in the crate-privateRenderOptions::scale_overridefield, bypassing the DPI round-trip entirely. The constrained dimension is now exact for all integer pixel inputs. Reported by @gevorgter.
legacy-cryptocompile-time feature flag (default-on) (#230) — New default-on Cargo feature that gates MD5 key-derivation and RC4 cipher support for PDF Standard Security R≤4 documents. Downstream crates that must not load legacy cryptography can opt out withdefault-features = false; they will receive a clearError::InvalidPdfinstead of silently accepting RC4/MD5-encrypted PDFs. Themd-5crate is now an optional dependency gated behind this feature. RC4 (pure Rust, no crate) is also disabled: bothRustCryptoProvider::rc4()andrc4_crypt_implare compiled out, and the provider returnsAlgorithmNotPermittedat runtime when the feature is absent. Phase A of Issue #230.
- Stub parity gate for Python wheels (#464) —
rylai.tomlnow uses--features pythononly (matching the released wheel) so generated.pyistubs no longer include symbols fromofficeor other optional features. A new CI step (Verify stub symbol parity) checks that every stub symbol exists in the installed wheel. - TypeScript 6 + @types/node 25 upgrade for JS bindings (#438, #440) — JS dev dependencies bumped to TypeScript
^6.0.3and@types/node^25.6.0.tsconfig.jsongains"types": ["node"](required by @types/node 25's ambient-global model) and"ignoreDeprecations": "6.0"(to acknowledge the TS6-deprecatedmoduleResolution: node— full migration tonode16deferred until the import-path audit is done).
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.44 | Pluggable cryptographic provider — FIPS 140-3 compliance for
pdf_oxide::crypto::CryptoProvidertrait — new abstraction that decouples PDF encryption and signature paths from any one cryptography crate. Two providers ship out of the box:RustCryptoProvider(default): pure-Rust stack as before (sha2,aes,rsa,p256,p384,getrandom,md-5,sha1). Permits every algorithm PDF specs reference, including the legacy MD5+RC4 path required by ISO 32000-1 R≤4 documents.AwsLcProvider(opt-in via--features fips): backed byaws-lc-rs, FIPS 140-3 validated since 2024. Refuses MD5 / SHA-1-for-signing / RC4 withError::AlgorithmNotPermittedand a clear remediation message.
- Single source of randomness.
src/encryption/algorithms.rs's formerSHA-256(uuid_v4 || timestamp_ns || …)cascade is replaced withcrypto::active().random_bytes()— under the default provider this isgetrandom::fill()(OS entropy pool); under FIPS it'saws_lc_rs::rand::SystemRandom. Cryptographically suitable for AES-256 file keys and salts; auditable. - Closes #236.
Three sub-traits compose into CryptoProvider:
Hasher— incremental hashing (update/finalize).SymmetricCipher— AES-128/256-CBC (PKCS#7 + no-padding) and RC4.SignatureVerifier— RSA-PKCS#1-v1.5, RSA-PSS, ECDSA P-256/P-384.
Plus an opaque Signer handle so HSM / PKCS#11 / Cloud KMS backends can plug in via SigningKeyMaterial (which is #[non_exhaustive] — future variants for HSM slots etc. are not breaking changes).
The is_legacy_allowed() policy bit lets each provider declare whether MD5 / SHA-1-sign / RC4 are permitted. PDF Standard Security R≤4 documents are gated at EncryptionHandler::new: under a FIPS provider they fail with a remediation message ("re-encrypt at R=6 or build pdf_oxide without the 'fips' feature so the default 'rust-crypto' provider stays active") rather than panic deep inside the cipher path.
use std::sync::Arc;
use pdf_oxide::crypto::{set_provider, AwsLcProvider};
set_provider(Arc::new(AwsLcProvider::new()))?;
let doc = pdf_oxide::PdfDocument::open("encrypted-r6.pdf")?;
See docs/CRYPTO_PROVIDERS.md for the algorithm coverage matrix, custom-provider walkthrough (sovereign-jurisdiction algorithms, HSMs), and the legacy-PDF policy table.
- New
fipsjob in.github/workflows/ci.ymlbuilds with--features fips, runs the 11-test AwsLcProvider suite including across_provider_aes_compatcheck that asserts the FIPS and rust-crypto AES paths produce byte-identical output, and enforces clippy-D warningsunder the FIPS feature.
-
New
.github/workflows/release-fips.ymlworkflow (manually triggered) builds and publishes parallel FIPS distributions on every package index, all from the same Rust source compiled with--features fipsso each binary contains only AWS-LC's FIPS-validated module:Ecosystem Package Install PyPI pdf_oxide_fipspip install pdf_oxide_fips==0.3.44npm pdf-oxide-fipsnpm install pdf-oxide-fips@0.3.44NuGet PdfOxide.Fipsdotnet add package PdfOxide.Fips --version 0.3.44Go github.com/yfedoseev/pdf_oxide/go-fipsgo get github.com/yfedoseev/pdf_oxide/go-fips@v0.3.44Platform matrix in v0.3.44 (every binding × every platform):
Platform Python npm NuGet Go Linux x86_64 ✅ ✅ ✅ ✅ Linux aarch64 ✅ ✅ ✅ ✅ macOS x86_64 ✅ ✅ ✅ ✅ macOS arm64 ✅ ✅ ✅ ✅ Windows x86_64 ✅ ✅ ✅ ✅ All distributions move in lockstep with the regular release — FIPS and default variants of the same release tag are byte-equal in their non-crypto code paths. Per-platform smoke tests in the workflow confirm the FIPS provider is reachable AND
crypto_use_fips()(or equivalent) flips the active provider as expected — catches API mismatches before publishing.Why
pdf_oxide_fips(underscore) for Python: PyPI normalizes hyphens / underscores to the same canonical form per PEP 503 (pip install pdf_oxide_fipsandpip install pdf-oxide-fipsresolve to the same package). Using underscore inpyproject.tomlmakes the wheel filename and theimport pdf_oxidepath identical to the default distribution — only the package name differs.Why parallel distributions instead of
pip install pdf_oxide[fips]: Python extras (PEP 508) can add Python dependencies but cannot swap the compiled.sobaked inside a wheel. The industry pattern (cryptography, pyOpenSSL) ships separate FIPS distributions; we follow suit.Why a
go-fipssubmodule path: Go modules are import-path-bound, so users pick atgo gettime:go get github.com/yfedoseev/pdf_oxide/go # default go get github.com/yfedoseev/pdf_oxide/go-fips # FIPSBoth submodules re-export the same Go API; only the linked native static lib differs.
- Restore
manylinux_2_28glibc floor for Python wheels. 0.3.42 and 0.3.43 published onlymanylinux_2_35Linux glibc wheels because the release workflow ranmaturin builddirectly onubuntu-latest(Ubuntu 24.04, glibc 2.39), letting the runner's glibc set the wheel tag. That excluded Amazon Linux 2023 / AWS Lambda Python (glibc 2.34), RHEL 8, Ubuntu 20.04 and Debian 11 — pip rejected the wheel and fell back to a source build that OOM-killedrustup-initinside the Lambda build container. Reported by @potatochipcoconut on PR #463. Bothrelease.yml(default wheels) andrelease-fips.yml(pdf_oxide_fipswheels) now build the Linux glibc wheels viaPyO3/maturin-actioninside themanylinux_2_28container, and a CI guard step fails the job if amanylinux_2_28wheel is not produced for either Linux target — preventing this regression from recurring. The 0.3.21 baseline (originally added in #284) is restored.
Extraction of page ranges from large PDFs is now bound by serialisation work instead of redundant document rebuilds and tree walks. Closes #474, reported by community contributor @potatochipcoconut, whose careful root-cause writeup (chunk-by-chunk timings, comparison against PyMuPDF's doc.select(), and a profiling-grade reproduction case from an AWS Lambda IDP pipeline) made this fix possible.
Measured on the public 1112-page / 38 MB Artificial Intelligence — A Modern Approach corpus (pdfs_slow2/) on an idle laptop:
| Workload | 0.3.43 | 0.3.44 | Speedup |
|---|---|---|---|
extract_pages_to_bytes(0..300) |
7301 ms / 36 MB out | 382 ms / 12 MB out | 19× + 3× smaller |
extract_pages_to_bytes(0..50) |
7983 ms / 36 MB out | 155 ms / 4 MB out | 51× + 9× smaller |
| Sequential 23 × 50-page chunks | ~3 min | 1542 ms total | ~120× |
Extrapolating to the reporter's 12k-page / 50 MB document chunked into five 3000-page slices: an AWS Lambda invocation that previously timed out at 900 s after two chunks now finishes the entire five-chunk batch in roughly 30 s.
All in src/editor/document_editor.rs + src/document.rs:
- Triple full-document rewrite.
extract_pages_to_bytesserialised the whole doc, re-parsed the bytes, removed pages one at a time, and serialised again — three full passes when one would do. Replaced with a non-mutating in-place trimmedpage_order, restored after the save (even onErr). - Garbage collector walked the original page tree. The trimmed
/Pagesdict was rebuilt locally insidewrite_full_to_writer, butcollect_reachable_ids()started its BFS from the unmodified catalog and pulled in every dropped page's resources — so the output never shrank no matter how few pages were kept. Fixed by staging the trimmed/Pagesdict inmodified_objectsbefore the save; the GC walker already prefers staged dicts over source. get_page_ref(i)in a 0..n loop is O(n²). Each call walks the page tree from the root and stops at the i-th leaf, so collecting all n leaf refs walks 1 + 2 + … + n nodes. New helperPdfDocument::all_page_refs()does it in one DFS. The flat-tree common case (root/Pageswhose/CountmatchesKids.len()) reads the ref array straight out of/Kidswithout touching individual leaves at all.
The same n² loop pattern was lurking in four other call sites on the reporter's hot path (their pipeline does PDF/A validate + convert before the chunked extract). All five collapsed to a single all_page_refs() call:
src/outline.rs—find_page_index(O(n²) per outline entry → O(n³) on documents with bookmarks).src/editor/document_editor.rsline ~4275 — page-ref → index map for partial form-flatten.src/editor/document_editor.rsline ~4505 — same map forget_form_fields().src/compliance/validators.rs—validate_fonts(doc.validate_pdf_a('2b')).src/compliance/converter.rs— per-page/AAstrip (doc.convert_to_pdfa('2b')).
Two additions, both directly requested by @potatochipcoconut in #474; both available in Rust and Python (the other bindings can be added on demand):
# Batch extraction — same single-call efficiency, ergonomic for
# the chunked-for-OCR / chunked-for-S3 pattern.
chunks = doc.extract_page_ranges_to_bytes(
[(0, 3000), (3000, 6000), (6000, 9000), (9000, 12000)]
)
# In-place selection — equivalent to PyMuPDF's doc.select(...).
# After this call, the document holds only the listed pages,
# in the order given. doc.save() / doc.save_to_bytes() then
# emit only those pages with garbage-collected resources.
doc.select_pages([1, 4, 7, 99])
PDFs whose /Pages root publishes shared /Resources used by all leaf pages (typical of high-resolution book scans, atypical of office documents with subset fonts) still produce full-size chunk output: GC correctly preserves resources reachable from kept pages, and a single shared resource pool stays reachable as long as any kept page references it. The principled fix is per-page resource sub-setting — parsing each kept page's content stream to determine which fonts / XObjects are actually used and emitting a minimal /Resources for that page. That is a feature, not a bug fix, and is deferred from this release. The wall-clock speedup (12–54×) holds regardless.
- 5050 lib tests pass under
--features python,fips(5039 default + 11 FIPS-only). - 119 encryption tests still pass byte-equal post-rewire to the trait.
- 69 signatures tests still pass byte-equal post-rewire.
- Hash vectors validated against NIST FIPS 180-4 for SHA-256/384/512 and RFC 1321 / 3174 for MD5 / SHA-1.
- New regression tests cover the issue #474 workflow:
test_extract_pages_chunked_sequential(4 sequential chunks on the sameDocumentEditor, source observably unchanged between calls),test_extract_pages_non_sequential(out-of-order indices[3, 0, 4]),test_extract_page_ranges_to_bytes_batch,test_select_pages_in_place, andtest_select_pages_out_of_range.
AwsLcProviderRSA-PKCS#1 v1.5 verify-from-digest (#475) —AwsLcProvider::verify_rsa_pkcs1v15is currently a stub; PDF/CMS signatures using RSA-PKCS#1 v1.5 returnSignerVerify::Unknowninstead of verifying under FIPS. Blocked onaws-lc-rsexposing a stableRSA_PKCS1_PRIM_VERIFYAPI.RustCryptoProvider(default) is not affected.AwsLcProvidersigning wiring — signing calls are currently routed toRustCryptoProvider. Full AWS-LC signing integration lands in v0.3.45.- musllinux Python wheels for the FIPS variant — FIPS musllinux wheels (Alpine / musl libc) require a musl-targeted
aws-lc-fips-sysbuild; work in progress.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.43 | Cross-binding parity, WASI build target, and a basket of issue fixes.
render_page_fit()now ships in all five bindings (Rust core + Python, Node.js / TypeScript, C#, Go). Picks the largest DPI such that both rendered dimensions fit inside a target pixel box, preserving aspect ratio. No more "what DPI hits 1024×768?" math on the caller's side. Fixes #441, closes #448.- Idiomatic page iteration parity across bindings. Rust gets
page_indices(), Python gets.pages, Node.js gets[Symbol.asyncIterator](the sync[Symbol.iterator]was already there). C#Pagesand GoPages()were already shipped. Closes #447. - WASI build target —
cargo build --target wasm32-wasip1now builds the lib cleanly on stable Rust. Unblocks @RALaBarge's externalpdf-oxide-wasistdin→stdout wrapper and any other consumer wanting to embed pdf_oxide in a sandboxed WASI runtime. CI now gates that the WASI build stays green. Closes #214. - Spurious-table fix on dense word grids — Roland's #405 lands via cherry-pick. A new
has_split_modal_column_groupsvalidator inspects the column co-occurrence graph across modal rows and rejects candidates whose populated columns split into two or more disconnected components — the signature of two adjacent text flows mis-clustered as one table. Composes cleanly with v0.3.42'sTable::is_real_gridfilter. Validated against the 86-PDF cross-build corpus: 888 / 888 byte-equal — zero observable change on common documents, the gate's value is in the safety net for adversarial cases.
- #456 —
PdfDocument::open(path)now populatessource_bytes, unblockingconvert_to_pdf_a(), the C FFIpdf_document_get_source_bytes, and any other API that re-reads the in-memory copy. Path-loaded documents previously got an emptyVec<u8>and hit"Invalid PDF header: File is empty (0 bytes read)"from the PDF/A converter. Reported by @potatochipcoconut on PR #445. - #451 — Standard14 PostScript fonts with no open-source equivalent (
Symbol,ZapfDingbats) are now downgraded from hardFontNotEmbeddederrors to a newKnownUnembeddableFontwarning during PDF/A conversion. A document that's otherwise compliant no longer fails solely because of one symbolic font. - #395 — closed; verified the off-by-one C#
ExceptionMapperfix in v0.3.38 actually resolves the reportedRenderPage→SignatureException [8500]. Added a Rust regression test that opens @gevorgter's exact reproducer PDF and assertsrender_pagesucceeds. The fixture is pinned inpdf_oxide_tests. - #462 — dropped the
scripts/modernize_stubs.pypost-processor and thepython_version = "3.8"setting fromrylai.toml. Rylai's default already emits PEP-585 / PEP-604 syntax withfrom __future__ import annotationsat the top, so post-processing was duplicate work in opposite directions. Runtime support for Python 3.8/3.9 is unaffected —.pyistubs are type-checker artifacts, never imported at runtime. Reported by @monchin with a clean diagnosis of the root cause.
PdfDocument::open(path)now reads the file once into memory rather than streaming viaBufReader<File>. The doc comment already promised "Reads the entire file into memory"; this makes it true. Memory usage onopen()is now equivalent tofrom_bytes(std::fs::read(path)?). Required by #456; the streaming reader was a partial optimisation no caller could rely on (every code path that touchedsource_bytesalready required the in-memory copy).PdfReaderenum collapsed to a single in-memory variant — removed unusedFilevariant.std::io::{Read, Seek, BufRead, …}imports are no longer cfg-gated, which is what unblocked the wasm32-wasip1 build target.
- Batch-applied 9 dependabot bumps onto
release/v0.3.43: CI workflows (golangci-lint-actionv7→v9,setup-go5.5→6.4,setup-node4.4→6.4,github-scriptSHA refresh,scorecard-action2.4.0→2.4.3), Go (testify1.8→1.11 — was declared but unimported, dropped entirely), JS (rimraf5→6 —@types/nodedeferred to a follow-up after a TypeScript-strict shake-out), Python (onnx≥1.14→≥1.19.1). - The RustCrypto 0.8 stack (
pkcs8 0.11,spki 0.8,der 0.8,digest 0.11,crypto-common 0.2,block-buffer 0.12) stays pinned —rsa 0.10andp256/p384 0.14are still RC upstream. See the existing pin note atCargo.toml:185-187.
- New
wasm32-wasip1build smoke check in.github/workflows/ci.ymlalongside the existingwasm32-unknown-unknownjob. - Regenerated SBOMs (
pdf_oxide_cli/sbom.cdx.json,pdf_oxide_mcp/sbom.cdx.json) for 0.3.43. - New regression tests:
tests/test_issue_456_path_open_source_bytes.rstests/test_issue_447_page_indices.rstests/test_issue_395_render_page.rs
- New unit tests on
compliance::converter::downgrade_known_unembeddable_fonts.
86-PDF stratified corpus comparison (academic, mixed, forms, government, newspapers, theses, plus the three #211 fixtures), 888 sampled (pdf, page, method) triples across extract_text, to_plain_text, to_markdown, to_html:
- v0.3.43 vs v0.3.42 — 888 / 888 byte-equal, zero deltas
- v0.3.43 vs PyPI v0.3.41 — 860 equal, 28 reorder/de-dup, 0 real content losses (same profile as v0.3.42's regression report)
This release exists because of the community. Special thanks to:
- @RolandWArnold — landed the spurious-table fix in #405. After iterating away from an earlier density-gate framing, the shipped form is
has_split_modal_column_groups: a connected- component check on the column co-occurrence graph across modal rows that flags two-flow grids the regular-row-ratio gate accepts. Roland's doc-comment explicitly flags it as a heuristic, making it easy to revisit later. The fix composes with v0.3.42's struct-tree-aware reading-order rewire without any merge conflict. - @RALaBarge — built an external WASI binary wrapper for pdf_oxide (pdf-oxide-wasi) and reported in #214 that it required nightly Rust because of an internal
ceil_char_boundarycall. That call was already removed; this release fixes the second hidden blocker (cfg-gatedstd::ioimports) and adds CI gating so the WASI target stays green. - @gevorgter — flagged two rendering-area gaps: the C# binding's misleading
SignatureExceptiononRenderPage(#395, fixed in v0.3.38, regression-guarded here) and the lack of a pixel-dimension render API (#441, closed byrender_page_fitshipping in all five bindings). - @potatochipcoconut — surfaced the
convert_to_pdf_afailure on path-loaded documents while testing PR #445; the investigation traced it to the emptysource_bytesfield and produced the one-line fix in this release (#456). - @monchin — pointed out (#462) that
scripts/modernize_stubs.pywas redundant work because rylai itself controls the typing flavour via itspython_versionsetting, and noted thatoffice/barcodes/ocrfeature alignment betweenrylai.tomland the released wheel is worth a follow-up. The cleaner stub pipeline ships in this release.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.42 | Text-extraction reading-order rewire — fixes [#211](https://github.com/yfedoseev/pdf_oxide/issues/211)
extract_wordsandextract_text_linesnow honor the structure tree on tagged PDFs (per ISO 32000-1:2008 §14.7 / §14.8.2.3) instead of applying XY-Cut block partitioning. On the three #211 fixtures from pdfplumber's public test corpus this restores correct reading order for centered titles above body text (Quebec municipal minutes case) and stops splitting prose lines across phantom column gutters in form-style layouts (US child-welfare report case).- Spurious markdown / HTML tables on form-style layouts (label-colon- value pairs) are gone — spatial table detection is now gated on a real-grid validator (≥2 rows × ≥2 cols, ≥50% of rows with at least two non-empty cells).
- New
include_artifactskwarg onextract_words/extract_text_lines(Python) gates the spec-correct behavior of excluding/Artifact-tagged content (running headers, footers, page numbers, watermarks; ISO 32000-1:2008 §14.8.2.2.1). Default isTrue— preserves pre-0.3.42 behavior so existing scripts don't lose content. Passinclude_artifacts=Falseto opt into the spec-correct exclude. The default may flip in a future major release once the artifact-detection heuristic is hardened against false positives on docs whose body text recurs across pages. - The default API surface is now knob-free:
region,word_gap_threshold,line_gap_threshold,profileare deprecated onextract_words/extract_text_lines(Python). They still work but emitDeprecationWarning; they will move to a separateextract_*_advancedsurface in a future release. - ~6× faster on
extract_words/extract_text_linesbecause the XY-Cut partition is no longer in the hot path.
- #211 —
extract_words/extract_text_linesproduce wrong reading order on tagged PDFs. Headings and prose lines that XY-Cut had moved out of position now appear where the document author marked them via the/StructTreeRootMCID order. Reported by @ankursri494 against pdfplumber'spdf_structure.pdf,2023-06-20-PV.pdf, and150109DSP-Milw-505-90D.pdftest fixtures.
extract_words(page)/extract_text_lines(page)gain aninclude_artifactskwarg (defaultTrue— backward-compatible). Passinclude_artifacts=Falseto drop spans tagged as artifacts per ISO 32000-1:2008 §14.8.2.2.1. Word counts on documents with running headers / footers will decrease in that mode.- Multi-column reading-order detection on untagged PDFs is now conservative: column-aware mode opts in only when the page presents ≥3 distinct vertical gutters, each ≥
median_char_width × 4wide, with text on both sides. 1- and 2-column synthetic layouts default to row-aware top-to-bottom ordering — matches pdfplumber. Tagged multi-column PDFs are unaffected: they reach the column-aware path via the structure tree. to_markdown(page)/to_html(page)no longer emit<table>for layout-only structures detected by the spatial heuristic. Real tables (<Table>in the struct tree, or grids ≥2×2 with ≥50% of rows populating ≥2 cells) still render as tables.
- New
pdf_oxide::pipeline::page_reading_order(doc, page)helper: single source of truth for canonical reading-order span sequence. Tagged + struct tree (no/Suspects) → walks the tree; otherwise → geometric top-to-bottom + y-tolerance. Companion variantpage_reading_order_no_artifactsstrips spans tagged as/Artifactfor the spec-correct exclude case. extract_words_with_thresholdsandextract_text_lines_with_thresholdsdelegate through the helper for the default code path (artifacts retained). Newextract_words_with_thresholds_no_artifactsandextract_text_lines_with_thresholds_no_artifactssurfaces are available for the spec-correct artifact-excluded behavior. Theprofile=Some(...)path retains its previous XY-Cut behavior pending the planned removal of theprofilekwarg.GeometricStrategynow defaults to row-aware top-to-bottom ordering; column-aware mode gated by the strict multi-column criterion above.Table::is_real_grid()introduced as the real-table validator;extract_page_tablesfilters the spatial heuristic's output through it.
75-PDF stratified-sample corpus (academic, mixed, forms, government, newspapers, theses, plus the three #211 fixtures) compared between 0.3.41 and 0.3.42 across all eight extraction methods on the first 3 pages of each PDF — 1592 comparisons total. Zero content regressions: every word the baseline extracted is also extracted by 0.3.42; only ordering / line-grouping / table-rendering changed.
- #453 — drop the unused
lzwdirect dependency.LzwDecoderalready routed throughweezlplus a custom fallback; thelzwcrate was declared inCargo.tomlbut never imported. Silences RUSTSEC-2020-0144 (unmaintained advisory) for downstream cargo-deny consumers as a side-effect. - #454 (partial) —
cargo updatelockfile refresh:fax 0.2.6 → 0.2.7,imageproc 0.26.1 → 0.26.2,js-sys/web-sys0.3.95 → 0.3.97,pdfium-render 0.9.0 → 0.9.1,rustls 0.23.39 → 0.23.40,wasm-bindgenfamily0.2.118 → 0.2.120, plus 12 other transitive patch / minor bumps. The remaining major-version items in #454 (RustCrypto 0.8 stack —pkcs8 0.11,spki 0.8,der 0.8,digest 0.11,crypto-common 0.2,block-buffer 0.12) stay pinned:rsa 0.10andp256 0.14/p384 0.14are still RC upstream as of 2026-04 (see the existing pin note inCargo.toml:185-187).
This release exists because of the community. Special thanks to:
- @ankursri494 — reported #211 with three carefully chosen pdfplumber-corpus fixtures (
pdf_structure.pdf,2023-06-20-PV.pdf,150109DSP-Milw-505-90D.pdf) that isolate three distinct failure modes — wrong reading order on tagged PDFs, dropped document headings, and prose-line splits at form gutters. They also kept the issue alive through two rounds of "is this still broken on the latest version?", which forced the deeper investigation that ultimately exposed the architectural gap behind #457. Without that persistence and that specific repro set, this rewire would not have shipped. - @lingcoder — flagged the unmaintained
lzwadvisory in #453 with a precise pointer to RUSTSEC-2020-0144 and theweezlmigration path; the investigation surfaced that the dep was unreferenced entirely, turning it into a one-line cleanup.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.
v0.3.41 | Real PDF/A conversion, LaTeX symbolic-font glyph rendering fix, and
This release exists because of the community. Special thanks to:
-
@FireMasterK — reported #307 with a precise reproduction case: a LaTeX-generated PDF where accented characters and ligatures (ú, á, fi) rendered as blank gaps across all pages. The report identified the exact document class (DC/EC TrueType fonts with Mac Roman cmap, no
/Encodingdict), which made the root cause inrender_cid_direct()straightforward to isolate and fix. -
@sparkyandrew — followed up on #425 with #443, noticing that the output PDF was 2.32 MB when the two source images summed to under 1.6 MB — even after the #425 image-pipeline fix. That single observation pinpointed the missing XObject deduplication: the same image data encoded twice produced two independent compressed streams. Fixed.
-
@potatochipcoconut — #418, the original PDF/A binding-completeness report that drove the full implementation in #442.
convert_to_pdf_a()existed in Rust but was a no-op: it recorded actions and returned success while leaving the document bytes untouched. The report surfaced this silently-broken state across all seven bindings. -
@nickpetrovic — filed #444 with a precise four-row reproduction table showing ligature glyphs in subset Calibri fonts decoded to wrong Unicode codepoints (
ti→O,tf→[,ft→e). The report included the exact PDF and the per-font-subset mapping failures, which led directly to the ICCBased color-space warn spam fix and the rowspan-label reading-order scramble fix. -
@RubberDuckShobe — reported #450: any PDF containing a PNG with an alpha channel showed a diagonal stripe through the image. A minimal reproduction confirmed the bug was reproducible across Acrobat, Preview, and browser PDF viewers. The report made the scope unambiguous — every image with transparency was affected — and led directly to the missing
DecodeParmsfix inbuild_soft_mask_dict(). -
@truffle-dev — first code contribution to the project: completed the CLI output-path fix for #412 in #452. The original audit in #412 covered all 11 CLI commands with exact line references and two proposed design options; the PR was clean on first submission. Picks up the four commands (
crop,decrypt,delete,reorder) missed by the earlier partial fix, and also enforces-o/--outputformergeinstead of silently defaulting to the first input's directory.
- Real PDF/A conversion — XMP metadata stream,
pdfaid:part/conformanceidentification, OutputIntents (sRGB), language tag, JavaScript removal; all 7 bindings (#418, #442). - Symbolic TrueType glyph rendering — non-ASCII bytes (ú=0xFA, á=0xE1, fi=0x85) in DC/EC-style LaTeX fonts with Mac Roman cmap no longer suppressed as spaces (partially fixes #307; follow-up cases reported by FireMasterK on 2026-04-29 remain open).
- Image XObject deduplication — same image embedded twice no longer re-encoded as two separate compressed streams; PDF size matches the sum of source images (#443).
- Diagonal-line artifact in transparent images fixed — missing
DecodeParmsin the soft-mask XObject caused a visible diagonal stripe in any PNG with an alpha channel (#450). - Barcode SVG generation —
pdf_barcode_get_svgno longer returnsERR_UNSUPPORTED; generates real SVG for all 8 barcode types including QR (#421). - CLI output routing —
crop,decrypt,delete, andreordernow write default output beside the input file instead of the current working directory;mergenow requires-o/--outputand errors up front instead of silently defaulting to the first input's directory. Completes #412.
convert_to_pdf_a() previously recorded conversion actions and returned success, but the document bytes were unchanged — the XMP metadata stream was constructed in memory and then discarded. This release rewrites the conversion core end-to-end:
- XMP metadata stream — a standards-compliant XMP packet is serialised and written as an indirect object, then wired into the document catalog as
/Metadata.pdfaid:partandpdfaid:conformanceare set per level: A1b →1/B, A2b →2/B, A2u →2/U, A3b →3/B. - OutputIntents — a
GTS_PDFA1output intent referencing sRGB is injected when none is present. Idempotent: a second call detects the existing intent and does not duplicate it. - Language tag —
/Langis written to the catalog when the validator raisesMissingLanguage. - JavaScript removal —
/Names/JavaScriptentries are stripped when present. - Source bytes patched —
doc.source_bytesis updated in-place; the document is immediately re-parseable after conversion. - Font embedding (
renderingfeature) —embed_font()now resolves the 14 standard PDF Type1 PostScript names (Helvetica, Courier, Times-Roman, …) to the metrically-equivalent URW Base 35 open-source fonts shipped by default on Linux (Nimbus Sans,Nimbus Mono PS,Nimbus Roman). With--features renderingall B-level PDFs convert to 0 remaining errors, includingFontNotEmbedded. Three bugs were fixed in the embedding pipeline:try_fix_errordedup applied to error codes, so only the firstFontNotEmbeddederror was processed; remaining fonts were skipped — fixed to dedup per-error-code for non-font errors only.write_full_to_writerwrote font objects from the original source instead of preferring stagedmodified_objects— fixed to use the same priority order as the general object sweep.add_structure()only added/StructTreeRootbut not/MarkInfo /Marked true; the validator requires both for PDF/A-*a conformance — fixed.
Test coverage — 17 new end-to-end roundtrip tests in tests/test_pdfa_roundtrip.rs verify every fixable scenario (validate → convert → validate). The showcase_pdfa_conversion CI example is rewritten to assert correctness and panics on any regression.
All seven bindings expose the updated function:
| Binding | API |
|---|---|
| Rust | convert_to_pdf_a(&mut doc, PdfALevel::A2b)? |
| Python | pdf_oxide.convert_to_pdf_a(doc, "A2b") |
| WASM | convertToPdfA(doc, "A2b") |
| C FFI | pdf_oxide_convert_to_pdf_a(doc, level, &out) |
| C# | Compliance.ConvertToPdfA(doc, PdfALevel.A2b) |
| Go | compliance.ConvertToPdfA(doc, compliance.PdfALevelA2b) |
| Node.js | compliance.convertToPdfA(doc, "A2b") |
LaTeX-generated PDFs using DC/EC fonts (Dcr10, Dcsl10, etc.) embed symbolic TrueType fonts with these characteristics:
/Flagshas the symbolic bit set (bit 3 = 4)- No
/Encodingdictionary - Mac Roman format-0 cmap (platform 1, encoding 0): byte code → glyph ID
- No Windows Unicode cmap
pdf_oxide correctly routes these through the render_cid_direct() path, which resolves each content-stream byte to a glyph ID via the Mac Roman cmap. The bug was one line in the space-detection guard:
// Before — bytes without a Unicode mapping fell through to unwrap_or(' ')
let char_at_pos = char_str.chars().next().unwrap_or(' ');
if char_at_pos.is_whitespace() { /* skip draw */ }
Any byte whose Unicode mapping returned None — including ú (0xFA → GID 85), á (0xE1 → GID 83), and fi (0x85 → GID 75) — was treated as a space, so the is_whitespace() guard blocked glyph drawing entirely.
// After — '\0' is not whitespace; GID ≠ 0 glyphs are drawn correctly
let char_at_pos = char_str.chars().next().unwrap_or('\0');
Verified pixel-perfect against Poppler and MuPDF on the #307 reproduction PDF. Regression-tested across 69 PDFs (120 page comparisons) — zero regressions in rendering, plain text, Markdown, and HTML extraction.
Two issues surfaced while investigating #444 (Calibri ligature mis-mapping, which is an upstream macOS Quartz PDF producer bug with no fix possible on our side):
ICCBased color space warn spam — PDF producers that register ICCBased profiles under user-defined names (e.g. Cs1, Cs2) caused the text extractor to fire a WARN log on every sc/SC/scn/SCN operator that used such a name. The catch-all _ branch in the color-space handler did not know how to handle named references, so it logged and left the color unchanged. The fix: apply a component-count fallback in that branch (1 component → gray, 3 → RGB, 4 → CMYK) and demote the log to DEBUG. Affected PDFs with large amounts of colored text (like typical Office documents) emitted 96+ spurious warnings per page; now silent.
Text span reading-order scrambling — reorder_rowspan_labels, a function that promotes vertically-centered table row labels to sort at the top of their row block, was incorrectly activating on single-column prose documents (resumes, reports). It identified spans at rightward X positions as a "sparse column" and promoted them to wrong Y coordinates, causing line-continuation text like "to assess technical needs and" or "-making." to appear before the earlier line they followed.
Root cause: the label-candidate filter did not exclude spans whose Y-band already appears in the dense column. Genuine rowspan labels are vertically between data rows, so their Y-band is absent from the dense column. Line-continuation spans share the Y-band of the main column text and must not be treated as labels. The fix adds that exclusion:
// Before — any sparse-column span in the data Y range
y > data_bot && y < data_top
// After — additionally exclude spans that align with a dense-column row
y > data_bot && y < data_top && !dense_bands.contains(&band_of(y))
The original rowspan-label behavior for actual table layouts (CJK lab reports, mixed-column tables) is preserved; the existing test confirms that genuine between-row labels are still promoted correctly.
When the same image data was passed to page.image() or from_bytes() on multiple pages, pdf_oxide encoded it as independent XObjects — each carrying the full compressed pixel data. A 760 KB PNG embedded twice contributed 1.52 MB instead of 760 KB; the #443 reproduction produced 2.32 MB from images totalling under 1.6 MB.
The fix hashes the normalised stream bytes after calling image_content_to_xobject_stream(). Hashing before normalisation failed across API paths: an image supplied via page.image() (which accepts raw file bytes and decodes them internally) and the same image supplied via ImageContent::from_bytes() produced different pre-encoding byte strings but identical post-normalisation compressed streams. Hashing after normalisation ensures the key is stable regardless of which API path the caller used. The key is (hash, byte_length) over the compressed pixel data; if a matching entry is already registered in the document's XObject map, the existing reference is reused and no new stream is written.
PDFs with PNG images that have an alpha channel displayed a diagonal stripe across the image when opened in Acrobat, Preview, and most other viewers.
Root cause: compress_image_data() prepends a PNG None-filter byte (0x00) before every scanline before Flate-compressing the pixel data. This is required by FlateDecode with DecodeParms/Predictor=15. The main image XObject carried the correct DecodeParms dictionary — but build_soft_mask_dict(), which builds the /SMask XObject for the alpha channel, emitted no DecodeParms at all. Viewers therefore decompressed the raw Flate stream, then treated the leading 0x00 filter byte of each row as an alpha pixel, shifting every row one byte to the right. The cumulative horizontal offset over hundreds of rows appears as a diagonal stripe.
Fixed by adding the same DecodeParms dictionary to the soft-mask stream:
DecodeParms { Predictor=15, Colors=1, BitsPerComponent=8, Columns=<width> }
Reported by @RubberDuckShobe in #450. Any PDF built with page.image() or ImageContent::from_bytes() where the source PNG has an alpha channel was affected; the fix is purely in the soft-mask stream header and does not change pixel data.
pdf_barcode_get_svg was a stub returning ERR_UNSUPPORTED. Two root causes were blocking a real implementation:
-
Format sentinel collision —
pdf_generate_qr_codestoredFfiBarcodeImage.format = 0, the same value aspdf_generate_barcodewithformat = 0(Code128). Theget_svgfunction had no way to distinguish QR from Code128. Fixed: QR codes now use the internal sentinel value100(outside the 0–7 range of 1D barcode types); the publicpdf_barcode_get_formatreturn value for QR codes changes from0to100accordingly. -
Missing SVG rendering path —
barcoders2.0 shipsbarcoders::generators::svg::SVG(enabled by default viafeatures = ["svg"]), so no new dependency was required. For 1D barcodes, the encoding step is now factored into a privateencode_1dhelper shared by bothgenerate_1d(PNG) and the newgenerate_1d_svg(SVG). For QR codes,generate_qr_svgrebuilds the code matrix fromqrcode::QrCode::to_colors()and emits a compact inline SVG with<rect>elements — no raster stage.
pdf_barcode_get_svg now returns a valid SVG string for all supported barcode types (Code128, Code39, EAN-13, EAN-8, UPC-A, ITF, Code93, Codabar, QR) when the barcodes feature is enabled.
A previous partial fix (commit 9dd94c0) introduced output_beside() / output_dir_beside() helpers and converted five commands (watermark, compress, flatten, rotate, split). Four binary-output commands were missed and continued resolving the default output path relative to the current working directory:
crop— now writes<stem>_cropped.pdfbeside the input file.decrypt— now writes<stem>_decrypted.pdfbeside the input file.delete— now writes<stem>_deleted.pdfbeside the input file.reorder— now writes<stem>_reordered.pdfbeside the input file.
merge previously silently defaulted to writing merged.pdf in the directory of the first input file when -o/--output was omitted. This silent fallback was the riskiest behavior in the CLI: callers who expected output beside a specific file got a surprise in a potentially unrelated directory. merge now requires -o/--output and exits with a clear error message if it is missing.
No library code was changed — all five files are in pdf_oxide_cli.
Rust (crates.io)
cargo add pdf_oxide
Python (PyPI)
pip install pdf_oxide
JavaScript/WASM (npm)
npm install pdf-oxide-wasm
CLI (Homebrew)
brew install yfedoseev/tap/pdf-oxide
CLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide
CLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh
CLI (cargo-binstall)
cargo binstall pdf_oxide_cli
MCP Server (for AI assistants)
cargo install pdf_oxide_mcp
Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
See CHANGELOG.md for full details.