pdf_oxide - 米舟开源

yfedoseev/pdf_oxide

Watch

Star

Fork

简介统计版本

4 days ago

pdf_oxide

yfedoseev

v0.3.77 | Search-index control lands in every first-party binding: `prepare_search()`/`clear_search_index()` (added to the Rust core in 0.3.76 alongside the new per-page search-index cache) can now be called from Python, JavaScript/WASM, Java/Kotlin/

Added

prepare_search()/clear_search_index() exposed across every language binding — callers can now build the search-index cache at a controlled point instead of paying for it on the first search() call, and free it before heavy extraction on the same document object, from any binding, not just Rust. Also fills in a pre-existing gap in the PHP binding, which had no public search() method at all (#952).
include_artifacts option on extract_text()/to_markdown()/to_markdown_all()/to_plain_text()/to_plain_text_all() (Python; ConversionOptions.include_artifacts in Rust) — matches the include_artifacts parameter extract_words()/extract_text_lines() already had, default true.

Fixed

extract_text(), to_markdown()/to_markdown_all(), and to_plain_text() unconditionally dropped content tagged /Artifact (ISO 32000-1:2008 §14.8.2.2.1 — running headers/footers, page numbers, watermarks), with no override, unlike extract_words()/extract_text_lines() which already defaulted to including artifact-tagged content for backward compatibility. On documents that tag a repeated footer carrying real information (e.g. a section identifier on every page of an engineering spec) as an artifact, this silently dropped that content from the vast majority of pages. All five methods now default to including artifact-tagged content, with include_artifacts=False available for the spec-correct exclusion behavior (#954).

Contributors

Issues reported by:

@ankursri494 — #952 (prepare_search()/clear_search_index() missing from every binding but Rust)
@tealtonyplanhub — #954 (extract_text() silently drops /Artifact-tagged content with no override)

Thank you!

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

5 days ago

pdf_oxide

yfedoseev

v0.3.76 | Redaction/editor persistence and rendering-accuracy release: DOM edits and `add_text()` overlays on source-loaded pages now survive `save()` without leaving dangling `/Contents` references; multi-input (DeviceN) Type 0 tint transforms, non-

Added

PDF/A conversion exposed in the Java binding, with idiomatic Kotlin/Scala/Clojure facades (PdfAConverter, ConversionResult/ConversionAction/ActionType/ConversionError in fyi.oxide.pdf.compliance) — completes PDF/A conversion coverage across every first-party binding (#948).

Fixed

Editor / redaction persistence

PdfDocument.save() silently discarded DOM edits (set_text(), remove_element(), erase_header()/erase_footer()/erase_artifacts()) made to a page loaded from an existing PDF — the overlay-merge path that reconciles DOM edits back into a page's /Contents on save had a defect that dropped the edit entirely; edits made via the DOM API now persist through save()/save_page() (#940).
A page whose original /Contents was already a multi-entry array (ISO 32000-1 §7.7.3.3) kept a dangling reference to its other original streams after destructive redaction — the merge step that replaces /Contents with the redacted stream assumed a single original stream at array position 0; it now replaces every original content reference, keeping only genuine overlay/addition streams. The same defect class was independently hit through two different triggers: set_text()/remove_element() on a page whose /Contents was already an array, and add_text() combined with apply_redactions_destructive() on the same page (#940, #941, #799).
add_text() overlay font registered under the raw font name while the content stream emitted the map_font_name()-transformed name, leaving a dangling /Tf resource for bold/italic overlays, generic family names (Arial, sans-serif), and Symbol/ZapfDingbats — registration now keys off the exact name the Tf operator emits, and /Resources//Resources/Font are resolved through indirect references (common for pages loaded from existing documents) (#941).
save_page() discarded overlay text staged by a prior save_page() call on the same page — overlay_additions now accumulates across calls instead of being overwritten (#941).

Rendering

CCITT Group 3/4 /ImageMask XObjects were treated as raw stencil rows instead of decoded — compressed fax data is now actually decompressed; DecodeParms/filter-chain handling, K < 0 Group 4 semantics, and allocation on oversized/malformed input are also corrected (#935, #939).
Separation/DeviceN colour spaces with a genuinely multi-channel (N>1) Type 0 (sampled) tint transform rendered black or dropped shapes entirely — the sampled-function evaluator only ever handled a single input dimension, forwarding just the first scn operand and silently falling back to gray = 1 − components[0]; it now performs full N-dimensional multilinear interpolation across the sample grid (ISO 32000-1 §7.10.2's general algorithm, of which the prior 1-D case is the N=1 special case), gated by a MAX_SAMPLED_FUNCTION_DIMS = 8 bound against a pathological /Size array (#849, #859).
Non-CCITT 1-bit /DeviceGray images (uncompressed or FlateDecode) were force-fed through the CCITT decompressor and silently dropped when decompression failed — the CCITT path is now gated on the XObject's filter actually being CCITTFaxDecode; the non-CCITT case is unpacked directly, folding /Decode [1 0] inversion the same way the CCITT path already does (#860).
Inline images (BI…ID…EI) were parsed and classified as a paint operator but never actually painted — the renderer's operator dispatch had no match arm for Operator::InlineImage; it now expands the abbreviated dictionary keys and routes through the same render_image/render_image_mask path used for Do-invoked image XObjects (#860).
Spatial table-cell ownership misattributed text near cell boundaries — stale per-glyph offsets outside a text span's bounding box are now rejected for singleton spans, and exact half-open internal grid intervals keep superscripts and boundary-adjacent text in their correct geometric cell; outer-edge tolerance near the table boundary is unchanged (#937, #938).

Performance

Repeated search()/search_page() calls on the same document re-extracted and re-postprocessed every page's full spans on every call — a multi-pattern scan over one document cost O(searches × full extraction) with no benefit from repetition, since the existing span cache is bounded to 8 entries. An unbounded, lighter-weight per-page search index (page text + span bounding boxes only, no font/glyph data) is now built lazily on first use and reused across calls; prepare_search()/clear_search_index() give callers control over eager population and memory reclamation. Measured: repeat search() calls on a 60-page document dropped from ~30ms to ~20-50µs per call after the first (#936).

Changed / Dependencies

Test/CI stability: two ocr-feature-gated CCITT diagnostic test files still called fax 0.2's u16 decode_g4 signature after the crate's 0.3 API bump (u32), breaking cargo clippy --all-targets --features rendering,barcodes,signatures,ocr — exactly the combo CONTRIBUTING.md's recommended pre-commit hook uses (#945).

Contributors

Rendering fixes from @Goldziher — CCITT /ImageMask decoding (#935, #939) and spatial table-cell ownership (#937, #938). Redaction/overlay persistence from @thomnico — the add_text()/destructive-redaction interaction (#941, #799, superseding an earlier iteration). Thank you!

Issues reported by:

@lightedlogic — #940 (save() doesn't persist DOM edits made via set_text()/remove_element())
@ankursri494 — #936 (repeated search() calls on the same document don't get faster)
@ultrasaurus — #945 (recommended pre-commit hook fails on main)
@bfchiheb — #947 (PDF/A conversion missing from the Java binding)

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

8 days ago

pdf_oxide

yfedoseev

v0.3.75 | Rendering-accuracy and extraction-fidelity release. DeviceCMYK now renders through measured process inks; page rotation (including `/Rotate 270` and negative values) and Separation/DeviceN tint transforms are honoured; inline images and CMY

Added

ReadingOrder::Structure — order a Tagged PDF by a pre-order traversal of its /StructTreeRoot (ISO 32000-1 §14.8.2.3), fixing table and complex-layout order where a geometric XY-cut guesses; falls back to ColumnAware when the structure tree is absent or not trustworthy (#877).
Mapping provenance — extract_spans reports which ISO 32000-1 §9.10.2 tier (ToUnicode / encoding / heuristic) produced each span's text, surfaced across all bindings (#893).
extract_spans_filtered_with_reading_order — reading-order extraction combined with optional-content (OCG) and ink-coverage filtering (#883).
Type 0 (sampled) and Type 3 (stitching) tint-transform evaluation in the renderer for Separation/DeviceN colours (#849).
CJK vertical-writing running header/footer detection — recognizes folios in the left/right side bands on tategaki pages (#889).
resolve_named_destination is now public (#881); a remove_artifacts markdown-conversion example (#845).

Fixed

Rendering

DeviceCMYK rendered via the naive 1−(C+K) additive clamp — now converts through measured SWOP-style process inks across every composite / vector / text / image path, so 0 0 0 1 k black renders #231F20 and process cyan renders #00ADEF instead of over-saturated additive values (#861).
/Rotate 270 rendered MIRRORED rather than rotated; negative /Rotate values were mishandled (% instead of rem_euclid); real /Rotate and indirect Separation alternates are now resolved correctly (#862, #848, #854).

Text extraction

Form-XObject (Do) text went missing when stray operands preceded the name — a dropped/malformed cm left dangling numeric operands, so Do read operands[0] (a stray number) and resolved to an empty name; it now reads the operand immediately preceding the operator per ISO 32000-1 §7.8.2 (#914).
Pages were lost when /Count-based page counting returned 0 — now recovered by walking the page tree (#909).
Inline images (BI/ID/EI) were parsed but never decoded — /Subtype is now supplied and the Table 92 abbreviated keys/values expanded (#863).
Unique /PlacedPDF bodies are kept instead of suppressed (#896); spans entirely outside the MediaBox are dropped from extract_spans_with_reading_order (#894); the top-level fill colour set before BT is preserved (scn/cs/rg no longer dropped) (#857).
Running-header/footer detection now requires position-consistency before removing repeated text as chrome, and recognizes non-Latin folio digits (#888, #887).

Images

/Decode [1 0] is honoured for 1-bit CCITT/DeviceGray images (#856); the jpeg-decoder Adobe inversion is undone for CMYK JPEGs (#855).

Recovery / parsing

A truncated file that lost its own Catalog is recovered by rebuilding one from the surviving pages (#890); a file padded after %%EOF is no longer rejected outright as 0 pages (#875).

Fonts

The referenced /Encoding /Differences is folded into the font identity hash, and subset choice in extract_embedded_fonts is made deterministic (#878, #853).

Writer

Coloured text on a registered embedded font rendered black — FluentPageBuilder::inline_color (and any TextStyle.color) was silently dropped for embedded fonts — PdfWriter::add_element routed embedded-font text through the deliberately colour-agnostic add_embedded_text (the HTML painter sets and resets the fill colour around its own calls) without ever emitting the element's fill colour, so no rg operator reached the content stream and the glyphs painted in whatever fill colour was last set (default black). The base-14 path (add_text_content) always emits rg from style.color; the embedded path now matches it by emitting fill_color before the glyph run. No restore is needed — every text element sets its own colour, mirroring the base-14 branch's "always set explicitly" contract.

Performance

extract_words / extract_text spent most of their time re-deriving per-glyph facts that cannot change (#882) — text extraction asked each font for its weight and slant once per glyph, inside the show-text loop. Both answers are name-derived: the weight lowercases the base font name (allocating) and runs up to a dozen substring searches, and the slant lowercases it again for two more — so a 13,234-page document repeated ~14 substring scans and 2 allocations 48.7M times for a value fixed at font-load. A sampling profile put str::contains and friends at ~38% of all samples. The Standard-14 width lookup had the same shape, re-stripping the subset prefix and re-scanning the 15-name table per glyph purely to choose a width table. Both are now resolved once per font and memoized, mirroring the existing byte-width-table memo. Alongside: postprocess_spans rescanned every glyph on the page per span to find its baseline (O(spans × chars) — now a bracketed y-sorted index); the page's characters were re-parsed for span post-processing and re-copied on every access (now cached and shared); the word and line paths materialized every glyph twice; article threads — a document-wide parse that walks the whole page tree — were re-parsed per page; the glyph dedup rebuilt the whole array to drop a handful; and the word-merge loop re-derived RTL-ness from an accumulating buffer, costing O(k²) characters per merge chain (the exact blow-up the backtrack guard above it exists to prevent). Measured on the reporter's PDFs: extract_words 156.9s → ~121s and extract_spans 88.7s → ~54s on a 13,234-page document; extract_text 6.03s → 3.94s on a 2,124-page one. Output is byte-identical — verified across a 419-PDF corpus for extract_spans/extract_chars/extract_words/extract_text_lines (including geometry and per-glyph x-offsets) and for text/markdown/HTML.

Changed / Dependencies

Text shaping migrated from rustybuzz to harfrust (#899); fax 0.2 → 0.3 (#873); the ttf-parser migration decision for RUSTSEC-2026-0192 is documented (#900).
office_oxide bumped to 0.1.8 (#904, #932). A combined dependency roundup (crates + CI actions + Go), plus routine crate and CI-action bumps (#931, #907, #898, #872, #869, #870, #871, #864, #865, #866, #867, #868, #874, #891).
Test/CI stability: the flaky structured_warnings round-trip test is fixed (#912); misc test guards and binding-format fixes (#846, #897, #880).

Contributors

The majority of this release was contributed by @ajbufort — rendering (/Rotate handling and Separation/DeviceN tint transforms #848, #849, #854, #862; DeviceCMYK process inks #861), image decode (#855, #856), text extraction (#857, #863, #894, #896, #909), fonts (#853, #878), recovery/parsing (#875, #890), and reading-order / span filtering (#877, #881, #883). Additional fixes from @norbusan (#911) and @ultrasaurus (#845, #846). Thank you!

Issues reported by:

@ankursri494 — #882 (extract_words/extract_text far slower than extract_spans on large PDFs)
@tobocop2 — #876 (signal to callers when a page's text cannot be extracted)
@ultrasaurus — #879 (remove_footers removed real content on IRS forms)
@norbusan — #913 (text inside a Form XObject missing from extraction)

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

18 days ago

pdf_oxide

yfedoseev

v0.3.74 | Scientific and print-era PDF extraction fixes — per-glyph advance now folds `TJ` kerning per the spec so it matches the renderer (poppler/PDFium/pymupdf), fixing word spacing on justified and kerned text; displayed-math tokens no longer fus

Fixed

Displayed-math tokens fused into one word — dx/dt = extracted as =dt, and whole equations could collapse into a single token (#830, #836) — the word-gap merge's backtrack check (gap ≤ font_size × 0.15) had no lower bound, so a run backtracking far behind the previous word's origin (a fraction bar returning to typeset the denominator, a relation sign closing an equation) still satisfied "a large negative gap ≤ a small positive threshold" and merged; because the merge is incremental, a chain of such backtracks could collapse an entire displayed equation — and in the worst corpus case, the start of the following sentence — into one word. The fix landed in two stages: the composed-text emitter (extract_text/to_markdown/to_html) was guarded first, then the identical guard (real baseline offset, origin-or-left backtrack, multi-em overlap, gated off for RTL) was applied to extract_words_inner's post-clustering merge so word geometry is correct too. Because table detection consumes word geometry, the detector was hardened in the same change so the corrected words no longer fabricate phantom tables out of ordinary wrapped captions.
Subscript index numbers extracted as decimals — P₁,₀ became P1.0 (#816) — the decimal-merge rule joins two adjacent pure-digit runs with a . (a heuristic for split-box dollar amounts where the whole part and cents print in separate fixed-width boxes, 123456 + 72 → 123456.72). Its upper gap bound was too permissive: real split-box amounts sit ~0.8–1.0× the font size apart, but subscript index digits are a smaller font spaced ~1.5–1.7× apart, so the old 2.0× ceiling let the rule invent decimals the document never contained. The ceiling is tightened to 1.3× the font size, separating genuine integer/cents boxes from widely spaced subscripts.
Born-digital pages were classified as Scanned and routed to pages_needing_ocr — 13.7% of a 6,269-page corpus (#840) — OCR-ing a page that already carries good native text replaces it with worse output, so a wrong Scanned verdict is actively harmful. The dominant cause: gather_page_signals and a second text_quality_gate call site both built their word-fragmentation input by joining raw content-stream spans with a forced space after each one. Math typesetting draws each atom — a parenthesis, an operator, a subscript — as its own span, so (∞) became three one-character "words"; on a dense LaTeX page this inflated the fragmented-word ratio and collapsed average word length until the quality gate mistook it for a scan and overrode an otherwise-correct TextLayer verdict. Both call sites now build their word list from extract_words — the same glyph/span clustering extract_text relies on, including the new math-backtrack guard — instead of one token per span.
extract_words split single string literals into fragments (module → m|odu|le) via phantom glyph gaps (#811) — TJ-offset space spans (ISO 32000-1 §9.4.4) were created with one char but an empty char_widths, and the span merge kept the widths in lockstep by tail-append + tail-resize. Whenever a width-less span contributed chars anywhere but the tail, every subsequent width shifted one slot, so per-glyph decomposition paired each glyph's accurate x-origin with its neighbor's nominal width — phantom ~0.3 em intra-word gaps that the word-gap clusterer split on. Space spans now carry their advance from creation, and the merge normalizes every contribution at its own position — inserted separators get the real geometric gap they stand in for.
PathContent geometry ignored stroke_width, so stroke-width-encoded table rules extracted as 1×0 pt specks (#812) — print-era generators draw a table's vertical rule as a ~1 pt segment stroked as wide as the table is tall (430 w … 0 0 m .998 0 l S). The geometric bbox of that path bears no resemblance to the rendered bar, so is_table_primitive() and the line-based table detector missed the whole grid and its text extracted column-major. stroke_width is now CTM-scaled at extraction (§8.4.3.2 — the line width transforms like all other geometry), the new PathContent::rendered_bbox() exposes the stroke-inflated extents (exact perpendicular + cap inflation for straight segments, conservative half-width outset otherwise), and line classification, clustering, and the per-row/column separator checks all judge rendered extents. The geometric bbox is unchanged for every other consumer. Also exposed as pdf_oxide_path_get_rendered_bbox (C FFI) and rendered_bbox in the Python/WASM path dicts, and threaded through the go, ruby, php, swift, csharp, dart, elixir, zig, julia, r, objc, cpp, and node bindings.
90°-rotated pages extracted in portrait order and words carried no rotation metadata (#813) — landscape tables typeset on portrait pages (text-matrix rotation, no /Rotate key) came out as interleaved word salad: the reading-order pipeline re-sorted spans with portrait-frame comparators, the plain-text assembler grouped lines in the portrait frame, and rotation_degrees was dropped at both TextSpan::to_chars and word assembly. A dominant-rotation vote (half-or-more of the page's non-whitespace spans sharing one quadrant rotation, mirroring the tategaki vote) now orders the whole page in its rotated reading frame — coordinates are restored afterwards, so callers keep true page space — and minority rotated runs (margin stamps, figure labels) are ordered upright per rotation group and appended after the horizontal flow, matching the span path's existing firewall. Runs sharing a ±90° rotation no longer span-merge across rotated lines. rotation_degrees now flows span → char → Word and is exposed on Word (Rust/serde), PyWord, the WASM word JSON, and pdf_oxide_word_get_rotation (C FFI), and surfaced on the word type of the go, ruby, php, swift, csharp, dart, elixir, zig, julia, r, objc, cpp, and node bindings, plus the JVM TextWord (Java, inherited by the Kotlin/Scala/Clojure wrappers).
Scanned Hebrew/Arabic OCR text layers extracted reversed — every word both letter- and word-order-reversed (#826) — scanned RTL PDFs whose invisible OCR text layer emits one TJ array per recognized word (the standard OCR-sandwich shape, e.g. Tesseract-style producers) had two compounding bugs in the Tj/TJ buffer-flush path. flush_tj_buffer (the default WordBoundaryMode::Tiebreaker path) never received the confidence-gated geometric direction detector, so it still used the old accumulated_width > 0.0 heuristic — true for nearly every non-empty RTL buffer — and reversed unconditionally instead of detecting direction; all three flush sites now route through one shared decision point (bidi::apply_rtl_verdict). And because already-logical invisible-OCR text and genuinely visual-order text have identical geometric signatures, text render mode is now threaded through so invisible runs (Tr 3/7) skip the geometric heuristics entirely and trust extraction order as-is.
FluentPageBuilder::rich_paragraph drew consecutive TextRuns flush together — TextRun::bold("Text Run 1") + TextRun::normal("Text Run 2") extracted as Text Run 1Text Run 2 (#837) — each run word-wraps and emits its own text, then advances cursor_x by exactly the emitted width, with nothing separating one run's end from the next's start, so a run boundary falling mid-line drew the next run against the previous one. Consecutive runs on the same line are now separated.
Stacked two-line column/table-header cells fused into one token — Comparison over rate extracted as Comparisonrate (#847) — when the structure-tree (tagged-content) assembler linearizes a header cell drawn as two stacked rows, the rows arrive as consecutive spans that horizontally overlap (negative gap) at a baseline drop sitting just under the same-line threshold, so the assembler treats them as one line and defers to the space decision — which, seeing a negative gap, returned no space and glued them. A negative gap combined with a genuine baseline shift is two stacked tokens, never intra-word kerning (which shares a baseline), so a separator is now inserted. Scoped to the tagged/structure-tree path so main-flow inputs (e.g. LaTeX math fraction stacks, already handled by dedicated line-break branches) stay byte-identical; a 419-PDF sweep confirmed the change is isolated to tagged tables/forms with only glyph-preserving spacing gains.
Per-glyph advance drifted behind the true rendered position on kerned/justified text, manufacturing phantom inter-glyph gaps (#847) — a sub-threshold TJ positioning number (ISO 32000-1 §9.4.4) advanced the text matrix but was dropped from the run's stored per-glyph advance (char_widths/accumulated width), so on a line drawn as one continuous buffer the many small post-space kerning offsets accumulated into a multi-point undershoot: the reconstructed glyph positions fell behind where the glyphs actually render. Poppler/PDFium/pymupdf all agree on the true position because they fold the offset into the advance; pdf_oxide was the sole outlier (−2.3 pt over one measured line, concentrated at word gaps). The stored advance now folds the exact §9.4.4 displacement — −Tj/1000 × Tfs × Th — into the run, so per-glyph geometry equals the text-matrix position by construction (closing ~72% of the drift on the worst case; the residual is the /Widths-vs-substitute-font-metric difference, a separate axis). This is a generic positioning fix, not a heuristic — it is the same advance the renderer uses — and it is what lets the narrow-word-gap rescue below operate on true gaps instead of phantom ones (a phantom ~0.15 em gap is what previously over-split matched → match ed, forcing this rescue to be held back). A companion guard tightens the cross-font single-letter glue ceiling from 0.25 em to 0.12 em: 0.25 em is a full word space, so a word followed by a single-letter variable set in a different font run (roman solution → math-italic U) was wrongly glued into solutionU; drop-caps and small-caps initials — the glue's real target — sit tight against their word at ~0 em, so 0.12 em keeps them while releasing genuine word→variable boundaries (poppler and PDFium keep the space).
Condensed headings and tracked runs typeset with no space glyph fused adjacent words — conformance test plans → conformancetestplans (#847) — a bold heading or a running header whose word separation is pure Td/TJ positioning (no 0x20 glyph) opens inter-word gaps of only ~0.18 em, below the intra-word kerning guard (0.75× the space-glyph advance), so the words glued. Because it now runs on the accurate per-glyph advance above, the gap distribution reflects the real render rather than the old undershoot. A fixed magnitude can't separate a 0.18 em word gap from ~0.15 em kerning — but within one line the intra-word glyph gaps cluster near zero while the inter-word gaps form a distinct larger cluster. A per-line multi-level bimodal split of the gap distribution now pins the word boundary regardless of absolute magnitude — splitting at every gap level above the intra-word cluster, so a condensed running footer (© ISO 2021 – All rights reserved) recovers its ~0.10 em word gaps too, matching the advance-aware extractors (pdfminer, poppler, Adobe Acrobat) that pymupdf/pdfplumber miss. It only ever adds a space, only when the suppression came from the geometric kerning guard (a new SpaceSource::IntraWordKerning marker) — never the semantic no-space rules, so complex-script text (Devanagari, Bengali, …), CJK, ligatures, and RTL are untouched. Two guards keep it off dense math, whose sub/superscript gaps are the same ~0.10 em magnitude: it never fires across a super/subscript baseline shift, nor when another glyph's ink occupies the gap (a subscript drawn between a variable and the next symbol, λᵢr → keeps λᵢr, never λ i r). 419-PDF sweep: glyph-preserving spacing gains on headings/footers/condensed runs, zero fusions, zero over-segmentation of math or complex scripts. (A word boundary whose two glyphs overlap — negative advance, e.g. the rights reserved seam — carries no geometric signal and is recovered by no extractor, Adobe included.)

Security

Bumped crossbeam-epoch to 0.9.20 (RUSTSEC-2026-0204) (#827).

Internal

Pinned the Go toolchain to 1.26.5 in CI (GO-2026-5856) (#834).
Consolidated the July 2026 Dependabot cargo + github-actions updates (#835).
Added tests that remove_footers preserves body content (#800).
Fixed the broken --all-features test commands in the PR template and dev guide (#838).
Bumped office_oxide to 0.1.6.

Contributors

Community fixes merged this release:

@tobocop2 — reported and submitted the fixes for the fragmented-word, stroke-encoded table-rule, and rotated-page bugs (#811, #812, #813 → #814), the subscript-decimal bug (#816 → #817), and the displayed-math relation-sign fusion (#830 → #831); also reported the word-layer math fusion (#836) and the born-digital misclassification (#840). A standout contribution across the whole release.
@ultrasaurus (Sarah Allen) — contributed the remove_footers content-preservation tests (#800).

Issues reported by:

@tobocop2 — #811, #812, #813, #816, #830, #836, #840
@RubberDuckShobe — #837 (rich_paragraph run spacing)
@palmoni5 — #826 (Hebrew OCR-sandwich reversal)
@Goldziher (Na'aman Hirschfeld) — #847 (word fusion on positioned runs)

Thank you all — reporters and fixers alike.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

26 days ago

pdf_oxide

yfedoseev

v0.3.73 | Two independent reading-order sort panics fixed — a non-transitive vertical-CJK (tategaki) column comparator and an oversized-literal lexer overflow — so malformed and scanned PDFs no longer crash extraction instead of returning text.

Fixed

Reading-order sort could panic on malformed or scanned PDFs instead of returning text (#807) — Rust's sort_by/sort_unstable_by (1.81+) detects a comparator that violates total order and panics with does not correctly implement a total order — uncatchable across the FFI boundary, aborting the host process across every binding. Two independent causes were fixed:
- Tategaki (vertical-writing) column grouping (ISO 32000-1 §9.7.4.3, WMode 1). sort_spans_vertical_tategaki and its two duplicated call sites (postprocess_spans's tategaki intercept, TategakiStrategy) decided "same column" with a pairwise |a - b| <= tol check on each span's X-center. That check is not transitive: a chain of spans each within tol of its neighbor can span far more than tol end to end, so the comparator can claim A<B, B<C, and C<A all at once. This is exactly what a scanned vertical-CJK OCR layer produces — hundreds of single-glyph, sub-point-wide spans whose X-centers step by a fraction of the column pitch. Columns are now found by single-linkage clustering of X-centers (order right-to-left, start a new column when the gap to the previous center exceeds the tolerance), then sorted by (column, Y) — a genuine total order, and more accurate than quantizing each center into a fixed-size band independently, which can split two spans only a couple points apart into different columns if they straddle a band boundary.
- Oversized real-number literals silently overflowing to Infinity. PDF 32000-1:2008 Annex C.2 bounds real values to approximately ±3.403×10^38, but the lexer parsed real literals via f64::from_str, which saturates an all-digit literal past that limit to f64::INFINITY rather than erroring. Combined with a degenerate content-stream matrix (a zero CTM/Tm component), 0.0 × Infinity produced a NaN glyph coordinate that could panic the same class of sort elsewhere in the pipeline. Oversized literals are now clamped to the spec's implementation limit at parse time, so an out-of-range literal can no longer poison downstream arithmetic into NaN.
@tobocop2 reported this, root-caused it, and submitted a working fix (#808) using single-linkage column clustering, along with a minimal repro and three real-world vertical-Japanese novels to stress-test against. We folded that clustering approach directly into this fix (verified byte-identical output against #808 on all three novels) alongside the separate lexer fix below, so #808 was closed in favor of this PR.

Thanks to @tobocop2 (#807, #808) for finding, root-causing, and fixing this.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

27 days ago

pdf_oxide

yfedoseev

v0.3.72 | Rotated-page text extraction & a transitive-dependency security patch — the spatial extractors no longer garble text on rotated pages, and the optional Office-export path clears an untrusted-XML denial-of-service advisory.

Security

office_oxide 0.1.2 → 0.1.3 (clears RUSTSEC-2026-0194 / RUSTSEC-2026-0195) — the optional Office-document export path depended on office_oxide 0.1.2, whose transitive quick-xml 0.40 has an unbounded per-xmlns heap allocation in NsReader::push that a crafted DOCX/XLSX/PPTX could use to exhaust memory (a denial-of-service on untrusted input). office_oxide 0.1.3 upgrades to quick-xml 0.41, which bounds the allocation. pdf_oxide's own quick-xml was already 0.41; this bump closes the remaining transitive path so the dependency tree is advisory-clean.

Fixed

extract_words / extract_spans / extract_text_lines garbled text on rotated pages (#804) — on rotated pages the spatial extractors clustered along the wrong axis and fused unrelated cells into giant tokens (a whole column returned as a single 1000+ character "word", separate rows fused into one line). Two independent root causes were fixed:
- Page /Rotate 90/270 (§7.7.3.3). Span bounding boxes were mapped into the page's displayed frame before word/line clustering, but a span decomposes into characters by laying glyphs horizontally along its bbox with their raw advance widths — a representation that cannot express a run whose visual direction has become vertical. Every raw text row therefore collapsed onto one displayed band and perpendicular columns fused. Because the horizontal clustering is already correct in raw user space (and extract_chars already reports raw coordinates), 90°/270° pages now keep their span geometry in raw space; all four spatial APIs agree. (180° pages, where text stays horizontal, keep their existing mirror.)
- Rotated text matrices (rotation_degrees = ±90 — vertical column headers, chart-axis labels). A run drawn with a rotated text matrix advances along a rotated axis, but the extractor stores a span bbox flattened onto the x-axis (width = Σ advances, height = font), so adjacent rotated columns overlap and the reading-order word merge and y-band line grouping fused them. Rotated runs are now excluded from both the cross-span word merge and the line grouping — each stays its own word(s) and its own line.
Thanks @ankursri494 for the report and the public, PII-free reproducers.

Thanks to @ankursri494 (#804) for reporting the issue that drove this release.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

27 days ago

pdf_oxide

yfedoseev

v0.3.71 | Spec-alignment & extraction-leadership release — the renderer gains tiling patterns, Type 3 fonts, and mesh shadings; the markdown converter gains first-class tables, images, links, headings, nested lists, running header/footer removal, and

Added

Renderer spec alignment (ISO 32000-1) — the CPU rasteriser now paints several previously-unsupported constructs: tiling patterns (PatternType 1, §8.7.3), Type 3 font glyphs (CharProcs executed under the font matrix with d0/d1, §9.6.5), mesh shadings (free-form and lattice-form Gouraud triangle meshes and Coons/tensor patches — types 4/5/6/7 — plus function-based type 1, §8.7.4.5), text rendering modes 4–7 (glyph-outline clip accumulation across BT/ET, §9.3.6), and colour-key masking (/Mask [ranges], §8.9.6.4). JPEG 2000 images with chroma-subsampled components are now upsampled and decoded rather than skipped.
First-class tables in the markdown/HTML converters — the pipeline converter renders detected tables directly (pipe tables with header rows and colspan handling), replacing the fragile text-post-processing path.
Images, links, and document structure in markdown — figures are emitted as ![](…), /Link annotations become [text](uri) / <a href> (with a safe-scheme gate), heading hierarchy is inferred as #–######, indentation-based nested lists are preserved, cross-page running headers/footers are detected and filtered, and superscript-marker + page-bottom footnotes become [^n] references.
Hybrid-reference files (/XRefStm, §7.5.8.4) — a classic trailer's cross-reference-stream supplement is now parsed and merged, so hybrid PDFs resolve all objects.

Fixed

Per-glyph coordinates in extract_words / extract_spans / extract_text_lines drifted on CID/Type 0 fonts (#780, part 2) — these APIs reconstructed each glyph's x-position by summing nominal advance widths, which omits the ISO 32000-1 §9.4.3 TJ-array kerning, so positions drifted cumulatively along a line (up to tens of points) versus extract_chars. Each glyph's x now comes from the accurate content-stream position (matching extract_chars and Poppler's pdftotext -bbox); on the reporter's repro, glyphs within 0.5 pt of the reference went from 15 % to 97 %. Word segmentation is unchanged (the char-width array is untouched), so complex-script extraction does not regress. Thanks @ankursri494 for the report and reproducer.
Valid ICC profiles reported as [XCOLOR-005] … not a valid stream in validate_pdf_x (#797) — an ICCBased colour space embeds its profile as a stream (§8.6.5.5, [ /ICCBased stream ]), but the validator only accepted a bare dictionary and flagged every conforming profile (including the Ghent Workgroup PDF/X-4 suite). It now reads /N from the stream dictionary. Thanks @takoportal for the detailed report and repro.
Structure-tree parsing dropped large trees under a hard-coded budget (#801) — parse_structure_tree imposed a 200 ms wall-clock budget and a 10 000-element cap and returned no structure tree at all when either was exceeded (e.g. the 756-page ISO 32000-1 specification), which is non-deterministic across machines and silently loses data. The default now parses the complete tree; callers that need to bound the work can opt in via the new parse_structure_tree_with_budget(&doc, Option<Duration>) (and doc.structure_tree_with_budget(…)). The redundant post-parse size check is removed. Thanks @bjorn3 for the report and proposed API.
Inter-word spaces dropped on justified TJ-positioned text (#803) — on documents whose words are positioned with TJ/Td offsets in embedded Type 0 / Identity-H subset fonts (e.g. the 214-page ISO 21111-10 standard), whole runs extracted glued together — All rights reserved came out as Allrightsreserved. The word-gap detector derives its threshold from the font's space-glyph advance, but under Identity-H character code 0x20 maps to CID 32 — an arbitrary glyph, not the space (ISO 32000-2 §9.7.5.2, §9.10.2: the space is reached through the font's CMap/ToUnicode, never code 0x20). Reading that ~0.56 em glyph advance as the space width inflated the threshold so far that genuine ~0.25 em word gaps fell below it and were suppressed. Identity-encoded Type 0 fonts now fall back to the 0.25 em typographic default; non-Identity CMaps that legitimately place a space at 0x20 still use their explicit /W entry. Thanks @Goldziher for the precise report and geometry.
Numeric median selection — heading/base-font-size statistics now use select_nth_unstable_by (exact O(n)) instead of a full sort.

Thanks to @ankursri494 (#780), @takoportal (#797), @bjorn3 (#801), and @Goldziher (#803) for reporting the issues that drove this release.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

29 days ago

pdf_oxide

yfedoseev

v0.3.70 | Extraction-fidelity release — kerning-split words rejoined in plain text, table/form line cells split consistently regardless of word width, resolved `/BaseFont` names on the span/word APIs, and content-stream order exposed on extracted spa

Added

Content-stream order exposed on extracted spans and words (#779) — extract_words and extract_text_lines now carry the originating span's sequence (the content-stream emission order). It is surfaced idiomatically on the word/span types of every language binding — Python, Node.js and WASM, Go, the JVM (Java/Kotlin/Scala/Clojure), C#, Ruby, PHP, C and C++, Objective-C, Swift, Dart, R, Julia, Zig, and Elixir — via the new C-ABI accessor pdf_oxide_word_get_sequence. This lets consumers tell genuinely-consecutive draw calls apart from spatially-close-but-stream-distant ones (e.g. table cells vs. overlays), independent of the final reading order. Thanks @ankursri494 for the request.

Fixed

A word split by a spurious space when its glyph runs overlap slightly (#791) — a single word drawn as two adjacent same-font runs whose glyphs overlap by a fraction of a point (ordinary tight kerning, e.g. (PLANAL) then (TINA) positioned just inside PLANAL's right edge) was extracted as PLANAL TINA. The plain-text assembler now recognises this case — a negative inter-run gap, same font/weight/style, word characters on both sides, real (varying) per-glyph metrics, and not a lowercase→uppercase word boundary — and joins the runs with no inserted space, reconstructing PLANALTINA, matching pdftotext / PyMuPDF / lopdf on the same file. The spans are left unmerged, so page layout, reading order, and table detection are unaffected. Thanks @schelip for the report and minimal repro.
extract_text --format lines merged table/form cells across column gaps inconsistently (#792) — a flat 50 pt column-gap threshold made cell splitting depend on how wide each row's words happened to be, so a header row of short values (CEP/Cidade/UF) split into one line per cell while the value row directly below it (73751-452/PLANALTINA/GO, wider words, same gutters) merged into a single line. The threshold in line clustering is now font-relative ((font_size × 3).max(30 pt)), so rows sharing the same columns split the same way. Thanks @schelip for the report.
Span-derived APIs reported unresolved (alias) font names (#780, part 1) — extract_spans, extract_words, and extract_text_lines reported the page's /Resources/Font alias (e.g. F1) rather than the resolved /BaseFont (e.g. Helvetica, CIDFont+F1). They now resolve to the base font, matching extract_chars and pdfminer.six / pdfplumber. (The second part of #780 — per-glyph coordinate drift on CID/Type0 fonts in extract_words — is tracked for a follow-up release.) Thanks @ankursri494 for the report.
Cased and caseless non-Latin prose no longer mis-detected as spatial tables — the no-rulings table detector's prose-paragraph guard now recognises sentence boundaries in cased non-Latin scripts and treats the Bengali/Devanagari danda (।, ॥) as a sentence terminator, so complex-script running prose that happens to align into columns is not extracted as a table grid.

Thanks to @schelip (#791, #792) and @ankursri494 (#779, #780) for reporting the issues that drove this release.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

2026-06-27 12:21:41

pdf_oxide

yfedoseev

v0.3.69 | Language-bindings release — idiomatic bindings for C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir, each over the stable C ABI, with per-language CI, package-registry publishing, cross-language regressio

Added

Eleven new language bindings, each with an idiomatic wrapper, an api-coverage test (one assertion per public method), runnable CI-asserted examples, a README with install coordinates, and a dedicated CI workflow (Linux+macOS) running the same verification set:
- C++ (cpp/) — header-only C++17 RAII wrapper; CMake with install/export targets and a Conan recipe.
- Swift (swift/) — SwiftPM package + C module map.
- Kotlin (kotlin/) — thin facade over the Java JNI binding.
- Dart/Flutter (dart/) — dart:ffi.
- R (r/) — .Call C shim, external-pointer handles.
- Julia (julia/) — ccall.
- Zig (zig/) — @cImport.
- Scala (scala/) — thin facade over the Java JNI binding (Scala 3).
- Clojure (clojure/) — direct Java interop over the JNI binding.
- Objective-C (objc/) — NSObject wrappers over the C ABI.
- Elixir (elixir/) — dirty-scheduler NIF (CPU-bound work never blocks the BEAM).
Package-registry publishing wired into the release pipeline for the new bindings: Maven Central (Kotlin, Scala), Clojars (Clojure), Hex.pm (Elixir), and pub.dev (Dart, via GitHub OIDC). Objective-C ships as a Trunk-free CocoaPods binary pod — an xcframework + podspec uploaded as release assets and installed via a :podspec URL — since CocoaPods Trunk goes read-only on 2026-12-02. C++ (vcpkg/Conan), R (CRAN), Julia (General registry), and Swift/Zig (git tag) are documented in docs/RELEASING-bindings.md.
Cross-language regression examples — alongside each binding's basic example, three shared-scenario examples (HTML extraction, word geometry, table extraction) run with output assertions in every binding's CI workflow.
Single-source version management — scripts/sync_version.py propagates the canonical Cargo.toml version into every binding manifest and version/parity assert (--check verifies, --set X.Y.Z bumps everything). A Version Consistency CI workflow fails if any binding drifts.

Fixed

Non-Identity-ordered Type0 fonts no longer emit a wrong character for CIDs missing from /ToUnicode (#773, #775) — for an embedded Type0 font whose /ToUnicode CMap omits some drawn CIDs (e.g. a ligature glyph with no single Unicode codepoint), the decode path fell back to a numeric guess — the GID via the standard glyph-name table → AGL, or the CID itself as a code point (char::from_u32) — emitting a plausible-but-wrong, content-like character that varied per subset (e.g. a ti ligature → : / D, so notificacao → no:ficacao). The glyph has no Unicode anywhere in the file (no /ToUnicode entry, no post name, no GSUB), so the letters are unrecoverable, but substituting a wrong character is silent corruption. When a usable /ToUnicode is present, the GID→AGL guess is now suppressed for all Type0 fonts, and the CID-as-Unicode guess is suppressed for fonts whose CIDSystemInfo ordering is not Identity, so an uncovered CID there decodes to U+FFFD instead. For Identity-ordered (Adobe-Identity-0) fonts the CID-as-Unicode guess is restricted to whitespace (U+0020 → space, which producers routinely omit and is reliably CID == codepoint); any other uncovered CID likewise decodes to U+FFFD. A font with no /ToUnicode still uses the CID-as-Unicode heuristic exactly as before, and the authoritative embedded-cmap/post lookups are unchanged. This also resolves the opt-in-flag request (#775) by making the detectable-gap behaviour the default rather than a configuration flag. Thanks @schelip for reporting both issues and contributing the fix.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

2026-06-24 16:40:27

pdf_oxide

yfedoseev

v0.3.68 | Extraction fidelity release — symbolic TrueType character mis-decoding corrected via the `(3,0)`/`(1,0)` cmap, same-row span ordering preserved in plain-text output, JPEG 2000 (`JPXDecode`) image XObjects decoded via OpenJPEG, and RTL Farsi

Added

JPEG 2000 (JPXDecode) image XObjects decoded via OpenJPEG — render_page previously skipped image XObjects whose stream was compressed with /JPXDecode, silently dropping page content. The OpenJPEG library (via jpeg2k) now decodes them at render time; multi-component images are colour-managed and alpha-composited exactly as other image types. Thanks @potatochipcoconut for the report.

Fixed

RTL Farsi body text recovered from tagged Type0/CID PDFs (#758) — Type0/CID composite fonts with a valid /ToUnicode CMap had ~92% of their body text silently dropped in v0.3.66 on RTL (Farsi) documents. The tagged-structure traversal now correctly assembles CID-encoded spans before the RTL reconstruction pass, recovering the full body. Thanks @Goldziher for the report.
Symbolic TrueType fonts no longer mis-decode characters (#760) — a simple symbolic TrueType font (FontDescriptor Flags bit 3, no /Encoding, no /ToUnicode) decoded its content bytes by treating each byte directly as a glyph ID, producing wrong-but-plausible characters (e.g. Ç → Ê, SOLUÇÃO → SOLUÊÃO). The fix parses the embedded font's (3,0) symbol (or (1,0) Macintosh) cmap subtable into a byte→GID map so the correct byte→GID→Unicode hop is applied; fonts without such a subtable still use the byte as the GID. Thanks @schelip for the report and fix.
Same-row spans no longer reordered or split in plain-text output (#752) — when one logical line was emitted as spans at the same Y in different reading-order groups whose boxes overlapped by a fraction of a point, to_plain_text interleaved the overlapping group as a vertical column (hoisting a fragment to the front) and forced a space between the overlapping fragments (splitting a word). A group whose spans share a Y row is now excluded from columnar detection, and the cross-group same-Y space rule is replaced by the standard has_horizontal_gap threshold used by the other converters. Thanks @schelip for the report and fix.

Documentation

macOS/Rust OCR setup guide corrected — ORT_LIB_LOCATION is inert with the load-dynamic ONNX Runtime feature; the guide now documents ORT_DYLIB_PATH, the variable actually read at runtime.

Dependencies

pyo3 0.28 → 0.29 — fixes two security vulnerabilities: a missing Sync bound on PyCFunction::new_closure closures, and a possible out-of-bounds read in BoundTupleIterator::nth_back / BoundListIterator::nth_back.
phf 0.13 → 0.14, bytes 1.11 → 1.12, log 0.4.32 → 0.4.33, p12-keystore 0.3.0 → 0.3.1.
GitHub Actions: actions/checkout v7.0.0, actions/setup-java v5.3.0, softprops/action-gh-release v3.0.1, ruby/setup-ruby v1.314.0, taiki-e/install-action v2.82.2.
Patch/minor updates for rustls, time, zerocopy, zeroize, wasm-bindgen, wide, and ~35 other transitive crates.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.