v4.4.6
- dBASE (.dbf) format support: Extract table data from dBASE files as markdown tables with field type support.
- Hangul Word Processor (.hwp/.hwpx) support: Extract text content from HWP 5.0 documents (standard Korean document format).
- Office template/macro format variants: Added support for
.docm,.dotx,.dotm,.dot(Word),.potx,.potm,.pot(PowerPoint),.xltx,.xlt(Excel) formats.
- DOCX image placeholders missing (#484): Extracting
.docxfiles withextract_images=Trueno longer producedplaceholders in the output. The default plain text output path was stripping image references. Image extraction now forces markdown output so placeholders are always included.
- Format count updated to 88+: Documentation across all READMEs, docs, and package manifests updated to reflect expanded format support (previously 75+).
Benchmark Results 2026-03-13 (31ae19c)
Comparative benchmark results from workflow run 23042674034.
Commit: 31ae19c77ebfd0cd6e809078c28a7cfb2388edeb Date: 2026-03-13
v4.4.5
- PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
- Node worker pool password bug:
extractFileInWorkerwas passing thepasswordargument asmime_typetoextract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected intoconfig.pdf_options.passwords. - Unused import in kreuzberg-node: Removed unused
use serde_json::Valueimport inresult.rsthat caused clippy warnings. - WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime. OCR fixtures are now skipped for the wasm-deno target.
- WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g.
includeDocumentStructure) butserdeexpects snake_case. Addedcamel_to_snaketransform inparse_config()so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM. - PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces
#[php_class]return values to arrays instead of objects. AddednormalizeExtractionResult()wrapper that transparently converts arrays viaExtractionResult::fromArray(). - PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
- Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use
pathinstead ofversion. Added path field handling toformat_dependency()and kreuzberg-ffi fixup block to the Ruby vendoring script. - pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.
- CLI
--pdf-passwordflag: New--pdf-passwordoption onextractandbatchcommands for encrypted PDF support. Can be specified multiple times. - MCP
pdf_passwordparameter: Addedpdf_passwordfield toextract_file,extract_bytes, andbatch_extract_filesMCP tool params for better discoverability. - API
pdf_passwordmultipart field: The HTTP API extract endpoint now accepts apdf_passwordmultipart field for encrypted PDFs. PdfConfigDefault impl: AddedDefaultimplementation forPdfConfigto support ergonomic config construction.- Binding crate clippy in CI: Added clippy steps to
ci-node,ci-python, andci-wasmworkflows (gated to Linux). Addednode:clippy,python:clippy, andwasm:clippytask commands. - E2E password-protected PDF fixture: Added
pdf_password_protectedfixture testing copy-protected PDF extraction across all bindings.
- All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm from pre-commit config.
- golangci-lint v2.11.3: Upgraded from v2.9.0 across Taskfile, CI workflows, and install scripts.
v4.4.5
- PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
- Node worker pool password bug:
extractFileInWorkerwas passing thepasswordargument asmime_typetoextract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected intoconfig.pdf_options.passwords. - WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g.
includeDocumentStructure) butserdeexpects snake_case. Addedcamel_to_snaketransform inparse_config()so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM. - PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces
#[php_class]return values to arrays instead of objects. AddednormalizeExtractionResult()wrapper that transparently converts arrays viaExtractionResult::fromArray(). - PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
- Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use
pathinstead ofversion. - WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime.
- pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.
- CLI
--pdf-passwordflag: New--pdf-passwordoption onextractandbatchcommands for encrypted PDF support. - MCP
pdf_passwordparameter: Addedpdf_passwordfield toextract_file,extract_bytes, andbatch_extract_filesMCP tool params. - API
pdf_passwordmultipart field: The HTTP API extract endpoint now accepts apdf_passwordmultipart field for encrypted PDFs. PdfConfigDefault impl: AddedDefaultimplementation forPdfConfigto support ergonomic config construction.- E2E password-protected PDF fixture: Added
pdf_password_protectedfixture testing copy-protected PDF extraction across all bindings.
- All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm.
- golangci-lint v2.11.3: Upgraded from v2.9.0.
Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.4.4...v4.4.5
Release v4.4.4
- CLI test app fixes: Fixed broken symlinks in CLI test documents, corrected
--formatto--output-formatflag usage, fixed multipart form field name (file=→files=) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection. - Publish idempotency check scripts: Fixed
check_nuget.shandcheck-nuget-version.shusing bash 4+${var,,}syntax incompatible with bash 3.x. Fixedcheck_pypi.shandcheck_packagist.shwriting to$GITHUB_OUTPUTinternally instead of stdout. Fixedcheck-rubygems-version.shfalse negatives for native gems by switching fromgem searchto RubyGems JSON API. Fixedcheck-rubygems-version-python.shPython operator precedence bug. Fixedcheck-maven-version.shusing unreliable Solr search API.
- CLI install with all features: CLI test install script now uses
--all-featuresflag. - Publish workflow republish support: Added
republishinput to publish workflow for clean retag + full republish. - C# lint exclusion: Excluded test_apps from C# lint script to avoid chicken-and-egg NuGet version resolution failures.
See CHANGELOG.md for full details.
v4.4.3
- PDF image placeholder toggle: New
inject_placeholdersoption onImageExtractionConfig(default:true). Set tofalseto extract images as data without injectingreferences into the markdown content.
- Token reduction not applied (#436): Token reduction config was accepted but never executed during extraction. The pipeline now applies
reduce_tokens()whentoken_reduction.modeis configured. - Nested HTML table extraction: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs.
- hOCR plain text output: hOCR conversion now correctly produces plain text when
OutputFormat::Plainis requested, instead of silently falling back to Markdown. - PDF garbled text for positioned/tabular content (#431): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds
0.8 × avg_font_size. - Chunk page metadata drift with overlap (#439): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled.
- Node.js metadata casing: Standardized all
MetadataandEmailMetadatafields tocamelCasein the Node.js/TypeScript bindings. Also corrected pluralization forauthorsandkeywords. - WASM build failure on Windows CI: CMake try-compile checks on Windows used the host MSVC compiler (
cl.exe), which rejected GCC/Clang flags like-Wno-implicit-function-declaration. AddedCMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARYto WASM cross-compilation builds. - WASM OCR build panic when
git/patchunavailable: The tesseract WASM patch application panicked when bothgit applyandpatchcommands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes via idempotent string replacements.
Release v4.4.2
- E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#
- Ruby PDF annotation extraction: Fixed
PdfAnnotationandPdfAnnotationBoundingBoxautoload and bounding box field name mismatch - WASM OCR blocking event loop: OCR now runs in a worker thread, keeping the main thread responsive
- JPEG 2000 OCR decode failure: Shared
load_image_for_ocr()helper withhayro-jpeg2000/hayro-jbig2decoders across all OCR backends - WASM PDF empty content: PDFium initialization now properly awaited during
initWasm()
- OMML-to-LaTeX math conversion for DOCX: Mathematical equations converted to LaTeX notation
- Plain text output paths for all extractors: DOCX, PPTX, ODT, FB2, DocBook, RTF, Jupyter produce clean plain text when requested
cells_to_text()shared utility: Tab-separated plain text table formatter
- CLI includes all features:
kreuzberg-clinow usesfullfeature set including archives
See CHANGELOG.md for full details.
Benchmark Results 2026-03-03 (8b7e35c)
Comparative benchmark results from workflow run 22610076103.
Commit: 8b7e35c641e3a918146d73e4b954c6d3a5cb6bdf Date: 2026-03-03
Benchmark Results 2026-03-01 (978102f)
Comparative benchmark results from workflow run 22521432636.
Commit: 978102f360273632db02791b07f4106af9af7408 Date: 2026-03-01
v4.4.1
- OCR table inlining into markdown content (#421): When
output_format = Markdownand OCR detects tables, the markdown pipe tables are now inlined intoresult.contentat their correct vertical positions instead of only appearing inresult.tables. AddsOcrTableBoundingBoxtoOcrTablefor spatial positioning. Setsmetadata.output_format = "markdown"to signal pre-formatted content and skip re-conversion. - OCR table bounding boxes: OCR-detected tables now include bounding box coordinates (pixel-level) computed from TSV word positions, propagated through all bindings as
Table.bounding_box. - OCR table test images: Added balance sheet and financial table test images from issue #421 for integration testing.
- OCR test_tsv_row_to_element used wrong Tesseract level: Test specified
level: 4(Line) but assertedWord. Fixed tolevel: 5(correct Tesseract word level). - MSG recipients missing email addresses: The MSG extractor read
PR_DISPLAY_TOwhich contains only display names (e.g. "John Jennings"), losing email addresses entirely. Now reads recipient substorages (__recip_version1.0_#XXXXXXXX) withPR_EMAIL_ADDRESSandPR_RECIPIENT_TYPEto produce full"Name" <email>output with correct To/CC/BCC separation. - MSG date missing or incorrect: Date was parsed from
PR_TRANSPORT_MESSAGE_HEADERSwhich is absent in many MSG files. Now readsPR_CLIENT_SUBMIT_TIMEFILETIME directly from the MAPI properties stream, with fallback to transport headers. - EML date mangled for non-standard formats:
mail_parserparsed ISO 8601 dates (e.g.2025-07-29T12:42:06.000Z) into garbled output (2000-00-20T00:00:00Z) and replaced invalid dates with2000-00-00T00:00:00Z. Now extracts the rawDate:header text from the email bytes, preserving the original value. - EML/MSG attachments line pollutes text output:
build_email_text_output()appended anAttachments: ...line that doesn't represent message content. Removed from text output; attachment names remain in metadata. - HTML script/style tags leak in email fallback: The regex-based HTML cleaner for email bodies used
.*?which doesn't match across newlines, allowing multiline<script>/<style>content to leak into extracted text. Added(?s)flag for dotall matching. - SVG CData content leaks JavaScript/CSS:
Event::CDatahandler in the XML extractor didn't check SVG mode, causing<script>and<style>CDATA blocks to appear in SVG text output. - RTF parser leaks metadata noise into text: The RTF extractor did not skip known destination groups (
fonttbl,stylesheet,colortbl,info,themedata, etc.) or ignorable destinations ({\*\...}), causing ~17KB of font tables, color definitions, and internal metadata to appear in extracted text. - RTF
\ucontrol word mishandled: Control words like\ul(underline) and\uc1were incorrectly interpreted as Unicode escapes (\u+ numeric param), producing garbage characters instead of being treated as formatting commands. - RTF paragraph breaks collapsed to spaces:
\parcontrol words emitted a single space instead of newlines, causing all paragraphs to merge into a single line. Now correctly emits double newlines for paragraph separation. - RTF whitespace normalization destroys paragraph structure:
normalize_whitespace()treated newlines as whitespace and collapsed them to spaces. Rewritten to preserve newlines while collapsing runs of spaces within lines.