kreuzberg - 米舟开源

kreuzberg-dev/kreuzberg

Watch

Star

Fork

简介统计版本

4 days ago

kreuzberg

kreuzberg-dev

v4.4.6

Added

dBASE (.dbf) format support: Extract table data from dBASE files as markdown tables with field type support.
Hangul Word Processor (.hwp/.hwpx) support: Extract text content from HWP 5.0 documents (standard Korean document format).
Office template/macro format variants: Added support for .docm, .dotx, .dotm, .dot (Word), .potx, .potm, .pot (PowerPoint), .xltx, .xlt (Excel) formats.

Fixed

DOCX image placeholders missing (#484): Extracting .docx files with extract_images=True no longer produced ![](image) placeholders in the output. The default plain text output path was stripping image references. Image extraction now forces markdown output so placeholders are always included.

Changed

Format count updated to 88+: Documentation across all READMEs, docs, and package manifests updated to reflect expanded format support (previously 75+).

5 days ago

kreuzberg

kreuzberg-dev

Benchmark Results 2026-03-13 (31ae19c)

Comparative benchmark results from workflow run 23042674034.

Commit: 31ae19c77ebfd0cd6e809078c28a7cfb2388edeb Date: 2026-03-13

7 days ago

kreuzberg

kreuzberg-dev

v4.4.5

Fixed

PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
Node worker pool password bug: extractFileInWorker was passing the password argument as mime_type to extract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into config.pdf_options.passwords.
Unused import in kreuzberg-node: Removed unused use serde_json::Value import in result.rs that caused clippy warnings.
WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime. OCR fixtures are now skipped for the wasm-deno target.
WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g. includeDocumentStructure) but serde expects snake_case. Added camel_to_snake transform in parse_config() so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM.
PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces #[php_class] return values to arrays instead of objects. Added normalizeExtractionResult() wrapper that transparently converts arrays via ExtractionResult::fromArray().
PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use path instead of version. Added path field handling to format_dependency() and kreuzberg-ffi fixup block to the Ruby vendoring script.
pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.

Added

CLI --pdf-password flag: New --pdf-password option on extract and batch commands for encrypted PDF support. Can be specified multiple times.
MCP pdf_password parameter: Added pdf_password field to extract_file, extract_bytes, and batch_extract_files MCP tool params for better discoverability.
API pdf_password multipart field: The HTTP API extract endpoint now accepts a pdf_password multipart field for encrypted PDFs.
PdfConfig Default impl: Added Default implementation for PdfConfig to support ergonomic config construction.
Binding crate clippy in CI: Added clippy steps to ci-node, ci-python, and ci-wasm workflows (gated to Linux). Added node:clippy, python:clippy, and wasm:clippy task commands.
E2E password-protected PDF fixture: Added pdf_password_protected fixture testing copy-protected PDF extraction across all bindings.

Changed

All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm from pre-commit config.
golangci-lint v2.11.3: Upgraded from v2.9.0 across Taskfile, CI workflows, and install scripts.

7 days ago

kreuzberg

kreuzberg-dev

v4.4.5

Fixed

PDF markdown garbles positioned text (#431): PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure.
Node worker pool password bug: extractFileInWorker was passing the password argument as mime_type to extract_file_sync, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into config.pdf_options.passwords.
WASM camelCase config deserialization: JS consumers send camelCase config keys (e.g. includeDocumentStructure) but serde expects snake_case. Added camel_to_snake transform in parse_config() so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM.
PHP 8.5 array coercion on macOS: On PHP 8.5 + macOS, ext-php-rs coerces #[php_class] return values to arrays instead of objects. Added normalizeExtractionResult() wrapper that transparently converts arrays via ExtractionResult::fromArray().
PHP 8.5 support: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility.
Vendoring scripts missing path deps: Ruby and R vendoring scripts failed when workspace dependencies use path instead of version.
WASM Deno OCR test hang: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime.
pdfium-render clippy lints: Fixed clippy warnings in kreuzberg-pdfium-render crate.

Added

CLI --pdf-password flag: New --pdf-password option on extract and batch commands for encrypted PDF support.
MCP pdf_password parameter: Added pdf_password field to extract_file, extract_bytes, and batch_extract_files MCP tool params.
API pdf_password multipart field: The HTTP API extract endpoint now accepts a pdf_password multipart field for encrypted PDFs.
PdfConfig Default impl: Added Default implementation for PdfConfig to support ergonomic config construction.
E2E password-protected PDF fixture: Added pdf_password_protected fixture testing copy-protected PDF extraction across all bindings.

Changed

All binding crates linted in pre-commit: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm.
golangci-lint v2.11.3: Upgraded from v2.9.0.

Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/compare/v4.4.4...v4.4.5

10 days ago

kreuzberg

kreuzberg-dev

Release v4.4.4

Fixed

CLI test app fixes: Fixed broken symlinks in CLI test documents, corrected --format to --output-format flag usage, fixed multipart form field name (file= → files=) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection.
Publish idempotency check scripts: Fixed check_nuget.sh and check-nuget-version.sh using bash 4+ ${var,,} syntax incompatible with bash 3.x. Fixed check_pypi.sh and check_packagist.sh writing to $GITHUB_OUTPUT internally instead of stdout. Fixed check-rubygems-version.sh false negatives for native gems by switching from gem search to RubyGems JSON API. Fixed check-rubygems-version-python.sh Python operator precedence bug. Fixed check-maven-version.sh using unreliable Solr search API.

Changed

CLI install with all features: CLI test install script now uses --all-features flag.
Publish workflow republish support: Added republish input to publish workflow for clean retag + full republish.
C# lint exclusion: Excluded test_apps from C# lint script to avoid chicken-and-egg NuGet version resolution failures.

See CHANGELOG.md for full details.

12 days ago

kreuzberg

kreuzberg-dev

v4.4.3

Added

PDF image placeholder toggle: New inject_placeholders option on ImageExtractionConfig (default: true). Set to false to extract images as data without injecting ![image](...) references into the markdown content.

Fixed

Token reduction not applied (#436): Token reduction config was accepted but never executed during extraction. The pipeline now applies reduce_tokens() when token_reduction.mode is configured.
Nested HTML table extraction: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs.
hOCR plain text output: hOCR conversion now correctly produces plain text when OutputFormat::Plain is requested, instead of silently falling back to Markdown.
PDF garbled text for positioned/tabular content (#431): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds 0.8 × avg_font_size.
Chunk page metadata drift with overlap (#439): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled.
Node.js metadata casing: Standardized all Metadata and EmailMetadata fields to camelCase in the Node.js/TypeScript bindings. Also corrected pluralization for authors and keywords.
WASM build failure on Windows CI: CMake try-compile checks on Windows used the host MSVC compiler (cl.exe), which rejected GCC/Clang flags like -Wno-implicit-function-declaration. Added CMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY to WASM cross-compilation builds.
WASM OCR build panic when git/patch unavailable: The tesseract WASM patch application panicked when both git apply and patch commands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes via idempotent string replacements.

14 days ago

kreuzberg

kreuzberg-dev

Release v4.4.2

Fixed

E2E element type assertions: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#
Ruby PDF annotation extraction: Fixed PdfAnnotation and PdfAnnotationBoundingBox autoload and bounding box field name mismatch
WASM OCR blocking event loop: OCR now runs in a worker thread, keeping the main thread responsive
JPEG 2000 OCR decode failure: Shared load_image_for_ocr() helper with hayro-jpeg2000/hayro-jbig2 decoders across all OCR backends
WASM PDF empty content: PDFium initialization now properly awaited during initWasm()

Added

OMML-to-LaTeX math conversion for DOCX: Mathematical equations converted to LaTeX notation
Plain text output paths for all extractors: DOCX, PPTX, ODT, FB2, DocBook, RTF, Jupyter produce clean plain text when requested
cells_to_text() shared utility: Tab-separated plain text table formatter

Changed

CLI includes all features: kreuzberg-cli now uses full feature set including archives

See CHANGELOG.md for full details.

14 days ago

kreuzberg

kreuzberg-dev

Benchmark Results 2026-03-03 (8b7e35c)

Comparative benchmark results from workflow run 22610076103.

Commit: 8b7e35c641e3a918146d73e4b954c6d3a5cb6bdf Date: 2026-03-03

17 days ago

kreuzberg

kreuzberg-dev

Benchmark Results 2026-03-01 (978102f)

Comparative benchmark results from workflow run 22521432636.

Commit: 978102f360273632db02791b07f4106af9af7408 Date: 2026-03-01

18 days ago

kreuzberg

kreuzberg-dev

v4.4.1

Added

OCR table inlining into markdown content (#421): When output_format = Markdown and OCR detects tables, the markdown pipe tables are now inlined into result.content at their correct vertical positions instead of only appearing in result.tables. Adds OcrTableBoundingBox to OcrTable for spatial positioning. Sets metadata.output_format = "markdown" to signal pre-formatted content and skip re-conversion.
OCR table bounding boxes: OCR-detected tables now include bounding box coordinates (pixel-level) computed from TSV word positions, propagated through all bindings as Table.bounding_box.
OCR table test images: Added balance sheet and financial table test images from issue #421 for integration testing.

Fixed

OCR test_tsv_row_to_element used wrong Tesseract level: Test specified level: 4 (Line) but asserted Word. Fixed to level: 5 (correct Tesseract word level).
MSG recipients missing email addresses: The MSG extractor read PR_DISPLAY_TO which contains only display names (e.g. "John Jennings"), losing email addresses entirely. Now reads recipient substorages (__recip_version1.0_#XXXXXXXX) with PR_EMAIL_ADDRESS and PR_RECIPIENT_TYPE to produce full "Name" <email> output with correct To/CC/BCC separation.
MSG date missing or incorrect: Date was parsed from PR_TRANSPORT_MESSAGE_HEADERS which is absent in many MSG files. Now reads PR_CLIENT_SUBMIT_TIME FILETIME directly from the MAPI properties stream, with fallback to transport headers.
EML date mangled for non-standard formats: mail_parser parsed ISO 8601 dates (e.g. 2025-07-29T12:42:06.000Z) into garbled output (2000-00-20T00:00:00Z) and replaced invalid dates with 2000-00-00T00:00:00Z. Now extracts the raw Date: header text from the email bytes, preserving the original value.
EML/MSG attachments line pollutes text output: build_email_text_output() appended an Attachments: ... line that doesn't represent message content. Removed from text output; attachment names remain in metadata.
HTML script/style tags leak in email fallback: The regex-based HTML cleaner for email bodies used .*? which doesn't match across newlines, allowing multiline <script>/<style> content to leak into extracted text. Added (?s) flag for dotall matching.
SVG CData content leaks JavaScript/CSS: Event::CData handler in the XML extractor didn't check SVG mode, causing <script> and <style> CDATA blocks to appear in SVG text output.
RTF parser leaks metadata noise into text: The RTF extractor did not skip known destination groups (fonttbl, stylesheet, colortbl, info, themedata, etc.) or ignorable destinations ({\*\...}), causing ~17KB of font tables, color definitions, and internal metadata to appear in extracted text.
RTF \u control word mishandled: Control words like \ul (underline) and \uc1 were incorrectly interpreted as Unicode escapes (\u + numeric param), producing garbage characters instead of being treated as formatting commands.
RTF paragraph breaks collapsed to spaces: \par control words emitted a single space instead of newlines, causing all paragraphs to merge into a single line. Now correctly emits double newlines for paragraph separation.
RTF whitespace normalization destroys paragraph structure: normalize_whitespace() treated newlines as whitespace and collapsed them to spaces. Rewritten to preserve newlines while collapsing runs of spaces within lines.