kreuzberg-dev/kreuzberg
 Watch   
 Star   
 Fork   
1 days ago
kreuzberg

Release v4.2.6

Fixed

Python Bindings

  • Added output_format, result_format, elements, and djot_content fields to ExtractionResult
  • Created proper PyChunk pyclass with attribute access (chunk.content) instead of raw dicts

Benchmark Harness

  • Unified output format: merged consolidated.json + aggregated.json into single results.json (schema v2.0.0)
  • Added F1 token-based quality scoring with ground truth support
  • Added OCR coverage for docling, unstructured, tika, mineru
  • Naming normalization: strip -sync/-async suffixes
  • Safety: eliminated unsafe set_var, added NaN sanitization, bounds checks, input validation
  • Fixed zero-duration throughput inflation in batch results

See CHANGELOG.md for full details.

1 days ago
kreuzberg

v4.2.5

[4.2.5] - 2026-01-30

Fixed

Python Bindings

  • Missing OutputFormat/ResultFormat exports with Python 3.10 compatibility
  • .pyi stub alignment with Rust core (missing elements, Element, BoundingBox, PageHierarchy types)

PHP Bindings

  • Config alignment with Rust core (field names, defaults, removed phantom parameters)
  • Serialization test fixes and PHPStan compliance

TypeScript/Node Bindings

  • Missing elements field in NAPI-RS bindings
  • outputFormat and resultFormat now correctly passed through config normalizer
  • Serialization test import path fix

Java Bindings

  • asMap() null handling fix: absent config sections no longer incorrectly deserialized as default objects

C# Bindings

  • Enum serialization and test exception alignment

Elixir Bindings

  • Windows CI fix and e2e-generator support

Node Bindings

  • Bun runtime support

Changed

  • Benchmark CI artifact size reduced from ~1.5GB to essential files only
  • PageContent field parity across all language bindings
2 days ago
kreuzberg

v4.2.4

Fixed

TypeScript/Node Bindings

  • Added Element, ElementType, BoundingBox, and ElementMetadata types to the TypeScript API surface
  • Fixed batchExtractFilebatchExtractFiles export name

Rust Core

  • Added #[serde(default)] to KeywordConfig fields for partial config deserialization

C# Bindings

  • Added JsonSerializable attributes for Element types (element_based result format)

Go Bindings

  • Removed deprecated Success field references

Elixir Bindings

  • Derived Jason.Encoder for ExtractionConfig struct

See CHANGELOG.md for full details.

3 days ago
kreuzberg

v4.2.3

See CHANGELOG.md for details.

3 days ago
kreuzberg

v4.2.3

See CHANGELOG.md for details.

3 days ago
kreuzberg

v4.2.3

Fixed

Elixir Bindings

  • API parity: Added ExtractionConfig.new/0 and new/1 constructors for consistent struct creation
  • Chunk field alignment: Changed text field to content for API parity with Rust core

C# Bindings

  • Error type fix: File-not-found errors now throw KreuzbergIOException instead of KreuzbergValidationException
    • Aligns with Rust error handling where file access issues are I/O errors

Go Bindings

  • Test alignment: Removed references to deprecated WithEmbedding() API and Chunking.Embedding field
  • Test fixes: Updated config_comprehensive_test, config_result_test, embeddings_test, memory_safety_test

Java Bindings

  • ExtractionConfig expansion: Added embedding() and imagePreprocessing() builder methods
  • Default value alignment: Fixed test assertions to expect enableQualityProcessing=true (matches Rust default)

Ruby Bindings

  • Rubocop compliance: Fixed Style/EmptyClassDefinition offenses in api_proxy.rb, cli_proxy.rb, mcp_proxy.rb
3 days ago
kreuzberg

v4.2.2

Changed

API Alignment - 1:1 Parity Across All Bindings

  • Strict API parity enforcement: All 9 language bindings now have exact 1:1 field parity with Rust core
    • Verification script (scripts/verify_api_parity.py) now runs in STRICT mode, failing on ANY field differences
    • Added to ci-validate.yaml workflow to prevent future API drift

PHP Bindings

  • ExtractionConfig alignment: Removed 5 non-canonical fields and fixed defaults
    • Removed: embedding, extractImages, extractTables, preserveFormatting, outputEncoding
    • Fixed defaults: useCache → true, enableQualityProcessing → true, maxConcurrentExtractions → null
    • Updated ExtractionConfigBuilder to match canonical API
    • All 16 fields now match Rust canonical source exactly

Go Bindings

  • ExtractionResult alignment: Removed Success field (not in Rust canonical)
  • PageInfo alignment: Removed Visible and ContentType fields (not in Rust canonical)
  • Updated 14 test files to remove references to removed fields

Ruby Bindings

  • Default value fix: Changed enable_quality_processing default from false to true to match Rust

Java Bindings

  • Default value fix: Changed enableQualityProcessing default from false to true to match Rust

TypeScript Bindings

  • Type exports cleanup: Removed non-existent type exports (EmbeddingConfig, EmbeddingModelType, HierarchyConfig, ImagePreprocessingConfig) from index.ts

Fixed

Elixir Bindings

  • Hex package compilation: Fixed force_build: true causing production installs to fail (#333)
    • Changed to force_build: Mix.env() in [:test, :dev] to only build from source in development
    • Production installs now correctly use precompiled NIF binaries from GitHub releases

Docker Images

  • Tesseract OCR plugin initialization: Fixed "OCR backend 'tesseract' not registered" error in published Docker images
  • Embeddings plugin initialization: Fixed "Failed to initialize embedding model" error

API

  • JSON error responses: Added custom JsonApi extractor for consistent JSON error responses
  • OpenAPI schema improvements: Enhanced schema validation constraints
  • Chunking validation: Added validation that overlap must be less than max_characters
  • Embed validation: Added validation that all text entries must be non-empty strings
  • Default embedding model: EmbeddingConfig.model now defaults to "balanced" preset

CI/CD

  • Schemathesis API contract testing: Added schemathesis to Docker CI workflow

Rust Core

  • XLSX OOM with Excel Solver files: Fixed out-of-memory issue when processing XLSX files with sparse data at extreme cell positions (#331)
4 days ago
kreuzberg

v4.2.1

Patch Release: API Parity Fixes and CI Reliability Improvements

This patch release fixes API validation issues, adds missing format aliases, and improves backward compatibility across all language bindings.

Fixed

Rust Core

  • PPTX image page numbers: Fixed reversed page numbers when extracting images from PPTX files (#329)
    • Images on slide 1 were incorrectly reported with page_number=2 due to unsorted slide paths from presentation.xml.rels
    • Now sorts slide paths after parsing to ensure correct ordering regardless of XML element order
  • Plugin registry error logging: Added comprehensive error logging for silent plugin failures (#328)
    • OCR registry now logs errors and warnings when plugins fail to initialize
    • Extractor registry logs plugin load failures for troubleshooting
    • PostProcessor registry tracks plugin status changes
    • Validator registry records plugin validation errors
    • New `startup_validation.rs` module provides plugin status verification
    • Server startup logs all active plugins and their initialization status (fixes Kubernetes deployment visibility)
  • Output format validation: Extended `VALID_OUTPUT_FORMATS` to include all valid aliases (`plain`, `text`, `markdown`, `md`, `djot`, `html`)
  • Error type consistency: `validate_file_exists()` now returns `Io` error instead of `Validation` error for file-not-found cases
  • C# pre-commit hooks: Added dotnet restore to format/lint check tasks to fix failures in clean environments

Go Bindings

  • Format constants: Added `OutputFormatText` and `OutputFormatMd` as aliases for `plain` and `markdown`
  • Documentation: Fixed default format comment (default is `plain`, not `markdown`)

Elixir Bindings

  • Format validation: Added `text` and `md` aliases to `validate_output_format` function
  • Config validation: Updated error messages to list all valid format options

Ruby Bindings

  • CLI backward compatibility: `extract` and `detect` methods now accept both positional and keyword arguments
  • Config field naming: Renamed `image_extraction` to `images` (canonical name) with backward-compatible alias
  • Spec fixes: Updated test expectations to match actual implementation behavior

PHP Bindings

  • Config field naming: Renamed fields to canonical names (`images`, `pages`, `pdfOptions`, `postprocessor`, `tokenReduction`)
  • API parity: Added missing `postprocessor` and `tokenReduction` fields

Java Bindings

  • API parity: Added `getImages()` and `images()` builder methods as aliases for `getImageExtraction()`

WASM Bindings

  • TypeScript types: Added `outputFormat`, `resultFormat`, and `htmlOptions` to `ExtractionConfig` interface

Python E2E Tests

  • Case sensitivity: Fixed tests to use lowercase format strings (`plain`, `unified`, `element_based`)
  • API usage: Updated to use module-level functions (`config_to_json`, `config_merge`) instead of instance methods

CI/CD

  • Go test app: Fixed build by adding `-tags kreuzberg_dev` flag for FFI linking
  • Go tests: Fixed flawed pointer test that made incorrect assumptions about Go's memory model

Changed

API Verification

  • Parity script: Improved `scripts/verify_api_parity.py` to correctly parse all language bindings
    • TypeScript: Better handling of multi-line interfaces with JSDoc
    • Python: Correct parsing of `.pyi` stub files
    • Java: Extract field names from `toMap()` serialization
    • C#: Extract `JsonPropertyName` attributes for canonical names
    • WASM: Dedicated extractor for TypeScript type definitions

Documentation

  • Kubernetes deployment guide: New comprehensive guide for deploying Kreuzberg in Kubernetes (#328)
    • Complete K8s architecture overview with StatefulSet, Service, and ConfigMap examples
    • Health check configuration for plugin readiness and liveness probes
    • Logging aggregation best practices for plugin status visibility
    • Troubleshooting section for silent plugin failures in containerized environments
    • Updated Docker guide with K8s deployment references
    • Location: `docs/guides/kubernetes.md`
5 days ago
kreuzberg

v4.2.0: Complete API Consistency Across All 10 Language Bindings

Major Release: Complete API Consistency Across All 10 Language Bindings

This release achieves 100% API parity across Rust, Python, TypeScript, Ruby, Java, Go, PHP, C#, Elixir, and WebAssembly bindings. Every ExtractionConfig field is now available in all languages with consistent naming conventions and type safety.

📊 Release Stats:

  • 70 files changed
  • 11,839 lines added
  • 300+ new tests across all bindings
  • 4,500+ total tests passing
  • Complete backward compatibility for all SDK APIs

What's New

MCP Interface

  • Full config parameter support: All ExtractionConfig options now available via MCP tools
  • Enables complete configuration pass-through from AI agents to Rust core
  • Standardizes parameter handling across all MCP tools

CLI Improvements

  • --output-format flag: Canonical replacement for --content-format
  • --result-format flag: Controls result structure (unified, element_based)
  • --config-json flag: Inline JSON configuration
  • --config-json-base64 flag: Base64-encoded configuration support

Language Bindings

  • All 10 bindings now expose identical API surface
  • Added missing fields to PHP, Go, and Java bindings
  • Fixed Ruby batch chunking operations
  • Complete type safety across all languages

Testing & Documentation

  • 300+ new tests across all bindings
  • API consistency validator with CI integration
  • Comprehensive migration guide (757 lines)
  • Deprecation guide with code examples

Breaking Changes ⚠️

MCP Interface Only (AI agents only, no end-user impact):

  • enable_ocr and force_ocr moved under config object
  • Old parameter names still work with deprecation warnings

Deprecations

  • CLI: --content-format → use --output-format instead
  • Environment: KREUZBERG_CONTENT_FORMAT → use KREUZBERG_OUTPUT_FORMAT

All deprecated features remain functional in v4.2.0 and will be removed in v5.0.0.

Migration

See the Migration Guide for complete upgrade instructions.


Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#420---2026-01-26

6 days ago
kreuzberg

v4.1.2

Added

Language Bindings

  • Ruby: Ruby 4.0 support
    • Updated gemspec to support Ruby 3.2.0 through 4.x
    • Tested with Ruby 4.0.1: all tests pass with Magnus bindings
    • No breaking changes required in binding code

Fixed

Language Bindings

  • Ruby: Fixed gem native extension build failure

    • Vendor script now correctly updates native Cargo.toml paths to use vendored crates
    • Fixed sed pattern matching (5 parent directories, not 6)
  • Go: Fixed Windows timeout in Go tests

    • Removed init() function in helpers_test.go that caused FFI mutex deadlock on Windows
    • Now uses lazy initialization via sync.Once pattern

Changed

Dependencies

  • Updated dependencies across the project