Release v4.2.6
- Added
output_format,result_format,elements, anddjot_contentfields toExtractionResult - Created proper
PyChunkpyclass with attribute access (chunk.content) instead of raw dicts
- Unified output format: merged
consolidated.json+aggregated.jsoninto singleresults.json(schema v2.0.0) - Added F1 token-based quality scoring with ground truth support
- Added OCR coverage for docling, unstructured, tika, mineru
- Naming normalization: strip
-sync/-asyncsuffixes - Safety: eliminated unsafe
set_var, added NaN sanitization, bounds checks, input validation - Fixed zero-duration throughput inflation in batch results
See CHANGELOG.md for full details.
v4.2.5
- Missing
OutputFormat/ResultFormatexports with Python 3.10 compatibility .pyistub alignment with Rust core (missingelements,Element,BoundingBox,PageHierarchytypes)
- Config alignment with Rust core (field names, defaults, removed phantom parameters)
- Serialization test fixes and PHPStan compliance
- Missing
elementsfield in NAPI-RS bindings outputFormatandresultFormatnow correctly passed through config normalizer- Serialization test import path fix
asMap()null handling fix: absent config sections no longer incorrectly deserialized as default objects
- Enum serialization and test exception alignment
- Windows CI fix and e2e-generator support
- Bun runtime support
- Benchmark CI artifact size reduced from ~1.5GB to essential files only
- PageContent field parity across all language bindings
v4.2.4
- Added
Element,ElementType,BoundingBox, andElementMetadatatypes to the TypeScript API surface - Fixed
batchExtractFile→batchExtractFilesexport name
- Added
#[serde(default)]toKeywordConfigfields for partial config deserialization
- Added
JsonSerializableattributes for Element types (element_based result format)
- Removed deprecated
Successfield references
- Derived
Jason.EncoderforExtractionConfigstruct
See CHANGELOG.md for full details.
v4.2.3
See CHANGELOG.md for details.
v4.2.3
See CHANGELOG.md for details.
v4.2.3
- API parity: Added
ExtractionConfig.new/0andnew/1constructors for consistent struct creation - Chunk field alignment: Changed
textfield tocontentfor API parity with Rust core
- Error type fix: File-not-found errors now throw
KreuzbergIOExceptioninstead ofKreuzbergValidationException- Aligns with Rust error handling where file access issues are I/O errors
- Test alignment: Removed references to deprecated
WithEmbedding()API andChunking.Embeddingfield - Test fixes: Updated config_comprehensive_test, config_result_test, embeddings_test, memory_safety_test
- ExtractionConfig expansion: Added
embedding()andimagePreprocessing()builder methods - Default value alignment: Fixed test assertions to expect
enableQualityProcessing=true(matches Rust default)
- Rubocop compliance: Fixed
Style/EmptyClassDefinitionoffenses in api_proxy.rb, cli_proxy.rb, mcp_proxy.rb
v4.2.2
- Strict API parity enforcement: All 9 language bindings now have exact 1:1 field parity with Rust core
- Verification script (
scripts/verify_api_parity.py) now runs in STRICT mode, failing on ANY field differences - Added to
ci-validate.yamlworkflow to prevent future API drift
- Verification script (
- ExtractionConfig alignment: Removed 5 non-canonical fields and fixed defaults
- Removed:
embedding,extractImages,extractTables,preserveFormatting,outputEncoding - Fixed defaults:
useCache→ true,enableQualityProcessing→ true,maxConcurrentExtractions→ null - Updated
ExtractionConfigBuilderto match canonical API - All 16 fields now match Rust canonical source exactly
- Removed:
- ExtractionResult alignment: Removed
Successfield (not in Rust canonical) - PageInfo alignment: Removed
VisibleandContentTypefields (not in Rust canonical) - Updated 14 test files to remove references to removed fields
- Default value fix: Changed
enable_quality_processingdefault fromfalsetotrueto match Rust
- Default value fix: Changed
enableQualityProcessingdefault fromfalsetotrueto match Rust
- Type exports cleanup: Removed non-existent type exports (
EmbeddingConfig,EmbeddingModelType,HierarchyConfig,ImagePreprocessingConfig) from index.ts
- Hex package compilation: Fixed
force_build: truecausing production installs to fail (#333)- Changed to
force_build: Mix.env() in [:test, :dev]to only build from source in development - Production installs now correctly use precompiled NIF binaries from GitHub releases
- Changed to
- Tesseract OCR plugin initialization: Fixed "OCR backend 'tesseract' not registered" error in published Docker images
- Embeddings plugin initialization: Fixed "Failed to initialize embedding model" error
- JSON error responses: Added custom
JsonApiextractor for consistent JSON error responses - OpenAPI schema improvements: Enhanced schema validation constraints
- Chunking validation: Added validation that
overlapmust be less thanmax_characters - Embed validation: Added validation that all text entries must be non-empty strings
- Default embedding model:
EmbeddingConfig.modelnow defaults to "balanced" preset
- Schemathesis API contract testing: Added schemathesis to Docker CI workflow
- XLSX OOM with Excel Solver files: Fixed out-of-memory issue when processing XLSX files with sparse data at extreme cell positions (#331)
v4.2.1
Patch Release: API Parity Fixes and CI Reliability Improvements
This patch release fixes API validation issues, adds missing format aliases, and improves backward compatibility across all language bindings.
- PPTX image page numbers: Fixed reversed page numbers when extracting images from PPTX files (#329)
- Images on slide 1 were incorrectly reported with
page_number=2due to unsorted slide paths from presentation.xml.rels - Now sorts slide paths after parsing to ensure correct ordering regardless of XML element order
- Images on slide 1 were incorrectly reported with
- Plugin registry error logging: Added comprehensive error logging for silent plugin failures (#328)
- OCR registry now logs errors and warnings when plugins fail to initialize
- Extractor registry logs plugin load failures for troubleshooting
- PostProcessor registry tracks plugin status changes
- Validator registry records plugin validation errors
- New `startup_validation.rs` module provides plugin status verification
- Server startup logs all active plugins and their initialization status (fixes Kubernetes deployment visibility)
- Output format validation: Extended `VALID_OUTPUT_FORMATS` to include all valid aliases (`plain`, `text`, `markdown`, `md`, `djot`, `html`)
- Error type consistency: `validate_file_exists()` now returns `Io` error instead of `Validation` error for file-not-found cases
- C# pre-commit hooks: Added dotnet restore to format/lint check tasks to fix failures in clean environments
- Format constants: Added `OutputFormatText` and `OutputFormatMd` as aliases for `plain` and `markdown`
- Documentation: Fixed default format comment (default is `plain`, not `markdown`)
- Format validation: Added `text` and `md` aliases to `validate_output_format` function
- Config validation: Updated error messages to list all valid format options
- CLI backward compatibility: `extract` and `detect` methods now accept both positional and keyword arguments
- Config field naming: Renamed `image_extraction` to `images` (canonical name) with backward-compatible alias
- Spec fixes: Updated test expectations to match actual implementation behavior
- Config field naming: Renamed fields to canonical names (`images`, `pages`, `pdfOptions`, `postprocessor`, `tokenReduction`)
- API parity: Added missing `postprocessor` and `tokenReduction` fields
- API parity: Added `getImages()` and `images()` builder methods as aliases for `getImageExtraction()`
- TypeScript types: Added `outputFormat`, `resultFormat`, and `htmlOptions` to `ExtractionConfig` interface
- Case sensitivity: Fixed tests to use lowercase format strings (`plain`, `unified`, `element_based`)
- API usage: Updated to use module-level functions (`config_to_json`, `config_merge`) instead of instance methods
- Go test app: Fixed build by adding `-tags kreuzberg_dev` flag for FFI linking
- Go tests: Fixed flawed pointer test that made incorrect assumptions about Go's memory model
- Parity script: Improved `scripts/verify_api_parity.py` to correctly parse all language bindings
- TypeScript: Better handling of multi-line interfaces with JSDoc
- Python: Correct parsing of `.pyi` stub files
- Java: Extract field names from `toMap()` serialization
- C#: Extract `JsonPropertyName` attributes for canonical names
- WASM: Dedicated extractor for TypeScript type definitions
- Kubernetes deployment guide: New comprehensive guide for deploying Kreuzberg in Kubernetes (#328)
- Complete K8s architecture overview with StatefulSet, Service, and ConfigMap examples
- Health check configuration for plugin readiness and liveness probes
- Logging aggregation best practices for plugin status visibility
- Troubleshooting section for silent plugin failures in containerized environments
- Updated Docker guide with K8s deployment references
- Location: `docs/guides/kubernetes.md`
v4.2.0: Complete API Consistency Across All 10 Language Bindings
Major Release: Complete API Consistency Across All 10 Language Bindings
This release achieves 100% API parity across Rust, Python, TypeScript, Ruby, Java, Go, PHP, C#, Elixir, and WebAssembly bindings. Every ExtractionConfig field is now available in all languages with consistent naming conventions and type safety.
📊 Release Stats:
- 70 files changed
- 11,839 lines added
- 300+ new tests across all bindings
- 4,500+ total tests passing
- Complete backward compatibility for all SDK APIs
- Full
configparameter support: AllExtractionConfigoptions now available via MCP tools - Enables complete configuration pass-through from AI agents to Rust core
- Standardizes parameter handling across all MCP tools
--output-formatflag: Canonical replacement for--content-format--result-formatflag: Controls result structure (unified, element_based)--config-jsonflag: Inline JSON configuration--config-json-base64flag: Base64-encoded configuration support
- All 10 bindings now expose identical API surface
- Added missing fields to PHP, Go, and Java bindings
- Fixed Ruby batch chunking operations
- Complete type safety across all languages
- 300+ new tests across all bindings
- API consistency validator with CI integration
- Comprehensive migration guide (757 lines)
- Deprecation guide with code examples
MCP Interface Only (AI agents only, no end-user impact):
enable_ocrandforce_ocrmoved underconfigobject- Old parameter names still work with deprecation warnings
- CLI:
--content-format→ use--output-formatinstead - Environment:
KREUZBERG_CONTENT_FORMAT→ useKREUZBERG_OUTPUT_FORMAT
All deprecated features remain functional in v4.2.0 and will be removed in v5.0.0.
See the Migration Guide for complete upgrade instructions.
Full Changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md#420---2026-01-26
v4.1.2
- Ruby: Ruby 4.0 support
- Updated gemspec to support Ruby 3.2.0 through 4.x
- Tested with Ruby 4.0.1: all tests pass with Magnus bindings
- No breaking changes required in binding code
-
Ruby: Fixed gem native extension build failure
- Vendor script now correctly updates native Cargo.toml paths to use vendored crates
- Fixed sed pattern matching (5 parent directories, not 6)
-
Go: Fixed Windows timeout in Go tests
- Removed
init()function in helpers_test.go that caused FFI mutex deadlock on Windows - Now uses lazy initialization via
sync.Oncepattern
- Removed
- Updated dependencies across the project