v5.0.0-rc.14
- Publish: unblock minimumReleaseAge supply-chain gate. Set
minimumReleaseAge: 0inpnpm-workspace.yamlso first-party@kreuzberg/*platform packages are no longer rejected by pnpm's default 24h supply-chain delay during the publish workflow's Build Node bindings stage. - Publish: version drift in root manifests. Bumped
package.json(root) andcrates/kreuzberg-py/src/pyproject.tomlfrom rc.12 to rc.14; previously missed by alefsync-versions. - Mobile/ARM build: dart binding crate now compiles when upstream variants are cfg-gated. Bumps alef pin to v0.25.8 which fixes both E0004 (non-exhaustive From-impl matches) and E0599 (mirror enum cfg-gated variants vs frb-generated unconditional refs).
See CHANGELOG for details.
v5.0.0-rc.13
ImageOutputFormat::Svgvariant (wire tag"svg"). Gated by newsvgCargo feature; included inno-ort-target,formats,full. Ships in WASM + Android.SvgOptions { sanitize: bool, render_dpi: f32 }— defaults:sanitize = true,render_dpi = 96.0(clamped 1.0–600.0).- SVG → PNG / JPEG / WebP / HEIF rasterization via resvg + usvg + tiny-skia.
- SVG → SVG sanitize on
Nativetarget strips<script>, externalxlink:href/href,<foreignObject>, and JS event handlers. - Raster → SVG returns new
EncodeWarning::UnsupportedDirection. No auto-vectorization. - Security caps: input ≤ 10 MB, render output ≤ 16384² pixels (~1 GB peak),
render_dpiclamped. usvgimage_href_resolverno-op blocks SSRF / filesystem reads. - Sync pipeline path now also applies the image output format pass (closes WASM bypass).
EncoderUnavailablegated onheic(silences-D warningsdead-code on 9+ publish builds).- Three
Dockerfile.musl-*builds:cargo build --locked→--offlineso the sed-trimmed manifest reconciles with the lock (unblocks 4 musl native builds). - ci-e2e: prepend
/usr/local/libtoLD_LIBRARY_PATHso the source-built libheif 1.23 takes precedence over apt's older version on the Zig job. - Elixir + Ruby NIF Cargo.toml: pin
alloc-stdlib = "=0.2.2"(brotli 8 trait drift fix).
- 0.25.3: R extendr struct field escaping (
pub r#typefor serde-tagged enums). - 0.25.4: Dart per-target feature
#[cfg(feature = "X")]guards on enum variants (mobile Android/iOS cargo check now sees the correctImageOutputFormatvariant set). - 0.25.5: trait-bridge adapter regression coverage across all 14 language bindings; before-hooks documentation and test.
kreuzberg-libheifbootstrapped on crates.io.
v5.0.0-rc.12
Release candidate 12 — full retry @ 33a76190cb. Adds image-output normalization (ImageOutputFormat::{Native, Png, Jpeg, Webp, Heif}) with regen pipeline pass, bumps tslp to 1.9.0-rc.44 (parsers.json present), alef 0.25.2 with swift accessor + assertion fixes, brotli/alloc-stdlib unpin via cargo-stdlib 0.2.2, R Linux $ORIGIN rpath, libnuma-dev for ORT aarch64, and --locked on every cargo invocation. Three lockfiles previously gitignored are now tracked. See CHANGELOG.md for full details.
v5.0.0-rc.12
- New
windows-targetaggregate feature incrates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and dropsheicalong with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate[[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"'blocks inalef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honortarget_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to dropheicon Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.
-
CI / publish: source-build libheif 1.23.0 to satisfy
libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt shipslibheif 1.17.6and Alpine 3.21 shipslibheif 1.19.5— both rejected bylibheif-syswithPackage 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at thelibheif-sysbuild script. Three coordinated fixes:scripts/ci/install-system-deps/install-linux.shdrops aptlibheif-dev, installslibde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into/usr/local, exportingPKG_CONFIG_PATH+LD_LIBRARY_PATHviaGITHUB_ENV. Cached via a newcache-libheif-linuxstep ininstall-system-deps/action.yml.docker/Dockerfile.musl-{build,ffi,rustler}now pulllibheif-devand codec headers fromalpine/edge/community+edge/main(libheif 1.23.0) instead of Alpine 3.21 main.kreuzberg-dev/actions/build-python-wheels@v1.8.64source-builds libheif insidemanylinux_2_28viaCIBW_BEFORE_ALL_LINUX(AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).python-wheelsjob inpublish.yamlnow runsinstall-system-depsbeforebuild-python-wheelsso Windows wheels pick up vcpkg-built libheif.
-
Publish workflow: fixed dead-code warnings as errors on Windows /
reranker-presets-only builds.rerank_via_llmandextract_rerank_usageincrates/kreuzberg/src/llm/rerank.rswere gated byany(feature = "reranker-presets", feature = "reranker"), but every caller was gated byfeature = "reranker". On the Windows feature combo (which usesreranker-presets+liter-llmwithoutreranker), the functions compiled but had no callers — clippy-D warningsfailed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg tofeature = "reranker"on both functions, the relatedusestatements, and the test module so the function tracks its callers exactly. Also removed the orphanedContentLayer::is_defaulthelper (made dead by the rc.11 removal of#[serde(skip_serializing_if = ...)]oncontent_layer). -
Publish workflow: install libheif on Linux + macOS runners.
libheif-sysis a transitive dependency ofkreuzberg-libheif(gated behind theheicfeature, included infull), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invokedkreuzberg-dev/actions/build-*without first installing system dependencies.libheif-sys's pkg-config probe then failed withPackage libheif was not found in the pkg-config search path. Addedlibheif-devtoscripts/ci/install-system-deps/install-linux.sh,brew install libheiftoinstall-macos.sh, and injected a./.github/actions/install-system-depsstep before everybuild-*action invocation in.github/workflows/publish.yaml(17 sites + thepublish-cratesjob, sincekreuzberg-libheifis also published from that job). -
Publish workflow: publish
kreuzberg-libheifto crates.io. The new path-onlykreuzberg-libheifcrate is a build dependency ofkreuzbergvia theheicfeature. Without publishing it first,cargo publish -p kreuzbergaborts withno matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in thepublish-cratesinvocation'scrates:list so it lands on crates.io beforekreuzberg. -
Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's
target_dep_overridespreviously narrowed theandroid-targetfeature set tocfg(all(target_os = "android", target_arch = "x86_64")), on the assumption thataarch64-linux-androidcould use the full ORT-enabled set via pyke prebuilts. In practicelibheif-sysstill blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg totarget_os = "android"(covers both arches) and added a paralleltarget_os = "ios"override that also routes toandroid-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml). -
Docs strict build: broken anchor in
docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked[Embeddings](architecture.md#embeddings), butarchitecture.mdhas no## Embeddingsheading —task docs:build:strictaborted with exit code 1. Reworded the parenthetical to drop the link target. -
musl Docker builds: include
kreuzberg-libheifcrate + installlibheif-dev. The threedocker/Dockerfile.musl-{ffi,rustler,build}images each copy a subset of workspace crates into the build context; the newkreuzberg-libheifcrate (required transitively whenkreuzbergis built with thefullfeatures that includeheic) was missing, so cargo aborted withfailed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and addedlibheif-devto theapk addlist so libheif-sys's pkg-config probe finds Alpine's package. -
CI Lint: install system dependencies before running clippy.
ci-lint.yamlset up the Rust toolchain but never called./.github/actions/install-system-deps, so clippy on the Linux runner failed to findlibheif.pc. Added the install-system-deps step right aftersetup-rust. -
Windows binding wheels: install libheif via vcpkg.
libheif-sys's build.rs usesvcpkg::Config::new().find_package("libheif")on Windows MSVC. Thewindows-latestrunner image ships vcpkg pre-installed atC:\vcpkgbut the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) toscripts/ci/install-system-deps/install-windows.ps1, exposedVCPKG_ROOTto the build environment, and added anactions/cacheentry onC:\vcpkg\installed\x64-windows-static-mdso subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use thefullfeature set because their alef scaffolders do not yet honortarget_dep_overrides. -
Elixir cargo
ccversion conflict. The native Rustler NIF lockfile atpackages/elixir/native/kreuzberg_nif/Cargo.lockpinnedcc 1.2.63, whilekreuzberg-tesseractrequirescc ^1.2.64. Cargo failed to resolve. Rancargo update -p ccin that sub-project to lift it to1.2.64; this also picked uphtml-to-markdown-rs 3.6.2andpdf_oxide 0.3.64in both lockfiles to keep them in sync. -
Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path
metadata.format.excel.sheet_countresolved againstFormatMetadata(a tagged enum with apub fn excel(&self) -> Option<&ExcelMetadata>convenience accessor). The Rust e2e renderer was emittingresult.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error becauseFormatMetadatahas noexcelfield. Added"metadata.format.excel"tofields_method_callsinalef.tomlso the Rust renderer (and the Zig one) appends()after the segment instead of dot-accessing it as a field. -
Rust e2e generator:
not_emptyonStringleaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path likesummary.textcrosses anOption<Summary>parent on the way down, the resolver registers the path as optional; the renderer'sis_optbranch previously emittedaccessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has typeStringand.is_some()is a compile error. The renderer now checks for the trailing.as_ref().unwrap().marker and emits!accessor.is_empty()for the concrete-leaf case while preserving.is_some()for trueOption<T>leaves. -
Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (
src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into locallet _name = result.<path>.as_ref().map(...).unwrap_or_default();bindings called the basicrender_accessorinstead ofrender_rust_with_optionals, so a path likesummary.strategyemittedresult.summary.strategy— a compile error becauseOption<Summary>has nostrategyfield. Switched the binding torender_rust_with_optionalsso intermediate optional segments produce.as_ref().unwrap(). -
Removed
summary.textandsummary.strategyfromfields_optional. They are not optional onDocumentSummary(pub text: String,pub strategy: SummaryStrategy); only the parentsummary: Option<DocumentSummary>is. Listing the leaves as optional caused the e2e renderer to emit.as_ref()and.as_deref()against concrete types. The parent stays in the set. -
SummaryStrategynow implementsDisplaymatching the snake-case serde wire form (extractive/abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needsto_string()on the enum value to compare against the fixture's literal string. Added a minimalDisplayimpl. -
Dead-code
is_heif_containerandextract_exif_dataunder reranker-only builds. TheLive HF preset testsCI job buildskreuzbergwith--features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (inextraction::imageandextractors::image) is gated out — clippy-D warningsthen surfaced both functions asdead_code. Added#[allow(dead_code)]to both definitions with comments documenting the unconditional-compile intent. -
text::classification::classify_textstub added. The real implementation lives behind theclassificationfeature; alef-generated bindings callkreuzberg::text::classification::classify_textunconditionally, so the existing stub module needed a matching no-op for the path that returnsErr(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Androidandroid-targetbuilds, which drop the classification feature. -
LlmBackendandGlineBackendstubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 /android+x86_64triple cfg. With the binding crates now widening their target gates totarget_os = "android"andtarget_os = "ios"(both arches),aarch64-apple-iosandaarch64-linux-androidwere failing to compile againstkreuzberg::LlmBackend/kreuzberg::GlineBackend. Simplified the gate to#[cfg(not(feature = "ner-llm"))]and#[cfg(not(feature = "ner-onnx"))]respectively — any config that drops the feature now gets the stub regardless of target. -
Swift Rust crate now honours
target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) calledcrate::scaffold::render_core_depdirectly, ignoring the[[crates.swift.target_dep_overrides]]block. ExtendedSwiftConfigwith atarget_dep_overrides: Vec<SwiftTargetDepOverride>field (mirrorsDartTargetDepOverride) and refactoredemit_cargo_tomlto emit[target.'cfg(not(any(...)))'.dependencies]+ per-override[target.'cfg(...)'.dependencies]blocks when overrides are present, matching the FFI and Dart patterns. Withalef.tomlnow declaring iOS, Android, and Windows overrides for[crates.swift],packages/swift/rust/Cargo.tomlcorrectly routes iOS toandroid-target, Android toandroid-target, Windows towindows-target, and the default (macOS host) to the full feature set. Sametarget_dep_overridesconfig gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.
-
list_supported_formats()is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091) -
[v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level
rerank/rerank_asyncAPI,RerankerConfigwith Preset/Custom/Llm/Plugin variants,RerankerBackendplugin trait + registry,POST /rerankHTTP endpoint, and per-language bindings via alef. Gated behind the newreranker+reranker-presetsCargo features; reranker-presets is WASM/Android-safe. -
[v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries:
bge-reranker-base(BAAI/bge-reranker-base, EN+ZH),bge-reranker-v2-m3(rozgo/bge-reranker-v2-m3 with the requiredmodel.onnx.datasibling, multilingual),jina-reranker-v1-turbo-en, andjina-reranker-v2-base-multilingual. Friendly aliasesfast/balanced/quality/multilingualresolve to catalog entries, so existing call sites keep working. -
[v5.0.0]
RerankerPreset+RerankerModelType::Customgainedadditional_files: Vec<String>. Lets multi-blob ONNX exports (notablyrozgo/bge-reranker-v2-m3, which splits weights intomodel.onnx+model.onnx.data) actually load. -
[v5.0.0]
RerankerModelType::Customgainedmodel_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to"onnx/model.onnx"when omitted. -
[v5.0.0] CI
live-hfjob. Always-on reranker preset-path validation on every PR via a new.github/actions/cache-hf-fastembedcomposite action that caches~/.cache/huggingface/hubkeyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus aPreset/Custom-equivalent crosscheck andtop_ktruncation.
-
[v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (
Xenova/ms-marco-MiniLM-L-6-v2,Xenova/bge-reranker-base,Xenova/bge-reranker-large,BAAI/bge-reranker-v2-m3) are removed. The hand-curatedmodel_filepaths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast/balanced/quality/multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically needms-marco-MiniLMorbge-reranker-largecan pass them viaCustom { model_id, ... }or a registeredPluginbackend. -
[v5.0.0] BREAKING:
RerankerModelType::Customis no longer a two-field tuple. Exhaustive Rust matches onCustom { model_id, max_length }need to addmodel_fileandadditional_files. Serde defaults keep existing TOML / JSON configs valid without migration.
-
rendering: fixed panic when a non-
Itemblock element appears directly under aListnode before anyListItem. The comrak AST builder now synthesises an implicitItemwrapper instead of falling back onto the bareList, which violated CommonMark'sList → Item-onlyconstraint and panicked in debug builds. (#1096) -
pdf:
result.pages[*].isBlanknow reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existingPageContententries,is_blankwas never recalculated — it retained the stale value from native text extraction, which is alwaysSome(true)for pages with no text layer. All four write sites in the OCR page-assembly block now callis_page_text_blankafter every content mutation. (#1095) -
reranker:
RerankErrormigrated tothiserror. Matches the rest of the library andrust-conventions. -
reranker:
shutdown_allnow best-effort. Continues invokingshutdown()on every backend even after one fails, returns the first error, drops subsequent ones (logged atwarn). Previously stopped on the first failure, leaving the registry in a half-shutdown state. -
reranker: synchronous
rerank()returns a clear error instead of panicking on a current-thread Tokio runtime.block_in_placerequires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detectRuntimeFlavor::CurrentThreadand ask the caller to usererank_async()or build a multi-thread runtime. -
reranker: stronger sigmoid coverage in
tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0,3.0,0.5) and asserts both the sigmoid output range[0, 1]and the sign-→-side property — silently dropping the sigmoid would break the test. -
reranker: engine tests share the production
sigmoid_f32instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.
-
pdf: table extraction failures are now visible at
warnlog level.extract_tables_nativeandextract_tables_borderedsilently caughtpdf_oxide::extract_tables_with_configerrors attracing::debug!, making per-page failures invisible at the default log level. Promoted totracing::warn!to match the existing behaviour of the TATR and SLANeXT inference paths. The threeunwrap_or_default()call sites inextraction.rsthat silently swallowed function-level errors are also replaced withunwrap_or_else(|e| { tracing::warn!(...); Vec::new() })so that apage_count()failure is equally visible. (#1097) -
publish.yaml
trigger-pubdevjob: explicitpermissions: actions: write. Since thea8f8597e45migration to thekreuzberg-dev-publisherApp-token, thegh workflow run publish-pubdev.yamlstep has 403'd with "Resource not accessible by integration" — the App's installation token didn't carryactions: write. Adding job-levelpermissions: { actions: write, contents: read }covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also needactions: writeconfigured on github.com.
- Root Taskfile now includes test-apps task namespace. Added
test-appsto the rootTaskfile.ymlincludes block with proper{{.ROOT_DIR}}path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible viatask test-apps:smoke:*.
-
swift e2e: removed erroneous
async = falseoverride onextract_filefor swift.Kreuzberg.extractFile(_:_:_:)is async in the Swift binding. The override inalef.tomlforcedis_async = falsefor fixtures that explicitly set"call": "extract_file"(e.g.api_batch_bytes_async), generating non-async test methods that called the async binding withoutawait— compile errors. Fixtures without an explicitcallfall throughresolve_call_for_fixtureto the global default and gotis_async = truecorrectly, which is whytestApiExtractFileAsynccompiled buttestApiBatchBytesAsyncand 4 siblings did not. Dropping the override aligns both code paths. -
r: fix macOS dylib rpath so ORT loads at R extension runtime.
packages/r/src/rust/build.rsnow adds-Wl,-rpath,@loader_pathlinker flag on macOS, enabling the final R extension.soto locate transitively-linked dylibs likelibonnxruntime.dylibat load time. Without this, R'sdyn.loadvialibrary.dynam2failed withundefined symbol: OrtGetApiBasein CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791. -
Publish Release WASM job now non-blocking via
continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure onubuntu-latest(8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Setcontinue-on-error: trueso Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallelpublish-wasmjob success. -
FormatMetadata::Codenow serializes correctly. The#[serde(skip)]annotation on theCodevariant causedserde_json::to_string(and every*_to_jsonFFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the innerCodeMetadataInner(ProcessResult)wrapper already derivesSerialize/Deserializevia the upstreamserdefeature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction. -
Python sdist publish step now uses split-layout invocation.
.github/workflows/publish.yamlpassedmanifest-path: crates/kreuzberg-py/Cargo.tomltobuild-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runsmaturin sdist— but the kreuzberg layout keepspyproject.tomlinpackages/python/, so maturin failed withFailed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped themanifest-pathinput so the action falls through to the defaultpackage-dir: packages/pythonsplit-layout fallback, which cd's into the package dir and lets maturin resolvemanifest-pathfrom pyproject.toml's[tool.maturin]section itself. -
Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking
kreuzberg_php.dll/kreuzberg_nif.dllonx86_64-pc-windows-msvcfailed withLNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease.libkreuzberg_tesseract.rlibis built by cmake-rs which defaults to/MD;libesaxx_rs.rlib(transitively pulled in bygliner→tokenizers→esaxx-rs) is built by cc-rs which fell back to/MT.alef.toml[crates.scaffold.cargo.env]now setsCFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD"and the matchingCXXFLAGS_*, propagated to.cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7). -
captioning: captioning was a no-op for all image paths —
CaptioningProcessornever received image bytes. Two root causes: (1)ImageExtractorbuiltextracted_imageon the OCR path but passedNonetobuild_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction onconfig.images.extract_images, so setting onlyconfig.captioningleft them with empty data. Fix: addExtractionConfig::needs_image_data()— true whenimages.extract_imagesorcaptioningis set — and use it in every extractor image gate and inneeds_image_processing(). Also emits aProcessingWarningwhen captioning is configured butresult.imagesisNone. (#732)
v5.0.0-rc.12
- New
windows-targetaggregate feature incrates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and dropsheicalong with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate[[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"'blocks inalef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honortarget_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to dropheicon Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.
-
CI / publish: source-build libheif 1.23.0 to satisfy
libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt shipslibheif 1.17.6and Alpine 3.21 shipslibheif 1.19.5— both rejected bylibheif-syswithPackage 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at thelibheif-sysbuild script. Three coordinated fixes:scripts/ci/install-system-deps/install-linux.shdrops aptlibheif-dev, installslibde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into/usr/local, exportingPKG_CONFIG_PATH+LD_LIBRARY_PATHviaGITHUB_ENV. Cached via a newcache-libheif-linuxstep ininstall-system-deps/action.yml.docker/Dockerfile.musl-{build,ffi,rustler}now pulllibheif-devand codec headers fromalpine/edge/community+edge/main(libheif 1.23.0) instead of Alpine 3.21 main.kreuzberg-dev/actions/build-python-wheels@v1.8.64source-builds libheif insidemanylinux_2_28viaCIBW_BEFORE_ALL_LINUX(AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).python-wheelsjob inpublish.yamlnow runsinstall-system-depsbeforebuild-python-wheelsso Windows wheels pick up vcpkg-built libheif.
-
Publish workflow: fixed dead-code warnings as errors on Windows /
reranker-presets-only builds.rerank_via_llmandextract_rerank_usageincrates/kreuzberg/src/llm/rerank.rswere gated byany(feature = "reranker-presets", feature = "reranker"), but every caller was gated byfeature = "reranker". On the Windows feature combo (which usesreranker-presets+liter-llmwithoutreranker), the functions compiled but had no callers — clippy-D warningsfailed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg tofeature = "reranker"on both functions, the relatedusestatements, and the test module so the function tracks its callers exactly. Also removed the orphanedContentLayer::is_defaulthelper (made dead by the rc.11 removal of#[serde(skip_serializing_if = ...)]oncontent_layer). -
Publish workflow: install libheif on Linux + macOS runners.
libheif-sysis a transitive dependency ofkreuzberg-libheif(gated behind theheicfeature, included infull), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invokedkreuzberg-dev/actions/build-*without first installing system dependencies.libheif-sys's pkg-config probe then failed withPackage libheif was not found in the pkg-config search path. Addedlibheif-devtoscripts/ci/install-system-deps/install-linux.sh,brew install libheiftoinstall-macos.sh, and injected a./.github/actions/install-system-depsstep before everybuild-*action invocation in.github/workflows/publish.yaml(17 sites + thepublish-cratesjob, sincekreuzberg-libheifis also published from that job). -
Publish workflow: publish
kreuzberg-libheifto crates.io. The new path-onlykreuzberg-libheifcrate is a build dependency ofkreuzbergvia theheicfeature. Without publishing it first,cargo publish -p kreuzbergaborts withno matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in thepublish-cratesinvocation'scrates:list so it lands on crates.io beforekreuzberg. -
Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's
target_dep_overridespreviously narrowed theandroid-targetfeature set tocfg(all(target_os = "android", target_arch = "x86_64")), on the assumption thataarch64-linux-androidcould use the full ORT-enabled set via pyke prebuilts. In practicelibheif-sysstill blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg totarget_os = "android"(covers both arches) and added a paralleltarget_os = "ios"override that also routes toandroid-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml). -
Docs strict build: broken anchor in
docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked[Embeddings](architecture.md#embeddings), butarchitecture.mdhas no## Embeddingsheading —task docs:build:strictaborted with exit code 1. Reworded the parenthetical to drop the link target. -
musl Docker builds: include
kreuzberg-libheifcrate + installlibheif-dev. The threedocker/Dockerfile.musl-{ffi,rustler,build}images each copy a subset of workspace crates into the build context; the newkreuzberg-libheifcrate (required transitively whenkreuzbergis built with thefullfeatures that includeheic) was missing, so cargo aborted withfailed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and addedlibheif-devto theapk addlist so libheif-sys's pkg-config probe finds Alpine's package. -
CI Lint: install system dependencies before running clippy.
ci-lint.yamlset up the Rust toolchain but never called./.github/actions/install-system-deps, so clippy on the Linux runner failed to findlibheif.pc. Added the install-system-deps step right aftersetup-rust. -
Windows binding wheels: install libheif via vcpkg.
libheif-sys's build.rs usesvcpkg::Config::new().find_package("libheif")on Windows MSVC. Thewindows-latestrunner image ships vcpkg pre-installed atC:\vcpkgbut the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) toscripts/ci/install-system-deps/install-windows.ps1, exposedVCPKG_ROOTto the build environment, and added anactions/cacheentry onC:\vcpkg\installed\x64-windows-static-mdso subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use thefullfeature set because their alef scaffolders do not yet honortarget_dep_overrides. -
Elixir cargo
ccversion conflict. The native Rustler NIF lockfile atpackages/elixir/native/kreuzberg_nif/Cargo.lockpinnedcc 1.2.63, whilekreuzberg-tesseractrequirescc ^1.2.64. Cargo failed to resolve. Rancargo update -p ccin that sub-project to lift it to1.2.64; this also picked uphtml-to-markdown-rs 3.6.2andpdf_oxide 0.3.64in both lockfiles to keep them in sync. -
Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path
metadata.format.excel.sheet_countresolved againstFormatMetadata(a tagged enum with apub fn excel(&self) -> Option<&ExcelMetadata>convenience accessor). The Rust e2e renderer was emittingresult.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error becauseFormatMetadatahas noexcelfield. Added"metadata.format.excel"tofields_method_callsinalef.tomlso the Rust renderer (and the Zig one) appends()after the segment instead of dot-accessing it as a field. -
Rust e2e generator:
not_emptyonStringleaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path likesummary.textcrosses anOption<Summary>parent on the way down, the resolver registers the path as optional; the renderer'sis_optbranch previously emittedaccessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has typeStringand.is_some()is a compile error. The renderer now checks for the trailing.as_ref().unwrap().marker and emits!accessor.is_empty()for the concrete-leaf case while preserving.is_some()for trueOption<T>leaves. -
Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (
src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into locallet _name = result.<path>.as_ref().map(...).unwrap_or_default();bindings called the basicrender_accessorinstead ofrender_rust_with_optionals, so a path likesummary.strategyemittedresult.summary.strategy— a compile error becauseOption<Summary>has nostrategyfield. Switched the binding torender_rust_with_optionalsso intermediate optional segments produce.as_ref().unwrap(). -
Removed
summary.textandsummary.strategyfromfields_optional. They are not optional onDocumentSummary(pub text: String,pub strategy: SummaryStrategy); only the parentsummary: Option<DocumentSummary>is. Listing the leaves as optional caused the e2e renderer to emit.as_ref()and.as_deref()against concrete types. The parent stays in the set. -
SummaryStrategynow implementsDisplaymatching the snake-case serde wire form (extractive/abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needsto_string()on the enum value to compare against the fixture's literal string. Added a minimalDisplayimpl. -
Dead-code
is_heif_containerandextract_exif_dataunder reranker-only builds. TheLive HF preset testsCI job buildskreuzbergwith--features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (inextraction::imageandextractors::image) is gated out — clippy-D warningsthen surfaced both functions asdead_code. Added#[allow(dead_code)]to both definitions with comments documenting the unconditional-compile intent. -
text::classification::classify_textstub added. The real implementation lives behind theclassificationfeature; alef-generated bindings callkreuzberg::text::classification::classify_textunconditionally, so the existing stub module needed a matching no-op for the path that returnsErr(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Androidandroid-targetbuilds, which drop the classification feature. -
LlmBackendandGlineBackendstubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 /android+x86_64triple cfg. With the binding crates now widening their target gates totarget_os = "android"andtarget_os = "ios"(both arches),aarch64-apple-iosandaarch64-linux-androidwere failing to compile againstkreuzberg::LlmBackend/kreuzberg::GlineBackend. Simplified the gate to#[cfg(not(feature = "ner-llm"))]and#[cfg(not(feature = "ner-onnx"))]respectively — any config that drops the feature now gets the stub regardless of target. -
Swift Rust crate now honours
target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) calledcrate::scaffold::render_core_depdirectly, ignoring the[[crates.swift.target_dep_overrides]]block. ExtendedSwiftConfigwith atarget_dep_overrides: Vec<SwiftTargetDepOverride>field (mirrorsDartTargetDepOverride) and refactoredemit_cargo_tomlto emit[target.'cfg(not(any(...)))'.dependencies]+ per-override[target.'cfg(...)'.dependencies]blocks when overrides are present, matching the FFI and Dart patterns. Withalef.tomlnow declaring iOS, Android, and Windows overrides for[crates.swift],packages/swift/rust/Cargo.tomlcorrectly routes iOS toandroid-target, Android toandroid-target, Windows towindows-target, and the default (macOS host) to the full feature set. Sametarget_dep_overridesconfig gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.
-
list_supported_formats()is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091) -
[v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level
rerank/rerank_asyncAPI,RerankerConfigwith Preset/Custom/Llm/Plugin variants,RerankerBackendplugin trait + registry,POST /rerankHTTP endpoint, and per-language bindings via alef. Gated behind the newreranker+reranker-presetsCargo features; reranker-presets is WASM/Android-safe. -
[v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries:
bge-reranker-base(BAAI/bge-reranker-base, EN+ZH),bge-reranker-v2-m3(rozgo/bge-reranker-v2-m3 with the requiredmodel.onnx.datasibling, multilingual),jina-reranker-v1-turbo-en, andjina-reranker-v2-base-multilingual. Friendly aliasesfast/balanced/quality/multilingualresolve to catalog entries, so existing call sites keep working. -
[v5.0.0]
RerankerPreset+RerankerModelType::Customgainedadditional_files: Vec<String>. Lets multi-blob ONNX exports (notablyrozgo/bge-reranker-v2-m3, which splits weights intomodel.onnx+model.onnx.data) actually load. -
[v5.0.0]
RerankerModelType::Customgainedmodel_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to"onnx/model.onnx"when omitted. -
[v5.0.0] CI
live-hfjob. Always-on reranker preset-path validation on every PR via a new.github/actions/cache-hf-fastembedcomposite action that caches~/.cache/huggingface/hubkeyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus aPreset/Custom-equivalent crosscheck andtop_ktruncation.
-
[v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (
Xenova/ms-marco-MiniLM-L-6-v2,Xenova/bge-reranker-base,Xenova/bge-reranker-large,BAAI/bge-reranker-v2-m3) are removed. The hand-curatedmodel_filepaths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast/balanced/quality/multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically needms-marco-MiniLMorbge-reranker-largecan pass them viaCustom { model_id, ... }or a registeredPluginbackend. -
[v5.0.0] BREAKING:
RerankerModelType::Customis no longer a two-field tuple. Exhaustive Rust matches onCustom { model_id, max_length }need to addmodel_fileandadditional_files. Serde defaults keep existing TOML / JSON configs valid without migration.
-
rendering: fixed panic when a non-
Itemblock element appears directly under aListnode before anyListItem. The comrak AST builder now synthesises an implicitItemwrapper instead of falling back onto the bareList, which violated CommonMark'sList → Item-onlyconstraint and panicked in debug builds. (#1096) -
pdf:
result.pages[*].isBlanknow reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existingPageContententries,is_blankwas never recalculated — it retained the stale value from native text extraction, which is alwaysSome(true)for pages with no text layer. All four write sites in the OCR page-assembly block now callis_page_text_blankafter every content mutation. (#1095) -
reranker:
RerankErrormigrated tothiserror. Matches the rest of the library andrust-conventions. -
reranker:
shutdown_allnow best-effort. Continues invokingshutdown()on every backend even after one fails, returns the first error, drops subsequent ones (logged atwarn). Previously stopped on the first failure, leaving the registry in a half-shutdown state. -
reranker: synchronous
rerank()returns a clear error instead of panicking on a current-thread Tokio runtime.block_in_placerequires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detectRuntimeFlavor::CurrentThreadand ask the caller to usererank_async()or build a multi-thread runtime. -
reranker: stronger sigmoid coverage in
tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0,3.0,0.5) and asserts both the sigmoid output range[0, 1]and the sign-→-side property — silently dropping the sigmoid would break the test. -
reranker: engine tests share the production
sigmoid_f32instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.
-
pdf: table extraction failures are now visible at
warnlog level.extract_tables_nativeandextract_tables_borderedsilently caughtpdf_oxide::extract_tables_with_configerrors attracing::debug!, making per-page failures invisible at the default log level. Promoted totracing::warn!to match the existing behaviour of the TATR and SLANeXT inference paths. The threeunwrap_or_default()call sites inextraction.rsthat silently swallowed function-level errors are also replaced withunwrap_or_else(|e| { tracing::warn!(...); Vec::new() })so that apage_count()failure is equally visible. (#1097) -
publish.yaml
trigger-pubdevjob: explicitpermissions: actions: write. Since thea8f8597e45migration to thekreuzberg-dev-publisherApp-token, thegh workflow run publish-pubdev.yamlstep has 403'd with "Resource not accessible by integration" — the App's installation token didn't carryactions: write. Adding job-levelpermissions: { actions: write, contents: read }covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also needactions: writeconfigured on github.com.
- Root Taskfile now includes test-apps task namespace. Added
test-appsto the rootTaskfile.ymlincludes block with proper{{.ROOT_DIR}}path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible viatask test-apps:smoke:*.
-
swift e2e: removed erroneous
async = falseoverride onextract_filefor swift.Kreuzberg.extractFile(_:_:_:)is async in the Swift binding. The override inalef.tomlforcedis_async = falsefor fixtures that explicitly set"call": "extract_file"(e.g.api_batch_bytes_async), generating non-async test methods that called the async binding withoutawait— compile errors. Fixtures without an explicitcallfall throughresolve_call_for_fixtureto the global default and gotis_async = truecorrectly, which is whytestApiExtractFileAsynccompiled buttestApiBatchBytesAsyncand 4 siblings did not. Dropping the override aligns both code paths. -
r: fix macOS dylib rpath so ORT loads at R extension runtime.
packages/r/src/rust/build.rsnow adds-Wl,-rpath,@loader_pathlinker flag on macOS, enabling the final R extension.soto locate transitively-linked dylibs likelibonnxruntime.dylibat load time. Without this, R'sdyn.loadvialibrary.dynam2failed withundefined symbol: OrtGetApiBasein CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791. -
Publish Release WASM job now non-blocking via
continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure onubuntu-latest(8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Setcontinue-on-error: trueso Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallelpublish-wasmjob success. -
FormatMetadata::Codenow serializes correctly. The#[serde(skip)]annotation on theCodevariant causedserde_json::to_string(and every*_to_jsonFFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the innerCodeMetadataInner(ProcessResult)wrapper already derivesSerialize/Deserializevia the upstreamserdefeature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction. -
Python sdist publish step now uses split-layout invocation.
.github/workflows/publish.yamlpassedmanifest-path: crates/kreuzberg-py/Cargo.tomltobuild-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runsmaturin sdist— but the kreuzberg layout keepspyproject.tomlinpackages/python/, so maturin failed withFailed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped themanifest-pathinput so the action falls through to the defaultpackage-dir: packages/pythonsplit-layout fallback, which cd's into the package dir and lets maturin resolvemanifest-pathfrom pyproject.toml's[tool.maturin]section itself. -
Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking
kreuzberg_php.dll/kreuzberg_nif.dllonx86_64-pc-windows-msvcfailed withLNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease.libkreuzberg_tesseract.rlibis built by cmake-rs which defaults to/MD;libesaxx_rs.rlib(transitively pulled in bygliner→tokenizers→esaxx-rs) is built by cc-rs which fell back to/MT.alef.toml[crates.scaffold.cargo.env]now setsCFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD"and the matchingCXXFLAGS_*, propagated to.cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7). -
captioning: captioning was a no-op for all image paths —
CaptioningProcessornever received image bytes. Two root causes: (1)ImageExtractorbuiltextracted_imageon the OCR path but passedNonetobuild_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction onconfig.images.extract_images, so setting onlyconfig.captioningleft them with empty data. Fix: addExtractionConfig::needs_image_data()— true whenimages.extract_imagesorcaptioningis set — and use it in every extractor image gate and inneeds_image_processing(). Also emits aProcessingWarningwhen captioning is configured butresult.imagesisNone. (#732)
v5.0.0-rc.12
- New
windows-targetaggregate feature incrates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and dropsheicalong with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate[[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"'blocks inalef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honortarget_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to dropheicon Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.
-
CI / publish: source-build libheif 1.23.0 to satisfy
libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt shipslibheif 1.17.6and Alpine 3.21 shipslibheif 1.19.5— both rejected bylibheif-syswithPackage 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at thelibheif-sysbuild script. Three coordinated fixes:scripts/ci/install-system-deps/install-linux.shdrops aptlibheif-dev, installslibde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into/usr/local, exportingPKG_CONFIG_PATH+LD_LIBRARY_PATHviaGITHUB_ENV. Cached via a newcache-libheif-linuxstep ininstall-system-deps/action.yml.docker/Dockerfile.musl-{build,ffi,rustler}now pulllibheif-devand codec headers fromalpine/edge/community+edge/main(libheif 1.23.0) instead of Alpine 3.21 main.kreuzberg-dev/actions/build-python-wheels@v1.8.64source-builds libheif insidemanylinux_2_28viaCIBW_BEFORE_ALL_LINUX(AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).python-wheelsjob inpublish.yamlnow runsinstall-system-depsbeforebuild-python-wheelsso Windows wheels pick up vcpkg-built libheif.
-
Publish workflow: fixed dead-code warnings as errors on Windows /
reranker-presets-only builds.rerank_via_llmandextract_rerank_usageincrates/kreuzberg/src/llm/rerank.rswere gated byany(feature = "reranker-presets", feature = "reranker"), but every caller was gated byfeature = "reranker". On the Windows feature combo (which usesreranker-presets+liter-llmwithoutreranker), the functions compiled but had no callers — clippy-D warningsfailed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg tofeature = "reranker"on both functions, the relatedusestatements, and the test module so the function tracks its callers exactly. Also removed the orphanedContentLayer::is_defaulthelper (made dead by the rc.11 removal of#[serde(skip_serializing_if = ...)]oncontent_layer). -
Publish workflow: install libheif on Linux + macOS runners.
libheif-sysis a transitive dependency ofkreuzberg-libheif(gated behind theheicfeature, included infull), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invokedkreuzberg-dev/actions/build-*without first installing system dependencies.libheif-sys's pkg-config probe then failed withPackage libheif was not found in the pkg-config search path. Addedlibheif-devtoscripts/ci/install-system-deps/install-linux.sh,brew install libheiftoinstall-macos.sh, and injected a./.github/actions/install-system-depsstep before everybuild-*action invocation in.github/workflows/publish.yaml(17 sites + thepublish-cratesjob, sincekreuzberg-libheifis also published from that job). -
Publish workflow: publish
kreuzberg-libheifto crates.io. The new path-onlykreuzberg-libheifcrate is a build dependency ofkreuzbergvia theheicfeature. Without publishing it first,cargo publish -p kreuzbergaborts withno matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in thepublish-cratesinvocation'scrates:list so it lands on crates.io beforekreuzberg. -
Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's
target_dep_overridespreviously narrowed theandroid-targetfeature set tocfg(all(target_os = "android", target_arch = "x86_64")), on the assumption thataarch64-linux-androidcould use the full ORT-enabled set via pyke prebuilts. In practicelibheif-sysstill blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg totarget_os = "android"(covers both arches) and added a paralleltarget_os = "ios"override that also routes toandroid-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml). -
Docs strict build: broken anchor in
docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked[Embeddings](architecture.md#embeddings), butarchitecture.mdhas no## Embeddingsheading —task docs:build:strictaborted with exit code 1. Reworded the parenthetical to drop the link target. -
musl Docker builds: include
kreuzberg-libheifcrate + installlibheif-dev. The threedocker/Dockerfile.musl-{ffi,rustler,build}images each copy a subset of workspace crates into the build context; the newkreuzberg-libheifcrate (required transitively whenkreuzbergis built with thefullfeatures that includeheic) was missing, so cargo aborted withfailed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and addedlibheif-devto theapk addlist so libheif-sys's pkg-config probe finds Alpine's package. -
CI Lint: install system dependencies before running clippy.
ci-lint.yamlset up the Rust toolchain but never called./.github/actions/install-system-deps, so clippy on the Linux runner failed to findlibheif.pc. Added the install-system-deps step right aftersetup-rust. -
Windows binding wheels: install libheif via vcpkg.
libheif-sys's build.rs usesvcpkg::Config::new().find_package("libheif")on Windows MSVC. Thewindows-latestrunner image ships vcpkg pre-installed atC:\vcpkgbut the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) toscripts/ci/install-system-deps/install-windows.ps1, exposedVCPKG_ROOTto the build environment, and added anactions/cacheentry onC:\vcpkg\installed\x64-windows-static-mdso subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use thefullfeature set because their alef scaffolders do not yet honortarget_dep_overrides. -
Elixir cargo
ccversion conflict. The native Rustler NIF lockfile atpackages/elixir/native/kreuzberg_nif/Cargo.lockpinnedcc 1.2.63, whilekreuzberg-tesseractrequirescc ^1.2.64. Cargo failed to resolve. Rancargo update -p ccin that sub-project to lift it to1.2.64; this also picked uphtml-to-markdown-rs 3.6.2andpdf_oxide 0.3.64in both lockfiles to keep them in sync. -
Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path
metadata.format.excel.sheet_countresolved againstFormatMetadata(a tagged enum with apub fn excel(&self) -> Option<&ExcelMetadata>convenience accessor). The Rust e2e renderer was emittingresult.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error becauseFormatMetadatahas noexcelfield. Added"metadata.format.excel"tofields_method_callsinalef.tomlso the Rust renderer (and the Zig one) appends()after the segment instead of dot-accessing it as a field. -
Rust e2e generator:
not_emptyonStringleaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path likesummary.textcrosses anOption<Summary>parent on the way down, the resolver registers the path as optional; the renderer'sis_optbranch previously emittedaccessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has typeStringand.is_some()is a compile error. The renderer now checks for the trailing.as_ref().unwrap().marker and emits!accessor.is_empty()for the concrete-leaf case while preserving.is_some()for trueOption<T>leaves. -
Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (
src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into locallet _name = result.<path>.as_ref().map(...).unwrap_or_default();bindings called the basicrender_accessorinstead ofrender_rust_with_optionals, so a path likesummary.strategyemittedresult.summary.strategy— a compile error becauseOption<Summary>has nostrategyfield. Switched the binding torender_rust_with_optionalsso intermediate optional segments produce.as_ref().unwrap(). -
Removed
summary.textandsummary.strategyfromfields_optional. They are not optional onDocumentSummary(pub text: String,pub strategy: SummaryStrategy); only the parentsummary: Option<DocumentSummary>is. Listing the leaves as optional caused the e2e renderer to emit.as_ref()and.as_deref()against concrete types. The parent stays in the set. -
SummaryStrategynow implementsDisplaymatching the snake-case serde wire form (extractive/abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needsto_string()on the enum value to compare against the fixture's literal string. Added a minimalDisplayimpl. -
Dead-code
is_heif_containerandextract_exif_dataunder reranker-only builds. TheLive HF preset testsCI job buildskreuzbergwith--features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (inextraction::imageandextractors::image) is gated out — clippy-D warningsthen surfaced both functions asdead_code. Added#[allow(dead_code)]to both definitions with comments documenting the unconditional-compile intent. -
text::classification::classify_textstub added. The real implementation lives behind theclassificationfeature; alef-generated bindings callkreuzberg::text::classification::classify_textunconditionally, so the existing stub module needed a matching no-op for the path that returnsErr(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Androidandroid-targetbuilds, which drop the classification feature. -
LlmBackendandGlineBackendstubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 /android+x86_64triple cfg. With the binding crates now widening their target gates totarget_os = "android"andtarget_os = "ios"(both arches),aarch64-apple-iosandaarch64-linux-androidwere failing to compile againstkreuzberg::LlmBackend/kreuzberg::GlineBackend. Simplified the gate to#[cfg(not(feature = "ner-llm"))]and#[cfg(not(feature = "ner-onnx"))]respectively — any config that drops the feature now gets the stub regardless of target. -
Swift Rust crate now honours
target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) calledcrate::scaffold::render_core_depdirectly, ignoring the[[crates.swift.target_dep_overrides]]block. ExtendedSwiftConfigwith atarget_dep_overrides: Vec<SwiftTargetDepOverride>field (mirrorsDartTargetDepOverride) and refactoredemit_cargo_tomlto emit[target.'cfg(not(any(...)))'.dependencies]+ per-override[target.'cfg(...)'.dependencies]blocks when overrides are present, matching the FFI and Dart patterns. Withalef.tomlnow declaring iOS, Android, and Windows overrides for[crates.swift],packages/swift/rust/Cargo.tomlcorrectly routes iOS toandroid-target, Android toandroid-target, Windows towindows-target, and the default (macOS host) to the full feature set. Sametarget_dep_overridesconfig gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.
-
list_supported_formats()is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091) -
[v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level
rerank/rerank_asyncAPI,RerankerConfigwith Preset/Custom/Llm/Plugin variants,RerankerBackendplugin trait + registry,POST /rerankHTTP endpoint, and per-language bindings via alef. Gated behind the newreranker+reranker-presetsCargo features; reranker-presets is WASM/Android-safe. -
[v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries:
bge-reranker-base(BAAI/bge-reranker-base, EN+ZH),bge-reranker-v2-m3(rozgo/bge-reranker-v2-m3 with the requiredmodel.onnx.datasibling, multilingual),jina-reranker-v1-turbo-en, andjina-reranker-v2-base-multilingual. Friendly aliasesfast/balanced/quality/multilingualresolve to catalog entries, so existing call sites keep working. -
[v5.0.0]
RerankerPreset+RerankerModelType::Customgainedadditional_files: Vec<String>. Lets multi-blob ONNX exports (notablyrozgo/bge-reranker-v2-m3, which splits weights intomodel.onnx+model.onnx.data) actually load. -
[v5.0.0]
RerankerModelType::Customgainedmodel_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to"onnx/model.onnx"when omitted. -
[v5.0.0] CI
live-hfjob. Always-on reranker preset-path validation on every PR via a new.github/actions/cache-hf-fastembedcomposite action that caches~/.cache/huggingface/hubkeyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus aPreset/Custom-equivalent crosscheck andtop_ktruncation.
-
[v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (
Xenova/ms-marco-MiniLM-L-6-v2,Xenova/bge-reranker-base,Xenova/bge-reranker-large,BAAI/bge-reranker-v2-m3) are removed. The hand-curatedmodel_filepaths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast/balanced/quality/multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically needms-marco-MiniLMorbge-reranker-largecan pass them viaCustom { model_id, ... }or a registeredPluginbackend. -
[v5.0.0] BREAKING:
RerankerModelType::Customis no longer a two-field tuple. Exhaustive Rust matches onCustom { model_id, max_length }need to addmodel_fileandadditional_files. Serde defaults keep existing TOML / JSON configs valid without migration.
-
rendering: fixed panic when a non-
Itemblock element appears directly under aListnode before anyListItem. The comrak AST builder now synthesises an implicitItemwrapper instead of falling back onto the bareList, which violated CommonMark'sList → Item-onlyconstraint and panicked in debug builds. (#1096) -
pdf:
result.pages[*].isBlanknow reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existingPageContententries,is_blankwas never recalculated — it retained the stale value from native text extraction, which is alwaysSome(true)for pages with no text layer. All four write sites in the OCR page-assembly block now callis_page_text_blankafter every content mutation. (#1095) -
reranker:
RerankErrormigrated tothiserror. Matches the rest of the library andrust-conventions. -
reranker:
shutdown_allnow best-effort. Continues invokingshutdown()on every backend even after one fails, returns the first error, drops subsequent ones (logged atwarn). Previously stopped on the first failure, leaving the registry in a half-shutdown state. -
reranker: synchronous
rerank()returns a clear error instead of panicking on a current-thread Tokio runtime.block_in_placerequires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detectRuntimeFlavor::CurrentThreadand ask the caller to usererank_async()or build a multi-thread runtime. -
reranker: stronger sigmoid coverage in
tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0,3.0,0.5) and asserts both the sigmoid output range[0, 1]and the sign-→-side property — silently dropping the sigmoid would break the test. -
reranker: engine tests share the production
sigmoid_f32instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.
-
pdf: table extraction failures are now visible at
warnlog level.extract_tables_nativeandextract_tables_borderedsilently caughtpdf_oxide::extract_tables_with_configerrors attracing::debug!, making per-page failures invisible at the default log level. Promoted totracing::warn!to match the existing behaviour of the TATR and SLANeXT inference paths. The threeunwrap_or_default()call sites inextraction.rsthat silently swallowed function-level errors are also replaced withunwrap_or_else(|e| { tracing::warn!(...); Vec::new() })so that apage_count()failure is equally visible. (#1097) -
publish.yaml
trigger-pubdevjob: explicitpermissions: actions: write. Since thea8f8597e45migration to thekreuzberg-dev-publisherApp-token, thegh workflow run publish-pubdev.yamlstep has 403'd with "Resource not accessible by integration" — the App's installation token didn't carryactions: write. Adding job-levelpermissions: { actions: write, contents: read }covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also needactions: writeconfigured on github.com.
- Root Taskfile now includes test-apps task namespace. Added
test-appsto the rootTaskfile.ymlincludes block with proper{{.ROOT_DIR}}path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible viatask test-apps:smoke:*.
-
swift e2e: removed erroneous
async = falseoverride onextract_filefor swift.Kreuzberg.extractFile(_:_:_:)is async in the Swift binding. The override inalef.tomlforcedis_async = falsefor fixtures that explicitly set"call": "extract_file"(e.g.api_batch_bytes_async), generating non-async test methods that called the async binding withoutawait— compile errors. Fixtures without an explicitcallfall throughresolve_call_for_fixtureto the global default and gotis_async = truecorrectly, which is whytestApiExtractFileAsynccompiled buttestApiBatchBytesAsyncand 4 siblings did not. Dropping the override aligns both code paths. -
r: fix macOS dylib rpath so ORT loads at R extension runtime.
packages/r/src/rust/build.rsnow adds-Wl,-rpath,@loader_pathlinker flag on macOS, enabling the final R extension.soto locate transitively-linked dylibs likelibonnxruntime.dylibat load time. Without this, R'sdyn.loadvialibrary.dynam2failed withundefined symbol: OrtGetApiBasein CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791. -
Publish Release WASM job now non-blocking via
continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure onubuntu-latest(8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Setcontinue-on-error: trueso Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallelpublish-wasmjob success. -
FormatMetadata::Codenow serializes correctly. The#[serde(skip)]annotation on theCodevariant causedserde_json::to_string(and every*_to_jsonFFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the innerCodeMetadataInner(ProcessResult)wrapper already derivesSerialize/Deserializevia the upstreamserdefeature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction. -
Python sdist publish step now uses split-layout invocation.
.github/workflows/publish.yamlpassedmanifest-path: crates/kreuzberg-py/Cargo.tomltobuild-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runsmaturin sdist— but the kreuzberg layout keepspyproject.tomlinpackages/python/, so maturin failed withFailed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped themanifest-pathinput so the action falls through to the defaultpackage-dir: packages/pythonsplit-layout fallback, which cd's into the package dir and lets maturin resolvemanifest-pathfrom pyproject.toml's[tool.maturin]section itself. -
Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking
kreuzberg_php.dll/kreuzberg_nif.dllonx86_64-pc-windows-msvcfailed withLNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease.libkreuzberg_tesseract.rlibis built by cmake-rs which defaults to/MD;libesaxx_rs.rlib(transitively pulled in bygliner→tokenizers→esaxx-rs) is built by cc-rs which fell back to/MT.alef.toml[crates.scaffold.cargo.env]now setsCFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD"and the matchingCXXFLAGS_*, propagated to.cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7). -
captioning: captioning was a no-op for all image paths —
CaptioningProcessornever received image bytes. Two root causes: (1)ImageExtractorbuiltextracted_imageon the OCR path but passedNonetobuild_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction onconfig.images.extract_images, so setting onlyconfig.captioningleft them with empty data. Fix: addExtractionConfig::needs_image_data()— true whenimages.extract_imagesorcaptioningis set — and use it in every extractor image gate and inneeds_image_processing(). Also emits aProcessingWarningwhen captioning is configured butresult.imagesisNone. (#732)