kreuzberg-dev/kreuzberg
 Watch   
 Star   
 Fork   
2 days ago
kreuzberg

v5.0.0-rc.14

Fixed

  • Publish: unblock minimumReleaseAge supply-chain gate. Set minimumReleaseAge: 0 in pnpm-workspace.yaml so first-party @kreuzberg/* platform packages are no longer rejected by pnpm's default 24h supply-chain delay during the publish workflow's Build Node bindings stage.
  • Publish: version drift in root manifests. Bumped package.json (root) and crates/kreuzberg-py/src/pyproject.toml from rc.12 to rc.14; previously missed by alef sync-versions.
  • Mobile/ARM build: dart binding crate now compiles when upstream variants are cfg-gated. Bumps alef pin to v0.25.8 which fixes both E0004 (non-exhaustive From-impl matches) and E0599 (mirror enum cfg-gated variants vs frb-generated unconditional refs).

See CHANGELOG for details.

2 days ago
kreuzberg

v5.0.0-rc.13

v5.0.0-rc.13

SVG support (new)

  • ImageOutputFormat::Svg variant (wire tag "svg"). Gated by new svg Cargo feature; included in no-ort-target, formats, full. Ships in WASM + Android.
  • SvgOptions { sanitize: bool, render_dpi: f32 } — defaults: sanitize = true, render_dpi = 96.0 (clamped 1.0–600.0).
  • SVG → PNG / JPEG / WebP / HEIF rasterization via resvg + usvg + tiny-skia.
  • SVG → SVG sanitize on Native target strips <script>, external xlink:href/href, <foreignObject>, and JS event handlers.
  • Raster → SVG returns new EncodeWarning::UnsupportedDirection. No auto-vectorization.
  • Security caps: input ≤ 10 MB, render output ≤ 16384² pixels (~1 GB peak), render_dpi clamped. usvg image_href_resolver no-op blocks SSRF / filesystem reads.
  • Sync pipeline path now also applies the image output format pass (closes WASM bypass).

CI / publish fixes

  • EncoderUnavailable gated on heic (silences -D warnings dead-code on 9+ publish builds).
  • Three Dockerfile.musl-* builds: cargo build --locked--offline so the sed-trimmed manifest reconciles with the lock (unblocks 4 musl native builds).
  • ci-e2e: prepend /usr/local/lib to LD_LIBRARY_PATH so the source-built libheif 1.23 takes precedence over apt's older version on the Zig job.
  • Elixir + Ruby NIF Cargo.toml: pin alloc-stdlib = "=0.2.2" (brotli 8 trait drift fix).

alef pin 0.25.2 → 0.25.5

  • 0.25.3: R extendr struct field escaping (pub r#type for serde-tagged enums).
  • 0.25.4: Dart per-target feature #[cfg(feature = "X")] guards on enum variants (mobile Android/iOS cargo check now sees the correct ImageOutputFormat variant set).
  • 0.25.5: trait-bridge adapter regression coverage across all 14 language bindings; before-hooks documentation and test.

Other

  • kreuzberg-libheif bootstrapped on crates.io.
2 days ago
kreuzberg

v5.0.0-rc.12

Release candidate 12 — full retry @ 33a76190cb. Adds image-output normalization (ImageOutputFormat::{Native, Png, Jpeg, Webp, Heif}) with regen pipeline pass, bumps tslp to 1.9.0-rc.44 (parsers.json present), alef 0.25.2 with swift accessor + assertion fixes, brotli/alloc-stdlib unpin via cargo-stdlib 0.2.2, R Linux $ORIGIN rpath, libnuma-dev for ORT aarch64, and --locked on every cargo invocation. Three lockfiles previously gitignored are now tracked. See CHANGELOG.md for full details.

2 days ago
kreuzberg

v5.0.0-rc.12

Added

  • New windows-target aggregate feature in crates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and drops heic along with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate [[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"' blocks in alef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honor target_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to drop heic on Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.

Fixed

  • CI / publish: source-build libheif 1.23.0 to satisfy libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt ships libheif 1.17.6 and Alpine 3.21 ships libheif 1.19.5 — both rejected by libheif-sys with Package 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at the libheif-sys build script. Three coordinated fixes:

    • scripts/ci/install-system-deps/install-linux.sh drops apt libheif-dev, installs libde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into /usr/local, exporting PKG_CONFIG_PATH + LD_LIBRARY_PATH via GITHUB_ENV. Cached via a new cache-libheif-linux step in install-system-deps/action.yml.
    • docker/Dockerfile.musl-{build,ffi,rustler} now pull libheif-dev and codec headers from alpine/edge/community + edge/main (libheif 1.23.0) instead of Alpine 3.21 main.
    • kreuzberg-dev/actions/build-python-wheels@v1.8.64 source-builds libheif inside manylinux_2_28 via CIBW_BEFORE_ALL_LINUX (AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).
    • python-wheels job in publish.yaml now runs install-system-deps before build-python-wheels so Windows wheels pick up vcpkg-built libheif.
  • Publish workflow: fixed dead-code warnings as errors on Windows / reranker-presets-only builds. rerank_via_llm and extract_rerank_usage in crates/kreuzberg/src/llm/rerank.rs were gated by any(feature = "reranker-presets", feature = "reranker"), but every caller was gated by feature = "reranker". On the Windows feature combo (which uses reranker-presets + liter-llm without reranker), the functions compiled but had no callers — clippy -D warnings failed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg to feature = "reranker" on both functions, the related use statements, and the test module so the function tracks its callers exactly. Also removed the orphaned ContentLayer::is_default helper (made dead by the rc.11 removal of #[serde(skip_serializing_if = ...)] on content_layer).

  • Publish workflow: install libheif on Linux + macOS runners. libheif-sys is a transitive dependency of kreuzberg-libheif (gated behind the heic feature, included in full), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invoked kreuzberg-dev/actions/build-* without first installing system dependencies. libheif-sys's pkg-config probe then failed with Package libheif was not found in the pkg-config search path. Added libheif-dev to scripts/ci/install-system-deps/install-linux.sh, brew install libheif to install-macos.sh, and injected a ./.github/actions/install-system-deps step before every build-* action invocation in .github/workflows/publish.yaml (17 sites + the publish-crates job, since kreuzberg-libheif is also published from that job).

  • Publish workflow: publish kreuzberg-libheif to crates.io. The new path-only kreuzberg-libheif crate is a build dependency of kreuzberg via the heic feature. Without publishing it first, cargo publish -p kreuzberg aborts with no matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in the publish-crates invocation's crates: list so it lands on crates.io before kreuzberg.

  • Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's target_dep_overrides previously narrowed the android-target feature set to cfg(all(target_os = "android", target_arch = "x86_64")), on the assumption that aarch64-linux-android could use the full ORT-enabled set via pyke prebuilts. In practice libheif-sys still blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg to target_os = "android" (covers both arches) and added a parallel target_os = "ios" override that also routes to android-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml).

  • Docs strict build: broken anchor in docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked [Embeddings](architecture.md#embeddings), but architecture.md has no ## Embeddings heading — task docs:build:strict aborted with exit code 1. Reworded the parenthetical to drop the link target.

  • musl Docker builds: include kreuzberg-libheif crate + install libheif-dev. The three docker/Dockerfile.musl-{ffi,rustler,build} images each copy a subset of workspace crates into the build context; the new kreuzberg-libheif crate (required transitively when kreuzberg is built with the full features that include heic) was missing, so cargo aborted with failed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and added libheif-dev to the apk add list so libheif-sys's pkg-config probe finds Alpine's package.

  • CI Lint: install system dependencies before running clippy. ci-lint.yaml set up the Rust toolchain but never called ./.github/actions/install-system-deps, so clippy on the Linux runner failed to find libheif.pc. Added the install-system-deps step right after setup-rust.

  • Windows binding wheels: install libheif via vcpkg. libheif-sys's build.rs uses vcpkg::Config::new().find_package("libheif") on Windows MSVC. The windows-latest runner image ships vcpkg pre-installed at C:\vcpkg but the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) to scripts/ci/install-system-deps/install-windows.ps1, exposed VCPKG_ROOT to the build environment, and added an actions/cache entry on C:\vcpkg\installed\x64-windows-static-md so subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use the full feature set because their alef scaffolders do not yet honor target_dep_overrides.

  • Elixir cargo cc version conflict. The native Rustler NIF lockfile at packages/elixir/native/kreuzberg_nif/Cargo.lock pinned cc 1.2.63, while kreuzberg-tesseract requires cc ^1.2.64. Cargo failed to resolve. Ran cargo update -p cc in that sub-project to lift it to 1.2.64; this also picked up html-to-markdown-rs 3.6.2 and pdf_oxide 0.3.64 in both lockfiles to keep them in sync.

  • Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path metadata.format.excel.sheet_count resolved against FormatMetadata (a tagged enum with a pub fn excel(&self) -> Option<&ExcelMetadata> convenience accessor). The Rust e2e renderer was emitting result.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error because FormatMetadata has no excel field. Added "metadata.format.excel" to fields_method_calls in alef.toml so the Rust renderer (and the Zig one) appends () after the segment instead of dot-accessing it as a field.

  • Rust e2e generator: not_empty on String leaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path like summary.text crosses an Option<Summary> parent on the way down, the resolver registers the path as optional; the renderer's is_opt branch previously emitted accessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has type String and .is_some() is a compile error. The renderer now checks for the trailing .as_ref().unwrap(). marker and emits !accessor.is_empty() for the concrete-leaf case while preserving .is_some() for true Option<T> leaves.

  • Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into local let _name = result.<path>.as_ref().map(...).unwrap_or_default(); bindings called the basic render_accessor instead of render_rust_with_optionals, so a path like summary.strategy emitted result.summary.strategy — a compile error because Option<Summary> has no strategy field. Switched the binding to render_rust_with_optionals so intermediate optional segments produce .as_ref().unwrap().

  • Removed summary.text and summary.strategy from fields_optional. They are not optional on DocumentSummary (pub text: String, pub strategy: SummaryStrategy); only the parent summary: Option<DocumentSummary> is. Listing the leaves as optional caused the e2e renderer to emit .as_ref() and .as_deref() against concrete types. The parent stays in the set.

  • SummaryStrategy now implements Display matching the snake-case serde wire form (extractive / abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needs to_string() on the enum value to compare against the fixture's literal string. Added a minimal Display impl.

  • Dead-code is_heif_container and extract_exif_data under reranker-only builds. The Live HF preset tests CI job builds kreuzberg with --features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (in extraction::image and extractors::image) is gated out — clippy -D warnings then surfaced both functions as dead_code. Added #[allow(dead_code)] to both definitions with comments documenting the unconditional-compile intent.

  • text::classification::classify_text stub added. The real implementation lives behind the classification feature; alef-generated bindings call kreuzberg::text::classification::classify_text unconditionally, so the existing stub module needed a matching no-op for the path that returns Err(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Android android-target builds, which drop the classification feature.

  • LlmBackend and GlineBackend stubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 / android+x86_64 triple cfg. With the binding crates now widening their target gates to target_os = "android" and target_os = "ios" (both arches), aarch64-apple-ios and aarch64-linux-android were failing to compile against kreuzberg::LlmBackend / kreuzberg::GlineBackend. Simplified the gate to #[cfg(not(feature = "ner-llm"))] and #[cfg(not(feature = "ner-onnx"))] respectively — any config that drops the feature now gets the stub regardless of target.

  • Swift Rust crate now honours target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) called crate::scaffold::render_core_dep directly, ignoring the [[crates.swift.target_dep_overrides]] block. Extended SwiftConfig with a target_dep_overrides: Vec<SwiftTargetDepOverride> field (mirrors DartTargetDepOverride) and refactored emit_cargo_toml to emit [target.'cfg(not(any(...)))'.dependencies] + per-override [target.'cfg(...)'.dependencies] blocks when overrides are present, matching the FFI and Dart patterns. With alef.toml now declaring iOS, Android, and Windows overrides for [crates.swift], packages/swift/rust/Cargo.toml correctly routes iOS to android-target, Android to android-target, Windows to windows-target, and the default (macOS host) to the full feature set. Same target_dep_overrides config gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.

Added

  • list_supported_formats() is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091)

  • [v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level rerank / rerank_async API, RerankerConfig with Preset/Custom/Llm/Plugin variants, RerankerBackend plugin trait + registry, POST /rerank HTTP endpoint, and per-language bindings via alef. Gated behind the new reranker + reranker-presets Cargo features; reranker-presets is WASM/Android-safe.

  • [v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries: bge-reranker-base (BAAI/bge-reranker-base, EN+ZH), bge-reranker-v2-m3 (rozgo/bge-reranker-v2-m3 with the required model.onnx.data sibling, multilingual), jina-reranker-v1-turbo-en, and jina-reranker-v2-base-multilingual. Friendly aliases fast / balanced / quality / multilingual resolve to catalog entries, so existing call sites keep working.

  • [v5.0.0] RerankerPreset + RerankerModelType::Custom gained additional_files: Vec<String>. Lets multi-blob ONNX exports (notably rozgo/bge-reranker-v2-m3, which splits weights into model.onnx + model.onnx.data) actually load.

  • [v5.0.0] RerankerModelType::Custom gained model_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to "onnx/model.onnx" when omitted.

  • [v5.0.0] CI live-hf job. Always-on reranker preset-path validation on every PR via a new .github/actions/cache-hf-fastembed composite action that caches ~/.cache/huggingface/hub keyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus a Preset / Custom-equivalent crosscheck and top_k truncation.

Changed

  • [v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (Xenova/ms-marco-MiniLM-L-6-v2, Xenova/bge-reranker-base, Xenova/bge-reranker-large, BAAI/bge-reranker-v2-m3) are removed. The hand-curated model_file paths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast / balanced / quality / multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically need ms-marco-MiniLM or bge-reranker-large can pass them via Custom { model_id, ... } or a registered Plugin backend.

  • [v5.0.0] BREAKING: RerankerModelType::Custom is no longer a two-field tuple. Exhaustive Rust matches on Custom { model_id, max_length } need to add model_file and additional_files. Serde defaults keep existing TOML / JSON configs valid without migration.

Fixed

  • rendering: fixed panic when a non-Item block element appears directly under a List node before any ListItem. The comrak AST builder now synthesises an implicit Item wrapper instead of falling back onto the bare List, which violated CommonMark's List → Item-only constraint and panicked in debug builds. (#1096)

  • pdf: result.pages[*].isBlank now reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existing PageContent entries, is_blank was never recalculated — it retained the stale value from native text extraction, which is always Some(true) for pages with no text layer. All four write sites in the OCR page-assembly block now call is_page_text_blank after every content mutation. (#1095)

  • reranker: RerankError migrated to thiserror. Matches the rest of the library and rust-conventions.

  • reranker: shutdown_all now best-effort. Continues invoking shutdown() on every backend even after one fails, returns the first error, drops subsequent ones (logged at warn). Previously stopped on the first failure, leaving the registry in a half-shutdown state.

  • reranker: synchronous rerank() returns a clear error instead of panicking on a current-thread Tokio runtime. block_in_place requires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detect RuntimeFlavor::CurrentThread and ask the caller to use rerank_async() or build a multi-thread runtime.

  • reranker: stronger sigmoid coverage in tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0, 3.0, 0.5) and asserts both the sigmoid output range [0, 1] and the sign-→-side property — silently dropping the sigmoid would break the test.

  • reranker: engine tests share the production sigmoid_f32 instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.

Fixed

  • pdf: table extraction failures are now visible at warn log level. extract_tables_native and extract_tables_bordered silently caught pdf_oxide::extract_tables_with_config errors at tracing::debug!, making per-page failures invisible at the default log level. Promoted to tracing::warn! to match the existing behaviour of the TATR and SLANeXT inference paths. The three unwrap_or_default() call sites in extraction.rs that silently swallowed function-level errors are also replaced with unwrap_or_else(|e| { tracing::warn!(...); Vec::new() }) so that a page_count() failure is equally visible. (#1097)

  • publish.yaml trigger-pubdev job: explicit permissions: actions: write. Since the a8f8597e45 migration to the kreuzberg-dev-publisher App-token, the gh workflow run publish-pubdev.yaml step has 403'd with "Resource not accessible by integration" — the App's installation token didn't carry actions: write. Adding job-level permissions: { actions: write, contents: read } covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also need actions: write configured on github.com.

Changed

  • Root Taskfile now includes test-apps task namespace. Added test-apps to the root Taskfile.yml includes block with proper {{.ROOT_DIR}} path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible via task test-apps:smoke:*.

Fixed

  • swift e2e: removed erroneous async = false override on extract_file for swift. Kreuzberg.extractFile(_:_:_:) is async in the Swift binding. The override in alef.toml forced is_async = false for fixtures that explicitly set "call": "extract_file" (e.g. api_batch_bytes_async), generating non-async test methods that called the async binding without await — compile errors. Fixtures without an explicit call fall through resolve_call_for_fixture to the global default and got is_async = true correctly, which is why testApiExtractFileAsync compiled but testApiBatchBytesAsync and 4 siblings did not. Dropping the override aligns both code paths.

  • r: fix macOS dylib rpath so ORT loads at R extension runtime. packages/r/src/rust/build.rs now adds -Wl,-rpath,@loader_path linker flag on macOS, enabling the final R extension .so to locate transitively-linked dylibs like libonnxruntime.dylib at load time. Without this, R's dyn.load via library.dynam2 failed with undefined symbol: OrtGetApiBase in CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791.

  • Publish Release WASM job now non-blocking via continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure on ubuntu-latest (8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Set continue-on-error: true so Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallel publish-wasm job success.

  • FormatMetadata::Code now serializes correctly. The #[serde(skip)] annotation on the Code variant caused serde_json::to_string (and every *_to_json FFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the inner CodeMetadataInner(ProcessResult) wrapper already derives Serialize/Deserialize via the upstream serde feature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction.

  • Python sdist publish step now uses split-layout invocation. .github/workflows/publish.yaml passed manifest-path: crates/kreuzberg-py/Cargo.toml to build-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runs maturin sdist — but the kreuzberg layout keeps pyproject.toml in packages/python/, so maturin failed with Failed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped the manifest-path input so the action falls through to the default package-dir: packages/python split-layout fallback, which cd's into the package dir and lets maturin resolve manifest-path from pyproject.toml's [tool.maturin] section itself.

  • Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking kreuzberg_php.dll / kreuzberg_nif.dll on x86_64-pc-windows-msvc failed with LNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease. libkreuzberg_tesseract.rlib is built by cmake-rs which defaults to /MD; libesaxx_rs.rlib (transitively pulled in by glinertokenizersesaxx-rs) is built by cc-rs which fell back to /MT. alef.toml [crates.scaffold.cargo.env] now sets CFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD" and the matching CXXFLAGS_*, propagated to .cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7).

  • captioning: captioning was a no-op for all image paths — CaptioningProcessor never received image bytes. Two root causes: (1) ImageExtractor built extracted_image on the OCR path but passed None to build_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction on config.images.extract_images, so setting only config.captioning left them with empty data. Fix: add ExtractionConfig::needs_image_data() — true when images.extract_images or captioning is set — and use it in every extractor image gate and in needs_image_processing(). Also emits a ProcessingWarning when captioning is configured but result.images is None. (#732)

2 days ago
kreuzberg

v5.0.0-rc.12

Added

  • New windows-target aggregate feature in crates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and drops heic along with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate [[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"' blocks in alef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honor target_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to drop heic on Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.

Fixed

  • CI / publish: source-build libheif 1.23.0 to satisfy libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt ships libheif 1.17.6 and Alpine 3.21 ships libheif 1.19.5 — both rejected by libheif-sys with Package 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at the libheif-sys build script. Three coordinated fixes:

    • scripts/ci/install-system-deps/install-linux.sh drops apt libheif-dev, installs libde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into /usr/local, exporting PKG_CONFIG_PATH + LD_LIBRARY_PATH via GITHUB_ENV. Cached via a new cache-libheif-linux step in install-system-deps/action.yml.
    • docker/Dockerfile.musl-{build,ffi,rustler} now pull libheif-dev and codec headers from alpine/edge/community + edge/main (libheif 1.23.0) instead of Alpine 3.21 main.
    • kreuzberg-dev/actions/build-python-wheels@v1.8.64 source-builds libheif inside manylinux_2_28 via CIBW_BEFORE_ALL_LINUX (AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).
    • python-wheels job in publish.yaml now runs install-system-deps before build-python-wheels so Windows wheels pick up vcpkg-built libheif.
  • Publish workflow: fixed dead-code warnings as errors on Windows / reranker-presets-only builds. rerank_via_llm and extract_rerank_usage in crates/kreuzberg/src/llm/rerank.rs were gated by any(feature = "reranker-presets", feature = "reranker"), but every caller was gated by feature = "reranker". On the Windows feature combo (which uses reranker-presets + liter-llm without reranker), the functions compiled but had no callers — clippy -D warnings failed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg to feature = "reranker" on both functions, the related use statements, and the test module so the function tracks its callers exactly. Also removed the orphaned ContentLayer::is_default helper (made dead by the rc.11 removal of #[serde(skip_serializing_if = ...)] on content_layer).

  • Publish workflow: install libheif on Linux + macOS runners. libheif-sys is a transitive dependency of kreuzberg-libheif (gated behind the heic feature, included in full), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invoked kreuzberg-dev/actions/build-* without first installing system dependencies. libheif-sys's pkg-config probe then failed with Package libheif was not found in the pkg-config search path. Added libheif-dev to scripts/ci/install-system-deps/install-linux.sh, brew install libheif to install-macos.sh, and injected a ./.github/actions/install-system-deps step before every build-* action invocation in .github/workflows/publish.yaml (17 sites + the publish-crates job, since kreuzberg-libheif is also published from that job).

  • Publish workflow: publish kreuzberg-libheif to crates.io. The new path-only kreuzberg-libheif crate is a build dependency of kreuzberg via the heic feature. Without publishing it first, cargo publish -p kreuzberg aborts with no matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in the publish-crates invocation's crates: list so it lands on crates.io before kreuzberg.

  • Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's target_dep_overrides previously narrowed the android-target feature set to cfg(all(target_os = "android", target_arch = "x86_64")), on the assumption that aarch64-linux-android could use the full ORT-enabled set via pyke prebuilts. In practice libheif-sys still blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg to target_os = "android" (covers both arches) and added a parallel target_os = "ios" override that also routes to android-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml).

  • Docs strict build: broken anchor in docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked [Embeddings](architecture.md#embeddings), but architecture.md has no ## Embeddings heading — task docs:build:strict aborted with exit code 1. Reworded the parenthetical to drop the link target.

  • musl Docker builds: include kreuzberg-libheif crate + install libheif-dev. The three docker/Dockerfile.musl-{ffi,rustler,build} images each copy a subset of workspace crates into the build context; the new kreuzberg-libheif crate (required transitively when kreuzberg is built with the full features that include heic) was missing, so cargo aborted with failed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and added libheif-dev to the apk add list so libheif-sys's pkg-config probe finds Alpine's package.

  • CI Lint: install system dependencies before running clippy. ci-lint.yaml set up the Rust toolchain but never called ./.github/actions/install-system-deps, so clippy on the Linux runner failed to find libheif.pc. Added the install-system-deps step right after setup-rust.

  • Windows binding wheels: install libheif via vcpkg. libheif-sys's build.rs uses vcpkg::Config::new().find_package("libheif") on Windows MSVC. The windows-latest runner image ships vcpkg pre-installed at C:\vcpkg but the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) to scripts/ci/install-system-deps/install-windows.ps1, exposed VCPKG_ROOT to the build environment, and added an actions/cache entry on C:\vcpkg\installed\x64-windows-static-md so subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use the full feature set because their alef scaffolders do not yet honor target_dep_overrides.

  • Elixir cargo cc version conflict. The native Rustler NIF lockfile at packages/elixir/native/kreuzberg_nif/Cargo.lock pinned cc 1.2.63, while kreuzberg-tesseract requires cc ^1.2.64. Cargo failed to resolve. Ran cargo update -p cc in that sub-project to lift it to 1.2.64; this also picked up html-to-markdown-rs 3.6.2 and pdf_oxide 0.3.64 in both lockfiles to keep them in sync.

  • Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path metadata.format.excel.sheet_count resolved against FormatMetadata (a tagged enum with a pub fn excel(&self) -> Option<&ExcelMetadata> convenience accessor). The Rust e2e renderer was emitting result.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error because FormatMetadata has no excel field. Added "metadata.format.excel" to fields_method_calls in alef.toml so the Rust renderer (and the Zig one) appends () after the segment instead of dot-accessing it as a field.

  • Rust e2e generator: not_empty on String leaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path like summary.text crosses an Option<Summary> parent on the way down, the resolver registers the path as optional; the renderer's is_opt branch previously emitted accessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has type String and .is_some() is a compile error. The renderer now checks for the trailing .as_ref().unwrap(). marker and emits !accessor.is_empty() for the concrete-leaf case while preserving .is_some() for true Option<T> leaves.

  • Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into local let _name = result.<path>.as_ref().map(...).unwrap_or_default(); bindings called the basic render_accessor instead of render_rust_with_optionals, so a path like summary.strategy emitted result.summary.strategy — a compile error because Option<Summary> has no strategy field. Switched the binding to render_rust_with_optionals so intermediate optional segments produce .as_ref().unwrap().

  • Removed summary.text and summary.strategy from fields_optional. They are not optional on DocumentSummary (pub text: String, pub strategy: SummaryStrategy); only the parent summary: Option<DocumentSummary> is. Listing the leaves as optional caused the e2e renderer to emit .as_ref() and .as_deref() against concrete types. The parent stays in the set.

  • SummaryStrategy now implements Display matching the snake-case serde wire form (extractive / abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needs to_string() on the enum value to compare against the fixture's literal string. Added a minimal Display impl.

  • Dead-code is_heif_container and extract_exif_data under reranker-only builds. The Live HF preset tests CI job builds kreuzberg with --features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (in extraction::image and extractors::image) is gated out — clippy -D warnings then surfaced both functions as dead_code. Added #[allow(dead_code)] to both definitions with comments documenting the unconditional-compile intent.

  • text::classification::classify_text stub added. The real implementation lives behind the classification feature; alef-generated bindings call kreuzberg::text::classification::classify_text unconditionally, so the existing stub module needed a matching no-op for the path that returns Err(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Android android-target builds, which drop the classification feature.

  • LlmBackend and GlineBackend stubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 / android+x86_64 triple cfg. With the binding crates now widening their target gates to target_os = "android" and target_os = "ios" (both arches), aarch64-apple-ios and aarch64-linux-android were failing to compile against kreuzberg::LlmBackend / kreuzberg::GlineBackend. Simplified the gate to #[cfg(not(feature = "ner-llm"))] and #[cfg(not(feature = "ner-onnx"))] respectively — any config that drops the feature now gets the stub regardless of target.

  • Swift Rust crate now honours target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) called crate::scaffold::render_core_dep directly, ignoring the [[crates.swift.target_dep_overrides]] block. Extended SwiftConfig with a target_dep_overrides: Vec<SwiftTargetDepOverride> field (mirrors DartTargetDepOverride) and refactored emit_cargo_toml to emit [target.'cfg(not(any(...)))'.dependencies] + per-override [target.'cfg(...)'.dependencies] blocks when overrides are present, matching the FFI and Dart patterns. With alef.toml now declaring iOS, Android, and Windows overrides for [crates.swift], packages/swift/rust/Cargo.toml correctly routes iOS to android-target, Android to android-target, Windows to windows-target, and the default (macOS host) to the full feature set. Same target_dep_overrides config gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.

Added

  • list_supported_formats() is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091)

  • [v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level rerank / rerank_async API, RerankerConfig with Preset/Custom/Llm/Plugin variants, RerankerBackend plugin trait + registry, POST /rerank HTTP endpoint, and per-language bindings via alef. Gated behind the new reranker + reranker-presets Cargo features; reranker-presets is WASM/Android-safe.

  • [v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries: bge-reranker-base (BAAI/bge-reranker-base, EN+ZH), bge-reranker-v2-m3 (rozgo/bge-reranker-v2-m3 with the required model.onnx.data sibling, multilingual), jina-reranker-v1-turbo-en, and jina-reranker-v2-base-multilingual. Friendly aliases fast / balanced / quality / multilingual resolve to catalog entries, so existing call sites keep working.

  • [v5.0.0] RerankerPreset + RerankerModelType::Custom gained additional_files: Vec<String>. Lets multi-blob ONNX exports (notably rozgo/bge-reranker-v2-m3, which splits weights into model.onnx + model.onnx.data) actually load.

  • [v5.0.0] RerankerModelType::Custom gained model_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to "onnx/model.onnx" when omitted.

  • [v5.0.0] CI live-hf job. Always-on reranker preset-path validation on every PR via a new .github/actions/cache-hf-fastembed composite action that caches ~/.cache/huggingface/hub keyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus a Preset / Custom-equivalent crosscheck and top_k truncation.

Changed

  • [v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (Xenova/ms-marco-MiniLM-L-6-v2, Xenova/bge-reranker-base, Xenova/bge-reranker-large, BAAI/bge-reranker-v2-m3) are removed. The hand-curated model_file paths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast / balanced / quality / multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically need ms-marco-MiniLM or bge-reranker-large can pass them via Custom { model_id, ... } or a registered Plugin backend.

  • [v5.0.0] BREAKING: RerankerModelType::Custom is no longer a two-field tuple. Exhaustive Rust matches on Custom { model_id, max_length } need to add model_file and additional_files. Serde defaults keep existing TOML / JSON configs valid without migration.

Fixed

  • rendering: fixed panic when a non-Item block element appears directly under a List node before any ListItem. The comrak AST builder now synthesises an implicit Item wrapper instead of falling back onto the bare List, which violated CommonMark's List → Item-only constraint and panicked in debug builds. (#1096)

  • pdf: result.pages[*].isBlank now reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existing PageContent entries, is_blank was never recalculated — it retained the stale value from native text extraction, which is always Some(true) for pages with no text layer. All four write sites in the OCR page-assembly block now call is_page_text_blank after every content mutation. (#1095)

  • reranker: RerankError migrated to thiserror. Matches the rest of the library and rust-conventions.

  • reranker: shutdown_all now best-effort. Continues invoking shutdown() on every backend even after one fails, returns the first error, drops subsequent ones (logged at warn). Previously stopped on the first failure, leaving the registry in a half-shutdown state.

  • reranker: synchronous rerank() returns a clear error instead of panicking on a current-thread Tokio runtime. block_in_place requires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detect RuntimeFlavor::CurrentThread and ask the caller to use rerank_async() or build a multi-thread runtime.

  • reranker: stronger sigmoid coverage in tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0, 3.0, 0.5) and asserts both the sigmoid output range [0, 1] and the sign-→-side property — silently dropping the sigmoid would break the test.

  • reranker: engine tests share the production sigmoid_f32 instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.

Fixed

  • pdf: table extraction failures are now visible at warn log level. extract_tables_native and extract_tables_bordered silently caught pdf_oxide::extract_tables_with_config errors at tracing::debug!, making per-page failures invisible at the default log level. Promoted to tracing::warn! to match the existing behaviour of the TATR and SLANeXT inference paths. The three unwrap_or_default() call sites in extraction.rs that silently swallowed function-level errors are also replaced with unwrap_or_else(|e| { tracing::warn!(...); Vec::new() }) so that a page_count() failure is equally visible. (#1097)

  • publish.yaml trigger-pubdev job: explicit permissions: actions: write. Since the a8f8597e45 migration to the kreuzberg-dev-publisher App-token, the gh workflow run publish-pubdev.yaml step has 403'd with "Resource not accessible by integration" — the App's installation token didn't carry actions: write. Adding job-level permissions: { actions: write, contents: read } covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also need actions: write configured on github.com.

Changed

  • Root Taskfile now includes test-apps task namespace. Added test-apps to the root Taskfile.yml includes block with proper {{.ROOT_DIR}} path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible via task test-apps:smoke:*.

Fixed

  • swift e2e: removed erroneous async = false override on extract_file for swift. Kreuzberg.extractFile(_:_:_:) is async in the Swift binding. The override in alef.toml forced is_async = false for fixtures that explicitly set "call": "extract_file" (e.g. api_batch_bytes_async), generating non-async test methods that called the async binding without await — compile errors. Fixtures without an explicit call fall through resolve_call_for_fixture to the global default and got is_async = true correctly, which is why testApiExtractFileAsync compiled but testApiBatchBytesAsync and 4 siblings did not. Dropping the override aligns both code paths.

  • r: fix macOS dylib rpath so ORT loads at R extension runtime. packages/r/src/rust/build.rs now adds -Wl,-rpath,@loader_path linker flag on macOS, enabling the final R extension .so to locate transitively-linked dylibs like libonnxruntime.dylib at load time. Without this, R's dyn.load via library.dynam2 failed with undefined symbol: OrtGetApiBase in CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791.

  • Publish Release WASM job now non-blocking via continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure on ubuntu-latest (8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Set continue-on-error: true so Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallel publish-wasm job success.

  • FormatMetadata::Code now serializes correctly. The #[serde(skip)] annotation on the Code variant caused serde_json::to_string (and every *_to_json FFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the inner CodeMetadataInner(ProcessResult) wrapper already derives Serialize/Deserialize via the upstream serde feature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction.

  • Python sdist publish step now uses split-layout invocation. .github/workflows/publish.yaml passed manifest-path: crates/kreuzberg-py/Cargo.toml to build-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runs maturin sdist — but the kreuzberg layout keeps pyproject.toml in packages/python/, so maturin failed with Failed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped the manifest-path input so the action falls through to the default package-dir: packages/python split-layout fallback, which cd's into the package dir and lets maturin resolve manifest-path from pyproject.toml's [tool.maturin] section itself.

  • Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking kreuzberg_php.dll / kreuzberg_nif.dll on x86_64-pc-windows-msvc failed with LNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease. libkreuzberg_tesseract.rlib is built by cmake-rs which defaults to /MD; libesaxx_rs.rlib (transitively pulled in by glinertokenizersesaxx-rs) is built by cc-rs which fell back to /MT. alef.toml [crates.scaffold.cargo.env] now sets CFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD" and the matching CXXFLAGS_*, propagated to .cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7).

  • captioning: captioning was a no-op for all image paths — CaptioningProcessor never received image bytes. Two root causes: (1) ImageExtractor built extracted_image on the OCR path but passed None to build_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction on config.images.extract_images, so setting only config.captioning left them with empty data. Fix: add ExtractionConfig::needs_image_data() — true when images.extract_images or captioning is set — and use it in every extractor image gate and in needs_image_processing(). Also emits a ProcessingWarning when captioning is configured but result.images is None. (#732)

2 days ago
kreuzberg

v5.0.0-rc.12

Added

  • New windows-target aggregate feature in crates/kreuzberg/Cargo.toml. Mirrors the curated FFI-on-Windows list the publish workflow already used and drops heic along with the ORT-dependent capabilities (paddle-ocr, layout-detection, embeddings, reranker, ner-llm). The FFI and Dart Rust crates pick it up via per-crate [[crates.<x>.target_dep_overrides]] cfg = 'target_os = "windows"' blocks in alef.toml. The pyo3 / napi-rs / magnus / ext-php-rs / rustler / swift-bridge / jni scaffolders do not yet honor target_dep_overrides, so the python/node/ruby/php/elixir/swift/kotlin_android Windows wheels still need an alef-side scaffolder fix to drop heic on Windows. The alef.toml entries for those crates are no-ops today but document the intent; they activate once the upstream scaffolders pick up the override block.

Fixed

  • CI / publish: source-build libheif 1.23.0 to satisfy libheif-sys 5.3 >= 1.21. Ubuntu Noble's apt ships libheif 1.17.6 and Alpine 3.21 ships libheif 1.19.5 — both rejected by libheif-sys with Package 'libheif' has version '1.17.6', required version is '>= 1.21'. Without a fix, every Linux build job (Build FFI matrix, CLI binaries, Go FFI, C# natives, Zig package, Python wheels, C FFI distribution, Java natives, Kotlin Android, Dart, Swift, manylinux_2_28 wheels) failed at the libheif-sys build script. Three coordinated fixes:

    • scripts/ci/install-system-deps/install-linux.sh drops apt libheif-dev, installs libde265-dev libaom-dev libx265-dev libdav1d-dev, then downloads + builds + installs libheif 1.23.0 via cmake into /usr/local, exporting PKG_CONFIG_PATH + LD_LIBRARY_PATH via GITHUB_ENV. Cached via a new cache-libheif-linux step in install-system-deps/action.yml.
    • docker/Dockerfile.musl-{build,ffi,rustler} now pull libheif-dev and codec headers from alpine/edge/community + edge/main (libheif 1.23.0) instead of Alpine 3.21 main.
    • kreuzberg-dev/actions/build-python-wheels@v1.8.64 source-builds libheif inside manylinux_2_28 via CIBW_BEFORE_ALL_LINUX (AlmaLinux 8 + EPEL ships codec subsets only — we install what's available and let libheif compile without missing codec features).
    • python-wheels job in publish.yaml now runs install-system-deps before build-python-wheels so Windows wheels pick up vcpkg-built libheif.
  • Publish workflow: fixed dead-code warnings as errors on Windows / reranker-presets-only builds. rerank_via_llm and extract_rerank_usage in crates/kreuzberg/src/llm/rerank.rs were gated by any(feature = "reranker-presets", feature = "reranker"), but every caller was gated by feature = "reranker". On the Windows feature combo (which uses reranker-presets + liter-llm without reranker), the functions compiled but had no callers — clippy -D warnings failed every Windows binding build (C#, Go, Java, CLI, Node, C FFI). Tightened the cfg to feature = "reranker" on both functions, the related use statements, and the test module so the function tracks its callers exactly. Also removed the orphaned ContentLayer::is_default helper (made dead by the rc.11 removal of #[serde(skip_serializing_if = ...)] on content_layer).

  • Publish workflow: install libheif on Linux + macOS runners. libheif-sys is a transitive dependency of kreuzberg-libheif (gated behind the heic feature, included in full), and the publish workflow's per-language build jobs (Swift / Zig / C-FFI / C# / Go / Java / Kotlin-Android / CLI / Node / Dart / Python wheels) invoked kreuzberg-dev/actions/build-* without first installing system dependencies. libheif-sys's pkg-config probe then failed with Package libheif was not found in the pkg-config search path. Added libheif-dev to scripts/ci/install-system-deps/install-linux.sh, brew install libheif to install-macos.sh, and injected a ./.github/actions/install-system-deps step before every build-* action invocation in .github/workflows/publish.yaml (17 sites + the publish-crates job, since kreuzberg-libheif is also published from that job).

  • Publish workflow: publish kreuzberg-libheif to crates.io. The new path-only kreuzberg-libheif crate is a build dependency of kreuzberg via the heic feature. Without publishing it first, cargo publish -p kreuzberg aborts with no matching package named 'kreuzberg-libheif' found (location searched: crates.io index). Added it as the first entry in the publish-crates invocation's crates: list so it lands on crates.io before kreuzberg.

  • Android (any arch) + iOS: widened libheif/ORT exclusion gate. The C FFI crate's target_dep_overrides previously narrowed the android-target feature set to cfg(all(target_os = "android", target_arch = "x86_64")), on the assumption that aarch64-linux-android could use the full ORT-enabled set via pyke prebuilts. In practice libheif-sys still blocked the cross-compile on the NDK for both architectures, and iOS hit the same wall. Widened the cfg to target_os = "android" (covers both arches) and added a parallel target_os = "ios" override that also routes to android-target. Same treatment applied to the Dart Rust crate (packages/dart/rust/Cargo.toml).

  • Docs strict build: broken anchor in docs/concepts/reranking.md:10. The opening "Bi-encoders vs cross-encoders" paragraph linked [Embeddings](architecture.md#embeddings), but architecture.md has no ## Embeddings heading — task docs:build:strict aborted with exit code 1. Reworded the parenthetical to drop the link target.

  • musl Docker builds: include kreuzberg-libheif crate + install libheif-dev. The three docker/Dockerfile.musl-{ffi,rustler,build} images each copy a subset of workspace crates into the build context; the new kreuzberg-libheif crate (required transitively when kreuzberg is built with the full features that include heic) was missing, so cargo aborted with failed to read /build/crates/kreuzberg-libheif/Cargo.toml. Added the COPY entry to each Dockerfile and added libheif-dev to the apk add list so libheif-sys's pkg-config probe finds Alpine's package.

  • CI Lint: install system dependencies before running clippy. ci-lint.yaml set up the Rust toolchain but never called ./.github/actions/install-system-deps, so clippy on the Linux runner failed to find libheif.pc. Added the install-system-deps step right after setup-rust.

  • Windows binding wheels: install libheif via vcpkg. libheif-sys's build.rs uses vcpkg::Config::new().find_package("libheif") on Windows MSVC. The windows-latest runner image ships vcpkg pre-installed at C:\vcpkg but the libheif port is not. Added a vcpkg install step (vcpkg install libheif:x64-windows-static-md) to scripts/ci/install-system-deps/install-windows.ps1, exposed VCPKG_ROOT to the build environment, and added an actions/cache entry on C:\vcpkg\installed\x64-windows-static-md so subsequent runs hit the cache (~10s) instead of the cold ~20-min vcpkg build. Unblocks Python / Node / Ruby / PHP / Elixir / Swift / JNI Windows wheels which still use the full feature set because their alef scaffolders do not yet honor target_dep_overrides.

  • Elixir cargo cc version conflict. The native Rustler NIF lockfile at packages/elixir/native/kreuzberg_nif/Cargo.lock pinned cc 1.2.63, while kreuzberg-tesseract requires cc ^1.2.64. Cargo failed to resolve. Ran cargo update -p cc in that sub-project to lift it to 1.2.64; this also picked up html-to-markdown-rs 3.6.2 and pdf_oxide 0.3.64 in both lockfiles to keep them in sync.

  • Rust e2e generator: enum variant accessors emit method-call syntax. The fixture path metadata.format.excel.sheet_count resolved against FormatMetadata (a tagged enum with a pub fn excel(&self) -> Option<&ExcelMetadata> convenience accessor). The Rust e2e renderer was emitting result.metadata.format.as_ref().unwrap().excel.as_ref().unwrap().sheet_count, which is a compile error because FormatMetadata has no excel field. Added "metadata.format.excel" to fields_method_calls in alef.toml so the Rust renderer (and the Zig one) appends () after the segment instead of dot-accessing it as a field.

  • Rust e2e generator: not_empty on String leaves the leaf concrete. Fixed in alef v0.25.0 (src/e2e/codegen/rust/assertion_helpers.rs). When a path like summary.text crosses an Option<Summary> parent on the way down, the resolver registers the path as optional; the renderer's is_opt branch previously emitted accessor.is_some(), but the accessor already auto-unwrapped the parent (...summary.as_ref().unwrap().text), so the final expression has type String and .is_some() is a compile error. The renderer now checks for the trailing .as_ref().unwrap(). marker and emits !accessor.is_empty() for the concrete-leaf case while preserving .is_some() for true Option<T> leaves.

  • Rust e2e generator: pre-assertion let-binding uses optional-aware accessor. Fixed in alef v0.25.0 (src/e2e/field_access/resolver.rs::rust_unwrap_binding). The pre-pass that lifts string-equals assertions into local let _name = result.<path>.as_ref().map(...).unwrap_or_default(); bindings called the basic render_accessor instead of render_rust_with_optionals, so a path like summary.strategy emitted result.summary.strategy — a compile error because Option<Summary> has no strategy field. Switched the binding to render_rust_with_optionals so intermediate optional segments produce .as_ref().unwrap().

  • Removed summary.text and summary.strategy from fields_optional. They are not optional on DocumentSummary (pub text: String, pub strategy: SummaryStrategy); only the parent summary: Option<DocumentSummary> is. Listing the leaves as optional caused the e2e renderer to emit .as_ref() and .as_deref() against concrete types. The parent stays in the set.

  • SummaryStrategy now implements Display matching the snake-case serde wire form (extractive / abstractive). The Rust e2e renderer's string-equality let-binding pre-pass needs to_string() on the enum value to compare against the fixture's literal string. Added a minimal Display impl.

  • Dead-code is_heif_container and extract_exif_data under reranker-only builds. The Live HF preset tests CI job builds kreuzberg with --features "reranker,reranker-presets,tokio-runtime". The HEIF sniffer is always compiled by design (12-byte magic check, zero deps) and EXIF extraction stubs out when no ocr/ocr-wasm/heic feature is enabled, but with the reranker-only feature set every caller (in extraction::image and extractors::image) is gated out — clippy -D warnings then surfaced both functions as dead_code. Added #[allow(dead_code)] to both definitions with comments documenting the unconditional-compile intent.

  • text::classification::classify_text stub added. The real implementation lives behind the classification feature; alef-generated bindings call kreuzberg::text::classification::classify_text unconditionally, so the existing stub module needed a matching no-op for the path that returns Err(KreuzbergError::Other("classification feature not available on this target")). Required for the iOS / Android android-target builds, which drop the classification feature.

  • LlmBackend and GlineBackend stubs widened to all non-ner-llm / non-ner-onnx targets. Both stubs previously listed an explicit Windows / wasm32 / android+x86_64 triple cfg. With the binding crates now widening their target gates to target_os = "android" and target_os = "ios" (both arches), aarch64-apple-ios and aarch64-linux-android were failing to compile against kreuzberg::LlmBackend / kreuzberg::GlineBackend. Simplified the gate to #[cfg(not(feature = "ner-llm"))] and #[cfg(not(feature = "ner-onnx"))] respectively — any config that drops the feature now gets the stub regardless of target.

  • Swift Rust crate now honours target_dep_overrides. The Swift cargo emitter (alef::backends::swift::gen_rust_crate::cargo::emit_cargo_toml) called crate::scaffold::render_core_dep directly, ignoring the [[crates.swift.target_dep_overrides]] block. Extended SwiftConfig with a target_dep_overrides: Vec<SwiftTargetDepOverride> field (mirrors DartTargetDepOverride) and refactored emit_cargo_toml to emit [target.'cfg(not(any(...)))'.dependencies] + per-override [target.'cfg(...)'.dependencies] blocks when overrides are present, matching the FFI and Dart patterns. With alef.toml now declaring iOS, Android, and Windows overrides for [crates.swift], packages/swift/rust/Cargo.toml correctly routes iOS to android-target, Android to android-target, Windows to windows-target, and the default (macOS host) to the full feature set. Same target_dep_overrides config gap exists for python / node / ruby / php / elixir / jni / kotlin_android scaffolders; addressing those is tracked separately.

Added

  • list_supported_formats() is now part of the public crate root and every language binding. Returns every file extension Kreuzberg recognizes with its corresponding MIME type, so callers can derive ingestion policy from the library instead of maintaining their own extension whitelists. The function already backed the CLI (kreuzberg formats), REST API (GET /formats), and MCP server; it is now exported from the crate root and exposed in every binding via the alef catalog. (#1091)

  • [v5.0.0] reranking: cross-encoder reordering with optional liter-llm wiring. New top-level rerank / rerank_async API, RerankerConfig with Preset/Custom/Llm/Plugin variants, RerankerBackend plugin trait + registry, POST /rerank HTTP endpoint, and per-language bindings via alef. Gated behind the new reranker + reranker-presets Cargo features; reranker-presets is WASM/Android-safe.

  • [v5.0.0] reranker preset catalog now mirrors fastembed-rs verbatim. Four verified entries: bge-reranker-base (BAAI/bge-reranker-base, EN+ZH), bge-reranker-v2-m3 (rozgo/bge-reranker-v2-m3 with the required model.onnx.data sibling, multilingual), jina-reranker-v1-turbo-en, and jina-reranker-v2-base-multilingual. Friendly aliases fast / balanced / quality / multilingual resolve to catalog entries, so existing call sites keep working.

  • [v5.0.0] RerankerPreset + RerankerModelType::Custom gained additional_files: Vec<String>. Lets multi-blob ONNX exports (notably rozgo/bge-reranker-v2-m3, which splits weights into model.onnx + model.onnx.data) actually load.

  • [v5.0.0] RerankerModelType::Custom gained model_file: Option<String>. Lets callers point at non-default ONNX paths (e.g. quantized variants) without falling back to the plugin escape hatch. Defaults to "onnx/model.onnx" when omitted.

  • [v5.0.0] CI live-hf job. Always-on reranker preset-path validation on every PR via a new .github/actions/cache-hf-fastembed composite action that caches ~/.cache/huggingface/hub keyed on the catalog literal — any preset path change triggers a fresh download so we catch drift the moment it happens. Tests cover all four presets plus a Preset / Custom-equivalent crosscheck and top_k truncation.

Changed

  • [v5.0.0] BREAKING: reranker preset paths replaced. The four unverified Xenova / BAAI paths that shipped in earlier rc builds (Xenova/ms-marco-MiniLM-L-6-v2, Xenova/bge-reranker-base, Xenova/bge-reranker-large, BAAI/bge-reranker-v2-m3) are removed. The hand-curated model_file paths were not verified against HF and would have 404'd at runtime. Three of four upstream paths were wrong. Callers using the friendly aliases (fast / balanced / quality / multilingual) keep working; callers who hardcoded the old catalog names need to switch to the new short-names listed above. Users who specifically need ms-marco-MiniLM or bge-reranker-large can pass them via Custom { model_id, ... } or a registered Plugin backend.

  • [v5.0.0] BREAKING: RerankerModelType::Custom is no longer a two-field tuple. Exhaustive Rust matches on Custom { model_id, max_length } need to add model_file and additional_files. Serde defaults keep existing TOML / JSON configs valid without migration.

Fixed

  • rendering: fixed panic when a non-Item block element appears directly under a List node before any ListItem. The comrak AST builder now synthesises an implicit Item wrapper instead of falling back onto the bare List, which violated CommonMark's List → Item-only constraint and panicked in debug builds. (#1096)

  • pdf: result.pages[*].isBlank now reflects OCR content for scanned/rasterized PDFs. When OCR (including VLM) wrote text into existing PageContent entries, is_blank was never recalculated — it retained the stale value from native text extraction, which is always Some(true) for pages with no text layer. All four write sites in the OCR page-assembly block now call is_page_text_blank after every content mutation. (#1095)

  • reranker: RerankError migrated to thiserror. Matches the rest of the library and rust-conventions.

  • reranker: shutdown_all now best-effort. Continues invoking shutdown() on every backend even after one fails, returns the first error, drops subsequent ones (logged at warn). Previously stopped on the first failure, leaving the registry in a half-shutdown state.

  • reranker: synchronous rerank() returns a clear error instead of panicking on a current-thread Tokio runtime. block_in_place requires a multi-thread scheduler; the previous code path would panic rather than refuse the call. The LLM and Plugin synchronous branches now detect RuntimeFlavor::CurrentThread and ask the caller to use rerank_async() or build a multi-thread runtime.

  • reranker: stronger sigmoid coverage in tests/api_rerank.rs. Happy-path test now uses mixed-sign logits (-2.0, 3.0, 0.5) and asserts both the sigmoid output range [0, 1] and the sign-→-side property — silently dropping the sigmoid would break the test.

  • reranker: engine tests share the production sigmoid_f32 instead of duplicating it locally. Keeps the test signal honest if the production function ever changes.

Fixed

  • pdf: table extraction failures are now visible at warn log level. extract_tables_native and extract_tables_bordered silently caught pdf_oxide::extract_tables_with_config errors at tracing::debug!, making per-page failures invisible at the default log level. Promoted to tracing::warn! to match the existing behaviour of the TATR and SLANeXT inference paths. The three unwrap_or_default() call sites in extraction.rs that silently swallowed function-level errors are also replaced with unwrap_or_else(|e| { tracing::warn!(...); Vec::new() }) so that a page_count() failure is equally visible. (#1097)

  • publish.yaml trigger-pubdev job: explicit permissions: actions: write. Since the a8f8597e45 migration to the kreuzberg-dev-publisher App-token, the gh workflow run publish-pubdev.yaml step has 403'd with "Resource not accessible by integration" — the App's installation token didn't carry actions: write. Adding job-level permissions: { actions: write, contents: read } covers the case where GITHUB_TOKEN is used as a fallback, and documents that the App's permissions also need actions: write configured on github.com.

Changed

  • Root Taskfile now includes test-apps task namespace. Added test-apps to the root Taskfile.yml includes block with proper {{.ROOT_DIR}} path resolution. Smoke and comprehensive test tasks for all 11 languages now accessible via task test-apps:smoke:*.

Fixed

  • swift e2e: removed erroneous async = false override on extract_file for swift. Kreuzberg.extractFile(_:_:_:) is async in the Swift binding. The override in alef.toml forced is_async = false for fixtures that explicitly set "call": "extract_file" (e.g. api_batch_bytes_async), generating non-async test methods that called the async binding without await — compile errors. Fixtures without an explicit call fall through resolve_call_for_fixture to the global default and got is_async = true correctly, which is why testApiExtractFileAsync compiled but testApiBatchBytesAsync and 4 siblings did not. Dropping the override aligns both code paths.

  • r: fix macOS dylib rpath so ORT loads at R extension runtime. packages/r/src/rust/build.rs now adds -Wl,-rpath,@loader_path linker flag on macOS, enabling the final R extension .so to locate transitively-linked dylibs like libonnxruntime.dylib at load time. Without this, R's dyn.load via library.dynam2 failed with undefined symbol: OrtGetApiBase in CI on arm64-apple-darwin, blocking all R e2e tests. This matches the pattern applied to C# FFI in commit b5bc5d7791.

  • Publish Release WASM job now non-blocking via continue-on-error. Build WASM package job consistently hits GHA runner OOM during the linker stage (~9 min in) due to cold-build memory pressure on ubuntu-latest (8GB RAM). Runner preemption ("runner shutdown signal") terminates the job before timeout expires. Set continue-on-error: true so Publish-Final job proceeds regardless of WASM build outcome. Deeper fix (runner upsizing or two-stage build) deferred to rc.11+. WASM package still publishes if npm cache was warm from previous rc or from parallel publish-wasm job success.

  • FormatMetadata::Code now serializes correctly. The #[serde(skip)] annotation on the Code variant caused serde_json::to_string (and every *_to_json FFI call) to return an error whenever tree-sitter code extraction produced metadata. Removed the annotation — the inner CodeMetadataInner(ProcessResult) wrapper already derives Serialize/Deserialize via the upstream serde feature. Affects Java, Go, C#, Dart, and all other FFI consumers that exercise code-file extraction.

  • Python sdist publish step now uses split-layout invocation. .github/workflows/publish.yaml passed manifest-path: crates/kreuzberg-py/Cargo.toml to build-python-sdist@v1. That input routes the action into its single-tree branch, which cd's into the Rust crate directory and runs maturin sdist — but the kreuzberg layout keeps pyproject.toml in packages/python/, so maturin failed with Failed to build source distribution, pyproject.toml not found. PyPI publish then skipped for rc.10. Dropped the manifest-path input so the action falls through to the default package-dir: packages/python split-layout fallback, which cd's into the package dir and lets maturin resolve manifest-path from pyproject.toml's [tool.maturin] section itself.

  • Windows MSVC CRT mismatch in PHP and Elixir cdylibs. Linking kreuzberg_php.dll / kreuzberg_nif.dll on x86_64-pc-windows-msvc failed with LNK1319: mismatch detected for 'RuntimeLibrary': MT_StaticRelease vs MD_DynamicRelease. libkreuzberg_tesseract.rlib is built by cmake-rs which defaults to /MD; libesaxx_rs.rlib (transitively pulled in by glinertokenizersesaxx-rs) is built by cc-rs which fell back to /MT. alef.toml [crates.scaffold.cargo.env] now sets CFLAGS_{x86_64,i686}_pc_windows_msvc = "/MD" and the matching CXXFLAGS_*, propagated to .cargo/config.toml [env]. cc-rs honors these target-suffixed env vars only when actually building for that target, so non-Windows builds are unaffected. Same fix unblocks Elixir NIF Windows build (recurring failure since rc.7).

  • captioning: captioning was a no-op for all image paths — CaptioningProcessor never received image bytes. Two root causes: (1) ImageExtractor built extracted_image on the OCR path but passed None to build_image_internal_document, discarding the bytes; (2) all document extractors (DOCX, PDF, PPTX, HTML, Markdown) gated binary image extraction on config.images.extract_images, so setting only config.captioning left them with empty data. Fix: add ExtractionConfig::needs_image_data() — true when images.extract_images or captioning is set — and use it in every extractor image gate and in needs_image_processing(). Also emits a ProcessingWarning when captioning is configured but result.images is None. (#732)