harshankur/officeParser

Watch

Star

Fork

简介统计版本

2026-06-05 05:10:13

officeParser

harshankur

v7.2.0

v7.2.0: 🏗️ Parser Enhancements, Granular HTML Generator Controls, and Strict AST Typings

I am thrilled to announce the release of officeParser v7.2.0! This major update brings a massive architectural upgrade to the AST, empowering developers with deeper insight into document layout, embedded metadata, and bulletproof TypeScript integrations.

As we pave the way for building advanced RAG architectures, deep-document search systems, and robust AI parsing pipelines on top of officeParser, v7.2.0 guarantees that every piece of document intelligence—from slide masters to hidden footnotes—is logically structured and heavily typed.

[!WARNING] Soft Breaking Change: Notes Placement
If your application iterates over ast.content to manually extract footnotes, endnotes, or slide speaker notes, you will need to update your logic. These nodes are no longer appended to the main content array. They are now structurally nested inside the notes[] array of their logical parent or preceding text node.

🌟 Key Pillars of the v7.2.0 Update

1. Structural Notes Attachment

Previously, footnotes, endnotes, and slide speaker notes were flattened and appended to the end of the document content. In v7.2.0, these notes are now strictly attached to their logical parent or preceding sibling nodes via a new node.notes[] array. Note: The legacy putNotesAtLast config flag is now deprecated.

2. Auxiliary Content (Headers, Footers, Slide Masters)

The new ast.auxiliary property unlocks out-of-band document templates! officeParser now automatically extracts headers and footers from Word documents (ast.auxiliary.headers / footers), and Slide Masters from PowerPoint presentations (ast.auxiliary.slideMasters). These are neatly separated from the main sequential document flow.

3. Native & Custom Document Properties

The OfficeMetadata interface has been radically upgraded. Alongside canonical metadata fields (title, author, dates), officeParser now exposes format-specific verbatim metadata via ast.metadata.nativeProperties (e.g., <meta> tags in HTML, app.xml stats in DOCX, XMP dicts in PDF) and user-defined variables via ast.metadata.customProperties.

4. Discriminated Unions & Strict AST Typings

The generic OfficeContentNode interface has been completely refactored into a strict TypeScript Discriminated Union. This unlocks precise, compile-time type narrowing per node.type (e.g., safely accessing SlideMetadata only when type === 'slide'), eliminating the need for generic fallback assertions across your application.

5. Interactive HTML Spreadsheet Layouts & DOM Injections

The HTML Generator just got significantly smarter:

Interactive Spreadsheets: Spreadsheets generated from Excel or CSV files now render with desktop-class interactivity, featuring native draggable boundary handles (.col-resizer) to dynamically resize rows and columns in the browser.
Granular Layout Controls: Expanded HtmlGeneratorConfig with containerWidth, customCss, and DOM injections (head/body hook insertions).

🛠 Getting Started

npm install officeparser@7.2.0

Example of using the new Discriminated Unions, Auxiliary nodes, and Structural Notes:

import { parseOffice } from 'officeparser';

const ast = await parseOffice('presentation.pptx', {
  ignoreSlideMasters: false
});

// Access Slide Masters from the new auxiliary AST branch
const masterSlides = ast.auxiliary?.slideMasters || [];
console.log(`Found ${masterSlides.length} master slides!`);

// Confidently narrow types using Discriminated Unions!
for (const node of ast.content) {
  if (node.type === 'slide') {
    // TypeScript now explicitly knows this is a Slide node.
    // Slide Notes are now structurally nested under the slide!
    const noteCount = node.notes?.length || 0;
    console.log(`Slide ${node.metadata.pageNumber} has ${noteCount} notes attached.`);
  }
}

🔗 Full Changelog: View v7.2.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com

❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.

As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:

Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors 👉 Buy Me A Coffee

Changes: v7.1.0..v7.2.0

2026-05-26 05:38:59

officeParser

harshankur

v7.1.0

v7.1.0: 🛡️ Cancellation Control, Thread Safety & Robust Entity Decoding

I am excited to announce the release of officeParser v7.1.0! Following the massive paradigm shift of v7.0.0, this release is dedicated to enterprise-grade reliability, memory leak prevention, and precision parsing.

As officeParser scales to support millions of production workloads and AI pipelines, v7.1.0 introduces critical safety guards, cancellation capabilities, and robustness improvements for heavy-duty document processing.

🌟 Key Pillars of the v7.1.0 Update

1. Native Cancellation with `AbortSignal`

You can now pass an abortSignal in both OfficeParserConfig and OcrConfig (as well as specific configurations for PdfGenerator and ChunkingGenerator). This allows you to immediately interrupt:

Document loading and parsing loops.
Background Puppeteer browsers.
Active OCR worker recognition tasks.

2. Consolidated Timeouts & Memory Safety

To prevent execution stalls and hanging resources in serverless or containerized environments:

Consolidated OCR Timeouts: Timeout options have been unified under a structured timeout configuration (workerLoad, recognition, and autoTerminate in OcrTimeoutConfig).
Generator Timeouts: Added robust timeouts for PdfGenerator and ChunkingGenerator tasks.
Resource Leak Prevention: If a generator or parser execution fails, cancels, or times out, Puppeteer browser instances and Tesseract workers are forcefully terminated and evicted, ensuring no dangling resources are left behind.

3. Robust XLSX Parsing & Entity Decoding

XML Entity Decoding: Resolved bugs where decimal, hex, and named XML entities (e.g., &, &, <) in Excel sheets were parsed as raw strings.
inlineStr Attribute Support: Fixed inlineStr tag attribute matching to correctly process inline spreadsheet strings.

4. Visualizer Panel Upgrades & Compliance

Timeout & Cancellation Controls: The web visualizer config drawer now exposes granular controls for OCR and generator timeouts.
ESM CSP Compliance: Replaced legacy dynamic module loading with direct ESM-native import() to comply with strict Content Security Policies.

🛠 Getting Started

npm install officeparser@7.1.0

Example of using the new AbortSignal and timeout suite:

const { parseOffice } = require('officeparser');

const controller = new AbortController();

try {
  const ast = await parseOffice('large-file.docx', {
    abortSignal: controller.signal,
    ocr: {
      enable: true,
      // Consolidated OCR timeouts
      timeout: {
        workerLoad: 10000,   // Max time to load OCR worker (ms)
        recognition: 30000,  // Max time for text recognition (ms)
        autoTerminate: 60000 // Inactivity cleanup (ms)
      }
    }
  });
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Parsing task was aborted successfully.');
  } else {
    console.error('Parsing failed:', error);
  }
}

// Cancel the operation at any point:
// controller.abort();

🔗 Full Changelog: View v7.1.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com

❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.

As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:

Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors 👉 Buy Me A Coffee

2026-05-14 05:30:07

officeParser

harshankur

v7.0.0

v7.0.0: 🚀 Dual-Purpose Office Parser & Generator with Native RAG Suite

We are thrilled to announce the release of officeParser v7.0.0, a milestone version that redefines document processing for the AI era.

Since 2019, officeParser has been a trusted utility for simple text extraction. Today, we are evolving into a comprehensive document knowledge engine designed specifically for the next generation of AI-first infrastructure.

🌟 Key Pillars of the v7.0.0 Revolution

1. The Generation Revolution: `OfficeGenerator`

officeParser is now a dual-purpose engine. Beyond parsing, you can now generate high-fidelity outputs from the unified Office AST.

Universal Serialization: Transform any document into Markdown, HTML, CSV, RTF, or Layout-Aware Text.
The StyleMapper Engine: A new semantic translation layer that preserves formatting (bold, italic, colors, tables) across all output formats using a robust DSL.

2. The `OfficeConverter` & Fluent `.to()` API

v7.0.0 introduces OfficeConverter, our new flagship API for one-step document transformations.

Streamlined convert: A single method to go from any source file to any target format with automatic configuration sync.
Fluent AST Interface: The AST now features an asynchronous .to() method, allowing you to chain transformations effortlessly: await ast.to('markdown'), await ast.to('html'), or await ast.to('chunks').

3. Native AI/RAG Infrastructure

We’ve built the "Knowledge Bridge" required to turn messy, unstructured office files into high-precision data for your AI agents.

Native RAG Chunking Suite: No more external dependencies. Split documents using fixed-size (recursive), structural (hierarchy-aware), or semantic strategies.
Metadata-Aware: Every chunk retains its structural context, ensuring your Vector DB retrieval is more accurate than ever.

4. Unified Document Intelligence

New Parser Extensions: We now natively ingest CSV, HTML, and Markdown, treating them as first-class citizens in our unified Office AST.
Redesigned AST: Support for complex table structures (vertical/horizontal merging), nested lists, and format-specific metadata.

5. Engineering Excellence & Performance

Extreme Speedups: We eliminated $O(n^2)$ bottlenecks in RTF parsing and achieved up to 23x speedups in OpenOffice (ODP) processing.
Memory Efficiency: Re-engineered Excel parsing with matchAll iteration, preventing execution stalls on massive spreadsheets.
DOCX Fidelity: Full support for w:vMerge and w:gridSpan, ensuring table structures are preserved exactly as they appear in Word.

🛠 Getting Started

npm install officeparser

The new API makes complex transformations trivial:

const { parseOffice, convert } = require('officeparser');
// Option 1: One-step conversion (High-level)
// Convert any file to Markdown, HTML, CSV, etc. in one line.
const { value } = await convert('proposal.docx', 'md');
console.log(value); // The generated Markdown string
// Option 2: Parse once, convert many (Fluent API)
// Ideal for multi-format export or RAG chunking.
const ast = await parseOffice('data.xlsx');
const { value: html } = await ast.to('html');
const { value: chunks } = await ast.to('chunks');

🔗 Full Changelog: View v7.0.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com

❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained by a single person as a voluntary project, growing from a simple utility to a critical piece of infrastructure with over 10 million downloads and 300,000+ weekly installations.

As we pivot towards the "Super-Tool" era, I am seeking professional sustainability to fund the next phase of the roadmap:

Core Sustainability: Maintaining 100% test coverage and dependency health for my global user base.
Multi-Runtime Excellence: Official support and drivers for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
Enterprise Connectivity: High-performance connectors for LangChain, LlamaIndex, and Haystack, alongside intelligent chart-to-JSON extraction.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors 👉 Buy Me A Coffee

2026-04-29 05:51:54

officeParser

harshankur

v6.1.1

v6.1.1: Sequential Content Ordering & AST Structural Fidelity

Changes: v6.1.0..v6.1.1

I am excited to announce v6.1.1, a significant update focused on Structural Fidelity and AST Precision. This release standardizes how complex document layouts—like soft line breaks and nested lists—are represented across all major office formats.

✨ Key Highlights

📐 Advanced Layout Analysis (DOCX)

Break Node Support: Comprehensive extraction of w:br, w:cr, and w:lastRenderedPageBreak. Your AST now understands physical document breaks perfectly.
Indentation Metadata: Now extracting <w:ind> properties, allowing you to reconstruct paragraph layouts with high accuracy.

📊 High-Fidelity Presentations (PPTX)

Sequential Parsing Engine: We've migrated to an iterative child-processing model. This guarantees that text runs, soft breaks, and fields are captured in their exact visual order.
Dynamic Field Extraction: Support for <a:fld> elements ensures slide numbers, dates, and other dynamic content are no longer lost.

📝 Perfect Lists (ODP & PPTX)

Soft Break Handling: Standardized handling of Shift+Enter within list items. Interruptions are now intelligently split into independent paragraph nodes, maintaining clean numbering continuity.
Nested List Correction: Fixed a stateful indexing bug in ODP to ensure perfect sequential numbering even in deeply nested structures.

🛡️ Stability & Reliability

Excel Multi-line Fix: Resolved edge cases in XLSX parsing where complex multi-line cells could cause parser failures.
RTF Encoding Resilience: Improved byte-buffering logic to resolve character dropouts (like smart quotes) in legacy RTF streams.
Security Hardening: Upgraded @xmldom/xmldom to 0.9.10 to ensure a secure parsing environment.

2026-04-14 21:51:57

officeParser

harshankur

v6.1.0

v6.1.0: Infrastructure Stability & Smart OCR Scheduling

Changes: v6.0.7..v6.1.0

This release marks a major milestone in the technical maturity of officeParser, moving from a monolithic parser to a robust, resource-aware infrastructure. v6.1.0 focuses on stability, performance, and developer experience without breaking core backward compatibility.

🔥 Major Highlights

📦 Modern Module System & Nomenclature Change

Since v6.0.7, we have standardized our browser distribution to support modern development workflows. This includes a major change in bundle naming:

Nomenclature Change: officeparser.browser.js or officeParserBundle@${VERSION}.js has been renamed to officeparser.browser.iife.js to explicitly indicate its format.
Dual-Bundle System: We now ship two distinct browser packages (add @${VERSION} after officeparser if using release asset):
1. officeparser.browser.iife.js: Standard IIFE bundle for direct <script> tag usage (Global officeParser namespace).
2. officeparser.browser.mjs: A native ESM bundle for modern browsers, Vite, and Webpack 5.
Node.js: Full native ESM support with Node16 resolution.

🧠 Smart OCR Worker Pool

We’ve completely rewritten the OCR engine to handle internal resource management intelligently.

Lazy Initialization: OCR workers and tesseract.js are now lazy-loaded. Simply require or import the library at the top-level no longer spawns any background processes, resolving the long-standing "process leak" issue.
Worker Pooling: Workers are now pooled and reused across parallel parsing requests, providing up to 3x faster processing for documents with multiple images.
Auto-Termination: A new idle-timer automatically cleans up worker processes after 10 seconds of inactivity (configurable via ocrConfig.autoTerminateTimeout).
Manual Control: Exported a new terminateOcr() function for snappy CLI script exits.

📦 Infrastructure Migration

Fflate Integration: Replaced legacy yauzl with fflate for zero-memory-buffer zip extraction, significantly improving performance on massive spreadsheets and low-memory environments like Edge Functions.
PDF.js v5: Upgraded the internal PDF engine to the latest stable release (v5.6.205) with improved text-layer coordinate alignment.

🏷️ Custom Property Extraction

You can now extract user-defined custom metadata (e.g., custom tags, proprietary fields) across almost all formats:

OOXML: Standard custom document properties (docProps/custom.xml).
ODF: User-defined fields in OpenOffice/LibreOffice metadata.
PDF: Custom key-value pairs from the PDF Info dictionary.

📄 Documentation & Branding Overhaul

Premium SPA Docs: A completely redesigned live documentation site with persistent fragments.
Interactive Visualizer: Test any file in-browser to see the hierarchical AST and real-time preview.
Troubleshooting Guide: A new, comprehensive debugging guide covering everything from process hangs to PDF worker resolution.
Metric Verification: Standardized all download metrics to 260k+ weekly installs with dynamic Shields.io badges and live verification links via npm-stat.com.

🛠️ API & Configuration Changes

New Configuration Options (`OfficeParserConfig`)

ocrConfig: New object for fine-grained OCR control.
ocrConfig.autoTerminateTimeout: Duration (ms) to keep workers alive before cleanup.
ocrConfig.workerPath, ocrConfig.corePath, ocrConfig.langPath: Full support for air-gapped/offline local Tesseract hosting.

Deprecations

ocrLanguage: This string property is now deprecated. Use ocrConfig.language instead. (Note: Existing code using ocrLanguage will continue to work perfectly in v6.1.0).

🐛 Bug Fixes & Refinements

PDF: Fixed hierarchical alignment where links sometimes drifted from their corresponding text nodes.
ODT/RTF: Improved list parsing to accurately reflect nested indentation levels in the AST.
CLI: The command-line interface now automatically calls terminateOcr() for a faster return to the prompt.
Sponsorship: Integrated funding.json and .well-known manifest support for community sustainability.

⚠️ Migration Note for v6.0.7 Users

Nomenclature: If you were using a script tag, update your source from officeparser.browser.js or officeParserBundle@${VERSION}.js to officeparser.browser.iife.js.
Process Lifecycle: Node.js scripts using OCR may stay alive for 10s after finishing due to the worker pool. Call await terminateOcr() for an immediate exit.

❤️ Contributors

A huge shoutout to @carlosb1504 for their massive contributions, specifically replacing yauzl with fflate for improved performance and implementing the core custom property extraction logic.

Full Changelog: v6.0.7...v6.1.0

2026-03-24 19:25:03

officeParser

harshankur

v6.0.7

v6.0.7 - 24.03.2026

Changes: v6.0.6..v6.0.7

This release focuses on improving the developer experience for browser-side integration, upgrading core dependencies for better security/performance, and stabilizing the CI/CD pipeline with modern OIDC publishing standards.

✨ New Features & Improvements

📦 Bundled Browser Typings: Added dist/officeparser.browser.d.ts. This is a single, self-contained declaration file designed specifically for developers using the browser bundle directly. It provides full IntelliSense without needing any node_modules.
🚀 Robust OIDC Publishing:
- Upgraded the release pipeline to Node.js 24 for better native support of modern NPM features.
- Implemented explicit NPM Provenance using enhanced environment configurations and publishConfig in package.json.
🧹 Cleaner Assets: Disabled source maps in the browser bundle (officeparser.browser.js) to provide a cleaner and more lightweight production asset.
🔗 Updated Homepage: The project homepage has been moved to officeparser.harshankur.com.

⬆️ Dependency Upgrades

file-type: Upgraded to ^21.3.4 for improved file detection and security.
typescript: Upgraded to ^6.0.2.
@types/node: Upgraded to ^22.15.5.

⚙️ Technical Changes

Added build:browser:types script using dts-bundle-generator.
Refactored build-and-publish.yml to trigger on release: published, ensuring stable asset attachment.
Added publishConfig to package.json to codify public access and provenance rules.

Full Changelog: v6.0.6...v6.0.7

2026-03-24 07:19:40

officeParser

harshankur

v6.0.6

v6.0.6 - 23.03.2026

Changes: v6.0.1..v6.0.6

This release introduces significant dependency upgrades, a modernized CI/CD pipeline with enhanced security, and several key improvements to RTF and PDF parsing.

🚀 Major Dependency Upgrades (v6 Core)

Engine Upgrade: All core parsing libraries have been bumped to their latest major versions for enhanced performance and security.
- file-type v19 (ESM support)
- tesseract.js v7
- pdfjs-dist v5.5
- yauzl v3
- @xmldom/xmldom v0.8.11

✨ New Features

OIDC Passwordless Publishing: The library now uses GitHub's OpenID Connect (OIDC) trust with NPM. This "Passwordless" flow eliminates the need for manually managed NPM tokens in CI/CD, significantly increasing supply chain security.
Smart PDF Worker Sync: Added a new runtime synchronization system that automatically matches the PDF worker version with the library version, preventing "API/Worker version mismatch" errors in browser environments.
Improved Module Loading: Introduced a robust moduleLoader to handle complex ESM/CJS interop for modern dependencies.
AST Visualizer & Docs: Launched a new documentation website and an interactive AST visualizer to help developers inspect parsed document structures.

🔧 Bug Fixes & Refinements

RTF Parser:
- Fixed a logic error where RTF endnotes were incorrectly identified as footnotes.
- Improved structure detection for complex RTF headers.
PDF Parser: Enhanced reliability of require-based loading for Node.js environments.
CI/CD Reliability: Overhauled the build pipeline to be "failure-aware"—the release process now correctly halts if parser validation tests fail, ensuring only stable versions are published.

📦 Maintenance

Synchronized package-lock.json and cleaned up build scripts.
Updated documentation and README with the latest versioning and security best practices.

2026-01-02 22:49:15

officeParser

harshankur

v6.0.1

v6.0.1 - 02.01.2026

Changes: v5.2.2..v6.0.1

We are thrilled to announce the release of officeParser v6.0.1, a major overhaul that transforms the library from a simple text extractor into a powerful, format-agnostic document analysis engine.

🌟 Key Highlights (v6.0.0+)

🌳 Abstract Syntax Tree (AST) Output

The core parsing engine now produces a rich, hierarchical Abstract Syntax Tree. This allows you to traverse documents structurally—accessing paragraphs, headings, tables, and lists with their original nesting and metadata preserved.

🖼️ OCR & Attachment Extraction

Integrated OCR: Use Tesseract.js to extract text from images and scanned PDF documents automatically. (Fixes #57)
Base64 Attachments: Extract images and charts directly as Base64 strings from all supported formats. (Fixes #68)

📄 New Format Support & Improvements

RTF Support: Added full support for Rich Text Format (.rtf) files, including complex nested tables and lists. (Fixes #54)
Hierarchical PDF Parsing: PDFs are now split into logical page nodes, matching the structure of slides and sheets.
PowerPoint & Excel Nodes: Introduced dedicated slide and sheet delimiter nodes for cleaner visualization and processing. (Fixes #64)

🔗 Enhanced Hyperlinks

Extract Link Addresses: External hyperlinks are now correctly extracted and tagged in the AST. (Fixes #50)
Clickable Visualizer Links: The built-in visualizer now renders external links as clickable <a> tags.

🛠️ Bug Fixes & Refinements

Word List Preservation: Fixed issues where numbered elements and indentation levels were lost in .docx parsing. (Fixes #29)
Robust PDF Parsing: Added graceful error handling for corrupt PDF files and bad XRef entries, preventing parser crashes. (Fixes #44)
Formatting Parity: Expanded support for bold, italic, underline, colors, and fonts across all parsers (Docx, Pptx, Xlsx, Odp, Odt, Ods, Pdf, Rtf).
Strict Typing: Full TypeScript rewrite providing comprehensive interfaces for the entire AST structure.

🎨 Interactive AST Visualizer (v6.0.1 Fix)

The Live Visualizer has been revamped and fixed for stable deployment:

Color-Coded Sections: Blue for Pages, Green for Sheets, and Orange for Slides.
Premium UI: New card-based layout with interactive previews and deep-linked metadata.
Deployment: Migrated to the /docs folder for standard GitHub Pages hosting at the repository's root. (Fixed in v6.0.1)

⚠️ Breaking Changes

The library now returns an OfficeParserAST object instead of a raw string.
To get the old behavior (plain text), call ast.toText() on the returned object.

2025-11-12 19:10:11

officeParser

harshankur

v5.2.2

v5.2.2 - 12.11.2025

Changes: v5.2.1..v5.2.2

Fixed https://github.com/harshankur/officeParser/issues/69 where Excel numbers were parsed as Int even when they didn't represent index in sharedStrings array. Further, the extracted float numbers from openOffice files were not precise enough. Fixed that too.
Fixed https://github.com/harshankur/officeParser/issues/66 where order of text would get messed up when part of a text in a cell differs in formatting.

2025-09-30 05:21:58

officeParser

harshankur

v5.2.1

v5.2.1 - 29.09.2025

Changes: v5.2.0..v5.2.1

Fixed #58 by merging PR #63 which introduced conditional import of pdfjs in browser environments for cases where it is not required. This acts as a temporary fix.
Fixed #61 with updating the generated typing file to allow JS ArrayBuffer as an accepted type for parsing office files.

v7.2.0: 🏗️ Parser Enhancements, Granular HTML Generator Controls, and Strict AST Typings

🌟 Key Pillars of the v7.2.0 Update

1. Structural Notes Attachment

2. Auxiliary Content (Headers, Footers, Slide Masters)

3. Native & Custom Document Properties

4. Discriminated Unions & Strict AST Typings

5. Interactive HTML Spreadsheet Layouts & DOM Injections

🛠 Getting Started

❤️ Supporting the Future of Document Infrastructure

Changes: v7.1.0..v7.2.0

v7.1.0: 🛡️ Cancellation Control, Thread Safety & Robust Entity Decoding

🌟 Key Pillars of the v7.1.0 Update

1. Native Cancellation with AbortSignal

2. Consolidated Timeouts & Memory Safety

3. Robust XLSX Parsing & Entity Decoding

4. Visualizer Panel Upgrades & Compliance

🛠 Getting Started

❤️ Supporting the Future of Document Infrastructure

v7.0.0: 🚀 Dual-Purpose Office Parser & Generator with Native RAG Suite

🌟 Key Pillars of the v7.0.0 Revolution

1. The Generation Revolution: OfficeGenerator

2. The OfficeConverter & Fluent .to() API

3. Native AI/RAG Infrastructure

4. Unified Document Intelligence

5. Engineering Excellence & Performance

🛠 Getting Started

❤️ Supporting the Future of Document Infrastructure

v6.1.1: Sequential Content Ordering & AST Structural Fidelity

Changes: v6.1.0..v6.1.1

✨ Key Highlights

📐 Advanced Layout Analysis (DOCX)

📊 High-Fidelity Presentations (PPTX)

📝 Perfect Lists (ODP & PPTX)

🛡️ Stability & Reliability

v6.1.0: Infrastructure Stability & Smart OCR Scheduling

Changes: v6.0.7..v6.1.0

🔥 Major Highlights

📦 Modern Module System & Nomenclature Change

🧠 Smart OCR Worker Pool

📦 Infrastructure Migration

🏷️ Custom Property Extraction

📄 Documentation & Branding Overhaul

🛠️ API & Configuration Changes

New Configuration Options (OfficeParserConfig)

Deprecations

🐛 Bug Fixes & Refinements

⚠️ Migration Note for v6.0.7 Users

❤️ Contributors

v6.0.7 - 24.03.2026

Changes: v6.0.6..v6.0.7

✨ New Features & Improvements

⬆️ Dependency Upgrades

⚙️ Technical Changes

v6.0.6 - 23.03.2026

Changes: v6.0.1..v6.0.6

🚀 Major Dependency Upgrades (v6 Core)

✨ New Features

🔧 Bug Fixes & Refinements

📦 Maintenance

v6.0.1 - 02.01.2026

Changes: v5.2.2..v6.0.1

🌟 Key Highlights (v6.0.0+)

🌳 Abstract Syntax Tree (AST) Output

🖼️ OCR & Attachment Extraction

📄 New Format Support & Improvements

🔗 Enhanced Hyperlinks

🛠️ Bug Fixes & Refinements

🎨 Interactive AST Visualizer (v6.0.1 Fix)

⚠️ Breaking Changes

v5.2.2 - 12.11.2025

Changes: v5.2.1..v5.2.2

v5.2.1 - 29.09.2025

Changes: v5.2.0..v5.2.1

1. Native Cancellation with `AbortSignal`

1. The Generation Revolution: `OfficeGenerator`

2. The `OfficeConverter` & Fluent `.to()` API

New Configuration Options (`OfficeParserConfig`)