harshankur/officeParser
 Watch   
 Star   
 Fork   
2 days ago
officeParser

v7.1.0

v7.1.0: 🛡️ Cancellation Control, Thread Safety & Robust Entity Decoding

I am excited to announce the release of officeParser v7.1.0! Following the massive paradigm shift of v7.0.0, this release is dedicated to enterprise-grade reliability, memory leak prevention, and precision parsing.

As officeParser scales to support millions of production workloads and AI pipelines, v7.1.0 introduces critical safety guards, cancellation capabilities, and robustness improvements for heavy-duty document processing.


🌟 Key Pillars of the v7.1.0 Update

1. Native Cancellation with AbortSignal

You can now pass an abortSignal in both OfficeParserConfig and OcrConfig (as well as specific configurations for PdfGenerator and ChunkingGenerator). This allows you to immediately interrupt:

  • Document loading and parsing loops.
  • Background Puppeteer browsers.
  • Active OCR worker recognition tasks.

2. Consolidated Timeouts & Memory Safety

To prevent execution stalls and hanging resources in serverless or containerized environments:

  • Consolidated OCR Timeouts: Timeout options have been unified under a structured timeout configuration (workerLoad, recognition, and autoTerminate in OcrTimeoutConfig).
  • Generator Timeouts: Added robust timeouts for PdfGenerator and ChunkingGenerator tasks.
  • Resource Leak Prevention: If a generator or parser execution fails, cancels, or times out, Puppeteer browser instances and Tesseract workers are forcefully terminated and evicted, ensuring no dangling resources are left behind.

3. Robust XLSX Parsing & Entity Decoding

  • XML Entity Decoding: Resolved bugs where decimal, hex, and named XML entities (e.g., &, &, <) in Excel sheets were parsed as raw strings.
  • inlineStr Attribute Support: Fixed inlineStr tag attribute matching to correctly process inline spreadsheet strings.

4. Visualizer Panel Upgrades & Compliance

  • Timeout & Cancellation Controls: The web visualizer config drawer now exposes granular controls for OCR and generator timeouts.
  • ESM CSP Compliance: Replaced legacy dynamic module loading with direct ESM-native import() to comply with strict Content Security Policies.

🛠 Getting Started

npm install officeparser@7.1.0

Example of using the new AbortSignal and timeout suite:

const { parseOffice } = require('officeparser');

const controller = new AbortController();

try {
  const ast = await parseOffice('large-file.docx', {
    abortSignal: controller.signal,
    ocr: {
      enable: true,
      // Consolidated OCR timeouts
      timeout: {
        workerLoad: 10000,   // Max time to load OCR worker (ms)
        recognition: 30000,  // Max time for text recognition (ms)
        autoTerminate: 60000 // Inactivity cleanup (ms)
      }
    }
  });
} catch (error) {
  if (error.name === 'AbortError') {
    console.log('Parsing task was aborted successfully.');
  } else {
    console.error('Parsing failed:', error);
  }
}

// Cancel the operation at any point:
// controller.abort();

🔗 Full Changelog: View v7.1.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com


❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.

As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:

  • Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
  • Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
  • Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors 👉 Buy Me A Coffee


14 days ago
officeParser

v7.0.0

v7.0.0: 🚀 Dual-Purpose Office Parser & Generator with Native RAG Suite

We are thrilled to announce the release of officeParser v7.0.0, a milestone version that redefines document processing for the AI era.

Since 2019, officeParser has been a trusted utility for simple text extraction. Today, we are evolving into a comprehensive document knowledge engine designed specifically for the next generation of AI-first infrastructure.


🌟 Key Pillars of the v7.0.0 Revolution

1. The Generation Revolution: OfficeGenerator

officeParser is now a dual-purpose engine. Beyond parsing, you can now generate high-fidelity outputs from the unified Office AST.

  • Universal Serialization: Transform any document into Markdown, HTML, CSV, RTF, or Layout-Aware Text.
  • The StyleMapper Engine: A new semantic translation layer that preserves formatting (bold, italic, colors, tables) across all output formats using a robust DSL.

2. The OfficeConverter & Fluent .to() API

v7.0.0 introduces OfficeConverter, our new flagship API for one-step document transformations.

  • Streamlined convert: A single method to go from any source file to any target format with automatic configuration sync.
  • Fluent AST Interface: The AST now features an asynchronous .to() method, allowing you to chain transformations effortlessly: await ast.to('markdown'), await ast.to('html'), or await ast.to('chunks').

3. Native AI/RAG Infrastructure

We’ve built the "Knowledge Bridge" required to turn messy, unstructured office files into high-precision data for your AI agents.

  • Native RAG Chunking Suite: No more external dependencies. Split documents using fixed-size (recursive), structural (hierarchy-aware), or semantic strategies.
  • Metadata-Aware: Every chunk retains its structural context, ensuring your Vector DB retrieval is more accurate than ever.

4. Unified Document Intelligence

  • New Parser Extensions: We now natively ingest CSV, HTML, and Markdown, treating them as first-class citizens in our unified Office AST.
  • Redesigned AST: Support for complex table structures (vertical/horizontal merging), nested lists, and format-specific metadata.

5. Engineering Excellence & Performance

  • Extreme Speedups: We eliminated $O(n^2)$ bottlenecks in RTF parsing and achieved up to 23x speedups in OpenOffice (ODP) processing.
  • Memory Efficiency: Re-engineered Excel parsing with matchAll iteration, preventing execution stalls on massive spreadsheets.
  • DOCX Fidelity: Full support for w:vMerge and w:gridSpan, ensuring table structures are preserved exactly as they appear in Word.

🛠 Getting Started

npm install officeparser

The new API makes complex transformations trivial:

const { parseOffice, convert } = require('officeparser');
// Option 1: One-step conversion (High-level)
// Convert any file to Markdown, HTML, CSV, etc. in one line.
const { value } = await convert('proposal.docx', 'md');
console.log(value); // The generated Markdown string
// Option 2: Parse once, convert many (Fluent API)
// Ideal for multi-format export or RAG chunking.
const ast = await parseOffice('data.xlsx');
const { value: html } = await ast.to('html');
const { value: chunks } = await ast.to('chunks');

🔗 Full Changelog: View v7.0.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com


❤️ Supporting the Future of Document Infrastructure

Since 2019, officeParser has been maintained by a single person as a voluntary project, growing from a simple utility to a critical piece of infrastructure with over 10 million downloads and 300,000+ weekly installations.

As we pivot towards the "Super-Tool" era, I am seeking professional sustainability to fund the next phase of the roadmap:

  • Core Sustainability: Maintaining 100% test coverage and dependency health for my global user base.
  • Multi-Runtime Excellence: Official support and drivers for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
  • Enterprise Connectivity: High-performance connectors for LangChain, LlamaIndex, and Haystack, alongside intelligent chart-to-JSON extraction.

If officeParser powers your production workflows or AI pipelines, please consider supporting its development:

👉 GitHub Sponsors 👉 Buy Me A Coffee


29 days ago
officeParser

v6.1.1

v6.1.1: Sequential Content Ordering & AST Structural Fidelity

Changes: v6.1.0..v6.1.1

I am excited to announce v6.1.1, a significant update focused on Structural Fidelity and AST Precision. This release standardizes how complex document layouts—like soft line breaks and nested lists—are represented across all major office formats.

✨ Key Highlights

📐 Advanced Layout Analysis (DOCX)

  • Break Node Support: Comprehensive extraction of w:br, w:cr, and w:lastRenderedPageBreak. Your AST now understands physical document breaks perfectly.
  • Indentation Metadata: Now extracting <w:ind> properties, allowing you to reconstruct paragraph layouts with high accuracy.

📊 High-Fidelity Presentations (PPTX)

  • Sequential Parsing Engine: We've migrated to an iterative child-processing model. This guarantees that text runs, soft breaks, and fields are captured in their exact visual order.
  • Dynamic Field Extraction: Support for <a:fld> elements ensures slide numbers, dates, and other dynamic content are no longer lost.

📝 Perfect Lists (ODP & PPTX)

  • Soft Break Handling: Standardized handling of Shift+Enter within list items. Interruptions are now intelligently split into independent paragraph nodes, maintaining clean numbering continuity.
  • Nested List Correction: Fixed a stateful indexing bug in ODP to ensure perfect sequential numbering even in deeply nested structures.

🛡️ Stability & Reliability

  • Excel Multi-line Fix: Resolved edge cases in XLSX parsing where complex multi-line cells could cause parser failures.
  • RTF Encoding Resilience: Improved byte-buffering logic to resolve character dropouts (like smart quotes) in legacy RTF streams.
  • Security Hardening: Upgraded @xmldom/xmldom to 0.9.10 to ensure a secure parsing environment.
2026-04-14 21:51:57
officeParser

v6.1.0

v6.1.0: Infrastructure Stability & Smart OCR Scheduling

Changes: v6.0.7..v6.1.0

This release marks a major milestone in the technical maturity of officeParser, moving from a monolithic parser to a robust, resource-aware infrastructure. v6.1.0 focuses on stability, performance, and developer experience without breaking core backward compatibility.


🔥 Major Highlights

📦 Modern Module System & Nomenclature Change

Since v6.0.7, we have standardized our browser distribution to support modern development workflows. This includes a major change in bundle naming:

  • Nomenclature Change: officeparser.browser.js or officeParserBundle@${VERSION}.js has been renamed to officeparser.browser.iife.js to explicitly indicate its format.
  • Dual-Bundle System: We now ship two distinct browser packages (add @${VERSION} after officeparser if using release asset):
    1. officeparser.browser.iife.js: Standard IIFE bundle for direct <script> tag usage (Global officeParser namespace).
    2. officeparser.browser.mjs: A native ESM bundle for modern browsers, Vite, and Webpack 5.
  • Node.js: Full native ESM support with Node16 resolution.

🧠 Smart OCR Worker Pool

We’ve completely rewritten the OCR engine to handle internal resource management intelligently.

  • Lazy Initialization: OCR workers and tesseract.js are now lazy-loaded. Simply require or import the library at the top-level no longer spawns any background processes, resolving the long-standing "process leak" issue.
  • Worker Pooling: Workers are now pooled and reused across parallel parsing requests, providing up to 3x faster processing for documents with multiple images.
  • Auto-Termination: A new idle-timer automatically cleans up worker processes after 10 seconds of inactivity (configurable via ocrConfig.autoTerminateTimeout).
  • Manual Control: Exported a new terminateOcr() function for snappy CLI script exits.

📦 Infrastructure Migration

  • Fflate Integration: Replaced legacy yauzl with fflate for zero-memory-buffer zip extraction, significantly improving performance on massive spreadsheets and low-memory environments like Edge Functions.
  • PDF.js v5: Upgraded the internal PDF engine to the latest stable release (v5.6.205) with improved text-layer coordinate alignment.

🏷️ Custom Property Extraction

You can now extract user-defined custom metadata (e.g., custom tags, proprietary fields) across almost all formats:

  • OOXML: Standard custom document properties (docProps/custom.xml).
  • ODF: User-defined fields in OpenOffice/LibreOffice metadata.
  • PDF: Custom key-value pairs from the PDF Info dictionary.

📄 Documentation & Branding Overhaul

  • Premium SPA Docs: A completely redesigned live documentation site with persistent fragments.
  • Interactive Visualizer: Test any file in-browser to see the hierarchical AST and real-time preview.
  • Troubleshooting Guide: A new, comprehensive debugging guide covering everything from process hangs to PDF worker resolution.
  • Metric Verification: Standardized all download metrics to 260k+ weekly installs with dynamic Shields.io badges and live verification links via npm-stat.com.

🛠️ API & Configuration Changes

New Configuration Options (OfficeParserConfig)

  • ocrConfig: New object for fine-grained OCR control.
  • ocrConfig.autoTerminateTimeout: Duration (ms) to keep workers alive before cleanup.
  • ocrConfig.workerPath, ocrConfig.corePath, ocrConfig.langPath: Full support for air-gapped/offline local Tesseract hosting.

Deprecations

  • ocrLanguage: This string property is now deprecated. Use ocrConfig.language instead. (Note: Existing code using ocrLanguage will continue to work perfectly in v6.1.0).

🐛 Bug Fixes & Refinements

  • PDF: Fixed hierarchical alignment where links sometimes drifted from their corresponding text nodes.
  • ODT/RTF: Improved list parsing to accurately reflect nested indentation levels in the AST.
  • CLI: The command-line interface now automatically calls terminateOcr() for a faster return to the prompt.
  • Sponsorship: Integrated funding.json and .well-known manifest support for community sustainability.

⚠️ Migration Note for v6.0.7 Users

  1. Nomenclature: If you were using a script tag, update your source from officeparser.browser.js or officeParserBundle@${VERSION}.js to officeparser.browser.iife.js.
  2. Process Lifecycle: Node.js scripts using OCR may stay alive for 10s after finishing due to the worker pool. Call await terminateOcr() for an immediate exit.

❤️ Contributors

A huge shoutout to @carlosb1504 for their massive contributions, specifically replacing yauzl with fflate for improved performance and implementing the core custom property extraction logic.

Full Changelog: v6.0.7...v6.1.0

2026-03-24 19:25:03
officeParser

v6.0.7

v6.0.7 - 24.03.2026

Changes: v6.0.6..v6.0.7

This release focuses on improving the developer experience for browser-side integration, upgrading core dependencies for better security/performance, and stabilizing the CI/CD pipeline with modern OIDC publishing standards.


✨ New Features & Improvements

  • 📦 Bundled Browser Typings: Added dist/officeparser.browser.d.ts. This is a single, self-contained declaration file designed specifically for developers using the browser bundle directly. It provides full IntelliSense without needing any node_modules.
  • 🚀 Robust OIDC Publishing:
    • Upgraded the release pipeline to Node.js 24 for better native support of modern NPM features.
    • Implemented explicit NPM Provenance using enhanced environment configurations and publishConfig in package.json.
  • 🧹 Cleaner Assets: Disabled source maps in the browser bundle (officeparser.browser.js) to provide a cleaner and more lightweight production asset.
  • 🔗 Updated Homepage: The project homepage has been moved to officeparser.harshankur.com.

⬆️ Dependency Upgrades

  • file-type: Upgraded to ^21.3.4 for improved file detection and security.
  • typescript: Upgraded to ^6.0.2.
  • @types/node: Upgraded to ^22.15.5.

⚙️ Technical Changes

  • Added build:browser:types script using dts-bundle-generator.
  • Refactored build-and-publish.yml to trigger on release: published, ensuring stable asset attachment.
  • Added publishConfig to package.json to codify public access and provenance rules.

Full Changelog: v6.0.6...v6.0.7

2026-03-24 07:19:40
officeParser

v6.0.6

v6.0.6 - 23.03.2026

Changes: v6.0.1..v6.0.6

This release introduces significant dependency upgrades, a modernized CI/CD pipeline with enhanced security, and several key improvements to RTF and PDF parsing.

🚀 Major Dependency Upgrades (v6 Core)

  • Engine Upgrade: All core parsing libraries have been bumped to their latest major versions for enhanced performance and security.
    • file-type v19 (ESM support)
    • tesseract.js v7
    • pdfjs-dist v5.5
    • yauzl v3
    • @xmldom/xmldom v0.8.11

✨ New Features

  • OIDC Passwordless Publishing: The library now uses GitHub's OpenID Connect (OIDC) trust with NPM. This "Passwordless" flow eliminates the need for manually managed NPM tokens in CI/CD, significantly increasing supply chain security.
  • Smart PDF Worker Sync: Added a new runtime synchronization system that automatically matches the PDF worker version with the library version, preventing "API/Worker version mismatch" errors in browser environments.
  • Improved Module Loading: Introduced a robust moduleLoader to handle complex ESM/CJS interop for modern dependencies.
  • AST Visualizer & Docs: Launched a new documentation website and an interactive AST visualizer to help developers inspect parsed document structures.

🔧 Bug Fixes & Refinements

  • RTF Parser:
    • Fixed a logic error where RTF endnotes were incorrectly identified as footnotes.
    • Improved structure detection for complex RTF headers.
  • PDF Parser: Enhanced reliability of require-based loading for Node.js environments.
  • CI/CD Reliability: Overhauled the build pipeline to be "failure-aware"—the release process now correctly halts if parser validation tests fail, ensuring only stable versions are published.

📦 Maintenance

  • Synchronized package-lock.json and cleaned up build scripts.
  • Updated documentation and README with the latest versioning and security best practices.
2026-01-02 22:49:15
officeParser

v6.0.1

v6.0.1 - 02.01.2026

Changes: v5.2.2..v6.0.1

We are thrilled to announce the release of officeParser v6.0.1, a major overhaul that transforms the library from a simple text extractor into a powerful, format-agnostic document analysis engine.

🌟 Key Highlights (v6.0.0+)

🌳 Abstract Syntax Tree (AST) Output

The core parsing engine now produces a rich, hierarchical Abstract Syntax Tree. This allows you to traverse documents structurally—accessing paragraphs, headings, tables, and lists with their original nesting and metadata preserved.

🖼️ OCR & Attachment Extraction

  • Integrated OCR: Use Tesseract.js to extract text from images and scanned PDF documents automatically. (Fixes #57)
  • Base64 Attachments: Extract images and charts directly as Base64 strings from all supported formats. (Fixes #68)

📄 New Format Support & Improvements

  • RTF Support: Added full support for Rich Text Format (.rtf) files, including complex nested tables and lists. (Fixes #54)
  • Hierarchical PDF Parsing: PDFs are now split into logical page nodes, matching the structure of slides and sheets.
  • PowerPoint & Excel Nodes: Introduced dedicated slide and sheet delimiter nodes for cleaner visualization and processing. (Fixes #64)

🔗 Enhanced Hyperlinks

  • Extract Link Addresses: External hyperlinks are now correctly extracted and tagged in the AST. (Fixes #50)
  • Clickable Visualizer Links: The built-in visualizer now renders external links as clickable <a> tags.

🛠️ Bug Fixes & Refinements

  • Word List Preservation: Fixed issues where numbered elements and indentation levels were lost in .docx parsing. (Fixes #29)
  • Robust PDF Parsing: Added graceful error handling for corrupt PDF files and bad XRef entries, preventing parser crashes. (Fixes #44)
  • Formatting Parity: Expanded support for bold, italic, underline, colors, and fonts across all parsers (Docx, Pptx, Xlsx, Odp, Odt, Ods, Pdf, Rtf).
  • Strict Typing: Full TypeScript rewrite providing comprehensive interfaces for the entire AST structure.

🎨 Interactive AST Visualizer (v6.0.1 Fix)

The Live Visualizer has been revamped and fixed for stable deployment:

  • Color-Coded Sections: Blue for Pages, Green for Sheets, and Orange for Slides.
  • Premium UI: New card-based layout with interactive previews and deep-linked metadata.
  • Deployment: Migrated to the /docs folder for standard GitHub Pages hosting at the repository's root. (Fixed in v6.0.1)

⚠️ Breaking Changes

  • The library now returns an OfficeParserAST object instead of a raw string.
  • To get the old behavior (plain text), call ast.toText() on the returned object.
2025-11-12 19:10:11
officeParser

v5.2.2

v5.2.2 - 12.11.2025

Changes: v5.2.1..v5.2.2

2025-09-30 05:21:58
officeParser

v5.2.1

v5.2.1 - 29.09.2025

Changes: v5.2.0..v5.2.1

  • Fixed #58 by merging PR #63 which introduced conditional import of pdfjs in browser environments for cases where it is not required. This acts as a temporary fix.
  • Fixed #61 with updating the generated typing file to allow JS ArrayBuffer as an accepted type for parsing office files.
2025-07-06 03:55:36
officeParser

v5.2.0

v5.2.0 - 05.07.2025

Changes: v5.1.1..v5.2.0

  • Fixed #36 by upgrading pdfjs-dist version to the latest v5.3.31.
  • Added pdfjs-dist as an npm dependency instead of using an older local library which unnecessarily increases this library's size.

Breaking Changes:

  • The new version of pdfjs-dist requires node >= v18. Please ensure you upgrade node before using this version.
  • Browser bundle of officeParser does not work for pdf files with this release. Please use the artifacts from the previous release v5.1.0 including the worker file. Text extraction for all other supported files work fine in browsers with the bundle artifact of this release as well.