v7.1.0
I am excited to announce the release of officeParser v7.1.0! Following the massive paradigm shift of v7.0.0, this release is dedicated to enterprise-grade reliability, memory leak prevention, and precision parsing.
As officeParser scales to support millions of production workloads and AI pipelines, v7.1.0 introduces critical safety guards, cancellation capabilities, and robustness improvements for heavy-duty document processing.
You can now pass an abortSignal in both OfficeParserConfig and OcrConfig (as well as specific configurations for PdfGenerator and ChunkingGenerator). This allows you to immediately interrupt:
- Document loading and parsing loops.
- Background Puppeteer browsers.
- Active OCR worker recognition tasks.
To prevent execution stalls and hanging resources in serverless or containerized environments:
- Consolidated OCR Timeouts: Timeout options have been unified under a structured
timeoutconfiguration (workerLoad,recognition, andautoTerminateinOcrTimeoutConfig). - Generator Timeouts: Added robust timeouts for
PdfGeneratorandChunkingGeneratortasks. - Resource Leak Prevention: If a generator or parser execution fails, cancels, or times out, Puppeteer browser instances and Tesseract workers are forcefully terminated and evicted, ensuring no dangling resources are left behind.
- XML Entity Decoding: Resolved bugs where decimal, hex, and named XML entities (e.g.,
&,&,<) in Excel sheets were parsed as raw strings. inlineStrAttribute Support: FixedinlineStrtag attribute matching to correctly process inline spreadsheet strings.
- Timeout & Cancellation Controls: The web visualizer config drawer now exposes granular controls for OCR and generator timeouts.
- ESM CSP Compliance: Replaced legacy dynamic module loading with direct ESM-native
import()to comply with strict Content Security Policies.
npm install officeparser@7.1.0
Example of using the new AbortSignal and timeout suite:
const { parseOffice } = require('officeparser');
const controller = new AbortController();
try {
const ast = await parseOffice('large-file.docx', {
abortSignal: controller.signal,
ocr: {
enable: true,
// Consolidated OCR timeouts
timeout: {
workerLoad: 10000, // Max time to load OCR worker (ms)
recognition: 30000, // Max time for text recognition (ms)
autoTerminate: 60000 // Inactivity cleanup (ms)
}
}
});
} catch (error) {
if (error.name === 'AbortError') {
console.log('Parsing task was aborted successfully.');
} else {
console.error('Parsing failed:', error);
}
}
// Cancel the operation at any point:
// controller.abort();
🔗 Full Changelog: View v7.1.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com
Since 2019, officeParser has been maintained as a voluntary project, growing to support over 10 million downloads and 300,000+ weekly installations.
As I build the ultimate document-to-AI pipeline, I seek professional sustainability to fund officeParser's next milestones:
- Core Sustainability: Keeping up with dependency updates, test coverage, and performance tuning.
- Multi-Runtime Excellence: Official support for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
- Enterprise Connectors: Dedicated integrations with LangChain, LlamaIndex, and Haystack.
If officeParser powers your production workflows or AI pipelines, please consider supporting its development:
👉 GitHub Sponsors 👉 Buy Me A Coffee
v7.0.0
We are thrilled to announce the release of officeParser v7.0.0, a milestone version that redefines document processing for the AI era.
Since 2019, officeParser has been a trusted utility for simple text extraction. Today, we are evolving into a comprehensive document knowledge engine designed specifically for the next generation of AI-first infrastructure.
officeParser is now a dual-purpose engine. Beyond parsing, you can now generate high-fidelity outputs from the unified Office AST.
- Universal Serialization: Transform any document into Markdown, HTML, CSV, RTF, or Layout-Aware Text.
- The StyleMapper Engine: A new semantic translation layer that preserves formatting (bold, italic, colors, tables) across all output formats using a robust DSL.
v7.0.0 introduces OfficeConverter, our new flagship API for one-step document transformations.
- Streamlined
convert: A single method to go from any source file to any target format with automatic configuration sync. - Fluent AST Interface: The AST now features an asynchronous
.to()method, allowing you to chain transformations effortlessly:await ast.to('markdown'),await ast.to('html'), orawait ast.to('chunks').
We’ve built the "Knowledge Bridge" required to turn messy, unstructured office files into high-precision data for your AI agents.
- Native RAG Chunking Suite: No more external dependencies. Split documents using
fixed-size(recursive),structural(hierarchy-aware), orsemanticstrategies. - Metadata-Aware: Every chunk retains its structural context, ensuring your Vector DB retrieval is more accurate than ever.
- New Parser Extensions: We now natively ingest
CSV,HTML, andMarkdown, treating them as first-class citizens in our unified Office AST. - Redesigned AST: Support for complex table structures (vertical/horizontal merging), nested lists, and format-specific metadata.
- Extreme Speedups: We eliminated $O(n^2)$ bottlenecks in RTF parsing and achieved up to 23x speedups in OpenOffice (ODP) processing.
- Memory Efficiency: Re-engineered Excel parsing with
matchAlliteration, preventing execution stalls on massive spreadsheets. - DOCX Fidelity: Full support for
w:vMergeandw:gridSpan, ensuring table structures are preserved exactly as they appear in Word.
npm install officeparser
The new API makes complex transformations trivial:
const { parseOffice, convert } = require('officeparser');
// Option 1: One-step conversion (High-level)
// Convert any file to Markdown, HTML, CSV, etc. in one line.
const { value } = await convert('proposal.docx', 'md');
console.log(value); // The generated Markdown string
// Option 2: Parse once, convert many (Fluent API)
// Ideal for multi-format export or RAG chunking.
const ast = await parseOffice('data.xlsx');
const { value: html } = await ast.to('html');
const { value: chunks } = await ast.to('chunks');
🔗 Full Changelog: View v7.0.0 Details 🔗 Documentation & Visualizer: officeparser.harshankur.com
Since 2019, officeParser has been maintained by a single person as a voluntary project, growing from a simple utility to a critical piece of infrastructure with over 10 million downloads and 300,000+ weekly installations.
As we pivot towards the "Super-Tool" era, I am seeking professional sustainability to fund the next phase of the roadmap:
- Core Sustainability: Maintaining 100% test coverage and dependency health for my global user base.
- Multi-Runtime Excellence: Official support and drivers for Bun, Deno, and Edge (Cloudflare Workers, Vercel).
- Enterprise Connectivity: High-performance connectors for LangChain, LlamaIndex, and Haystack, alongside intelligent chart-to-JSON extraction.
If officeParser powers your production workflows or AI pipelines, please consider supporting its development:
👉 GitHub Sponsors 👉 Buy Me A Coffee
v6.1.1
Changes: v6.1.0..v6.1.1
I am excited to announce v6.1.1, a significant update focused on Structural Fidelity and AST Precision. This release standardizes how complex document layouts—like soft line breaks and nested lists—are represented across all major office formats.
- Break Node Support: Comprehensive extraction of
w:br,w:cr, andw:lastRenderedPageBreak. Your AST now understands physical document breaks perfectly. - Indentation Metadata: Now extracting
<w:ind>properties, allowing you to reconstruct paragraph layouts with high accuracy.
- Sequential Parsing Engine: We've migrated to an iterative child-processing model. This guarantees that text runs, soft breaks, and fields are captured in their exact visual order.
- Dynamic Field Extraction: Support for
<a:fld>elements ensures slide numbers, dates, and other dynamic content are no longer lost.
- Soft Break Handling: Standardized handling of
Shift+Enterwithin list items. Interruptions are now intelligently split into independent paragraph nodes, maintaining clean numbering continuity. - Nested List Correction: Fixed a stateful indexing bug in ODP to ensure perfect sequential numbering even in deeply nested structures.
- Excel Multi-line Fix: Resolved edge cases in XLSX parsing where complex multi-line cells could cause parser failures.
- RTF Encoding Resilience: Improved byte-buffering logic to resolve character dropouts (like smart quotes) in legacy RTF streams.
- Security Hardening: Upgraded
@xmldom/xmldomto0.9.10to ensure a secure parsing environment.
v6.1.0
Changes: v6.0.7..v6.1.0
This release marks a major milestone in the technical maturity of officeParser, moving from a monolithic parser to a robust, resource-aware infrastructure. v6.1.0 focuses on stability, performance, and developer experience without breaking core backward compatibility.
Since v6.0.7, we have standardized our browser distribution to support modern development workflows. This includes a major change in bundle naming:
- Nomenclature Change:
officeparser.browser.jsorofficeParserBundle@${VERSION}.jshas been renamed toofficeparser.browser.iife.jsto explicitly indicate its format. - Dual-Bundle System: We now ship two distinct browser packages (add
@${VERSION}after officeparser if using release asset):officeparser.browser.iife.js: Standard IIFE bundle for direct<script>tag usage (GlobalofficeParsernamespace).officeparser.browser.mjs: A native ESM bundle for modern browsers, Vite, and Webpack 5.
- Node.js: Full native ESM support with
Node16resolution.
We’ve completely rewritten the OCR engine to handle internal resource management intelligently.
- Lazy Initialization: OCR workers and
tesseract.jsare now lazy-loaded. Simplyrequireorimportthe library at the top-level no longer spawns any background processes, resolving the long-standing "process leak" issue. - Worker Pooling: Workers are now pooled and reused across parallel parsing requests, providing up to 3x faster processing for documents with multiple images.
- Auto-Termination: A new idle-timer automatically cleans up worker processes after 10 seconds of inactivity (configurable via
ocrConfig.autoTerminateTimeout). - Manual Control: Exported a new
terminateOcr()function for snappy CLI script exits.
- Fflate Integration: Replaced legacy
yauzlwithfflatefor zero-memory-buffer zip extraction, significantly improving performance on massive spreadsheets and low-memory environments like Edge Functions. - PDF.js v5: Upgraded the internal PDF engine to the latest stable release (v5.6.205) with improved text-layer coordinate alignment.
You can now extract user-defined custom metadata (e.g., custom tags, proprietary fields) across almost all formats:
- OOXML: Standard custom document properties (
docProps/custom.xml). - ODF: User-defined fields in OpenOffice/LibreOffice metadata.
- PDF: Custom key-value pairs from the PDF Info dictionary.
- Premium SPA Docs: A completely redesigned live documentation site with persistent fragments.
- Interactive Visualizer: Test any file in-browser to see the hierarchical AST and real-time preview.
- Troubleshooting Guide: A new, comprehensive debugging guide covering everything from process hangs to PDF worker resolution.
- Metric Verification: Standardized all download metrics to 260k+ weekly installs with dynamic Shields.io badges and live verification links via
npm-stat.com.
ocrConfig: New object for fine-grained OCR control.ocrConfig.autoTerminateTimeout: Duration (ms) to keep workers alive before cleanup.ocrConfig.workerPath,ocrConfig.corePath,ocrConfig.langPath: Full support for air-gapped/offline local Tesseract hosting.
ocrLanguage: This string property is now deprecated. UseocrConfig.languageinstead. (Note: Existing code usingocrLanguagewill continue to work perfectly in v6.1.0).
- PDF: Fixed hierarchical alignment where links sometimes drifted from their corresponding text nodes.
- ODT/RTF: Improved list parsing to accurately reflect nested indentation levels in the AST.
- CLI: The command-line interface now automatically calls
terminateOcr()for a faster return to the prompt. - Sponsorship: Integrated
funding.jsonand.well-knownmanifest support for community sustainability.
- Nomenclature: If you were using a script tag, update your source from
officeparser.browser.jsorofficeParserBundle@${VERSION}.jstoofficeparser.browser.iife.js. - Process Lifecycle: Node.js scripts using OCR may stay alive for 10s after finishing due to the worker pool. Call
await terminateOcr()for an immediate exit.
A huge shoutout to @carlosb1504 for their massive contributions, specifically replacing yauzl with fflate for improved performance and implementing the core custom property extraction logic.
Full Changelog: v6.0.7...v6.1.0
v6.0.7
Changes: v6.0.6..v6.0.7
This release focuses on improving the developer experience for browser-side integration, upgrading core dependencies for better security/performance, and stabilizing the CI/CD pipeline with modern OIDC publishing standards.
- 📦 Bundled Browser Typings: Added dist/officeparser.browser.d.ts. This is a single, self-contained declaration file designed specifically for developers using the browser bundle directly. It provides full IntelliSense without needing any node_modules.
- 🚀 Robust OIDC Publishing:
- Upgraded the release pipeline to Node.js 24 for better native support of modern NPM features.
- Implemented explicit NPM Provenance using enhanced environment configurations and
publishConfigin package.json.
- 🧹 Cleaner Assets: Disabled source maps in the browser bundle (
officeparser.browser.js) to provide a cleaner and more lightweight production asset. - 🔗 Updated Homepage: The project homepage has been moved to officeparser.harshankur.com.
file-type: Upgraded to^21.3.4for improved file detection and security.typescript: Upgraded to^6.0.2.@types/node: Upgraded to^22.15.5.
- Added
build:browser:typesscript usingdts-bundle-generator. - Refactored build-and-publish.yml to trigger on
release: published, ensuring stable asset attachment. - Added
publishConfigto package.json to codify public access and provenance rules.
Full Changelog: v6.0.6...v6.0.7
v6.0.6
Changes: v6.0.1..v6.0.6
This release introduces significant dependency upgrades, a modernized CI/CD pipeline with enhanced security, and several key improvements to RTF and PDF parsing.
- Engine Upgrade: All core parsing libraries have been bumped to their latest major versions for enhanced performance and security.
file-typev19 (ESM support)tesseract.jsv7pdfjs-distv5.5yauzlv3@xmldom/xmldomv0.8.11
- OIDC Passwordless Publishing: The library now uses GitHub's OpenID Connect (OIDC) trust with NPM. This "Passwordless" flow eliminates the need for manually managed NPM tokens in CI/CD, significantly increasing supply chain security.
- Smart PDF Worker Sync: Added a new runtime synchronization system that automatically matches the PDF worker version with the library version, preventing "API/Worker version mismatch" errors in browser environments.
- Improved Module Loading: Introduced a robust
moduleLoaderto handle complex ESM/CJS interop for modern dependencies. - AST Visualizer & Docs: Launched a new documentation website and an interactive AST visualizer to help developers inspect parsed document structures.
- RTF Parser:
- Fixed a logic error where RTF endnotes were incorrectly identified as footnotes.
- Improved structure detection for complex RTF headers.
- PDF Parser: Enhanced reliability of
require-based loading for Node.js environments. - CI/CD Reliability: Overhauled the build pipeline to be "failure-aware"—the release process now correctly halts if parser validation tests fail, ensuring only stable versions are published.
- Synchronized package-lock.json and cleaned up build scripts.
- Updated documentation and README with the latest versioning and security best practices.
v6.0.1
Changes: v5.2.2..v6.0.1
We are thrilled to announce the release of officeParser v6.0.1, a major overhaul that transforms the library from a simple text extractor into a powerful, format-agnostic document analysis engine.
The core parsing engine now produces a rich, hierarchical Abstract Syntax Tree. This allows you to traverse documents structurally—accessing paragraphs, headings, tables, and lists with their original nesting and metadata preserved.
- Integrated OCR: Use Tesseract.js to extract text from images and scanned PDF documents automatically. (Fixes #57)
- Base64 Attachments: Extract images and charts directly as Base64 strings from all supported formats. (Fixes #68)
- RTF Support: Added full support for Rich Text Format (
.rtf) files, including complex nested tables and lists. (Fixes #54) - Hierarchical PDF Parsing: PDFs are now split into logical
pagenodes, matching the structure of slides and sheets. - PowerPoint & Excel Nodes: Introduced dedicated
slideandsheetdelimiter nodes for cleaner visualization and processing. (Fixes #64)
- Extract Link Addresses: External hyperlinks are now correctly extracted and tagged in the AST. (Fixes #50)
- Clickable Visualizer Links: The built-in visualizer now renders external links as clickable
<a>tags.
- Word List Preservation: Fixed issues where numbered elements and indentation levels were lost in
.docxparsing. (Fixes #29) - Robust PDF Parsing: Added graceful error handling for corrupt PDF files and bad XRef entries, preventing parser crashes. (Fixes #44)
- Formatting Parity: Expanded support for bold, italic, underline, colors, and fonts across all parsers (Docx, Pptx, Xlsx, Odp, Odt, Ods, Pdf, Rtf).
- Strict Typing: Full TypeScript rewrite providing comprehensive interfaces for the entire AST structure.
The Live Visualizer has been revamped and fixed for stable deployment:
- Color-Coded Sections: Blue for Pages, Green for Sheets, and Orange for Slides.
- Premium UI: New card-based layout with interactive previews and deep-linked metadata.
- Deployment: Migrated to the
/docsfolder for standard GitHub Pages hosting at the repository's root. (Fixed in v6.0.1)
- The library now returns an
OfficeParserASTobject instead of a raw string. - To get the old behavior (plain text), call
ast.toText()on the returned object.
v5.2.2
Changes: v5.2.1..v5.2.2
-
Fixed https://github.com/harshankur/officeParser/issues/69 where Excel numbers were parsed as Int even when they didn't represent index in sharedStrings array. Further, the extracted float numbers from openOffice files were not precise enough. Fixed that too.
-
Fixed https://github.com/harshankur/officeParser/issues/66 where order of text would get messed up when part of a text in a cell differs in formatting.
v5.2.1
Changes: v5.2.0..v5.2.1
- Fixed #58 by merging PR #63 which introduced conditional import of pdfjs in browser environments for cases where it is not required. This acts as a temporary fix.
- Fixed #61 with updating the generated typing file to allow JS ArrayBuffer as an accepted type for parsing office files.
v5.2.0
Changes: v5.1.1..v5.2.0
- Fixed #36 by upgrading pdfjs-dist version to the latest v5.3.31.
- Added pdfjs-dist as an npm dependency instead of using an older local library which unnecessarily increases this library's size.
- The new version of pdfjs-dist requires node >= v18. Please ensure you upgrade node before using this version.
- Browser bundle of officeParser does not work for pdf files with this release. Please use the artifacts from the previous release v5.1.0 including the worker file. Text extraction for all other supported files work fine in browsers with the bundle artifact of this release as well.