Benchmarks

Measured results comparing document processing pipelines for AI workloads.

Results Summary

We benchmarked four pipelines across 15 documents and 3 LLM tasks using Gemini 2.5 Flash. All pipelines ran against the same model for an apples-to-apples comparison.

Key Findings

Sidedoc uses 1,524x fewer prompt tokens than raw OOXML — measured across all successful LLM task runs
OOXML is not just expensive, it's unreliable — 9 of 45 OOXML calls failed (rate limits / errors), while all other pipelines completed 45/45
Sidedoc uses 13% fewer prompt tokens than Pandoc while preserving formatting that Pandoc loses
At Claude Sonnet 4 pricing, OOXML costs $2.78/doc vs Sidedoc's $0.07/doc — a 41x cost difference

Why This Matters

Every document processing pipeline faces a fundamental tradeoff: token efficiency vs format preservation. Raw text extraction is cheapest but loses all formatting. Raw OOXML preserves everything but is absurdly expensive in tokens. Sidedoc resolves this tradeoff — clean markdown for the LLM, formatting metadata stored separately, and lossless round-trip reconstruction.

LLM Task Token Usage

Actual API token counts from Gemini 2.5 Flash across 3 tasks and 15 documents. All numbers are measured, not estimated.

Summarization Task

Pipeline	Avg Prompt Tokens	Avg Completion Tokens	Avg Total	Successful Runs
Sidedoc	106	813	919	15/15
Pandoc	122	790	912	15/15
Raw Text	58	605	663	15/15
Raw OOXML	359,706	1,020	360,726	12/15

Single Edit Task

Pipeline	Avg Prompt Tokens	Avg Completion Tokens	Avg Total	Successful Runs
Sidedoc	115	1,267	1,382	15/15
Pandoc	130	1,100	1,230	15/15
Raw Text	67	940	1,007	15/15
Raw OOXML	359,865	4,092	363,957	13/15

Multi-Turn Edit Task (3 rounds)

Pipeline	Avg Prompt Tokens	Avg Completion Tokens	Avg Total	Successful Runs
Sidedoc	348	2,268	2,616	15/15
Pandoc	397	2,012	2,409	15/15
Raw Text	208	1,534	1,742	15/15
Raw OOXML	363,915	10,001	373,916	11/15

Aggregate Token Usage

Pipeline	Runs	Total Prompt	Total Completion	Grand Total	vs Sidedoc (prompt)
Sidedoc	45/45	8,531	65,228	73,759	1.0x
Pandoc	45/45	9,726	58,527	68,253	1.1x
Raw Text	45/45	4,994	46,194	51,188	0.6x
Raw OOXML	36/45	12,997,779	175,449	13,173,228	1,524x

Content Representation

How many tokens does each pipeline need to represent document content? Measured with cl100k_base tokenizer before sending to the LLM.

Pipeline	Avg Tokens/Doc	Total (15 docs)	vs Sidedoc
Sidedoc	74	1,117	1.0x
Pandoc	89	1,336	1.2x
Raw Text	34	505	0.5x
Raw OOXML	325,715	4,885,730	4,374x

Raw text extraction uses fewer tokens but cannot reconstruct the document — all formatting, structure, and metadata are lost.

Cost Analysis

Gemini 2.5 Flash ($0.15/M input, $0.60/M output)

Pipeline	Total Cost	Cost per Document	vs Sidedoc
Sidedoc	$0.04	$0.003	1.0x
Pandoc	$0.04	$0.002	0.9x
Raw Text	$0.03	$0.002	0.7x
Raw OOXML	$2.05	$0.137	51x

Claude Sonnet 4 ($3/M input, $15/M output)

Pipeline	Total Cost	Cost per Document	vs Sidedoc
Sidedoc	$1.00	$0.07	1.0x
Pandoc	$0.91	$0.06	0.9x
Raw Text	$0.71	$0.05	0.7x
Raw OOXML	$41.63	$2.78	41x

Format Fidelity

Format fidelity measures what each pipeline preserves on a round-trip: extract content, rebuild the document, and compare the result against the original at the XML level. Only Sidedoc and Pandoc support document rebuild; Raw Text and Raw OOXML cannot reconstruct documents from their extracted content.

Scoring Dimensions

Dimension	What It Measures
Structure	Heading levels at positions, paragraph/list/table counts
Formatting	Bold, italic, underline, font name, font size per run across all paragraphs
Tables	Row/col counts, merged cells, cell backgrounds, table styles
Hyperlinks	Link text + URL pair preservation
Track Changes	Insertion/deletion counts and author preservation

Scores are 0–100 per dimension. Dimensions not present in the original document (e.g., no tables) are excluded from the total. The total score is the mean of applicable dimensions.

How to Run

# Run fidelity scoring
python -m benchmarks.run_benchmark --pipeline sidedoc --pipeline pandoc --corpus synthetic --fidelity

# Generate report with fidelity table
python -m benchmarks.generate_report benchmarks/results/benchmark-latest.json

Pipeline Comparison

Capability	Sidedoc	Pandoc	Raw Text	Raw OOXML
Extract content	Yes	Yes	Yes	Yes
Preserve formatting metadata	Yes	No	No	Yes
Rebuild document	Yes	Partial	No	No*
Lossless round-trip	Yes	No	No	No
Token efficient	Yes	Yes	Best	Worst
Reliable (0 errors)	Yes	Yes	Yes	No (20% failure rate)
Tables preserved	Yes	Partial	No	Yes
Track changes support	Yes	No	No	Yes

Methodology

Test Corpus

15 synthetic documents from tests/fixtures/, covering: simple text, formatted text, hyperlinks, images, lists, tables (simple, complex, formatted, merged), and track changes (simple, headings, lists, multi-author, paragraph)
All documents processed through each pipeline identically

Pipelines

Pipeline	Description
Sidedoc	AI-native format — extracts to clean markdown + formatting metadata, enables lossless round-trip
Pandoc	Universal converter — `docx -> markdown` via pypandoc, loses most formatting on round-trip
Raw Text	Baseline — extracts paragraph text via python-docx, no formatting, no rebuild capability
Raw OOXML	Full XML content from the .docx archive (document.xml + styles.xml + numbering.xml + theme + rels) — what an LLM would need for format-preserving round-trip without an intermediate format

*OOXML theoretically supports reconstruction, but this pipeline is a baseline comparison tool only.

Tasks

Task	Description	LLM Calls
`summarize`	Generate 3-5 bullet point summary	1
`edit_single`	Apply a single edit instruction ("Make the text more concise")	1
`edit_multiturn`	Apply 3 sequential edits (concise, add summary, fix grammar)	3

Token Counting

Content representation: cl100k_base tokenizer via tiktoken
LLM task tokens: Actual prompt_tokens and completion_tokens from API responses
Model: Gemini 2.5 Flash via LiteLLM (all 4 pipelines on the same model for fair comparison)

Benchmark Date

March 2026. Full results in benchmarks/results/benchmark-latest.json.

Run It Yourself

For full setup instructions and troubleshooting, see benchmarks/README.md.

Prerequisites

Python 3.11+
Pandoc — brew install pandoc (macOS) or sudo apt install pandoc (Ubuntu)
API Key — ANTHROPIC_API_KEY or GEMINI_API_KEY depending on model

Installation

git clone https://github.com/jgardner04/sidedoc.git
cd sidedoc
pip install -r benchmarks/requirements.txt

Running Benchmarks

# Run all pipelines against synthetic corpus
for p in sidedoc pandoc raw_docx ooxml; do
  python -m benchmarks.run_benchmark --pipeline $p --corpus synthetic \
    --model gemini/gemini-2.5-flash \
    --output benchmarks/results/benchmark-${p}.json
done

# Generate report
python -m benchmarks.generate_report benchmarks/results/benchmark-latest.json

Filtering

# Single pipeline
python -m benchmarks.run_benchmark --pipeline sidedoc

# Single task
python -m benchmarks.run_benchmark --task summarize

# Combine filters
python -m benchmarks.run_benchmark --pipeline sidedoc --task summarize --corpus synthetic

# Use a different model
python -m benchmarks.run_benchmark --model claude-sonnet-4-20250514

Environment Variables

Variable	Required	Description
`ANTHROPIC_API_KEY`	For Claude	API key for Anthropic models
`GEMINI_API_KEY`	For Gemini	API key for Google Gemini models
`AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT`	No	Azure DI endpoint (for docint pipeline)
`AZURE_DOCUMENT_INTELLIGENCE_KEY`	No	Azure DI API key