Benchmarks
Measured results comparing document processing pipelines for AI workloads.
Results Summary
We benchmarked four pipelines across 15 documents and 3 LLM tasks using Gemini 2.5 Flash. All pipelines ran against the same model for an apples-to-apples comparison.
Key Findings
- Sidedoc uses 1,524x fewer prompt tokens than raw OOXML — measured across all successful LLM task runs
- OOXML is not just expensive, it's unreliable — 9 of 45 OOXML calls failed (rate limits / errors), while all other pipelines completed 45/45
- Sidedoc uses 13% fewer prompt tokens than Pandoc while preserving formatting that Pandoc loses
- At Claude Sonnet 4 pricing, OOXML costs $2.78/doc vs Sidedoc's $0.07/doc — a 41x cost difference
Why This Matters
Every document processing pipeline faces a fundamental tradeoff: token efficiency vs format preservation. Raw text extraction is cheapest but loses all formatting. Raw OOXML preserves everything but is absurdly expensive in tokens. Sidedoc resolves this tradeoff — clean markdown for the LLM, formatting metadata stored separately, and lossless round-trip reconstruction.
LLM Task Token Usage
Actual API token counts from Gemini 2.5 Flash across 3 tasks and 15 documents. All numbers are measured, not estimated.
Summarization Task
| Pipeline |
Avg Prompt Tokens |
Avg Completion Tokens |
Avg Total |
Successful Runs |
| Sidedoc |
106 |
813 |
919 |
15/15 |
| Pandoc |
122 |
790 |
912 |
15/15 |
| Raw Text |
58 |
605 |
663 |
15/15 |
| Raw OOXML |
359,706 |
1,020 |
360,726 |
12/15 |
Single Edit Task
| Pipeline |
Avg Prompt Tokens |
Avg Completion Tokens |
Avg Total |
Successful Runs |
| Sidedoc |
115 |
1,267 |
1,382 |
15/15 |
| Pandoc |
130 |
1,100 |
1,230 |
15/15 |
| Raw Text |
67 |
940 |
1,007 |
15/15 |
| Raw OOXML |
359,865 |
4,092 |
363,957 |
13/15 |
Multi-Turn Edit Task (3 rounds)
| Pipeline |
Avg Prompt Tokens |
Avg Completion Tokens |
Avg Total |
Successful Runs |
| Sidedoc |
348 |
2,268 |
2,616 |
15/15 |
| Pandoc |
397 |
2,012 |
2,409 |
15/15 |
| Raw Text |
208 |
1,534 |
1,742 |
15/15 |
| Raw OOXML |
363,915 |
10,001 |
373,916 |
11/15 |
Aggregate Token Usage
| Pipeline |
Runs |
Total Prompt |
Total Completion |
Grand Total |
vs Sidedoc (prompt) |
| Sidedoc |
45/45 |
8,531 |
65,228 |
73,759 |
1.0x |
| Pandoc |
45/45 |
9,726 |
58,527 |
68,253 |
1.1x |
| Raw Text |
45/45 |
4,994 |
46,194 |
51,188 |
0.6x |
| Raw OOXML |
36/45 |
12,997,779 |
175,449 |
13,173,228 |
1,524x |
Content Representation
How many tokens does each pipeline need to represent document content? Measured with cl100k_base tokenizer before sending to the LLM.
| Pipeline |
Avg Tokens/Doc |
Total (15 docs) |
vs Sidedoc |
| Sidedoc |
74 |
1,117 |
1.0x |
| Pandoc |
89 |
1,336 |
1.2x |
| Raw Text |
34 |
505 |
0.5x |
| Raw OOXML |
325,715 |
4,885,730 |
4,374x |
Raw text extraction uses fewer tokens but cannot reconstruct the document — all formatting, structure, and metadata are lost.
Cost Analysis
| Pipeline |
Total Cost |
Cost per Document |
vs Sidedoc |
| Sidedoc |
$0.04 |
$0.003 |
1.0x |
| Pandoc |
$0.04 |
$0.002 |
0.9x |
| Raw Text |
$0.03 |
$0.002 |
0.7x |
| Raw OOXML |
$2.05 |
$0.137 |
51x |
| Pipeline |
Total Cost |
Cost per Document |
vs Sidedoc |
| Sidedoc |
$1.00 |
$0.07 |
1.0x |
| Pandoc |
$0.91 |
$0.06 |
0.9x |
| Raw Text |
$0.71 |
$0.05 |
0.7x |
| Raw OOXML |
$41.63 |
$2.78 |
41x |
Format fidelity measures what each pipeline preserves on a round-trip: extract content, rebuild the document, and compare the result against the original at the XML level. Only Sidedoc and Pandoc support document rebuild; Raw Text and Raw OOXML cannot reconstruct documents from their extracted content.
Scoring Dimensions
| Dimension |
What It Measures |
| Structure |
Heading levels at positions, paragraph/list/table counts |
| Formatting |
Bold, italic, underline, font name, font size per run across all paragraphs |
| Tables |
Row/col counts, merged cells, cell backgrounds, table styles |
| Hyperlinks |
Link text + URL pair preservation |
| Track Changes |
Insertion/deletion counts and author preservation |
Scores are 0–100 per dimension. Dimensions not present in the original document (e.g., no tables) are excluded from the total. The total score is the mean of applicable dimensions.
How to Run
# Run fidelity scoring
python -m benchmarks.run_benchmark --pipeline sidedoc --pipeline pandoc --corpus synthetic --fidelity
# Generate report with fidelity table
python -m benchmarks.generate_report benchmarks/results/benchmark-latest.json
Pipeline Comparison
| Capability |
Sidedoc |
Pandoc |
Raw Text |
Raw OOXML |
| Extract content |
Yes |
Yes |
Yes |
Yes |
| Preserve formatting metadata |
Yes |
No |
No |
Yes |
| Rebuild document |
Yes |
Partial |
No |
No* |
| Lossless round-trip |
Yes |
No |
No |
No |
| Token efficient |
Yes |
Yes |
Best |
Worst |
| Reliable (0 errors) |
Yes |
Yes |
Yes |
No (20% failure rate) |
| Tables preserved |
Yes |
Partial |
No |
Yes |
| Track changes support |
Yes |
No |
No |
Yes |
Methodology
Test Corpus
- 15 synthetic documents from
tests/fixtures/, covering: simple text, formatted text, hyperlinks, images, lists, tables (simple, complex, formatted, merged), and track changes (simple, headings, lists, multi-author, paragraph)
- All documents processed through each pipeline identically
Pipelines
| Pipeline |
Description |
| Sidedoc |
AI-native format — extracts to clean markdown + formatting metadata, enables lossless round-trip |
| Pandoc |
Universal converter — docx -> markdown via pypandoc, loses most formatting on round-trip |
| Raw Text |
Baseline — extracts paragraph text via python-docx, no formatting, no rebuild capability |
| Raw OOXML |
Full XML content from the .docx archive (document.xml + styles.xml + numbering.xml + theme + rels) — what an LLM would need for format-preserving round-trip without an intermediate format |
*OOXML theoretically supports reconstruction, but this pipeline is a baseline comparison tool only.
Tasks
| Task |
Description |
LLM Calls |
summarize |
Generate 3-5 bullet point summary |
1 |
edit_single |
Apply a single edit instruction ("Make the text more concise") |
1 |
edit_multiturn |
Apply 3 sequential edits (concise, add summary, fix grammar) |
3 |
Token Counting
- Content representation:
cl100k_base tokenizer via tiktoken
- LLM task tokens: Actual
prompt_tokens and completion_tokens from API responses
- Model: Gemini 2.5 Flash via LiteLLM (all 4 pipelines on the same model for fair comparison)
Benchmark Date
March 2026. Full results in benchmarks/results/benchmark-latest.json.
Run It Yourself
For full setup instructions and troubleshooting, see benchmarks/README.md.
Prerequisites
- Python 3.11+
- Pandoc —
brew install pandoc (macOS) or sudo apt install pandoc (Ubuntu)
- API Key —
ANTHROPIC_API_KEY or GEMINI_API_KEY depending on model
Installation
git clone https://github.com/jgardner04/sidedoc.git
cd sidedoc
pip install -r benchmarks/requirements.txt
Running Benchmarks
# Run all pipelines against synthetic corpus
for p in sidedoc pandoc raw_docx ooxml; do
python -m benchmarks.run_benchmark --pipeline $p --corpus synthetic \
--model gemini/gemini-2.5-flash \
--output benchmarks/results/benchmark-${p}.json
done
# Generate report
python -m benchmarks.generate_report benchmarks/results/benchmark-latest.json
Filtering
# Single pipeline
python -m benchmarks.run_benchmark --pipeline sidedoc
# Single task
python -m benchmarks.run_benchmark --task summarize
# Combine filters
python -m benchmarks.run_benchmark --pipeline sidedoc --task summarize --corpus synthetic
# Use a different model
python -m benchmarks.run_benchmark --model claude-sonnet-4-20250514
Environment Variables
| Variable |
Required |
Description |
ANTHROPIC_API_KEY |
For Claude |
API key for Anthropic models |
GEMINI_API_KEY |
For Gemini |
API key for Google Gemini models |
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT |
No |
Azure DI endpoint (for docint pipeline) |
AZURE_DOCUMENT_INTELLIGENCE_KEY |
No |
Azure DI API key |