Skip to content

PDF stage 2.3: form XObjects#540

Merged
andiwand merged 7 commits into
mainfrom
pdf-stage-2.3-form-xobjects
Jun 17, 2026
Merged

PDF stage 2.3: form XObjects#540
andiwand merged 7 commits into
mainfrom
pdf-stage-2.3-form-xobjects

Conversation

@andiwand

@andiwand andiwand commented Jun 16, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Implements stage 2.3 of the in-house PDF roadmap: form XObjects.

What it does

  • Parsing (pdf_document_parser, pdf_document_element): a resource dictionary's /XObject subdictionary is parsed into Resources::x_object. Each /Subtype /Form becomes an XObject element carrying its /Matrix (default identity), its eagerly decoded content stream, and its own parsed /Resources (nullptr ⇒ inherit the invoking scope). Image XObjects are recognized but deferred to stage 4; unknown subtypes are inexecutable.
  • Execution (pdf_page_text): Do saves the graphics state, concatenates the form /Matrix onto the CTM, runs the form content against the form's resources (falling back to the enclosing scope), then restores — so text inside forms is placed correctly. Added GraphicsState::save()/restore()/concat_matrix(), reused by q/Q/cm.

Cycles & sharing

The parser memoizes XObjects by reference (ObjectReference → XObject*, registered before recursing into /Resources), so:

  • a form shared across pages is parsed once, and
  • a cyclic form reference resolves to the existing element — the in-memory graph mirrors the file (cycles included) rather than clipping them.

A render-time active-set guard cuts cyclic invocation. The spec forbids form cycles (ISO 32000-1 §8.10.1), but real files contain them.

Deferred

/BBox clipping (text-only for now). The form machinery is reused by stage 4 (tiling patterns) and stage 5 (annotation appearances).

Tests

  • Parser: cyclic /Resources represented faithfully via the cache (back-edge points at the same element); a form shared by two pages parsed once.
  • Page-text: invocation, /Matrix placement, state restoration, scoped resources, nested forms, image/unknown XObjects ignored, and a self-referential form terminating via the active-set guard.

All PDF unit tests + odr-engine PDF end-to-end tests pass locally.

andiwand and others added 3 commits June 16, 2026 13:44
Parse a resource dictionary's /XObject table and execute `Do` on form
XObjects: save state, concatenate the form /Matrix onto the CTM, run the
form content against its own /Resources (falling back to the enclosing
scope), then restore. Image XObjects are recognized but deferred to
stage 4; /BBox clipping is deferred.

Form content, /Matrix and nested /Resources are read eagerly at parse
time so text extraction needs no parser handle. The parser memoizes
XObjects by reference (ObjectReference -> XObject*, registered before
recursing), so a form shared across pages is parsed once and a cyclic
form reference resolves to the existing element -- the in-memory graph
mirrors the file rather than clipping cycles. A render-time active-set
guard cuts cyclic invocation (forbidden by ISO 32000-1 8.10.1 but
present in real files).

Add GraphicsState save()/restore()/concat_matrix(), reused by q/Q/cm.

Tests: cyclic /Resources represented faithfully via the cache, a form
shared by two pages parsed once, and (page-text) invocation, /Matrix
placement, state restoration, scoped resources, nested forms,
image/unknown XObjects ignored, and a self-referential form terminating
via the active-set guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand marked this pull request as ready for review June 16, 2026 20:17

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d8fdbc0af

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/pdf/pdf_document_parser.cpp Outdated
Comment thread src/odr/internal/pdf/pdf_document_parser.cpp
@andiwand andiwand enabled auto-merge (squash) June 17, 2026 17:12
@andiwand andiwand merged commit d278f38 into main Jun 17, 2026
11 checks passed
@andiwand andiwand deleted the pdf-stage-2.3-form-xobjects branch June 17, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant