PDF stage 2.3: form XObjects#540
Merged
Merged
Conversation
Parse a resource dictionary's /XObject table and execute `Do` on form XObjects: save state, concatenate the form /Matrix onto the CTM, run the form content against its own /Resources (falling back to the enclosing scope), then restore. Image XObjects are recognized but deferred to stage 4; /BBox clipping is deferred. Form content, /Matrix and nested /Resources are read eagerly at parse time so text extraction needs no parser handle. The parser memoizes XObjects by reference (ObjectReference -> XObject*, registered before recursing), so a form shared across pages is parsed once and a cyclic form reference resolves to the existing element -- the in-memory graph mirrors the file rather than clipping cycles. A render-time active-set guard cuts cyclic invocation (forbidden by ISO 32000-1 8.10.1 but present in real files). Add GraphicsState save()/restore()/concat_matrix(), reused by q/Q/cm. Tests: cyclic /Resources represented faithfully via the cache, a form shared by two pages parsed once, and (page-text) invocation, /Matrix placement, state restoration, scoped resources, nested forms, image/unknown XObjects ignored, and a self-referential form terminating via the active-set guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d8fdbc0af
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
Implements stage 2.3 of the in-house PDF roadmap: form XObjects.
What it does
pdf_document_parser,pdf_document_element): a resource dictionary's/XObjectsubdictionary is parsed intoResources::x_object. Each/Subtype /Formbecomes anXObjectelement carrying its/Matrix(default identity), its eagerly decoded content stream, and its own parsed/Resources(nullptr⇒ inherit the invoking scope). Image XObjects are recognized but deferred to stage 4; unknown subtypes are inexecutable.pdf_page_text):Dosaves the graphics state, concatenates the form/Matrixonto the CTM, runs the form content against the form's resources (falling back to the enclosing scope), then restores — so text inside forms is placed correctly. AddedGraphicsState::save()/restore()/concat_matrix(), reused byq/Q/cm.Cycles & sharing
The parser memoizes XObjects by reference (
ObjectReference → XObject*, registered before recursing into/Resources), so:A render-time active-set guard cuts cyclic invocation. The spec forbids form cycles (ISO 32000-1 §8.10.1), but real files contain them.
Deferred
/BBoxclipping (text-only for now). The form machinery is reused by stage 4 (tiling patterns) and stage 5 (annotation appearances).Tests
/Resourcesrepresented faithfully via the cache (back-edge points at the same element); a form shared by two pages parsed once./Matrixplacement, state restoration, scoped resources, nested forms, image/unknown XObjects ignored, and a self-referential form terminating via the active-set guard.All PDF unit tests + odr-engine PDF end-to-end tests pass locally.