Skip to content

Report HTML document-loader failures clearly#913

Open
dahlia wants to merge 4 commits into
fedify-dev:mainfrom
dahlia:bugfix/html-document-loader
Open

Report HTML document-loader failures clearly#913
dahlia wants to merge 4 commits into
fedify-dev:mainfrom
dahlia:bugfix/html-document-loader

Conversation

@dahlia

@dahlia dahlia commented Jun 30, 2026

Copy link
Copy Markdown
Member

Fedify already treats HTML as a possible discovery page, not as an immediate failure. That is useful for servers that expose an ActivityPub representation through an alternate link, so this PR keeps that behavior. The problem is the fallback after discovery fails: once the loader has seen an HTML/XHTML response and found no usable ActivityPub alternate, passing the same body to the JSON parser turns a remote HTML page into a generic SyntaxError.

This moves the failure to the point where the document loader still has the right context. In packages/vocab-runtime/src/docloader.ts, the HTML/XHTML branch now clones the response before reading the body, keeps the existing header and in-document alternate discovery, then throws FetchError if no distinct ActivityPub or JSON-LD alternate can be found. Oversized HTML follows the same typed error path because Fedify cannot scan it safely. In both cases the error keeps the document URL, the response, and the Content-Type in the message.

The tests in packages/vocab-runtime/src/docloader.test.ts cover a real HTML page with no ActivityPub alternate and assert that the error is a FetchError with the original response metadata. The ReDoS regression test still checks that hostile HTML completes quickly, but now expects the same document-loader error instead of relying on the JSON parser. I also narrowed the malformed Link-header test to a JSON-LD response, so it continues to cover Link parser recovery without depending on the old HTML fallback.

CHANGES.md records this under @fedify/vocab-runtime for the next minor release.

Fixes #912.

Reject HTML and XHTML document-loader responses that do not advertise an
ActivityPub alternate document before they reach the JSON parser.  This
surfaces remote HTML pages as FetchError failures with response context
instead of generic SyntaxError crashes.

Add regression coverage for HTML responses without ActivityPub alternates
and keep the ReDoS regression test aligned with the new error path.

Fixes fedify-dev#912

Assisted-by: Codex:gpt-5.5
@dahlia dahlia added this to the Fedify 2.4 milestone Jun 30, 2026
@dahlia dahlia self-assigned this Jun 30, 2026
@dahlia dahlia requested a review from 2chanhaeng as a code owner June 30, 2026 23:55
@dahlia dahlia added the component/vocab Activity Vocabulary related label Jun 30, 2026
@netlify

netlify Bot commented Jun 30, 2026

Copy link
Copy Markdown

Deploy Preview for fedify-json-schema canceled.

Name Link
🔨 Latest commit 359f270
🔍 Latest deploy log https://app.netlify.com/projects/fedify-json-schema/deploys/6a44e96b48f70a0008ab3052

@dahlia

dahlia commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies getDocumentLoader() in @fedify/vocab-runtime to throw a FetchError when HTML/XHTML responses are too large or lack an ActivityPub alternate link, rather than attempting to parse the HTML as JSON. This improves error handling by surfacing document loading failures with descriptive messages instead of generic JSON parser crashes. Tests have been updated to assert this new behavior. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c6750f718

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/vocab-runtime/src/docloader.ts Outdated
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9871ceac-28f2-47f0-b976-5b6b514e8624

📥 Commits

Reviewing files that changed from the base of the PR and between c4db929 and 359f270.

📒 Files selected for processing (2)
  • packages/vocab-runtime/src/docloader.test.ts
  • packages/vocab-runtime/src/docloader.ts
💤 Files with no reviewable changes (1)
  • packages/vocab-runtime/src/docloader.test.ts

📝 Walkthrough

Walkthrough

getRemoteDocument() in @fedify/vocab-runtime now bounds HTML/XHTML reads and throws FetchError for oversized responses or HTML without an ActivityPub alternate link. Tests and the changelog were updated to match the new failure mode.

Changes

HTML document loader error handling

Layer / File(s) Summary
Bounded HTML error path
packages/vocab-runtime/src/docloader.ts
HTML/XHTML responses are read through a bounded helper, oversized responses fail with FetchError, alternate-link scanning uses the bounded text, and JSON fallback is now guarded.
FetchError coverage
packages/vocab-runtime/src/docloader.test.ts
Adjusts mocks and adds coverage for missing alternates, oversized HTML, canceled bodies, reader fallback, and the ReDoS regression.
Changelog note
CHANGES.md
Documents the new getDocumentLoader() behavior for HTML/XHTML responses without an ActivityPub alternate link.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Suggested reviewers: 2chanhaeng, sij411

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: improving HTML document-loader failure reporting.
Description check ✅ Passed The description is directly about the docloader HTML fallback error handling changes and related tests.
Linked Issues check ✅ Passed The changes satisfy #912 by replacing JSON parse crashes with typed FetchErrors and preserving response context.
Out of Scope Changes check ✅ Passed The test updates and changelog note are aligned with the issue and no unrelated changes stand out.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/vocab-runtime/src/docloader.ts`:
- Around line 205-217: The HTML size guard in docloader.ts is applied too late
because alternate-link scanning in the response handling path still buffers the
full body via response.text() after cloning the stream. Update the document
fetch logic in the relevant docloader flow (around the same response-processing
code used for the FetchError path) so MAX_HTML_SIZE is enforced before
buffering, either by short-circuiting on Content-Length or by streaming/reading
incrementally with an early stop, and keep the existing logger.warn and
FetchError behavior intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9dc5289d-6a6e-423f-a8da-b8aae6a2bb84

📥 Commits

Reviewing files that changed from the base of the PR and between e2224ef and 4c6750f.

📒 Files selected for processing (3)
  • CHANGES.md
  • packages/vocab-runtime/src/docloader.test.ts
  • packages/vocab-runtime/src/docloader.ts

Comment thread packages/vocab-runtime/src/docloader.ts Outdated
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.17647% with 6 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/vocab-runtime/src/docloader.ts 91.17% 3 Missing and 3 partials ⚠️
Files with missing lines Coverage Δ
packages/vocab-runtime/src/docloader.ts 93.93% <91.17%> (+1.14%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Keep the HTML document-loader fallback narrow by parsing responses that
look like JSON even when they are served with an HTML Content-Type.  This
preserves compatibility with misconfigured ActivityPub endpoints while
still reporting real HTML pages without alternates as FetchError failures.

Read HTML responses through a bounded stream instead of buffering the full
body before checking the scan limit.  Oversized responses now short-circuit
from Content-Length when available, or while reading when the stream crosses
the limit.

fedify-dev#913 (comment)
fedify-dev#913 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b00ea874a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/vocab-runtime/src/docloader.ts

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates getDocumentLoader() to reject HTML and XHTML responses that do not advertise an ActivityPub alternate document with a FetchError instead of attempting to parse them as JSON. It introduces a helper function readBoundedText to safely read response bodies up to a 1MB limit to prevent ReDoS attacks and excessive memory usage. The feedback suggests adding a defensive fallback to response.text() in readBoundedText to ensure compatibility with environments or test mocks where response.body.getReader is not available.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/vocab-runtime/src/docloader.ts
Cancel oversized response bodies when Content-Length already exceeds the
HTML scan limit, so document-loader failures do not leave the fetch stream
open while returning early.

Fall back to response.text() when a custom or non-standard response body
lacks getReader().  This keeps the bounded reader usable with fetch mocks
and older runtimes while preserving the streaming path where available.

fedify-dev#913 (comment)
fedify-dev#913 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates @fedify/vocab-runtime to reject HTML and XHTML responses that do not advertise an ActivityPub alternate document with a FetchError instead of attempting to parse the HTML as JSON. It introduces a MAX_HTML_SIZE limit of 1MB and helper functions like readBoundedText to safely stream and read HTML responses up to this limit, preventing JSON parser crashes on remote HTML error pages. Additionally, corresponding unit tests have been added to verify these behaviors, including handling oversized HTML and fallback scenarios. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

Reviewed commit: c4db929f49

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dahlia dahlia requested a review from sij411 July 1, 2026 01:40
sij411
sij411 previously approved these changes Jul 1, 2026

@sij411 sij411 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this since compatibility concern was already handled.

@2chanhaeng 2chanhaeng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions?

Comment thread packages/vocab-runtime/src/docloader.ts Outdated
Comment thread packages/vocab-runtime/src/docloader.ts Outdated
Comment thread packages/vocab-runtime/src/docloader.ts Outdated
Comment thread packages/vocab-runtime/src/docloader.ts Outdated
Comment thread packages/vocab-runtime/src/docloader.ts Outdated
The bounded HTML path can rely on the standard Response body type instead
of defensive assertions for non-standard mocks.  The parser now preserves
HTML-labelled JSON compatibility by attempting JSON.parse() directly and
converts syntax failures into the same FetchError used for ordinary HTML.

fedify-dev#913 (comment)
fedify-dev#913 (comment)
fedify-dev#913 (comment)
fedify-dev#913 (comment)
fedify-dev#913 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

Reviewed commit: 359f27073b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates @fedify/vocab-runtime to reject HTML and XHTML responses that do not advertise an ActivityPub alternate document with a FetchError instead of attempting to parse the HTML as JSON. It introduces a MAX_HTML_SIZE limit of 1MB and implements bounded text reading (readBoundedText) to safely cancel response bodies that exceed this limit, preventing potential ReDoS or memory issues. Additionally, unit tests have been updated and expanded to cover these new error conditions and body cancellation behaviors. There are no review comments provided, so I have no feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@dahlia dahlia requested review from 2chanhaeng and sij411 July 1, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/vocab Activity Vocabulary related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Report HTML document-loader failures without JSON parser crashes

3 participants