Skip to content

feat: add fastCRW tool block#5025

Open
us wants to merge 2 commits into
simstudioai:mainfrom
us:feat/add-fastcrw
Open

feat: add fastCRW tool block#5025
us wants to merge 2 commits into
simstudioai:mainfrom
us:feat/add-fastcrw

Conversation

@us

@us us commented Jun 13, 2026

Copy link
Copy Markdown

What

Adds fastCRW as a tool block (scrape / crawl / map / search), mirroring the existing Firecrawl block.

Why

fastCRW is a Firecrawl-API-compatible web engine in a single ~8MB binary — self-host free or managed cloud. Flat pricing (1 credit = 1 page; no 4x stealth surcharge, no billed-on-failure) and free anti-bot stealth — a drop-in alternative to the Firecrawl block for Sim workflows.

Changes (additive only)

  • apps/sim/tools/crw/: scrape/crawl/map/search + types (mirrors tools/firecrawl/).
  • apps/sim/blocks/blocks/crw.ts + registered in blocks/registry.ts, tools/registry.ts.
  • Icon, CSP allowlist entry, BYOK key entry, integrations.json — every place Firecrawl is registered.

Config

CRW_API_KEY from https://fastcrw.com/dashboard (free tier); base URL overridable for self-host.

Why fastCRW — beyond Firecrawl compatibility

The common assumption: "Firecrawl is open-source, so self-hosting it gets you the same thing." It doesn't.

Firecrawl's real anti-bot and Cloudflare-bypass path runs through fire-engine, which is cloud-only — the self-hosted build falls back to plain fetch / Playwright and cannot get past Cloudflare or most JS-heavy, protected sites. It also requires a multi-service stack (Redis + workers + Playwright) to run at all.

fastCRW ships the full capability set in its open core (AGPL): Cloudflare JS-challenge handling, UA rotation, SPA rendering, BYO-proxy with rotation, and an HTTP → headless → proxy fallback ladder — one binary, no cloud dependency, no asterisks.

Practical upshot for Sim users: self-hosted Sim + fastCRW is a genuinely complete, cloud-free scrape/crawl/search stack — something you cannot actually build today with Firecrawl's OSS. For the managed path, flat per-page pricing (no stealth surcharge, no charge on failure) makes cost predictable for crawl- and search-heavy workflows.


Happy to adjust — I maintain the integration and can provide free credits.

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 13, 2026 7:40pm

Request Review

@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Additive integration with BYOK-only outbound API calls; crawl polling could hold a workflow slot until timeout, similar to other async crawl providers.

Overview
Adds fastCRW as a Firecrawl-compatible web data integration so workflows can scrape, search, crawl, and map sites via managed cloud (https://fastcrw.com/api) or a self-hosted Base URL.

New crw workflow block routes operations to four tools (crw_scrape, crw_search, crw_crawl, crw_map) with BYOK API keys (CRW_API_KEY / workspace BYOK provider crw). Crawl creates an async job and polls until completion or the execution timeout.

Wiring is additive across block/tool registries, integrations.json, icon mapping, CSP connect-src for fastcrw.com, and the Search & web BYOK settings section. Vitest coverage exercises URL resolution, request shaping, and response mapping.

Reviewed by Cursor Bugbot for commit 056eac2. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2964aed. Configure here.

pages: [],
total: 0,
},
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crawl job id wrong path

High Severity

The crawl create handler sets jobId from the top-level id on the JSON body, but fastCRW’s documented POST /v1/crawl response puts the job id under the nested data object. jobId stays undefined, so status polling hits /v1/crawl/undefined and the crawl operation fails end-to-end.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2964aed. Configure here.

Comment thread apps/sim/tools/crw/crawl.ts
formats: params.formats || ['markdown'],
onlyMainContent: params.onlyMainContent || false,
},
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crawl sends maxPages not limit

Medium Severity

The crawl request body sends maxPages, while fastCRW’s Firecrawl-compatible POST /v1/crawl expects limit for the page cap. The block’s Max Pages value is ignored and the service falls back to its default crawl size.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2964aed. Configure here.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds fastCRW as a new tool block (scrape / crawl / map / search), mirroring the existing Firecrawl block. The integration is additive-only: new files under tools/crw/ and blocks/blocks/crw.ts, plus registration in the block/tool registries, BYOK keys, CSP allowlist, icon, and integrations.json.

  • Four tool configs (crw_scrape, crw_search, crw_crawl, crw_map) mirror Firecrawl's structure with fastCRW-specific differences: maxPages instead of limit for crawl, a dynamic baseUrl param for self-hosting, and a resolveCrwBaseUrl helper.
  • Registration is complete across all required locations (BYOK schema, type union, CSP, icon mapping, integrations JSON), and a test file covers URL construction, body building, and response transformation for all four operations.

Confidence Score: 4/5

The change is purely additive and isolated to new files; no existing functionality is modified. The three tools with hardcoded success responses will silently swallow API-level errors, but they won't cause data corruption or affect other blocks.

Three of the four new tools (scrape, search, crawl) always return success: true from transformResponse even when the API body indicates failure — the crawl case is the worst because an undefined jobId leads the poll loop to request /v1/crawl/undefined, masking the real error. The fourth tool (map) handles this correctly, making the inconsistency self-contained within this PR. No other part of the codebase is touched.

apps/sim/tools/crw/scrape.ts, apps/sim/tools/crw/search.ts, apps/sim/tools/crw/crawl.ts — the transformResponse functions in all three need to check data.success before reporting a successful result.

Important Files Changed

Filename Overview
apps/sim/blocks/blocks/crw.ts New block config mirroring Firecrawl; routes scrape/search/crawl/map to the correct crw_* tools, formats params, and exposes baseUrl for self-hosting. Clean and consistent with existing block patterns.
apps/sim/tools/crw/scrape.ts Scrape tool is structurally correct but hardcodes success: true in transformResponse regardless of API-level errors, unlike map.ts which properly checks data.success.
apps/sim/tools/crw/search.ts Search tool also hardcodes success: true in transformResponse; same inconsistency with map.ts. Additionally, limit and sources params are used in the body builder but not declared in the tool's params definition (though this mirrors the Firecrawl search pattern).
apps/sim/tools/crw/crawl.ts Crawl tool implements async polling correctly, but transformResponse ignores data.success — if job creation returns HTTP 200 with success:false, postProcess will poll /v1/crawl/undefined leading to a confusing 404 error instead of the real failure.
apps/sim/tools/crw/map.ts Map tool correctly checks data.success in transformResponse and handles missing links with a fallback array. Well-structured and complete.
apps/sim/tools/crw/types.ts Comprehensive type definitions and output property constants. Clean mirror of the Firecrawl types, with appropriate additions for fastCRW-specific fields.
apps/sim/tools/crw/crw.test.ts Good coverage of URL construction, body building, and response transformation for all four operations. Tests document the expected API response shapes clearly.
apps/sim/lib/core/security/csp.ts Adds https://fastcrw.com to connect-src allowlist. Covers the full domain/origin, which is sufficient since the API lives at /api/v1/* on the same origin.
apps/sim/tools/crw/base-url.ts Clean utility for resolving the base URL, with trailing-slash stripping and a sensible default. Well-tested.
apps/sim/lib/api/contracts/byok-keys.ts Correctly adds 'crw' to the BYOK provider ID zod schema enum.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CrwBlock - crw.ts] -->|operation=scrape| B[crw_scrape tool]
    A -->|operation=search| C[crw_search tool]
    A -->|operation=crawl| D[crw_crawl tool]
    A -->|operation=map| E[crw_map tool]

    B --> F["POST /v1/scrape\n(fastcrw.com/api)"]
    C --> G["POST /v1/search\n(fastcrw.com/api)"]
    D --> H["POST /v1/crawl\n(fastcrw.com/api)"]
    E --> I["POST /v1/map\n(fastcrw.com/api)"]

    D -->|async job| J[postProcess polling loop]
    J --> K["GET /v1/crawl/{jobId}"]
    K -->|completed| L[Return pages + total]
    K -->|failed| M[Return error]
    K -->|timeout| N[Return timeout error]

    B --> O[transformResponse - always success:true]
    C --> P[transformResponse - always success:true]
    E --> Q[transformResponse - checks data.success]
Loading

Comments Outside Diff (1)

  1. apps/sim/tools/crw/crawl.ts, line 623-634 (link)

    P2 transformResponse ignores API-level job creation failure

    If the crawl POST returns HTTP 200 with { success: false, error: "…" }, transformResponse still returns success: true with jobId: undefined. postProcess then checks if (!result.success) (passes), and proceeds to poll ${baseUrl}/v1/crawl/undefined, which returns a 404 and surfaces a confusing "Failed to get crawl status: Not Found" error rather than the original creation error. Guard against this by checking data.success (or at least data.id) in transformResponse before the poll loop begins.

Reviews (1): Last reviewed commit: "feat: add fastCRW tool block" | Re-trigger Greptile

Comment on lines +88 to +100
const result = data.data ?? data

return {
success: true,
output: {
markdown: result.markdown,
html: result.html,
metadata: result.metadata,
},
}
},

outputs: {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Scrape/search always report success: true regardless of API error body

Both scrape.ts and search.ts hardcode success: true in transformResponse. The map.ts counterpart correctly propagates data.success. When the fastCRW API returns HTTP 200 with { success: false, error: "…" } (e.g., invalid URL or auth error), the scrape and search tools will still emit success: true with undefined output fields, masking the failure from downstream blocks. map.ts shows the correct pattern: return success: data.success and reflect it in the output envelope.

Comment on lines +66 to +75
transformResponse: async (response: Response) => {
const data = await response.json()

return {
success: true,
output: {
data: data.data,
},
}
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Search always reports success: true on API-level failures

Same issue as scrape.tstransformResponse always returns success: true without checking data.success. The map.ts tool in this same PR correctly checks data.success. If the search API returns { success: false, error: "…" } with HTTP 200, downstream blocks see a successful result with data: undefined rather than a proper error.

scrape, search, and crawl transformResponse hardcoded success: true, masking HTTP 200 responses with { success: false, error }. They now reflect data.success and surface the error, matching map.ts. Crawl additionally fails fast when job creation has no id, preventing a poll loop against /v1/crawl/undefined. Adds error-path tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant