feat: add fastCRW reader#23
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new Crw loader class in internagent/mas/agents/dr_agents/camel/loaders/crw_reader.py to integrate the fastCRW web scraper, exposing it in the package's __init__.py. The review feedback suggests simplifying the parameter unpacking in the crawl method to use **(params or {}) for consistency, and adding a type check in structured_scrape to ensure data is a dictionary before calling .get() to prevent potential AttributeErrors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| crawl_response = self.app.crawl( | ||
| url=url, | ||
| **({} if params is None else params), | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
For consistency with the scrape and map_site methods in this class, use **(params or {}) instead of the more verbose inline conditional expression.
| crawl_response = self.app.crawl( | |
| url=url, | |
| **({} if params is None else params), | |
| **kwargs, | |
| ) | |
| crawl_response = self.app.crawl( | |
| url=url, | |
| **(params or {}), | |
| **kwargs, | |
| ) |
| data = self.app.scrape( | ||
| url, | ||
| formats=['json'], | ||
| jsonSchema=response_format.model_json_schema(), | ||
| ) | ||
| return data.get('json', {}) |
There was a problem hiding this comment.
If self.app.scrape returns None or a non-dictionary response (e.g., due to an unexpected API failure or empty response), calling .get() on it will raise an AttributeError. Adding a type check ensures robustness.
| data = self.app.scrape( | |
| url, | |
| formats=['json'], | |
| jsonSchema=response_format.model_json_schema(), | |
| ) | |
| return data.get('json', {}) | |
| data = self.app.scrape( | |
| url, | |
| formats=['json'], | |
| jsonSchema=response_format.model_json_schema(), | |
| ) | |
| return data.get('json', {}) if isinstance(data, dict) else {} |
What
Adds fastCRW as a web scrape/search provider, alongside the existing Firecrawl integration — additive, mirrors the Firecrawl wiring (Firecrawl untouched).
Why
fastCRW is a genuinely different engine, not a Firecrawl alias:
Runs 100% locally with stealth + JS rendering in the open core. Anti-bot handling (Cloudflare JS-challenge, UA rotation), BYO-proxy + rotation, SPA rendering, and an HTTP→headless→proxy fallback ladder all ship in the AGPL binary. Firecrawl's OSS self-host gates its stealth engine (
fire-engine) behind a cloud-only flag, so its self-hosted instance falls back to plain fetch/Playwright and can't reach protected or JS-heavy sites. fastCRW's open core can.Faster and higher recall on Firecrawl's own benchmark dataset. Truth-recall 63.74% vs 56.04%; median latency ~1.9 s vs ~2.3 s (p50). Ships as a single ~8 MB Rust binary using ~6 MB RAM — no multi-service stack.
Search is built on SearXNG, not an alternative to it. SearXNG is the metasearch aggregator underneath; crw adds a quality layer on top: query expansion (multi-variant rewrite), content-aware reranking (re-scoring by fetched content instead of SearXNG's content-blind ordering), and category routing (research queries fan out to arxiv/semantic scholar/Google Scholar, code queries to GitHub). You get SearXNG's breadth plus a measurable accuracy layer — all open-source (AGPL) and self-hostable with configurable engines.
Flat pricing: 1 credit = 1 page, no stealth surcharge, no billed-on-failure. Key via
CRW_API_KEY(free tier at https://fastcrw.com/dashboard); self-host base URL supported.The integration diff is small precisely because fastCRW exposes a Firecrawl-compatible API — that compatibility is the reason this is additive rather than a larger change, not the reason to use it.
Happy to adjust to your conventions — I maintain the integration and can provide free credits to evaluate.