codefrydev · PrashantUnity · Jun 20, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/.coveragerc b/.coveragerc
@@ -8,6 +8,7 @@ omit =
     */website_profiling/integrations/bing/*
     */website_profiling/integrations/crux/*
     */website_profiling/integrations/serp/*
+    */website_profiling/integrations/ai_citations/*
     */website_profiling/integrations/links/third_party_csv.py
     */website_profiling/lighthouse/*
     */website_profiling/reporting/*

diff --git a/.gitignore b/.gitignore
@@ -31,4 +31,5 @@ pipeline-config.txt
 .coverage
 .agents/
 skills-lock.json
-crawl_results.csv
+crawl_results.csv
+commit.*
diff --git a/AGENT.md b/AGENT.md
@@ -43,6 +43,7 @@ Developer reference for agents and contributors. User-facing overview: [README.m
 | Local analysis | `analysis/local.py`, `requirements.txt` |
 | AI insights (LLM) | `llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt` |
 | Audit query tools (MCP + chat) | `tools/audit_tools/`, `mcp/server.py`, `mcp/http_server.py`, `commands/chat_cmd.py` |
+| Agent readiness checks | `tools/audit_tools/agent_readiness.py`, `tools/audit_tools/_aeo_helpers.py` |
 | Config / CLI | `config.py` (`load_config`, `load_config_from_db`), `cli.py`, `input.txt.example` |
 | UI pipeline schema | `web/src/lib/pipelineConfigSchema.ts` |
 | UI LLM schema | `web/src/lib/llmConfigSchema.ts` |

diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,39 @@
+# Agent instructions — Site Audit (WebsiteProfiling)
+
+> Developer reference for AI coding agents and contributors.
+
+This file is the canonical entry point for agents. For full detail see [AGENT.md](AGENT.md).
+
+**What it is:** Self-hosted SEO crawl and technical audit platform — `python -m src` from repo root. Stack: Python (crawl + analysis + MCP), Next.js (web UI), PostgreSQL.
+
+**Key paths**
+
+- `src/website_profiling/` — core Python package
+  - `cli.py`, `config.py`, `crawl/`, `db/`, `reporting/`, `analysis/`, `llm/`, `tools/`
+- `web/` — Next.js frontend
+- `alembic/` — DB migrations
+- `docs/` — documentation index
+- `tests/` — pytest suite
+
+**Run / dev**
+
+```bash
+./local-run          # Start Postgres (Docker) + Next.js
+./local-test         # Run all three coverage gates
+python -m src        # Run audit pipeline
+python -m website_profiling.mcp   # Start MCP server (stdio)
+```
+
+**MCP:** 340 read-only audit tools via Model Context Protocol. See [docs/MCP.md](docs/MCP.md).
+
+**Edit targets**
+
+| Task | Where |
+|------|-------|
+| Crawl | `src/website_profiling/crawl/` |
+| Report | `src/website_profiling/reporting/` |
+| GEO / AEO / Agent readiness | `src/website_profiling/tools/audit_tools/geo_tools.py`, `agent_readiness.py` |
+| DB schema | `alembic/versions/` |
+| UI | `web/src/views/`, `web/app/` |
+
+**Common pitfalls:** See [AGENT.md](AGENT.md) for the full footguns checklist (React context, Python local imports, psycopg dict rows, coverage gates).
diff --git a/README.md b/README.md
@@ -58,13 +58,13 @@ Site Audit focuses on **honest, self-hosted technical SEO**. It is not a drop-in
 
 - **No live backlink index** — Backlink tools read **Google Search Console Links CSV imports** (and optional third-party CSV overlays). There is no Ahrefs, Semrush, Moz, or Majestic API integration.
 - **No daily rank tracking** — Keyword positions come from **GSC snapshots** on your connected property, not a proprietary SERP tracker or rank-history database.
-- **No live AI citation checks** — GEO/AEO tools use **on-site heuristics**; they do not query ChatGPT, Perplexity, or other AI search engines in real time.
+- **Live AI citation checks are opt-in** — GEO/AEO tools default to **on-site heuristics** (no API required). Optional live checks via `check_ai_citations_live` require a BYO API key (`PERPLEXITY_API_KEY`, `OPENAI_API_KEY`, etc.) and explicit `opt_in=true`; they are not called automatically.
 - **No third-party keyword volume APIs** — Keyword explorer uses **on-site frequency plus Search Console**; difficulty and SERP feature overlays are estimated unless you supply your own data.
 - **No managed cloud** — You run it (Docker or local dev). This repo is not a hosted multi-tenant SaaS.
 - **No substitute for Google access** — Search Console, Analytics, and Bing Webmaster require **your credentials**; missing or stale integrations show empty states with provenance labels, not fabricated metrics.
 - **Not a ranking guarantee** — Category scores (0–100) are **internal audit scores**, not Google rankings or predicted traffic impact.
 
-**Planned extensions** (not yet shipped): full backlink index beyond GSC import, SERP rank tracking beyond GSC snapshots, and live AI citation APIs. See [docs/MCP.md](docs/MCP.md#future-pipeline-items).
+**Planned extensions** (not yet shipped): full backlink index beyond GSC import, SERP rank tracking beyond GSC snapshots. See [docs/MCP.md](docs/MCP.md#future-pipeline-items).
 
 ## Features
 
@@ -207,6 +207,18 @@ CI also runs a **Docker** job (image build, browser pytest in container, compose
 
 Connect Google Search Console and Analytics via **Integrations** (gear icon) in the application UI.
 
+### Google Ads Keyword Planner (optional)
+
+Adds official search volume and competition data from the Google Ads API to the Keywords explorer. Requires:
+
+1. A [Google Ads developer token](https://developers.google.com/google-ads/api/docs/first-call/dev-token) (Basic access is sufficient for keyword research).
+2. A Google Ads manager account customer ID (login customer ID).
+3. An existing Google OAuth connection (via Integrations) — users must re-consent after the `adwords` scope is added.
+
+In **Integrations → Google Ads Keyword Planner**, enter the developer token and login customer ID. Then enable `enable_google_keyword_planner` in audit settings.
+
+The overlay enriches keywords that have no Search Console impressions with `planner_avg_monthly_searches` and `planner_competition`, labelled "Google Keyword Planner" to distinguish them from real GSC data. GSC-ranked keywords are never overwritten. Set `enable_keyword_forecast = true` to additionally attach click/conversion forecasts to the top 50 keywords.
+
 ### JavaScript crawl (optional)
 
 In Audit settings, set **Crawl rendering** to `javascript` (always headless Chromium) or `auto` (static first, browser when SPA heuristics match). Requires Playwright from `requirements.txt` and Chromium on `PATH` or `CHROME_PATH` (included in Docker). The UI preflights via `GET /api/crawl/browser-status` before runs when JS or auto mode is selected.
@@ -224,7 +236,9 @@ Ask questions about audit data at [http://localhost:3000/chat](http://localhost:
 | **Groq**                   | API key in AI settings or `GROQ_API_KEY`; official Groq Python SDK; native tool calling with streaming. Default model `openai/gpt-oss-120b`.                               |
 
 
-The agent uses the same **340 read-only audit tools** as the MCP server ([docs/MCP.md](docs/MCP.md)), with **dynamic routing** (~45 tools per turn). Responses stream over SSE (`POST /api/chat`). Sessions persist per property (`chat_sessions` / `chat_messages`).
+The agent uses the same **342 read-only audit tools** as the MCP server ([docs/MCP.md](docs/MCP.md)), with **dynamic routing** (~45 tools per turn). Responses stream over SSE (`POST /api/chat`). Sessions persist per property (`chat_sessions` / `chat_messages`).
+
+**Read-only SQL chat tool (opt-in):** Set `CHAT_SQL_TOOL_ENABLED=true` to expose `get_sql_schema` and `run_sql_query` to the LLM. The agent can then answer arbitrary data questions by generating and executing a single read-only SELECT. Queries are validated by a four-layer guard (regex pre-filter → `sqlglot` AST + table allowlist → `BEGIN TRANSACTION READ ONLY` → optional least-privilege DB role); DELETE/UPDATE/INSERT/DDL and non-allowlisted tables are always blocked. In multi-property deployments, scope-binding CTEs are automatically injected to enforce tenant isolation. See [docs/OPS.md](docs/OPS.md#read-only-sql-chat-tool) for setup including the recommended `audit_readonly` Postgres role and optional RLS configuration.
 
 ### Content studio (optional, Experimental)
 

diff --git a/alembic/versions/021_google_ads_planner_settings.py b/alembic/versions/021_google_ads_planner_settings.py
@@ -0,0 +1,30 @@
+"""Add developer_token and login_customer_id to google_app_settings for Keyword Planner.
+
+Revision ID: 021_google_ads_planner_settings
+Revises: 020_crawl_run_pause_state
+Create Date: 2026-06-19
+"""
+from alembic import op
+
+revision = "021_google_ads_planner_settings"
+down_revision = "020_crawl_run_pause_state"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    op.execute(
+        "ALTER TABLE google_app_settings ADD COLUMN IF NOT EXISTS developer_token TEXT"
+    )
+    op.execute(
+        "ALTER TABLE google_app_settings ADD COLUMN IF NOT EXISTS login_customer_id TEXT"
+    )
+
+
+def downgrade() -> None:
+    op.execute(
+        "ALTER TABLE google_app_settings DROP COLUMN IF EXISTS developer_token"
+    )
+    op.execute(
+        "ALTER TABLE google_app_settings DROP COLUMN IF EXISTS login_customer_id"
+    )
diff --git a/alembic/versions/022_dashboards.py b/alembic/versions/022_dashboards.py
@@ -0,0 +1,36 @@
+"""Custom dashboards — property-scoped dashboard builder with JSONB layout.
+
+Revision ID: 022_dashboards
+Revises: 021_google_ads_planner_settings
+Create Date: 2026-06-19
+"""
+from __future__ import annotations
+
+from alembic import op
+
+revision = "022_dashboards"
+down_revision = "021_google_ads_planner_settings"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    op.execute("""
+        CREATE TABLE dashboards (
+            id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
+            property_id BIGINT NOT NULL REFERENCES properties(id) ON DELETE CASCADE,
+            name TEXT NOT NULL DEFAULT 'Untitled dashboard',
+            layout_json JSONB NOT NULL DEFAULT '{}'::jsonb,
+            is_default BOOLEAN NOT NULL DEFAULT false,
+            created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+            updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
+        )
+    """)
+    op.execute("""
+        CREATE INDEX dashboards_property_updated_idx
+        ON dashboards(property_id, updated_at DESC)
+    """)
+
+
+def downgrade() -> None:
+    op.execute("DROP TABLE IF EXISTS dashboards")
diff --git a/alembic/versions/023_crawl_page_markdown.py b/alembic/versions/023_crawl_page_markdown.py
@@ -0,0 +1,44 @@
+"""Add crawl_page_markdown table for per-URL extracted markdown storage.
+
+Revision ID: 023_crawl_page_markdown
+Revises: 022_dashboards
+"""
+from __future__ import annotations
+
+from alembic import op
+
+revision = "023_crawl_page_markdown"
+down_revision = "022_dashboards"
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    op.execute("""
+        CREATE TABLE crawl_page_markdown (
+            crawl_run_id BIGINT NOT NULL REFERENCES crawl_runs(id) ON DELETE CASCADE,
+            url TEXT NOT NULL,
+            property_id BIGINT REFERENCES properties(id) ON DELETE SET NULL,
+            title TEXT,
+            markdown TEXT NOT NULL,
+            word_count INTEGER NOT NULL DEFAULT 0,
+            strategy TEXT NOT NULL DEFAULT 'main_only',
+            source_byte_length INTEGER NOT NULL DEFAULT 0,
+            extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
+            PRIMARY KEY (crawl_run_id, url)
+        )
+    """)
+    op.execute("""
+        CREATE INDEX idx_crawl_page_markdown_run
+            ON crawl_page_markdown (crawl_run_id)
+    """)
+    op.execute("""
+        CREATE INDEX idx_crawl_page_markdown_property
+            ON crawl_page_markdown (property_id)
+    """)
+
+
+def downgrade() -> None:
+    op.execute("DROP INDEX IF EXISTS idx_crawl_page_markdown_property")
+    op.execute("DROP INDEX IF EXISTS idx_crawl_page_markdown_run")
+    op.execute("DROP TABLE IF EXISTS crawl_page_markdown")
diff --git a/docs/GLOSSARY.md b/docs/GLOSSARY.md
@@ -42,6 +42,9 @@ This glossary maps agency-facing UI terms to internal keys, database tables, and
 | Moz / Majestic overlay | `third_party_overlays` on `gsc_links`, `/api/backlinks/third-party-import` | CSV export upload | Referring-domain comparison vs GSC sample |
 | Bing backlinks | `bing_backlinks`, Integrations sync | Bing Webmaster API (optional) | Secondary link source |
 | SERP competition overlay | `serp_estimated_competition` on keywords | SerpAPI (optional) | Estimated SERP difficulty |
+| Keyword Planner overlay | `planner_avg_monthly_searches`, `planner_competition`, `planner_competition_index`, `planner_provenance` on keyword rows | Google Ads API `KeywordPlanIdeaService` (optional; `enable_google_keyword_planner`) | Official market-level search volume + competition — does not overwrite GSC impressions |
+| Keyword Planner discovery | New keyword rows with `sources: ["planner"]` | `GenerateKeywordIdeas` | Brand-new keywords not yet in crawl or GSC |
+| Keyword Planner forecast | `planner_forecast_clicks`, `planner_forecast_conversions` on top rows | `GenerateKeywordForecastMetrics` v24 (`enable_keyword_forecast`) | Paid-campaign click/conversion forecast — clearly labelled, not organic traffic |
 | Scheduled audits | `properties.schedule_cron`, `/api/schedule/check` | Cron + pipeline spawn | Recurring site audit — see [OPS.md](OPS.md) |
 | Property alerts | `alert_webhook_url`, `/api/alerts/check` | Health snapshot rules | Operations notifications |
 | Content brief | Keywords Brief button, `/api/keywords/content-brief` | LLM or deterministic | Content planning |

diff --git a/docs/MCP.md b/docs/MCP.md
@@ -288,6 +288,26 @@ Size-based tools require `probe_image_inventory=true` in pipeline config. Relate
 
 `get_geo_readiness_score`, `get_aeo_content_signals_for_url`, `get_llms_txt_status`, `draft_llms_txt`, `get_faq_schema_coverage`, `list_pages_missing_faq_schema`, `get_eeat_signals_summary`, `get_internal_link_suggestions`, `check_ai_citation_presence`
 
+### Agent documentation readiness (agentic-seo parity)
+
+`get_agent_readiness_score` — 5-category composite score (0-100, A-F grade): discovery, content structure, token economics, capability signaling, UX bridge.
+
+**Discovery:** `get_agents_md_status`, `get_skill_md_status`, `get_agent_permissions_status`
+
+**Token economics:** `get_token_budget_summary`, `list_oversized_pages_for_agents`
+
+**Content structure:** `get_content_structure_aeo_summary`, `get_markdown_availability_summary`, `list_pages_agent_unfriendly`
+
+**UX bridge:** `get_copy_for_ai_signals`, `list_pages_missing_copy_for_ai`
+
+**Generator:** `generate_agent_readiness_bundle` — draft AGENTS.md, skill.md, agent-permissions.json
+
+**Example prompts:**
+- "Score this site's agent documentation readiness"
+- "Which pages are over the 8k token limit for AI agents?"
+- "Does this site have an AGENTS.md or skill.md?"
+- "Generate agent readiness files for my site"
+
 ### Integrations
 
 `get_bing_index_status` (requires `bing_webmaster_api_key` in audit settings)
@@ -306,6 +326,8 @@ In-app chat uses **dynamic tool routing**: each turn loads Tier 0 router tools p
 
 Responses stream over SSE via `POST /api/chat`. Sessions persist per property in `chat_sessions` and `chat_messages`.
 
+**Optional crawl actions:** When **Allow chat to start crawls** is enabled under **Run audit → Settings → Content & AI → Chat agent**, the chat agent can guide crawl setup and call `prepare_audit_run` to show an in-chat confirm card. The user must authorize crawling and click **Run audit** — the agent never spawns jobs directly. MCP tools remain read-only; `prepare_audit_run` is chat-only and excluded from MCP bundles.
+
 ---
 
 ## Provider notes