Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ omit =
*/website_profiling/integrations/bing/*
*/website_profiling/integrations/crux/*
*/website_profiling/integrations/serp/*
*/website_profiling/integrations/ai_citations/*
*/website_profiling/integrations/links/third_party_csv.py
*/website_profiling/lighthouse/*
*/website_profiling/reporting/*
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ pipeline-config.txt
.coverage
.agents/
skills-lock.json
crawl_results.csv
crawl_results.csv
commit.*
1 change: 1 addition & 0 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Developer reference for agents and contributors. User-facing overview: [README.m
| Local analysis | `analysis/local.py`, `requirements.txt` |
| AI insights (LLM) | `llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt` |
| Audit query tools (MCP + chat) | `tools/audit_tools/`, `mcp/server.py`, `mcp/http_server.py`, `commands/chat_cmd.py` |
| Agent readiness checks | `tools/audit_tools/agent_readiness.py`, `tools/audit_tools/_aeo_helpers.py` |
| Config / CLI | `config.py` (`load_config`, `load_config_from_db`), `cli.py`, `input.txt.example` |
| UI pipeline schema | `web/src/lib/pipelineConfigSchema.ts` |
| UI LLM schema | `web/src/lib/llmConfigSchema.ts` |
Expand Down
39 changes: 39 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Agent instructions — Site Audit (WebsiteProfiling)

> Developer reference for AI coding agents and contributors.

This file is the canonical entry point for agents. For full detail see [AGENT.md](AGENT.md).

**What it is:** Self-hosted SEO crawl and technical audit platform — `python -m src` from repo root. Stack: Python (crawl + analysis + MCP), Next.js (web UI), PostgreSQL.

**Key paths**

- `src/website_profiling/` — core Python package
- `cli.py`, `config.py`, `crawl/`, `db/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `web/` — Next.js frontend
- `alembic/` — DB migrations
- `docs/` — documentation index
- `tests/` — pytest suite

**Run / dev**

```bash
./local-run # Start Postgres (Docker) + Next.js
./local-test # Run all three coverage gates
python -m src # Run audit pipeline
python -m website_profiling.mcp # Start MCP server (stdio)
```

**MCP:** 340 read-only audit tools via Model Context Protocol. See [docs/MCP.md](docs/MCP.md).

**Edit targets**

| Task | Where |
|------|-------|
| Crawl | `src/website_profiling/crawl/` |
| Report | `src/website_profiling/reporting/` |
| GEO / AEO / Agent readiness | `src/website_profiling/tools/audit_tools/geo_tools.py`, `agent_readiness.py` |
| DB schema | `alembic/versions/` |
| UI | `web/src/views/`, `web/app/` |

**Common pitfalls:** See [AGENT.md](AGENT.md) for the full footguns checklist (React context, Python local imports, psycopg dict rows, coverage gates).
20 changes: 17 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,13 @@ Site Audit focuses on **honest, self-hosted technical SEO**. It is not a drop-in

- **No live backlink index** — Backlink tools read **Google Search Console Links CSV imports** (and optional third-party CSV overlays). There is no Ahrefs, Semrush, Moz, or Majestic API integration.
- **No daily rank tracking** — Keyword positions come from **GSC snapshots** on your connected property, not a proprietary SERP tracker or rank-history database.
- **No live AI citation checks** — GEO/AEO tools use **on-site heuristics**; they do not query ChatGPT, Perplexity, or other AI search engines in real time.
- **Live AI citation checks are opt-in** — GEO/AEO tools default to **on-site heuristics** (no API required). Optional live checks via `check_ai_citations_live` require a BYO API key (`PERPLEXITY_API_KEY`, `OPENAI_API_KEY`, etc.) and explicit `opt_in=true`; they are not called automatically.
- **No third-party keyword volume APIs** — Keyword explorer uses **on-site frequency plus Search Console**; difficulty and SERP feature overlays are estimated unless you supply your own data.
- **No managed cloud** — You run it (Docker or local dev). This repo is not a hosted multi-tenant SaaS.
- **No substitute for Google access** — Search Console, Analytics, and Bing Webmaster require **your credentials**; missing or stale integrations show empty states with provenance labels, not fabricated metrics.
- **Not a ranking guarantee** — Category scores (0–100) are **internal audit scores**, not Google rankings or predicted traffic impact.

**Planned extensions** (not yet shipped): full backlink index beyond GSC import, SERP rank tracking beyond GSC snapshots, and live AI citation APIs. See [docs/MCP.md](docs/MCP.md#future-pipeline-items).
**Planned extensions** (not yet shipped): full backlink index beyond GSC import, SERP rank tracking beyond GSC snapshots. See [docs/MCP.md](docs/MCP.md#future-pipeline-items).

## Features

Expand Down Expand Up @@ -207,6 +207,18 @@ CI also runs a **Docker** job (image build, browser pytest in container, compose

Connect Google Search Console and Analytics via **Integrations** (gear icon) in the application UI.

### Google Ads Keyword Planner (optional)

Adds official search volume and competition data from the Google Ads API to the Keywords explorer. Requires:

1. A [Google Ads developer token](https://developers.google.com/google-ads/api/docs/first-call/dev-token) (Basic access is sufficient for keyword research).
2. A Google Ads manager account customer ID (login customer ID).
3. An existing Google OAuth connection (via Integrations) — users must re-consent after the `adwords` scope is added.

In **Integrations → Google Ads Keyword Planner**, enter the developer token and login customer ID. Then enable `enable_google_keyword_planner` in audit settings.

The overlay enriches keywords that have no Search Console impressions with `planner_avg_monthly_searches` and `planner_competition`, labelled "Google Keyword Planner" to distinguish them from real GSC data. GSC-ranked keywords are never overwritten. Set `enable_keyword_forecast = true` to additionally attach click/conversion forecasts to the top 50 keywords.

### JavaScript crawl (optional)

In Audit settings, set **Crawl rendering** to `javascript` (always headless Chromium) or `auto` (static first, browser when SPA heuristics match). Requires Playwright from `requirements.txt` and Chromium on `PATH` or `CHROME_PATH` (included in Docker). The UI preflights via `GET /api/crawl/browser-status` before runs when JS or auto mode is selected.
Expand All @@ -224,7 +236,9 @@ Ask questions about audit data at [http://localhost:3000/chat](http://localhost:
| **Groq** | API key in AI settings or `GROQ_API_KEY`; official Groq Python SDK; native tool calling with streaming. Default model `openai/gpt-oss-120b`. |


The agent uses the same **340 read-only audit tools** as the MCP server ([docs/MCP.md](docs/MCP.md)), with **dynamic routing** (~45 tools per turn). Responses stream over SSE (`POST /api/chat`). Sessions persist per property (`chat_sessions` / `chat_messages`).
The agent uses the same **342 read-only audit tools** as the MCP server ([docs/MCP.md](docs/MCP.md)), with **dynamic routing** (~45 tools per turn). Responses stream over SSE (`POST /api/chat`). Sessions persist per property (`chat_sessions` / `chat_messages`).

**Read-only SQL chat tool (opt-in):** Set `CHAT_SQL_TOOL_ENABLED=true` to expose `get_sql_schema` and `run_sql_query` to the LLM. The agent can then answer arbitrary data questions by generating and executing a single read-only SELECT. Queries are validated by a four-layer guard (regex pre-filter → `sqlglot` AST + table allowlist → `BEGIN TRANSACTION READ ONLY` → optional least-privilege DB role); DELETE/UPDATE/INSERT/DDL and non-allowlisted tables are always blocked. In multi-property deployments, scope-binding CTEs are automatically injected to enforce tenant isolation. See [docs/OPS.md](docs/OPS.md#read-only-sql-chat-tool) for setup including the recommended `audit_readonly` Postgres role and optional RLS configuration.

### Content studio (optional, Experimental)

Expand Down
30 changes: 30 additions & 0 deletions alembic/versions/021_google_ads_planner_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""Add developer_token and login_customer_id to google_app_settings for Keyword Planner.

Revision ID: 021_google_ads_planner_settings
Revises: 020_crawl_run_pause_state
Create Date: 2026-06-19
"""
from alembic import op

revision = "021_google_ads_planner_settings"
down_revision = "020_crawl_run_pause_state"
branch_labels = None
depends_on = None


def upgrade() -> None:
op.execute(
"ALTER TABLE google_app_settings ADD COLUMN IF NOT EXISTS developer_token TEXT"
)
op.execute(
"ALTER TABLE google_app_settings ADD COLUMN IF NOT EXISTS login_customer_id TEXT"
)


def downgrade() -> None:
op.execute(
"ALTER TABLE google_app_settings DROP COLUMN IF EXISTS developer_token"
)
op.execute(
"ALTER TABLE google_app_settings DROP COLUMN IF EXISTS login_customer_id"
)
36 changes: 36 additions & 0 deletions alembic/versions/022_dashboards.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
"""Custom dashboards — property-scoped dashboard builder with JSONB layout.

Revision ID: 022_dashboards
Revises: 021_google_ads_planner_settings
Create Date: 2026-06-19
"""
from __future__ import annotations

from alembic import op

revision = "022_dashboards"
down_revision = "021_google_ads_planner_settings"
branch_labels = None
depends_on = None


def upgrade() -> None:
op.execute("""
CREATE TABLE dashboards (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
property_id BIGINT NOT NULL REFERENCES properties(id) ON DELETE CASCADE,
name TEXT NOT NULL DEFAULT 'Untitled dashboard',
layout_json JSONB NOT NULL DEFAULT '{}'::jsonb,
is_default BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
)
""")
op.execute("""
CREATE INDEX dashboards_property_updated_idx
ON dashboards(property_id, updated_at DESC)
""")


def downgrade() -> None:
op.execute("DROP TABLE IF EXISTS dashboards")
44 changes: 44 additions & 0 deletions alembic/versions/023_crawl_page_markdown.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""Add crawl_page_markdown table for per-URL extracted markdown storage.

Revision ID: 023_crawl_page_markdown
Revises: 022_dashboards
"""
from __future__ import annotations

from alembic import op

revision = "023_crawl_page_markdown"
down_revision = "022_dashboards"
branch_labels = None
depends_on = None


def upgrade() -> None:
op.execute("""
CREATE TABLE crawl_page_markdown (
crawl_run_id BIGINT NOT NULL REFERENCES crawl_runs(id) ON DELETE CASCADE,
url TEXT NOT NULL,
property_id BIGINT REFERENCES properties(id) ON DELETE SET NULL,
title TEXT,
markdown TEXT NOT NULL,
word_count INTEGER NOT NULL DEFAULT 0,
strategy TEXT NOT NULL DEFAULT 'main_only',
source_byte_length INTEGER NOT NULL DEFAULT 0,
extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (crawl_run_id, url)
)
""")
op.execute("""
CREATE INDEX idx_crawl_page_markdown_run
ON crawl_page_markdown (crawl_run_id)
""")
op.execute("""
CREATE INDEX idx_crawl_page_markdown_property
ON crawl_page_markdown (property_id)
""")


def downgrade() -> None:
op.execute("DROP INDEX IF EXISTS idx_crawl_page_markdown_property")
op.execute("DROP INDEX IF EXISTS idx_crawl_page_markdown_run")
op.execute("DROP TABLE IF EXISTS crawl_page_markdown")
3 changes: 3 additions & 0 deletions docs/GLOSSARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ This glossary maps agency-facing UI terms to internal keys, database tables, and
| Moz / Majestic overlay | `third_party_overlays` on `gsc_links`, `/api/backlinks/third-party-import` | CSV export upload | Referring-domain comparison vs GSC sample |
| Bing backlinks | `bing_backlinks`, Integrations sync | Bing Webmaster API (optional) | Secondary link source |
| SERP competition overlay | `serp_estimated_competition` on keywords | SerpAPI (optional) | Estimated SERP difficulty |
| Keyword Planner overlay | `planner_avg_monthly_searches`, `planner_competition`, `planner_competition_index`, `planner_provenance` on keyword rows | Google Ads API `KeywordPlanIdeaService` (optional; `enable_google_keyword_planner`) | Official market-level search volume + competition — does not overwrite GSC impressions |
| Keyword Planner discovery | New keyword rows with `sources: ["planner"]` | `GenerateKeywordIdeas` | Brand-new keywords not yet in crawl or GSC |
| Keyword Planner forecast | `planner_forecast_clicks`, `planner_forecast_conversions` on top rows | `GenerateKeywordForecastMetrics` v24 (`enable_keyword_forecast`) | Paid-campaign click/conversion forecast — clearly labelled, not organic traffic |
| Scheduled audits | `properties.schedule_cron`, `/api/schedule/check` | Cron + pipeline spawn | Recurring site audit — see [OPS.md](OPS.md) |
| Property alerts | `alert_webhook_url`, `/api/alerts/check` | Health snapshot rules | Operations notifications |
| Content brief | Keywords Brief button, `/api/keywords/content-brief` | LLM or deterministic | Content planning |
Expand Down
22 changes: 22 additions & 0 deletions docs/MCP.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,26 @@ Size-based tools require `probe_image_inventory=true` in pipeline config. Relate

`get_geo_readiness_score`, `get_aeo_content_signals_for_url`, `get_llms_txt_status`, `draft_llms_txt`, `get_faq_schema_coverage`, `list_pages_missing_faq_schema`, `get_eeat_signals_summary`, `get_internal_link_suggestions`, `check_ai_citation_presence`

### Agent documentation readiness (agentic-seo parity)

`get_agent_readiness_score` — 5-category composite score (0-100, A-F grade): discovery, content structure, token economics, capability signaling, UX bridge.

**Discovery:** `get_agents_md_status`, `get_skill_md_status`, `get_agent_permissions_status`

**Token economics:** `get_token_budget_summary`, `list_oversized_pages_for_agents`

**Content structure:** `get_content_structure_aeo_summary`, `get_markdown_availability_summary`, `list_pages_agent_unfriendly`

**UX bridge:** `get_copy_for_ai_signals`, `list_pages_missing_copy_for_ai`

**Generator:** `generate_agent_readiness_bundle` — draft AGENTS.md, skill.md, agent-permissions.json

**Example prompts:**
- "Score this site's agent documentation readiness"
- "Which pages are over the 8k token limit for AI agents?"
- "Does this site have an AGENTS.md or skill.md?"
- "Generate agent readiness files for my site"

### Integrations

`get_bing_index_status` (requires `bing_webmaster_api_key` in audit settings)
Expand All @@ -306,6 +326,8 @@ In-app chat uses **dynamic tool routing**: each turn loads Tier 0 router tools p

Responses stream over SSE via `POST /api/chat`. Sessions persist per property in `chat_sessions` and `chat_messages`.

**Optional crawl actions:** When **Allow chat to start crawls** is enabled under **Run audit → Settings → Content & AI → Chat agent**, the chat agent can guide crawl setup and call `prepare_audit_run` to show an in-chat confirm card. The user must authorize crawling and click **Run audit** — the agent never spawns jobs directly. MCP tools remain read-only; `prepare_audit_run` is chat-only and excluded from MCP bundles.

---

## Provider notes
Expand Down
Loading
Loading