Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,47 @@ Developer reference for agents and contributors. User-facing overview: [README.m
| Local analysis | `analysis/local.py`, `requirements.txt` |
| AI insights (LLM) | `llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt` |
| Audit query tools (MCP + chat) | `tools/audit_tools/`, `mcp/server.py`, `mcp/http_server.py`, `commands/chat_cmd.py` |
| Agent readiness checks | `tools/audit_tools/agent_readiness.py`, `tools/audit_tools/_aeo_helpers.py` |
| Agent readiness checks | `tools/audit_tools/geo/agent_readiness.py`, `tools/audit_tools/_aeo_helpers.py` |
| Config / CLI | `config.py` (`load_config`, `load_config_from_db`), `cli.py`, `input.txt.example` |
| UI pipeline schema | `web/src/lib/pipelineConfigSchema.ts` |
| UI LLM schema | `web/src/lib/llmConfigSchema.ts` |
| UI config I/O | `web/src/server/pipelineConfig.ts`, `web/src/server/llmConfig.ts` |
| D3 charts (custom / compare / overview) | `web/src/components/charts/d3/`, `web/src/lib/viz/` |
| Chart.js charts (standard bar/line/doughnut) | `web/src/utils/chartJsDefaults.ts`, `react-chartjs-2` in views under `web/src/views/`, `web/src/components/searchPerformance/`, `web/src/components/traffic/` |

Schema changes: add Alembic migration (`alembic revision`).

**Charts — Chart.js + D3 (hybrid)**

The web UI uses **both** Chart.js and D3.js. Pick the library that fits each chart; do not migrate everything to one stack.

| Prefer **Chart.js** when… | Prefer **D3** when… |
|---------------------------|---------------------|
| Standard bar, line, or doughnut with typical legend/tooltip/responsive canvas | Custom layout (grouped compare bars, dual lines with null gaps, arc gauges) |
| Quick add with minimal custom SVG | Tight theme control via CSS vars (`--chart-grid`, `--chart-title`, etc.) |
| Page already on Chart.js (GSC, GA4, Links, Content Analytics) | Reusing shared components in `web/src/components/charts/d3/` |
| Chart.js plugins or defaults are enough | Neutral data types + adapters in `web/src/lib/viz/` |

**Decision rule:** If a D3 component already exists (`D3GroupedBarChart`, `D3DualLineChart`, `D3VerticalBarChart`, `D3DonutChart`, compact charts, `arcGauge.ts`), reuse it. If it is a one-off standard chart on a Chart.js page, stay on Chart.js unless D3 clearly wins.

**Current split (indicative)**

| Area | Library |
|------|---------|
| Overview dashboard (`/dashboard`) | D3 |
| Compare (`/compare`) | D3 |
| Content analytics — Analytics tab (`/content-analytics?tab=analytics`) | D3 |
| GSC / GA4 / scatter (`GscCharts`, `Ga4Charts`) | Chart.js |
| Links explorer, Content Analytics, Text Content Analysis | Chart.js |
| Score rings, distribution donuts, compact sparklines | D3 |

**Conventions (both stacks)**

- Wrap charts in `ChartPanel`, `ChartAccessibleFallback`, and/or `ChartCard` where applicable.
- Theme helpers live in `web/src/utils/chartJsDefaults.ts` (`getGridColor`, `getChartTitleColor`, `truncateChartLabel`) — use them from D3 as well as Chart.js.
- Keep chart-library types out of data-prep: use neutral shapes (`BarChartData`, `DualSeriesChartData` in `web/src/lib/viz/types.ts` and `web/src/lib/compareChartData.ts`); convert at the render layer via `web/src/lib/viz/adapters.ts` when needed.
- Migrate page-by-page when D3 is the better fit; do not remove `chart.js` from `package.json` until all consumers are migrated.

**Company standards:** UI copy in `web/src/strings.json` (Site Audit, Properties, Run audit). Data provenance on `report_meta` in report payload. Docs: `docs/COMPANY_STANDARDS.md`, `docs/GLOSSARY.md`. Migration `003_company_standards` (properties, pipeline_jobs, audit_log). Durable jobs in `web/src/server/pipelineJobsDb.ts`. Export: `GET /api/report/export`, `src/website_profiling/tools/export_audit.py`.

**Common footguns (check before finishing web or DB work)**
Expand Down
5 changes: 4 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,11 @@ python -m website_profiling.mcp # Start MCP server (stdio)
|------|-------|
| Crawl | `src/website_profiling/crawl/` |
| Report | `src/website_profiling/reporting/` |
| GEO / AEO / Agent readiness | `src/website_profiling/tools/audit_tools/geo_tools.py`, `agent_readiness.py` |
| GEO / AEO / Agent readiness | `src/website_profiling/tools/audit_tools/geo/geo_tools.py`, `geo/agent_readiness.py` |
| DB schema | `alembic/versions/` |
| UI | `web/src/views/`, `web/app/` |
| Charts | D3: `web/src/components/charts/d3/`, `web/src/lib/viz/` · Chart.js: GSC/GA4/Links etc. — see [AGENT.md](AGENT.md) § Charts |

**Charts:** Use **both** Chart.js and D3 — choose per chart (Overview/Compare → D3; standard GSC/GA4 bars → Chart.js). Full rules in [AGENT.md](AGENT.md).

**Common pitfalls:** See [AGENT.md](AGENT.md) for the full footguns checklist (React context, Python local imports, psycopg dict rows, coverage gates).
28 changes: 27 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@

<p align="center">
<a href="#getting-started">Quick start</a> ·
<a href="#seo-feedback-loop">Feedback loop</a> ·
<a href="#features">Features</a> ·
<a href="#scope-and-limitations">Limitations</a> ·
<a href="#architecture">Structure</a> ·
Expand All @@ -50,8 +51,33 @@ Site Audit is a **developer-friendly SEO audit** tool: self-hosted, transparent,
- Content writing and optimization with live SEO scoring
- Search Console, GA4, and Bing Webmaster integration
- Agency portfolio management and run comparison
- **Closed-loop SEO workflow** — audit, report, feed data to IDE agents via MCP, fix in code, review and compare
- Optional AI-assisted analysis over audit data via MCP-compatible tools

## SEO feedback loop

Site Audit is built for a **continuous improve-and-verify cycle**, not one-off dashboard checks. Crawl your site, generate reports, expose audit data to AI agents in **Cursor, Claude Code, or Copilot** via [340 MCP tools](docs/MCP.md), fix issues in your repository, then **review** the next run to compare health scores and issue deltas.

```text
Audit → Report → MCP → Fix → Review → (repeat)
```

<p align="center">
<img src="docs/assets/seo-feedback-loop.png" alt="Site Audit SEO feedback loop — Audit, Report, MCP, Fix, Review" width="920">
</p>

**How each step maps to the product**

| Step | What you do | In Site Audit |
|------|-------------|---------------|
| **Audit** | Crawl and score the site | Pipeline (`python -m src`), Lighthouse, on-page checks |
| **Report** | Export and prioritize fixes | PDF/HTML/CSV exports, issue board, fix roadmap |
| **MCP** | Pull audit context into your IDE | `python -m website_profiling.mcp` — read-only tools for Cursor / Claude Desktop |
| **Fix** | Ship changes in your codebase | Your PR workflow (MCP does not write to the site) |
| **Review** | Prove improvement | Compare runs, category deltas, GSC metric changes |

See [docs/MCP.md](docs/MCP.md) for MCP setup and example prompts (e.g. compare two reports, export issue diffs).

## Scope and limitations

Site Audit focuses on **honest, self-hosted technical SEO**. It is not a drop-in replacement for every paid SaaS data product.
Expand Down Expand Up @@ -93,7 +119,7 @@ Site Audit focuses on **honest, self-hosted technical SEO**. It is not a drop-in
</tr>
</table>

Also included: **AI chat** over audit data (optional), **Content studio** (write &amp; optimize with live SEO scoring), **340 MCP tools** (local stdio or remote Streamable HTTP), image SEO, GEO/AEO readiness, keyword explorer (GSC + on-site), backlinks (GSC Links import), compare runs, and portfolio management for agencies.
Also included: **AI chat** over audit data (optional), **Content studio** (write &amp; optimize with live SEO scoring), **340 MCP tools** (local stdio or remote Streamable HTTP), image SEO, GEO/AEO readiness, keyword explorer (GSC + on-site), backlinks (GSC Links import), compare runs, portfolio management for agencies, and the **agent-driven feedback loop** above.

<img src="docs/assets/social-preview.png" alt="Site Audit — developer-friendly SEO audit preview" width="100%">

Expand Down
30 changes: 30 additions & 0 deletions alembic/versions/024_app_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""Add app_settings table for generic application-level key-value settings.

Used to persist appearance customisations (custom color palette, etc.) and
any future app-level preferences that have no dedicated table.

Revision ID: 024_app_settings
Revises: 023_crawl_page_markdown
"""
from __future__ import annotations

from alembic import op

revision = "024_app_settings"
down_revision = "023_crawl_page_markdown"
branch_labels = None
depends_on = None


def upgrade() -> None:
op.execute("""
CREATE TABLE app_settings (
key TEXT NOT NULL PRIMARY KEY,
value TEXT NOT NULL DEFAULT '',
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
)
""")


def downgrade() -> None:
op.execute("DROP TABLE IF EXISTS app_settings")
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Marketing and README assets are stored in [assets/](assets/):
| Asset | Purpose |
|-------|---------|
| `readme-banner.png` | README header banner |
| `seo-feedback-loop.png` | SEO feedback loop diagram (Audit → Report → MCP → Fix → Review) |
| `social-preview.png` | Application screenshot for README and social previews |
| `banner.svg` | Source artwork for the banner |
| `logo.svg`, `logo-icon.svg` | Product logo and icon |
Expand Down
Binary file added docs/assets/seo-feedback-loop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions local-prod
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/usr/bin/env bash
exec "$(cd "$(dirname "$0")" && pwd)/scripts/local-prod.sh" "$@"
114 changes: 114 additions & 0 deletions scripts/local-prod.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
#!/usr/bin/env bash
# Local prod: same Postgres as ./local-run, Next.js build + start (NODE_ENV=production).
# Usage: ./local-prod [command]
# (default) start — DB, migrations, npm run build, npm run start
# build — npm run build only
# help — show commands
set -euo pipefail

ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$ROOT"

PG_CONTAINER="${WP_PG_CONTAINER:-wp-pg}"
PG_PORT="${WP_PG_PORT:-5432}"
PG_USER="${WP_PG_USER:-postgres}"
PG_PASSWORD="${WP_PG_PASSWORD:-dev}"
PG_DB="${WP_PG_DB:-website_profiling}"

export DATABASE_URL="${DATABASE_URL:-postgres://${PG_USER}:${PG_PASSWORD}@127.0.0.1:${PG_PORT}/${PG_DB}}"
export DATA_DIR="${DATA_DIR:-$ROOT/data}"
export PYTHON="${PYTHON:-$ROOT/.venv/bin/python}"
export WEBSITE_PROFILING_ROOT="$ROOT"
export PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$ROOT/src"
export NODE_ENV=production

WEB="$ROOT/web"
LOCAL_RUN="$ROOT/scripts/local-run.sh"

log() { printf '\033[1;36m→\033[0m %s\n' "$*"; }
die() { printf '\033[1;31m✗\033[0m %s\n' "$*" >&2; exit 1; }

need_cmd() {
command -v "$1" >/dev/null 2>&1 || die "Missing required command: $1"
}

cmd_web_deps() {
need_cmd npm
if [[ ! -d "$WEB/node_modules" ]]; then
log "Installing web dependencies (npm ci)"
(cd "$WEB" && npm ci)
fi
}

cmd_build() {
cmd_web_deps
log "Building Next.js (production)"
(cd "$WEB" && npm run build)
}

cmd_start() {
local skip_build=0
for arg in "$@"; do
case "$arg" in
--skip-build) skip_build=1 ;;
esac
done

mkdir -p "$DATA_DIR"
log "Ensuring Postgres and migrations (via ./local-run migrate)"
"$LOCAL_RUN" migrate
if [[ "$skip_build" -eq 0 ]]; then
cmd_build
else
cmd_web_deps
log "Skipping build (--skip-build)"
fi
log "Starting Next.js production server (Ctrl+C to stop)"
log "DATABASE_URL=$DATABASE_URL"
log "DATA_DIR=$DATA_DIR"
log "PYTHON=$PYTHON"
log "NODE_ENV=$NODE_ENV"
cd "$WEB"
export DATABASE_URL DATA_DIR PYTHON WEBSITE_PROFILING_ROOT PYTHONPATH NODE_ENV
exec npm run start
}

cmd_help() {
cat <<EOF
Local prod runner — same Postgres as ./local-run, Next.js in production mode

./local-prod Same as: start
./local-prod start DB + migrations + build + npm run start
./local-prod start --skip-build Start without rebuilding (reuse .next)
./local-prod build npm run build only
./local-prod help Show this help

Environment overrides (optional):
DATABASE_URL (default: postgres://postgres:dev@127.0.0.1:5432/website_profiling)
DATA_DIR (default: <repo>/data)
AUTH_SECRET (optional — enables login when set)
WP_PG_CONTAINER, WP_PG_PORT, WP_PG_PASSWORD, WP_PG_DB

After start, open: http://localhost:3000/home
Use localhost (not 127.0.0.1) for pipeline APIs.

Dev mode with hot reload: ./local-run start
EOF
}

main() {
local cmd="${1:-start}"
case "$cmd" in
start)
shift || true
cmd_start "$@"
;;
build) cmd_build ;;
help|-h|--help) cmd_help ;;
*)
die "Unknown command: $cmd (try: ./local-prod help)"
;;
esac
}

main "$@"
2 changes: 2 additions & 0 deletions scripts/local-run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ Environment overrides (optional):
After start, open: http://localhost:3000/home
Run audits via sidebar "Run audit" (bottom-right FAB).

Production Next.js (same Postgres, no hot reload): ./local-prod start

Run CI-style tests: ./local-test (see ./local-test help). JS crawl integration: ./local-test browser.
EOF
}
Expand Down
15 changes: 13 additions & 2 deletions src/website_profiling/analysis/local.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,11 @@ def _cfg_int(cfg: dict[str, str] | None, key: str, default: int) -> int:


def _tokenize_simhash(text: str) -> list[str]:
return re.findall(r"[a-z0-9]{3,}", text.lower())
# `[^\W_]` is word chars minus underscore: identical to the old `[a-z0-9]`
# for ASCII (input is lowercased) but ALSO matches Unicode letters/digits, so
# CJK / Cyrillic / Arabic / Greek pages no longer tokenize to nothing and
# collapse to SimHash 0 (which falsely clustered them all as duplicates).
return re.findall(r"[^\W_]{3,}", text.lower(), re.UNICODE)


def _stable_token_hash(token: str) -> int:
Expand Down Expand Up @@ -123,6 +127,11 @@ def compute_duplicate_groups(

bucket: dict[int, list[str]] = defaultdict(list)
for u, h in url_to_sh.items():
# SimHash 0 means "no tokenizable content", not "identical content".
# Bucketing those together unioned every untokenizable page as a single
# giant duplicate group — skip them.
if h == 0:
continue
bucket[h].append(u)

fuzz = _import_rapidfuzz()
Expand Down Expand Up @@ -163,7 +172,9 @@ def union(a: str, b: str, method: str) -> None:
union(base, m, "simhash")

if hamming_max > 0 and len(urls) <= simhash_max_urls:
sh_list = [(u, url_to_sh[u]) for u in urls]
# Exclude SimHash-0 (untokenizable) pages — every pair of them has
# Hamming distance 0 and would be wrongly merged as duplicates.
sh_list = [(u, url_to_sh[u]) for u in urls if url_to_sh[u] != 0]
for i, (u1, h1) in enumerate(sh_list):
for u2, h2 in sh_list[i + 1 :]:
if _hamming(h1, h2) <= hamming_max:
Expand Down
2 changes: 1 addition & 1 deletion src/website_profiling/analysis/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def walk(obj: object) -> bool:
"corporation",
"store",
"restaurant",
"professionalService",
"professionalservice",
"newsmediaorganization",
})
_CONTACT_CAP = 10
Expand Down
3 changes: 3 additions & 0 deletions src/website_profiling/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
enrich_cmd,
google_cmd,
gsc_links_cmd,
help_cmd,
keywords_cmd,
lighthouse_cmd,
page_coach_cmd,
Expand Down Expand Up @@ -46,6 +47,8 @@ def main() -> None:
chat_cmd.run(cfg, args)
elif args.command == "page-markdown":
page_markdown_cmd.run(cfg, args)
elif args.command == "help":
help_cmd.run(cfg, args)
else:
pipeline_cmd.run(cfg, args)

Expand Down
3 changes: 2 additions & 1 deletion src/website_profiling/commands/config_resolve.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ def build_parser() -> argparse.ArgumentParser:
"page-coach",
"chat",
"page-markdown",
"help",
],
help="Run only this step (default: run all steps according to config)",
)
Expand Down Expand Up @@ -394,7 +395,7 @@ def build_parser() -> argparse.ArgumentParser:
"--stdin-json",
action="store_true",
dest="stdin_json",
help="For 'chat' command: read JSON payload from stdin and emit NDJSON events.",
help="For 'chat' and 'help' commands: read JSON payload from stdin and emit NDJSON events.",
)
parser.add_argument(
"--resume-run-id",
Expand Down
41 changes: 41 additions & 0 deletions src/website_profiling/commands/help_cmd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""CLI: help --stdin-json — single-turn help chat (NDJSON events on stdout)."""
from __future__ import annotations

import argparse
import json
import sys

from ..text_sanitize import sanitize_unicode_deep
from ..llm.help_agent import run_help_turn


def run(_cfg: dict, args: argparse.Namespace) -> None:
if not getattr(args, "stdin_json", False):
print("Error: help requires --stdin-json", file=sys.stderr)
sys.exit(1)

try:
payload = json.load(sys.stdin)
except json.JSONDecodeError as e:
print(json.dumps({"type": "error", "message": f"Invalid stdin JSON: {e}"}))
sys.exit(1)

messages = payload.get("messages") or []
if not isinstance(messages, list):
messages = []

def on_event(event: dict) -> None:
print(json.dumps(sanitize_unicode_deep(event), default=str), flush=True)

try:
result = run_help_turn(messages, on_event=on_event)
except Exception as e:
msg = str(e).strip() or type(e).__name__
print(json.dumps({"type": "error", "message": msg}), flush=True)
sys.exit(1)

if not result.get("ok"):
err = result.get("error", "Help agent failed")
print(json.dumps({"type": "error", "message": err}), flush=True)
sys.exit(1)
sys.exit(0)
Loading
Loading