Site Audit

Site Audit — Developer-friendly SEO crawl & audit
_{Self-hosted technical SEO for developers — your infrastructure, your data.}

Demo · Quick start · Feedback loop · Features · Limitations · Structure · Contributing · Docs · License

Site Audit

Developer-friendly SEO audit platform — open-source crawl and technical audit tooling built with React, Python, PostgreSQL, and .NET. The stack is split into focused services: a Python FastAPI backend (crawl, pipeline, chat, integrations), a .NET BFF as the browser-facing API gateway, a .NET Data read service (report payloads, portfolio, issue status), and a .NET FileService for PDF/Excel export.

Overview

Site Audit is a developer-friendly SEO audit tool: self-hosted, transparent, and built for engineers who want crawl data, issue reports, and integrations in their own stack — not another opaque SaaS dashboard. It runs on your infrastructure, stores data in PostgreSQL, and produces actionable technical reports with no subscription tiers or gated exports.

Use cases

Developer-friendly SEO audits for owned or client properties
Crawl analysis with static and JavaScript rendering
Content writing and optimization with live SEO scoring
Search Console, GA4, and Bing Webmaster integration
Agency portfolio management and run comparison
Closed-loop SEO workflow — audit, report, feed data to IDE agents via MCP, fix in code, review and compare
Optional AI-assisted analysis over audit data via MCP-compatible tools

Demo

_{▶ Site Audit — Open Source SEO Crawl & Audit}

SEO feedback loop

Site Audit is built for a continuous improve-and-verify cycle, not one-off dashboard checks. Crawl your site, generate reports, expose audit data to AI agents in Cursor, Claude Code, or Copilot via 340 MCP tools, fix issues in your repository, then review the next run to compare health scores and issue deltas.

Audit → Report → MCP → Fix → Review → (repeat)

How each step maps to the product

Step	What you do	In Site Audit
Audit	Crawl and score the site	Pipeline (`python -m src`), Lighthouse, on-page checks
Report	Export and prioritize fixes	PDF, Excel workbook, CSV, and JSON exports; issue board; fix roadmap
MCP	Pull audit context into your IDE	`python -m website_profiling.mcp` — read-only tools for Cursor / Claude Desktop
Fix	Ship changes in your codebase	Your PR workflow (MCP does not write to the site)
Review	Prove improvement	Compare runs, category deltas, GSC metric changes

See docs/MCP.md for MCP setup and example prompts (e.g. compare two reports, export issue diffs).

Scope and limitations

Site Audit focuses on honest, self-hosted technical SEO. It is not a drop-in replacement for every paid SaaS data product.

No live backlink index — Backlink tools read Google Search Console Links CSV imports (and optional third-party CSV overlays). There is no Ahrefs, Semrush, Moz, or Majestic API integration.
No daily rank tracking — Keyword positions come from GSC snapshots on your connected property, not a proprietary SERP tracker or rank-history database.
Live AI citation checks are opt-in — GEO/AEO tools default to on-site heuristics (no API required). Optional live checks via check_ai_citations_live require a BYO API key (PERPLEXITY_API_KEY, OPENAI_API_KEY, etc.) and explicit opt_in=true; they are not called automatically.
No third-party keyword volume APIs — Keyword explorer uses on-site frequency plus Search Console; difficulty and SERP feature overlays are estimated unless you supply your own data.
No managed cloud — You run it (Docker or local dev). This repo is not a hosted multi-tenant SaaS.
No substitute for Google access — Search Console, Analytics, and Bing Webmaster require your credentials; missing or stale integrations show empty states with provenance labels, not fabricated metrics.
Not a ranking guarantee — Category scores (0–100) are internal audit scores, not Google rankings or predicted traffic impact.

Planned extensions (not yet shipped): full backlink index beyond GSC import, SERP rank tracking beyond GSC snapshots. See docs/MCP.md.

Features

Site crawl
_{Static & JS rendering, sitemap export, crawl maps}

Technical audit
_{Issues, Lighthouse, accessibility (axe), on-page checks}

Integrations
_{Google Search Console, GA4, Bing Webmaster}

Self-hosted
_{Docker or local dev — your data stays yours}

Also included: AI chat over audit data (optional), Content studio (write & optimize with live SEO scoring), 340 MCP tools (local stdio or remote Streamable HTTP), image SEO, GEO/AEO readiness, keyword explorer (GSC + on-site), backlinks (GSC Links import), compare runs, portfolio management for agencies, and the agent-driven feedback loop above.

Architecture

Browser  →  web (:3000)  →  bff (:8090)  →  fastapi (:8001)   crawl, pipeline, chat, integrations
                              │              data (:8091)     report reads, portfolio, issue status, filters
                              │              files (:8080)    PDF + Excel export
                              worker          background pipeline jobs (same Python image)
                              postgres        audit data store

WebsiteProfiling/
├── src/website_profiling/     # Python audit engine (CLI: python -m src)
│   ├── api/                   # FastAPI app (uvicorn :8001)
│   ├── worker/                # Background pipeline job runner
│   ├── crawl/                 # Crawler, fetchers, JS rendering
│   ├── reporting/             # Report builder, issue categories
│   ├── analysis/              # On-page / local analysis
│   ├── content_studio/        # Content writing + live SEO scoring
│   ├── lighthouse/            # Lighthouse runner
│   ├── integrations/          # Google Search Console, GA4, Bing, CrUX
│   ├── llm/                   # AI enrich + chat agent
│   ├── tools/                 # Exports, audit query tools, MCP helpers
│   ├── mcp/                   # MCP server (stdio + remote HTTP, domain bundles)
│   ├── db/                    # PostgreSQL storage layer
│   ├── commands/              # CLI subcommands
│   ├── cli.py                 # Pipeline entrypoint
│   └── config.py              # Config load (DB + shadow file)
├── web/                       # Vite + React SPA (nginx in prod)
│   ├── src/AppRoutes.tsx      # React Router routes
│   ├── src/components/        # React UI components
│   ├── src/views/             # Report views (overview, links, issues, …)
│   ├── src/lib/               # Client helpers, BFF apiUrl/apiFetch
│   └── public/                # Static assets (logo, favicon)
├── services/Bff/              # .NET BFF — auth + /api/* proxy (port 8090)
├── services/Data/             # .NET read service — report/portfolio/issue reads (port 8091)
├── services/FileService/      # .NET PDF + Excel workbook export (port 8080)
├── alembic/versions/          # PostgreSQL schema migrations
├── tests/                     # pytest suite + fixtures
├── docs/                      # Glossary, MCP, ops, brand assets
├── scripts/                   # local-run.sh, local-test.sh, local-prod.sh
├── .github/workflows/         # CI (Python, web, .NET, Docker)
├── docker-compose.yml         # Full dev stack (see Getting started)
├── docker-compose.prod.yml    # Production stack (requires AUTH_SECRET; optional MCP profile)
├── docker-compose.pull.yml    # Pre-built BACKEND_IMAGE + WEB_IMAGE smoke layout
├── Dockerfile                 # Python backend image (fastapi + worker roles)
├── local-run                  # Dev setup & start script
├── local-prod                 # Production build + preview (no hot reload)
├── local-test                 # Full test suite (CI parity)
├── requirements.txt           # Python dependencies
└── pipeline-config.example.txt

Path	Purpose
`src/website_profiling/`	Crawl, analyze, report, Lighthouse, integrations, AI — run via `python -m src`
`src/website_profiling/api/`	FastAPI HTTP layer — pipeline, chat, integrations, crawl control
`services/Bff/`	Browser API gateway — auth, CORS, routes `/api/*` to FastAPI, Data, FileService
`services/Data/`	.NET read/mutation service for report payloads, portfolio, issue status, saved filters
`services/FileService/`	PDF and Excel workbook export — see services/FileService/README.md
`web/src/lib/publicBase.ts`	BFF base URL (`VITE_BFF_BASE_URL`) and `apiFetch` / `apiUrl`
`web/src/lib/pipelineConfigSchema.ts`	Audit settings schema (UI ↔ PostgreSQL)
`alembic/versions/`	Database migrations — run `./local-run migrate`
`tests/`	Backend tests; `./local-test browser` for Playwright crawl integration
`docs/MCP.md`	MCP server setup for IDE and agent integrations
`data/`	Local secrets and shadow `pipeline-config.txt` (gitignored)

For layout details and common development patterns, see AGENT.md.

Getting started

Prerequisites

Tool	Used for
Docker	Postgres container (local dev) and full-stack compose
Python 3.12+	Audit engine, FastAPI, pipeline worker, tests
Node 20+	Vite + React SPA
.NET SDK 10+	BFF, Data, and FileService (required for `./local-run`; optional if you only use Docker)

Docker

Build and run the full dev stack from source:

docker compose up --build

Services: postgres, fastapi (:8001, internal), worker, data (:8091, internal), bff (:8090), web (:3000), files (:8080, internal).

Open http://localhost:3000/home. The browser talks only to the BFF (:8090); the BFF proxies to FastAPI, the Data service (report reads and portfolio routes), and FileService (PDF/workbook export).

Production deployment: docker-compose.prod.yml — set POSTGRES_USER, POSTGRES_PASSWORD, AUTH_SECRET, BFF_ALLOWED_ORIGINS, and BFF_PUBLIC_URL. Optional remote MCP: docker compose -f docker-compose.prod.yml --profile mcp up. Pre-built images: docker-compose.pull.yml (BACKEND_IMAGE, WEB_IMAGE).

Local development

./local-run setup   # First time: Postgres, Python venv, Playwright/Chromium, migrations, npm deps
./local-run         # Start full dev stack → http://localhost:3000/home
./local-run db      # Postgres only (no app)
./local-run migrate # Apply Alembic migrations only
./local-run stop    # Stop Postgres container
./local-prod        # Same DB, Vite production build + preview (no hot reload)

./local-run starts (in order): FileService :8080, Data :8091, pipeline worker, FastAPI :8001, BFF :8090, and Vite :3000. Use localhost (not 127.0.0.1) for pipeline APIs so CORS and cookies match the BFF origin.

Default local DATABASE_URL: postgres://postgres:dev@127.0.0.1:5432/website_profiling (Docker Compose dev stack uses profiling:profiling).

requirements.txt pins direct Python dependencies to versions verified by ./local-test python. Re-run the full test suite after intentional upgrades.

Pipeline job timeouts

Setting	Default	Description
`PIPELINE_JOB_STALE_HOURS`	1 hour	Reconciles stuck `running` rows
`PIPELINE_JOB_ORPHAN_MINUTES`	5 minutes	Clears orphan jobs with no live server process

Increase PIPELINE_JOB_STALE_HOURS for crawls that routinely exceed one hour.

Testing

./local-test              # Full CI parity: Python + web + .NET (Data, Bff, FileService)
./local-test python       # Backend: three 100% coverage gates + browser pytest + CLI smoke
./local-test browser      # JS crawl integration tests (skips if Chromium unavailable)
./local-test web          # Frontend: build, typecheck, lint, vitest
./local-test dotnet       # dotnet test Data + Bff + FileService + BFF OpenAPI drift gate
./local-test quick        # Fast loop; requires DB already running (no coverage gate)
./local-test all --no-cov # Full run without pytest coverage gate

CI runs separate jobs for Python (coverage gates), web, Data, Bff, FileService, and Docker (image build, browser pytest in container, compose smoke). See .github/workflows/ci.yml.

Configuration

Integrations

Connect Google Search Console and Analytics via Integrations (gear icon) in the application UI.

Google Ads Keyword Planner (optional)

Adds official search volume and competition data from the Google Ads API to the Keywords explorer. Requires:

A Google Ads developer token (Basic access is sufficient for keyword research).
A Google Ads manager account customer ID (login customer ID).
An existing Google OAuth connection (via Integrations) — users must re-consent after the adwords scope is added.

In Integrations → Google Ads Keyword Planner, enter the developer token and login customer ID. Then enable enable_google_keyword_planner in audit settings.

The overlay enriches keywords that have no Search Console impressions with planner_avg_monthly_searches and planner_competition, labelled "Google Keyword Planner" to distinguish them from real GSC data. GSC-ranked keywords are never overwritten. Set enable_keyword_forecast = true to additionally attach click/conversion forecasts to the top 50 keywords.

JavaScript crawl (optional)

In Audit settings, set Crawl rendering to javascript (always headless Chromium) or auto (static first, browser when SPA heuristics match). Requires Playwright from requirements.txt and Chromium on PATH or CHROME_PATH (included in Docker). The UI preflights via GET /api/crawl/browser-status before runs when JS or auto mode is selected.

AI chat (optional)

Ask questions about audit data at http://localhost:3000/chat. Enable a provider under Run audit → AI settings (llm_enabled, provider, model). ./local-run setup installs Python deps from requirements.txt (including httpx, OpenAI, Anthropic, and Groq SDKs; Gemini uses httpx via REST).

Provider	Notes
Ollama	Local daemon at `http://127.0.0.1:11434`. Chat UI lists installed models plus the live Ollama cloud catalog. Native tool calling when supported; ReAct fallback otherwise.
OpenAI / Anthropic	API key in AI settings or env (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`); native tool calling with streaming.
Google Gemini	API key in AI settings or `GEMINI_API_KEY`; REST via `httpx`.
Groq	API key in AI settings or `GROQ_API_KEY`; official Groq Python SDK; native tool calling with streaming. Default model `openai/gpt-oss-120b`.

The agent uses the same 340 read-only audit tools as the MCP server (docs/MCP.md), with dynamic routing (~45 tools per turn). Responses stream over SSE (POST /api/chat). Sessions persist per property (chat_sessions / chat_messages).

Read-only SQL chat tool (opt-in): Set CHAT_SQL_TOOL_ENABLED=true to expose get_sql_schema and run_sql_query to the LLM. The agent can then answer arbitrary data questions by generating and executing a single read-only SELECT. Queries are validated by a four-layer guard (regex pre-filter → sqlglot AST + table allowlist → BEGIN TRANSACTION READ ONLY → optional least-privilege DB role); DELETE/UPDATE/INSERT/DDL and non-allowlisted tables are always blocked. In multi-property deployments, scope-binding CTEs are automatically injected to enforce tenant isolation. See docs/OPS.md for setup including the recommended audit_readonly Postgres role and optional RLS configuration.

Content studio (optional, Experimental)

Write and optimize content at http://localhost:3000/write with live SEO scoring from Search Console and on-page heuristics. Drafts persist per property; an optional AI assist (same providers as AI chat) drafts and rewrites copy. Backed by /api/content-drafts, /api/content/score, and /api/content/analyze.

Contributing

Contributions are welcome. See CONTRIBUTING.md for setup and pull request guidelines.

CODE_OF_CONDUCT.md — community standards
SECURITY.md — report vulnerabilities privately

Documentation

Document	Description
docs/README.md	Documentation index and brand assets
AGENT.md	Repository layout and development commands
docs/GLOSSARY.md	UI terminology
docs/COMPANY_STANDARDS.md	Data and security policy
docs/MCP.md	MCP server setup
docs/OPS.md	Scheduled audits, alerts, production ops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Audit

Overview

Demo

SEO feedback loop

Scope and limitations

Features

Architecture

Getting started

Prerequisites

Docker

Local development

Pipeline job timeouts

Testing

Configuration

Integrations

Google Ads Keyword Planner (optional)

JavaScript crawl (optional)

AI chat (optional)

Content studio (optional, Experimental)

Contributing

Documentation

Star History

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
.claude		.claude
.github		.github
alembic		alembic
data		data
docs		docs
scripts		scripts
services		services
src		src
tests		tests
web		web
.coveragerc		.coveragerc
.coveragerc.reporting		.coveragerc.reporting
.coveragerc.tools		.coveragerc.tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
AGENT.md		AGENT.md
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.pull.yml		docker-compose.pull.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
input.txt.example		input.txt.example
local-prod		local-prod
local-run		local-run
local-run.ps1		local-run.ps1
local-test		local-test
local-test.ps1		local-test.ps1
pipeline-config.example.txt		pipeline-config.example.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Site Audit

Overview

Demo

SEO feedback loop

Scope and limitations

Features

Architecture

Getting started

Prerequisites

Docker

Local development

Pipeline job timeouts

Testing

Configuration

Integrations

Google Ads Keyword Planner (optional)

JavaScript crawl (optional)

AI chat (optional)

Content studio (optional, Experimental)

Contributing

Documentation

Star History

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages