Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
38 changes: 35 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,16 +53,19 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
- name: Build backend image
run: docker build -t website-profiling:ci .
- name: Build web image
run: docker build -t website-profiling-web:ci ./web --build-arg VITE_BFF_BASE_URL=http://localhost:8090
- name: Browser crawl tests in image
run: |
docker run --rm \
website-profiling:ci \
/opt/venv/bin/pytest tests/test_crawl_fetchers.py tests/test_crawler_browser_e2e.py -m browser -q -o addopts=
- name: Compose smoke (postgres + web)
- name: Compose smoke (postgres + fastapi + web)
env:
WEB_IMAGE: website-profiling:ci
BACKEND_IMAGE: website-profiling:ci
WEB_IMAGE: website-profiling-web:ci
run: |
docker compose -f docker-compose.pull.yml up -d --wait
curl -fsS http://127.0.0.1:3000/home
Expand All @@ -82,6 +85,8 @@ jobs:
cache-dependency-path: web/package-lock.json
- name: Install
run: npm ci
- name: Build
run: npm run build
- name: Typecheck
run: npm run typecheck
- name: Lint
Expand All @@ -98,3 +103,30 @@ jobs:
dotnet-version: '10.0.x'
- name: Test FileService
run: dotnet test services/FileService/FileService.slnx

bff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'
- name: Test BFF
run: dotnet test services/Bff/Bff.slnx
- name: Generated client drift gate
run: |
dotnet tool install -g NSwag.ConsoleCore
export PATH="$PATH:$HOME/.dotnet/tools"
(cd services/Bff && nswag run nswag.json)
git diff --exit-code services/Bff/src/Bff.Application/Generated/FastApiClient.g.cs \
|| (echo "::error::FastApiClient.g.cs is stale — run services/Bff/generate-client.sh and commit." && exit 1)

data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'
- name: Test Data service
run: dotnet test services/Data/Data.slnx
21 changes: 11 additions & 10 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Project (root) only. Python: src/.gitignore. Next.js: web/.gitignore. .NET: services/FileService/
# Project (root) only. Python: src/.gitignore. Web UI: web/.gitignore. .NET: services/*/

# Next.js UI: generated pipeline configs from the runner modal (repo root; must match Python cwd for paths)
.website-profiling-ui-*.txt
Expand Down Expand Up @@ -33,12 +33,13 @@ pipeline-config.txt
skills-lock.json
crawl_results.csv
commit.*

# .NET FileService — build output and IDE artifacts
services/FileService/**/bin/
services/FileService/**/obj/
services/FileService/.vs/
services/FileService/**/*.user
services/FileService/**/*.suo
services/FileService/**/TestResults/
services/FileService/**/*.DotSettings.user
.cursor/
.claude/
# .NET services (FileService, Bff, …) — build output and IDE artifacts
services/**/bin/
services/**/obj/
services/**/.vs/
services/**/*.user
services/**/*.suo
services/**/TestResults/
services/**/*.DotSettings.user
34 changes: 18 additions & 16 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,17 @@ Developer reference for agents and contributors. User-facing overview: [README.m

**LLM / AI:** Settings live in **`llm_config`** table in PostgreSQL. Providers: OpenAI, Google Gemini, Anthropic, Groq, Ollama (`web/src/lib/llmConfigSchema.ts`). Configure only via web UI **AI** tab (`GET/PUT /api/llm-config`, localhost). Never in `pipeline-config.txt` or `--config`.

**Frontend:** **`web/`** (Next.js) -- server reads PostgreSQL via `/api/report/*`.
**Frontend:** **`web/`** (Vite + React SPA) — browser calls **`services/Bff/`** for all `/api/*`; BFF proxies to FastAPI and FileService.

**Key paths**

- `src/website_profiling/` -- `cli.py`, `config.py`, `crawl/`, `db/storage.py`, `lighthouse/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `services/Bff/` -- .NET BFF (auth, CORS, `/api/*` proxy)
- `services/FileService/` -- .NET PDF + Excel workbook export (HTTP-only; see [README](services/FileService/README.md))
- `web/app/` -- routes; `web/src/` -- React; pipeline: `PipelineRunnerFab`, `server/pipelineJobs.ts`, `server/pipelineConfig.ts`, `server/llmConfig.ts`, `server/db.ts`
- `web/src/` -- React SPA (`AppRoutes.tsx`, `views/`, `components/`); pipeline UI: `PipelineRunnerFab`, `PipelineContext`
- `alembic/` -- schema migrations

**Local dev:** `./local-run` (Postgres in Docker `wp-pg`, FileService on `:8080`, Next.js on host; default `DATABASE_URL`: `postgres://postgres:dev@127.0.0.1:5432/website_profiling`). See `scripts/local-run.sh`. **Local tests:** `./local-test` runs **three** Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI **python** and **web** jobs; Docker CI is separate (see `.github/workflows/ci.yml`). `./local-test browser` for `@pytest.mark.browser` integration tests — see `scripts/local-test.sh`. Mocked browser unit tests: `tests/test_browser_fetcher_unit.py`.
**Local dev:** `./local-run` (Postgres in Docker `wp-pg`, FileService on `:8080`, FastAPI on `:8001`, BFF on `:8090`, Vite on `:3000`; default `DATABASE_URL`: `postgres://postgres:dev@127.0.0.1:5432/website_profiling`). See `scripts/local-run.sh`. **Local tests:** `./local-test` runs **three** Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI **python** and **web** jobs; Docker CI is separate (see `.github/workflows/ci.yml`). `./local-test browser` for `@pytest.mark.browser` integration tests — see `scripts/local-test.sh`. Mocked browser unit tests: `tests/test_browser_fetcher_unit.py`.

**JavaScript crawl (optional):** Config keys `crawl_render_mode` (`static` | `javascript` | `auto`) and `crawl_js_*` in pipeline config / `pipelineConfigSchema.ts`. JS/auto crawls can capture browser console errors and uncaught exceptions (`crawl_js_capture_console`, stored under `page_analysis.browser`). **Auto mode** uses static-first fetch, pre-parse SPA heuristics (`needs_js_render`), then post-parse low-outlink fallback (`needs_js_render_after_parse`) in `crawler.py`. **Preflight:** `GET /api/crawl/browser-status` (localhost) spawns Python `browser_status()`; Run audit settings/run validation calls it when render mode is `javascript` or `auto`. Browser deps: Playwright from `requirements.txt` (installed by `./local-run setup` and `./local-test`). Runtime needs Chromium on `PATH` or `CHROME_PATH` (Docker sets `CHROME_PATH=/usr/bin/chromium`). Integration tests: `@pytest.mark.browser` — excluded by default in `pytest.ini`; Docker CI runs `tests/test_crawl_fetchers.py` and `tests/test_crawler_browser_e2e.py -m browser`; locally `./local-test browser`.

Expand All @@ -26,21 +27,21 @@ Developer reference for agents and contributors. User-facing overview: [README.m
- **`preserve_crawl_history`** (default true): append crawls; `false` truncates crawl tables but restores `report_payload`, Lighthouse, `google_data`, `keyword_data`, `keyword_history`, `keyword_suggest_cache`, and `crawl_runs`
- **`DATABASE_URL`** env: PostgreSQL connection string (required). **`DATA_DIR`**: secrets + shadow config (Docker: `/data`).
- **Pipeline storage** (crawl, edges, nodes, report payload, Lighthouse, keywords, warnings) lives in **PostgreSQL only**. Deliverables use the Export view, `GET /api/report/export`, or MCP `export_*` tools — not files written by the main pipeline step.
- **Pool tuning:** `DB_POOL_MIN` / `DB_POOL_MAX` (Python), `PGPOOL_MAX` (Node). Bulk crawl writes via `executemany`; optional **`crawl_stream_to_db`** streams rows during fetch. Per-URL raw HTML: `crawl_page_html` table (migration `015`); API `GET/POST /api/crawl/page-html` (localhost).
- **`web/` APIs:** `/api/report/*` read routes (payload, meta, history — not localhost-guarded; protect with `AUTH_*` when exposed); `/api/run` spawns Python (localhost); `/api/jobs`, `/api/jobs/[id]`, `/api/jobs/[id]/cancel` (localhost); `/api/crawl/browser-status`, `/api/crawl/page-html` (localhost); `/api/pipeline-config` GET/PUT; `/api/llm-config` GET/PUT; `/api/chat` POST (SSE); `/api/chat/sessions` GET/POST; `/api/ollama/status` (localhost); `/api/properties/{id}/google/links/import` POST; `PipelineRunnerFab` saves pipeline + LLM state before each run. Full route list: `web/app/api/**/route.ts`.
- **Pool tuning:** `DB_POOL_MIN` / `DB_POOL_MAX` (Python). Bulk crawl writes via `executemany`; optional **`crawl_stream_to_db`** streams rows during fetch. Per-URL raw HTML: `crawl_page_html` table (migration `015`); API `GET/POST /api/crawl/page-html`.
- **Browser API (BFF):** All `/api/*` routes are served by `services/Bff/` (proxied to FastAPI / FileService). Notable: `/api/report/*`, `/api/run`, `/api/jobs/*`, `/api/pipeline-config`, `/api/llm-config`, `/api/chat` (SSE), `/api/integrations/google/*` (OAuth callback on BFF origin). `PipelineRunnerFab` saves pipeline + LLM state before each run. OpenAPI: `web/openapi.json`; BFF client: `services/Bff/src/Bff.Application/Generated/`.
- **MCP:** `python -m website_profiling.mcp` (stdio) or `python -m website_profiling.mcp.http` (remote Streamable HTTP). Configure at **`/mcp`** in the web UI. See `docs/MCP.md`.
- **AI Chat UI:** `/chat` — property-scoped chat with saved sessions (`chat_sessions`, `chat_messages`; migration `012_chat_sessions`).
- **Job store:** PostgreSQL `pipeline_jobs` when `DATABASE_URL` is set (`pipelineJobsDb.ts` — status, timestamps, truncated logs). In-memory map in `pipelineJobs.ts` holds live log tail and child process handles; stale rows reconciled via `PIPELINE_JOB_STALE_HOURS`.
- **Job store:** PostgreSQL `pipeline_jobs` (FastAPI); live job status via `/api/jobs/*` through the BFF.
- **Schema head:** `015_crawl_page_html` (recent: `013` link_edges/discovery, `014` job log truncation, `015` per-URL HTML storage).
- **Docker:** `Dockerfile` + `docker-compose.yml` (postgres + web + FileService); **`docker-compose.prod.yml`** (production + remote MCP on `:8000`); **`docker-compose.pull.yml`** for pre-built images (`WEB_IMAGE`); **`LIGHTHOUSE_CHROME_FLAGS`**
- **Docker:** Root `Dockerfile` (Python backend); `web/Dockerfile` (Vite SPA + nginx); `docker-compose.yml` (postgres + fastapi + worker + bff + web + FileService); **`docker-compose.prod.yml`** (production + optional MCP on `:8000`); **`docker-compose.pull.yml`** for pre-built images (`BACKEND_IMAGE`, `WEB_IMAGE`); **`LIGHTHOUSE_CHROME_FLAGS`**

**Where to edit**

| Task | Where |
|------|--------|
| Crawl | `crawl/crawler.py`, `crawl/fetchers/` |
| Report | `reporting/builder.py`, `reporting/categories.py` |
| PDF / workbook export | `services/FileService/` (rendering); Next.js proxies in `web/src/server/proxyToFileService.ts` |
| PDF / workbook export | `services/FileService/` (rendering); BFF routes `/api/report/export` and `/api/report/export-workbook` to FileService |
| DB schema | `alembic/versions/` |
| Local analysis | `analysis/local.py`, `requirements.txt` |
| AI insights (LLM) | `llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt` |
Expand All @@ -49,7 +50,7 @@ Developer reference for agents and contributors. User-facing overview: [README.m
| Config / CLI | `config.py` (`load_config`, `load_config_from_db`), `cli.py`, `input.txt.example` |
| UI pipeline schema | `web/src/lib/pipelineConfigSchema.ts` |
| UI LLM schema | `web/src/lib/llmConfigSchema.ts` |
| UI config I/O | `web/src/server/pipelineConfig.ts`, `web/src/server/llmConfig.ts` |
| Browser API client | `web/src/lib/publicBase.ts` (`apiUrl`, `apiFetch`, `VITE_BFF_BASE_URL`) |
| D3 charts (custom / compare / overview) | `web/src/components/charts/d3/`, `web/src/lib/viz/` |
| Chart.js charts (standard bar/line/doughnut) | `web/src/utils/chartJsDefaults.ts`, `react-chartjs-2` in views under `web/src/views/`, `web/src/components/searchPerformance/`, `web/src/components/traffic/` |

Expand Down Expand Up @@ -86,23 +87,24 @@ The web UI uses **both** Chart.js and D3.js. Pick the library that fits each cha
- Keep chart-library types out of data-prep: use neutral shapes (`BarChartData`, `DualSeriesChartData` in `web/src/lib/viz/types.ts` and `web/src/lib/compareChartData.ts`); convert at the render layer via `web/src/lib/viz/adapters.ts` when needed.
- Migrate page-by-page when D3 is the better fit; do not remove `chart.js` from `package.json` until all consumers are migrated.

**Company standards:** UI copy in `web/src/strings.json` (Site Audit, Properties, Run audit). Data provenance on `report_meta` in report payload. Docs: `docs/COMPANY_STANDARDS.md`, `docs/GLOSSARY.md`. Migration `003_company_standards` (properties, pipeline_jobs, audit_log). Durable jobs in `web/src/server/pipelineJobsDb.ts`. **Export:** PDF/workbook via FileService (`FILE_SERVICE_URL` on web/MCP; `REPORT_API_URL` on FileService); CSV/JSON via `GET /api/report/export` and `src/website_profiling/tools/export_audit.py`.
**Company standards:** UI copy in `web/src/strings.json` (Site Audit, Properties, Run audit). Data provenance on `report_meta` in report payload. Docs: `docs/COMPANY_STANDARDS.md`, `docs/GLOSSARY.md`. Migration `003_company_standards` (properties, pipeline_jobs, audit_log). **Export:** PDF/workbook via FileService (`FILE_SERVICE_URL` on MCP; `REPORT_API_URL` on FileService); CSV/JSON via `GET /api/report/export` and `src/website_profiling/tools/export_audit.py`.

**Common footguns (check before finishing web or DB work)**

These recur when adding features. Verify explicitly — do not assume tests caught them.

1. **React context — `useReport` / `ReportProvider`**
- Report views call `useReport()`. That only works inside `ReportAppClient` → `ReportProvider`.
- **Do:** Render report views via `ReportShell` (wraps `ReportAppClient` internally).
- **Don't:** Import a view directly in `app/*/page.tsx` without `ReportShell`.
- Standalone routes under `web/app/` (e.g. `log-analyzer`, `indexation`) are **not** auto-wrapped by `(reports)/layout`.
- **Do:** Render report views via `ReportShell` inside `ReportLayout` (`AppRoutes.tsx` → `/:slug`).
- **Don't:** Mount a report view outside `ReportAppClient` / `ReportProvider`.
- Standalone routes (`/pipeline`, `/chat`, `/write`, etc.) are defined in `web/src/AppRoutes.tsx`, not wrapped by `ReportLayout`.

```tsx
// ✅
// ✅ ReportSlugPage in web/src/pages/ReportSlugPage.tsx
import ReportShell from '@/ReportShell';
export default function Page() {
return <ReportShell slug="log-analyzer" />;
export default function ReportSlugPage() {
const { slug } = useParams();
return <ReportShell slug={slug!} />;
}
```

Expand Down
16 changes: 9 additions & 7 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,25 @@

This file is the canonical entry point for agents. For full detail see [AGENT.md](AGENT.md).

**What it is:** Self-hosted SEO crawl and technical audit platform — `python -m src` from repo root. Stack: Python (crawl + analysis + MCP), Next.js (web UI), PostgreSQL.
**What it is:** Self-hosted SEO crawl and technical audit platform — `python -m src` from repo root. Stack: Python (crawl + analysis + MCP + FastAPI), Vite + React SPA (web UI), .NET BFF (browser API), .NET Data (report reads), PostgreSQL.

**Key paths**

- `src/website_profiling/` — core Python package
- `cli.py`, `config.py`, `crawl/`, `db/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `web/` — Next.js frontend
- `services/FileService/` — .NET PDF + Excel workbook export (port 8080). HTTP-only via `REPORT_API_URL`; no Postgres. Profiles: `executive|standard|full|premium`. Details: [services/FileService/README.md](services/FileService/README.md). Env: `FILE_SERVICE_URL` (Next.js/MCP), `REPORT_API_URL` (FileService).
- `cli.py`, `config.py`, `api/`, `worker/`, `crawl/`, `db/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `web/` — Vite + React SPA (static nginx in prod); browser calls `services/Bff/` for all `/api/*`
- `services/Bff/` — .NET BFF (auth, CORS, proxy to FastAPI + Data + FileService)
- `services/Data/` — .NET read service (report payloads, portfolio, issue status, filters; port 8091)
- `services/FileService/` — .NET PDF + Excel workbook export (port 8080). HTTP-only via `REPORT_API_URL`; no Postgres. Profiles: `executive|standard|full|premium`. Details: [services/FileService/README.md](services/FileService/README.md). Env: `FILE_SERVICE_URL` (MCP), `REPORT_API_URL` (FileService).
- `alembic/` — DB migrations
- `docs/` — documentation index
- `tests/` — pytest suite

**Run / dev**

```bash
./local-run # Start Postgres + FileService + Next.js
./local-test # Run all three coverage gates
./local-run # Start Postgres + FileService + Data + worker + FastAPI + BFF + Vite dev server
./local-test # Python + web + .NET tests (CI parity)
python -m src # Run audit pipeline
python -m website_profiling.mcp # Start MCP server (stdio)
```
Expand All @@ -35,7 +37,7 @@ python -m website_profiling.mcp # Start MCP server (stdio)
| Report | `src/website_profiling/reporting/` |
| GEO / AEO / Agent readiness | `src/website_profiling/tools/audit_tools/geo/geo_tools.py`, `geo/agent_readiness.py` |
| DB schema | `alembic/versions/` |
| UI | `web/src/views/`, `web/app/` |
| UI | `web/src/views/`, `web/src/pages/`, `web/src/AppRoutes.tsx` |
| Charts | D3: `web/src/components/charts/d3/`, `web/src/lib/viz/` · Chart.js: GSC/GA4/Links etc. — see [AGENT.md](AGENT.md) § Charts |

**Charts:** Use **both** Chart.js and D3 — choose per chart (Overview/Compare → D3; standard GSC/GA4 bars → Chart.js). Full rules in [AGENT.md](AGENT.md).
Expand Down
14 changes: 3 additions & 11 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# syntax=docker/dockerfile:1
# WebsiteProfiling: Next.js UI + FastAPI (port 8001) + Python worker + pipeline.
# WebsiteProfiling: FastAPI (port 8001) + Python worker + pipeline.
# Web UI is a separate image: web/Dockerfile (Vite SPA + nginx).
# Build from repository root: docker build -t website-profiling .
# BuildKit cache mounts (default in Docker Desktop) reuse pip/npm downloads across rebuilds.

Expand Down Expand Up @@ -32,7 +33,6 @@ RUN apt-get update && apt-get install -y --no-install-recommends \

ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
NEXT_TELEMETRY_DISABLED=1 \
WEBSITE_PROFILING_ROOT=/app \
DATA_DIR=/data \
PYTHON=/opt/venv/bin/python \
Expand All @@ -57,27 +57,19 @@ RUN --mount=type=cache,target=/root/.npm \

WORKDIR /app

# Next.js install + build (layer cache)
COPY web/package.json web/package-lock.json /app/web/
RUN --mount=type=cache,target=/root/.npm \
cd /app/web && npm ci

# Application source
COPY pytest.ini /app/pytest.ini
COPY src /app/src
COPY tests /app/tests
COPY web /app/web
COPY alembic /app/alembic
COPY alembic.ini /app/alembic.ini
COPY docker-entrypoint.sh /app/docker-entrypoint.sh

RUN cd /app/web && npm run build && npm prune --omit=dev

ENV NODE_ENV=production

# Persisted data directory (secrets + shadow config)
RUN mkdir -p /data && chmod +x /app/docker-entrypoint.sh

EXPOSE 3000
EXPOSE 8001

CMD ["/app/docker-entrypoint.sh"]
Loading
Loading