Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 5 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ omit =
*/website_profiling/llm_config.py
*/website_profiling/cli.py
*/website_profiling/commands/enrich_cmd.py
# FastAPI server — tested via integration tests, not unit tests
*/website_profiling/api/*
*/website_profiling/clients/*
# Worker process — requires running DB and subprocess for meaningful tests
*/website_profiling/worker/*

[report]
show_missing = True
Expand Down
4 changes: 3 additions & 1 deletion .coveragerc.tools
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
[run]
source = website_profiling.tools
source =
website_profiling.tools
website_profiling.clients
omit =
*/website_profiling/tools/keywords.py
*/website_profiling/tools/plot.py
Expand Down
12 changes: 11 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
--cov-report=term-missing --cov-fail-under=100 -q -o addopts=
- name: Pytest (tools coverage gate)
run: |
pytest tests/tools/ \
pytest tests/tools/ tests/clients/ \
--cov=website_profiling.tools --cov-config=.coveragerc.tools \
--cov-report=term-missing --cov-fail-under=100 -q -o addopts=
- name: CLI smoke
Expand Down Expand Up @@ -88,3 +88,13 @@ jobs:
run: npm run lint
- name: Test
run: npm test

files:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'
- name: Test FileService
run: dotnet test services/FileService/FileService.slnx
13 changes: 11 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Project (root) only. Python: src/.gitignore. Next.js: web/.gitignore
# Project (root) only. Python: src/.gitignore. Next.js: web/.gitignore. .NET: services/FileService/

# Next.js UI: generated pipeline configs from the runner modal (repo root; must match Python cwd for paths)
.website-profiling-ui-*.txt
Expand Down Expand Up @@ -32,4 +32,13 @@ pipeline-config.txt
.agents/
skills-lock.json
crawl_results.csv
commit.*
commit.*

# .NET FileService — build output and IDE artifacts
services/FileService/**/bin/
services/FileService/**/obj/
services/FileService/.vs/
services/FileService/**/*.user
services/FileService/**/*.suo
services/FileService/**/TestResults/
services/FileService/**/*.DotSettings.user
8 changes: 5 additions & 3 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ Developer reference for agents and contributors. User-facing overview: [README.m
**Key paths**

- `src/website_profiling/` -- `cli.py`, `config.py`, `crawl/`, `db/storage.py`, `lighthouse/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `services/FileService/` -- .NET PDF + Excel workbook export (HTTP-only; see [README](services/FileService/README.md))
- `web/app/` -- routes; `web/src/` -- React; pipeline: `PipelineRunnerFab`, `server/pipelineJobs.ts`, `server/pipelineConfig.ts`, `server/llmConfig.ts`, `server/db.ts`
- `alembic/` -- schema migrations

**Local dev:** `./local-run` (Postgres in Docker `wp-pg`, Next.js on host; default `DATABASE_URL`: `postgres://postgres:dev@127.0.0.1:5432/website_profiling`). See `scripts/local-run.sh`. **Local tests:** `./local-test` runs **three** Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI **python** and **web** jobs; Docker CI is separate (see `.github/workflows/ci.yml`). `./local-test browser` for `@pytest.mark.browser` integration tests — see `scripts/local-test.sh`. Mocked browser unit tests: `tests/test_browser_fetcher_unit.py`.
**Local dev:** `./local-run` (Postgres in Docker `wp-pg`, FileService on `:8080`, Next.js on host; default `DATABASE_URL`: `postgres://postgres:dev@127.0.0.1:5432/website_profiling`). See `scripts/local-run.sh`. **Local tests:** `./local-test` runs **three** Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI **python** and **web** jobs; Docker CI is separate (see `.github/workflows/ci.yml`). `./local-test browser` for `@pytest.mark.browser` integration tests — see `scripts/local-test.sh`. Mocked browser unit tests: `tests/test_browser_fetcher_unit.py`.

**JavaScript crawl (optional):** Config keys `crawl_render_mode` (`static` | `javascript` | `auto`) and `crawl_js_*` in pipeline config / `pipelineConfigSchema.ts`. JS/auto crawls can capture browser console errors and uncaught exceptions (`crawl_js_capture_console`, stored under `page_analysis.browser`). **Auto mode** uses static-first fetch, pre-parse SPA heuristics (`needs_js_render`), then post-parse low-outlink fallback (`needs_js_render_after_parse`) in `crawler.py`. **Preflight:** `GET /api/crawl/browser-status` (localhost) spawns Python `browser_status()`; Run audit settings/run validation calls it when render mode is `javascript` or `auto`. Browser deps: Playwright from `requirements.txt` (installed by `./local-run setup` and `./local-test`). Runtime needs Chromium on `PATH` or `CHROME_PATH` (Docker sets `CHROME_PATH=/usr/bin/chromium`). Integration tests: `@pytest.mark.browser` — excluded by default in `pytest.ini`; Docker CI runs `tests/test_crawl_fetchers.py` and `tests/test_crawler_browser_e2e.py -m browser`; locally `./local-test browser`.

Expand All @@ -31,14 +32,15 @@ Developer reference for agents and contributors. User-facing overview: [README.m
- **AI Chat UI:** `/chat` — property-scoped chat with saved sessions (`chat_sessions`, `chat_messages`; migration `012_chat_sessions`).
- **Job store:** PostgreSQL `pipeline_jobs` when `DATABASE_URL` is set (`pipelineJobsDb.ts` — status, timestamps, truncated logs). In-memory map in `pipelineJobs.ts` holds live log tail and child process handles; stale rows reconciled via `PIPELINE_JOB_STALE_HOURS`.
- **Schema head:** `015_crawl_page_html` (recent: `013` link_edges/discovery, `014` job log truncation, `015` per-URL HTML storage).
- **Docker:** `Dockerfile` + `docker-compose.yml` (postgres + web); **`docker-compose.prod.yml`** (production + remote MCP on `:8000`); **`docker-compose.pull.yml`** for pre-built images (`WEB_IMAGE`); **`LIGHTHOUSE_CHROME_FLAGS`**
- **Docker:** `Dockerfile` + `docker-compose.yml` (postgres + web + FileService); **`docker-compose.prod.yml`** (production + remote MCP on `:8000`); **`docker-compose.pull.yml`** for pre-built images (`WEB_IMAGE`); **`LIGHTHOUSE_CHROME_FLAGS`**

**Where to edit**

| Task | Where |
|------|--------|
| Crawl | `crawl/crawler.py`, `crawl/fetchers/` |
| Report | `reporting/builder.py`, `reporting/categories.py` |
| PDF / workbook export | `services/FileService/` (rendering); Next.js proxies in `web/src/server/proxyToFileService.ts` |
| DB schema | `alembic/versions/` |
| Local analysis | `analysis/local.py`, `requirements.txt` |
| AI insights (LLM) | `llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt` |
Expand Down Expand Up @@ -84,7 +86,7 @@ The web UI uses **both** Chart.js and D3.js. Pick the library that fits each cha
- Keep chart-library types out of data-prep: use neutral shapes (`BarChartData`, `DualSeriesChartData` in `web/src/lib/viz/types.ts` and `web/src/lib/compareChartData.ts`); convert at the render layer via `web/src/lib/viz/adapters.ts` when needed.
- Migrate page-by-page when D3 is the better fit; do not remove `chart.js` from `package.json` until all consumers are migrated.

**Company standards:** UI copy in `web/src/strings.json` (Site Audit, Properties, Run audit). Data provenance on `report_meta` in report payload. Docs: `docs/COMPANY_STANDARDS.md`, `docs/GLOSSARY.md`. Migration `003_company_standards` (properties, pipeline_jobs, audit_log). Durable jobs in `web/src/server/pipelineJobsDb.ts`. Export: `GET /api/report/export`, `src/website_profiling/tools/export_audit.py`.
**Company standards:** UI copy in `web/src/strings.json` (Site Audit, Properties, Run audit). Data provenance on `report_meta` in report payload. Docs: `docs/COMPANY_STANDARDS.md`, `docs/GLOSSARY.md`. Migration `003_company_standards` (properties, pipeline_jobs, audit_log). Durable jobs in `web/src/server/pipelineJobsDb.ts`. **Export:** PDF/workbook via FileService (`FILE_SERVICE_URL` on web/MCP; `REPORT_API_URL` on FileService); CSV/JSON via `GET /api/report/export` and `src/website_profiling/tools/export_audit.py`.

**Common footguns (check before finishing web or DB work)**

Expand Down
3 changes: 2 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,15 @@ This file is the canonical entry point for agents. For full detail see [AGENT.md
- `src/website_profiling/` — core Python package
- `cli.py`, `config.py`, `crawl/`, `db/`, `reporting/`, `analysis/`, `llm/`, `tools/`
- `web/` — Next.js frontend
- `services/FileService/` — .NET PDF + Excel workbook export (port 8080). HTTP-only via `REPORT_API_URL`; no Postgres. Profiles: `executive|standard|full|premium`. Details: [services/FileService/README.md](services/FileService/README.md). Env: `FILE_SERVICE_URL` (Next.js/MCP), `REPORT_API_URL` (FileService).
- `alembic/` — DB migrations
- `docs/` — documentation index
- `tests/` — pytest suite

**Run / dev**

```bash
./local-run # Start Postgres (Docker) + Next.js
./local-run # Start Postgres + FileService + Next.js
./local-test # Run all three coverage gates
python -m src # Run audit pipeline
python -m website_profiling.mcp # Start MCP server (stdio)
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# syntax=docker/dockerfile:1
# WebsiteProfiling: Next.js web UI + Python pipeline (spawned from /api/run).
# WebsiteProfiling: Next.js UI + FastAPI (port 8001) + Python worker + pipeline.
# Build from repository root: docker build -t website-profiling .
# BuildKit cache mounts (default in Docker Desktop) reuse pip/npm downloads across rebuilds.

Expand Down
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
<p align="center">
<img src="https://img.shields.io/badge/Next.js-000?logo=next.js&logoColor=white" alt="Next.js">
<img src="https://img.shields.io/badge/Python-3776AB?logo=python&logoColor=white" alt="Python">
<img src="https://img.shields.io/badge/.NET-512BD4?logo=dotnet&logoColor=white" alt=".NET">
<img src="https://img.shields.io/badge/PostgreSQL-4169E1?logo=postgresql&logoColor=white" alt="PostgreSQL">
<img src="https://img.shields.io/badge/Docker-2496ED?logo=docker&logoColor=white" alt="Docker">
</p>
Expand Down Expand Up @@ -84,7 +85,7 @@ Audit → Report → MCP → Fix → Review → (repeat)
| Step | What you do | In Site Audit |
|------|-------------|---------------|
| **Audit** | Crawl and score the site | Pipeline (`python -m src`), Lighthouse, on-page checks |
| **Report** | Export and prioritize fixes | PDF/HTML/CSV exports, issue board, fix roadmap |
| **Report** | Export and prioritize fixes | PDF, Excel workbook, CSV, and JSON exports; issue board; fix roadmap |
| **MCP** | Pull audit context into your IDE | `python -m website_profiling.mcp` — read-only tools for Cursor / Claude Desktop |
| **Fix** | Ship changes in your codebase | Your PR workflow (MCP does not write to the site) |
| **Review** | Prove improvement | Compare runs, category deltas, GSC metric changes |
Expand Down Expand Up @@ -159,12 +160,13 @@ WebsiteProfiling/
│ ├── src/views/ # Report views (overview, links, issues, …)
│ ├── src/server/ # Server-side DB, pipeline jobs, config I/O
│ └── public/ # Static assets (logo, favicon)
├── services/FileService/ # .NET PDF + Excel workbook export (port 8080)
├── alembic/versions/ # PostgreSQL schema migrations
├── tests/ # pytest suite + fixtures
├── docs/ # Glossary, MCP, ops, brand assets
├── scripts/ # local-run.sh, local-test.sh helpers
├── .github/workflows/ # CI (Python + web + browser crawl)
├── docker-compose.yml # Dev stack (Postgres + web)
├── docker-compose.yml # Dev stack (Postgres + web + FileService)
├── docker-compose.prod.yml # Production stack (requires AUTH_SECRET)
├── docker-compose.pull.yml # Pre-built WEB_IMAGE
├── Dockerfile # Production image
Expand All @@ -178,6 +180,7 @@ WebsiteProfiling/
| Path | Purpose |
| ------------------------------------- | ------------------------------------------------------------------------------ |
| `src/website_profiling/` | Crawl, analyze, report, Lighthouse, integrations, AI — run via `python -m src` |
| `services/FileService/` | PDF and Excel workbook export — see [services/FileService/README.md](services/FileService/README.md) |
| `web/app/api/` | REST APIs: report data, pipeline runs, chat (SSE), Google/Bing sync |
| `web/src/lib/pipelineConfigSchema.ts` | Audit settings schema (UI ↔ PostgreSQL) |
| `alembic/versions/` | Database migrations — run `./local-run migrate` |
Expand All @@ -198,15 +201,15 @@ Build and run from source:
docker compose up --build
```

Open [http://localhost:3000/home](http://localhost:3000/home).
Open [http://localhost:3000/home](http://localhost:3000/home). PDF and workbook exports require the **FileService** container (`files`, port 8080).

Production deployment: `docker-compose.prod.yml` — set `POSTGRES_USER`, `POSTGRES_PASSWORD`, and `AUTH_SECRET`. Pre-built images: `docker-compose.pull.yml` (`WEB_IMAGE`).

### Local development

```bash
./local-run setup # First time: Postgres, Python venv, migrations, npm deps
./local-run # Start DB + Next.js dev server → http://localhost:3000/home
./local-run # Start DB + FileService + Next.js → http://localhost:3000/home
./local-run db # Postgres only (no app)
./local-run migrate # Apply Alembic migrations only
./local-run stop # Stop Postgres container
Expand Down
35 changes: 35 additions & 0 deletions alembic/versions/025_pipeline_job_queue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Add pipeline job queue columns for the Python worker.

Adds: command, cancel_requested, pause_requested, worker_pid, status='pending'.

Revision ID: 025_pipeline_job_queue
Revises: 014_pipeline_log_truncated
"""
from __future__ import annotations

from alembic import op

revision = "025_pipeline_job_queue"
down_revision = "024_app_settings"
branch_labels = None
depends_on = None


def upgrade() -> None:
op.execute("""
ALTER TABLE pipeline_jobs
ADD COLUMN IF NOT EXISTS command TEXT,
ADD COLUMN IF NOT EXISTS cancel_requested BOOLEAN NOT NULL DEFAULT false,
ADD COLUMN IF NOT EXISTS pause_requested BOOLEAN NOT NULL DEFAULT false,
ADD COLUMN IF NOT EXISTS worker_pid INTEGER;
""")


def downgrade() -> None:
op.execute("""
ALTER TABLE pipeline_jobs
DROP COLUMN IF EXISTS worker_pid,
DROP COLUMN IF EXISTS pause_requested,
DROP COLUMN IF EXISTS cancel_requested,
DROP COLUMN IF EXISTS command;
""")
25 changes: 24 additions & 1 deletion docker-compose.prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,29 @@ services:
condition: service_healthy
ports:
- '${WEB_PORT:-3000}:3000'
- '${FASTAPI_PORT:-8001}:8001'
environment:
WEBSITE_PROFILING_ROOT: /app
DATABASE_URL: postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB:-website_profiling}
DATA_DIR: /data
AUTH_SECRET: ${AUTH_SECRET:?set AUTH_SECRET}
AUTH_PASSWORD: ${AUTH_PASSWORD:-}
NODE_ENV: production
FASTAPI_URL: http://127.0.0.1:8001
FASTAPI_ALLOWED_ORIGINS: ${FASTAPI_ALLOWED_ORIGINS:-http://localhost:3000}
FILE_SERVICE_URL: http://files:8080
PYTHON: /opt/venv/bin/python
CHROME_PATH: /usr/bin/chromium
LIGHTHOUSE_PATH: /usr/local/bin/lighthouse
LIGHTHOUSE_CHROME_FLAGS: --headless --no-sandbox --disable-dev-shm-usage --disable-gpu
volumes:
- profiling-data:/data
healthcheck:
test: ['CMD', 'node', '-e', "require('http').get('http://127.0.0.1:3000/api/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s

worker:
build:
Expand All @@ -45,7 +54,7 @@ services:
depends_on:
postgres:
condition: service_healthy
command: ['python', '-m', 'src', 'report']
command: ['/opt/venv/bin/python', '-m', 'website_profiling.worker']
environment:
WEBSITE_PROFILING_ROOT: /app
DATABASE_URL: postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB:-website_profiling}
Expand All @@ -55,17 +64,31 @@ services:
profiles:
- worker

files:
build:
context: ./services/FileService
ports:
- '${FILE_SERVICE_PORT:-8080}:8080'
environment:
REPORT_API_URL: http://web:8001
depends_on:
web:
condition: service_started

mcp:
build:
context: .
dockerfile: Dockerfile
depends_on:
postgres:
condition: service_healthy
files:
condition: service_started
command: ['python', '-m', 'website_profiling.mcp.http']
environment:
WEBSITE_PROFILING_ROOT: /app
DATABASE_URL: postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB:-website_profiling}
FILE_SERVICE_URL: http://files:8080
WP_MCP_HTTP_HOST: 0.0.0.0
WP_MCP_HTTP_PORT: 8000
WP_MCP_TOKEN: ${WP_MCP_TOKEN:?set WP_MCP_TOKEN}
Expand Down
13 changes: 13 additions & 0 deletions docker-compose.pull.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ services:
CHROME_PATH: /usr/bin/chromium
LIGHTHOUSE_PATH: /usr/local/bin/lighthouse
LIGHTHOUSE_CHROME_FLAGS: --headless --no-sandbox --disable-dev-shm-usage --disable-gpu
FASTAPI_URL: http://127.0.0.1:8001
FILE_SERVICE_URL: http://files:8080
volumes:
- profiling-data:/data
healthcheck:
Expand All @@ -42,6 +44,17 @@ services:
retries: 3
start_period: 15s

files:
build:
context: ./services/FileService
ports:
- "8080:8080"
environment:
REPORT_API_URL: http://web:8001
depends_on:
web:
condition: service_started

volumes:
pg-data:
profiling-data:
15 changes: 15 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ services:
condition: service_healthy
ports:
- "3000:3000"
- "8001:8001"
environment:
WEBSITE_PROFILING_ROOT: /app
DATABASE_URL: postgres://profiling:profiling@postgres:5432/website_profiling
Expand All @@ -32,6 +33,9 @@ services:
CHROME_PATH: /usr/bin/chromium
LIGHTHOUSE_PATH: /usr/local/bin/lighthouse
LIGHTHOUSE_CHROME_FLAGS: --headless --no-sandbox --disable-dev-shm-usage --disable-gpu
FASTAPI_URL: http://127.0.0.1:8001
FASTAPI_ALLOWED_ORIGINS: "http://localhost:3000"
FILE_SERVICE_URL: http://files:8080
volumes:
- profiling-data:/data
healthcheck:
Expand All @@ -41,6 +45,17 @@ services:
retries: 3
start_period: 15s

files:
build:
context: ./services/FileService
ports:
- "8080:8080"
environment:
REPORT_API_URL: http://web:8001
depends_on:
web:
condition: service_started

# Optional remote MCP (Streamable HTTP). Uncomment and set WP_MCP_TOKEN / WP_MCP_ALLOWED_HOSTS.
# mcp:
# build:
Expand Down
Loading
Loading