feat: cargo registry implementation [CM-1264]#4236
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
There was a problem hiding this comment.
Pull request overview
This PR adds initial Cargo (crates.io) registry ingestion support to services/apps/packages_worker, introducing a dedicated Temporal worker/task-queue that downloads the crates.io DB dump, stages it into a schema in packages-db, enriches package metadata, and records per-field audit changes.
Changes:
- Added a new Cargo Temporal workflow + schedule + worker entrypoint (
cargo-worker) to run a daily registry sync. - Implemented dump download/extract + high-volume
COPY FROM STDINload into a staging schema, plus set-based enrichment SQL phases. - Updated local/dev wiring (Docker Compose + workspace scripts) and introduced
pg-copy-streamsfor streaming COPY.
Reviewed changes
Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/tmp-test-versions.ts | Adds a temporary local script for manually running enrichVersions (needs cleanup before merge). |
| services/apps/packages_worker/src/workflows/index.ts | Exports the new cargoSyncWorkflow from the shared workflows index. |
| services/apps/packages_worker/src/db.ts | Adds getPackagesDbConnection() to expose the raw pg-promise connection for COPY streaming. |
| services/apps/packages_worker/src/config.ts | Adds getCargoConfig() (currently only CARGO_DUMP_URL). |
| services/apps/packages_worker/src/cargo/workflows.ts | Defines the cargoSyncWorkflow orchestration and activity timeouts/retries. |
| services/apps/packages_worker/src/cargo/types.ts | Adds Cargo-specific result/config types used across activities/enrichment. |
| services/apps/packages_worker/src/cargo/schedule.ts | Registers a daily Temporal Schedule to run the Cargo sync workflow on cargo-worker queue. |
| services/apps/packages_worker/src/cargo/loadDump.ts | Implements staging schema DDL + CSV COPY loading + aggregation/denormalization into enrich tables. |
| services/apps/packages_worker/src/cargo/enrich.ts | Implements enrichment phases (packages/versions/repos/maintainers/downloads) + audit logging. |
| services/apps/packages_worker/src/cargo/dump.ts | Downloads the crates.io tarball with timeout and extracts it to /tmp/cargo-dump. |
| services/apps/packages_worker/src/cargo/activities.ts | Wires dump download/load + enrichment phases into Temporal activities and cleanup. |
| services/apps/packages_worker/src/bin/cargo-worker.ts | New worker entrypoint that initializes the service, schedules Cargo sync, and starts processing. |
| services/apps/packages_worker/src/activities.ts | Re-exports Cargo activities from the root activities index. |
| services/apps/packages_worker/package.json | Adds start/dev scripts for cargo-worker and adds pg-copy-streams deps. |
| scripts/services/cargo-worker.yaml | Adds Docker Compose definitions for cargo-worker and cargo-worker-dev. |
| scripts/builders/packages.env | Adds cargo-worker to the packages image/service build list. |
| pnpm-lock.yaml | Locks pg-copy-streams (and types) and includes incidental lockfile metadata changes. |
Files not reviewed (1)
- pnpm-lock.yaml: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import { pgpQx } from '@crowd/data-access-layer/src/queryExecutor' | ||
| import { getDbConnection } from '@crowd/database' | ||
|
|
||
| import { enrichVersions } from './src/cargo/enrich' | ||
|
|
||
| async function main() { | ||
| const conn = await getDbConnection({ | ||
| host: 'localhost', | ||
| port: 5434, | ||
| database: 'packages-db', | ||
| user: 'postgres', | ||
| password: 'example', | ||
| } as never) |
| export interface CargoConfig { | ||
| dumpUrl: string | ||
| dumpDir: string | ||
| } |
| `INSERT INTO audit_field_changes (worker, purl, changed_fields) | ||
| SELECT $(worker), p.purl, array_agg(DISTINCT ac.field) | ||
| FROM ${STAGING_SCHEMA}.audit_changes ac | ||
| JOIN packages p ON p.id = ac.package_id | ||
| GROUP BY p.purl`, |
This pull request introduces initial support for a new Cargo (Rust package ecosystem) worker, including its Docker Compose configuration, scripts, dependencies, and the main activity implementations for handling Cargo package data. The changes add new scripts and code to download, extract, load, and process the crates.io database dump, as well as to enrich and clean up package data. Several new dependencies are introduced to support these operations.
Cargo Worker Integration
scripts/services/cargo-worker.yaml) and updated thepackages.envbuild script to support thecargo-workerservice for local development and deployment. [1] [2]services/apps/packages_worker/package.jsonto build, run, and develop the Cargo worker, including environment setup for Temporal task queueing.Cargo Worker Implementation
src/cargo/activities.tswith activity functions for downloading, loading, enriching, and cleaning up Cargo package data. These activities handle the full workflow from data ingestion to enrichment and teardown. [1] [2]src/cargo/dump.tsto handle downloading and extracting the crates.io database dump, including robust error handling and timeouts for large files.Dependency Management
pg-copy-streamsand its type definitions to dependencies in bothpackage.jsonandpnpm-lock.yaml, along with related type packages for PostgreSQL streaming support. [1] [2] [3] [4] [5] [6] [7] [8]Minor and Supporting Changes
pnpm-lock.yamlto reflect recent upstream and local changes, ensuring compatibility and proper dependency resolution. [1] [2] [3] [4] [5] [6] [7] [8]uuidpackage in the lockfile.