Skip to content

feat: cargo registry implementation [CM-1264]#4236

Draft
mbani01 wants to merge 1 commit into
mainfrom
feat/cargo_registry_implementation
Draft

feat: cargo registry implementation [CM-1264]#4236
mbani01 wants to merge 1 commit into
mainfrom
feat/cargo_registry_implementation

Conversation

@mbani01

@mbani01 mbani01 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces initial support for a new Cargo (Rust package ecosystem) worker, including its Docker Compose configuration, scripts, dependencies, and the main activity implementations for handling Cargo package data. The changes add new scripts and code to download, extract, load, and process the crates.io database dump, as well as to enrich and clean up package data. Several new dependencies are introduced to support these operations.

Cargo Worker Integration

  • Added a new Docker Compose file (scripts/services/cargo-worker.yaml) and updated the packages.env build script to support the cargo-worker service for local development and deployment. [1] [2]
  • Introduced new npm scripts in services/apps/packages_worker/package.json to build, run, and develop the Cargo worker, including environment setup for Temporal task queueing.

Cargo Worker Implementation

  • Added src/cargo/activities.ts with activity functions for downloading, loading, enriching, and cleaning up Cargo package data. These activities handle the full workflow from data ingestion to enrichment and teardown. [1] [2]
  • Implemented src/cargo/dump.ts to handle downloading and extracting the crates.io database dump, including robust error handling and timeouts for large files.

Dependency Management

  • Added pg-copy-streams and its type definitions to dependencies in both package.json and pnpm-lock.yaml, along with related type packages for PostgreSQL streaming support. [1] [2] [3] [4] [5] [6] [7] [8]

Minor and Supporting Changes

  • Updated AWS SDK dependency snapshots and some metadata in pnpm-lock.yaml to reflect recent upstream and local changes, ensuring compatibility and proper dependency resolution. [1] [2] [3] [4] [5] [6] [7] [8]
  • Minor update to the deprecation message for the uuid package in the lockfile.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 self-assigned this Jun 18, 2026
Copilot AI review requested due to automatic review settings June 18, 2026 17:15
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial Cargo (crates.io) registry ingestion support to services/apps/packages_worker, introducing a dedicated Temporal worker/task-queue that downloads the crates.io DB dump, stages it into a schema in packages-db, enriches package metadata, and records per-field audit changes.

Changes:

  • Added a new Cargo Temporal workflow + schedule + worker entrypoint (cargo-worker) to run a daily registry sync.
  • Implemented dump download/extract + high-volume COPY FROM STDIN load into a staging schema, plus set-based enrichment SQL phases.
  • Updated local/dev wiring (Docker Compose + workspace scripts) and introduced pg-copy-streams for streaming COPY.

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/apps/packages_worker/tmp-test-versions.ts Adds a temporary local script for manually running enrichVersions (needs cleanup before merge).
services/apps/packages_worker/src/workflows/index.ts Exports the new cargoSyncWorkflow from the shared workflows index.
services/apps/packages_worker/src/db.ts Adds getPackagesDbConnection() to expose the raw pg-promise connection for COPY streaming.
services/apps/packages_worker/src/config.ts Adds getCargoConfig() (currently only CARGO_DUMP_URL).
services/apps/packages_worker/src/cargo/workflows.ts Defines the cargoSyncWorkflow orchestration and activity timeouts/retries.
services/apps/packages_worker/src/cargo/types.ts Adds Cargo-specific result/config types used across activities/enrichment.
services/apps/packages_worker/src/cargo/schedule.ts Registers a daily Temporal Schedule to run the Cargo sync workflow on cargo-worker queue.
services/apps/packages_worker/src/cargo/loadDump.ts Implements staging schema DDL + CSV COPY loading + aggregation/denormalization into enrich tables.
services/apps/packages_worker/src/cargo/enrich.ts Implements enrichment phases (packages/versions/repos/maintainers/downloads) + audit logging.
services/apps/packages_worker/src/cargo/dump.ts Downloads the crates.io tarball with timeout and extracts it to /tmp/cargo-dump.
services/apps/packages_worker/src/cargo/activities.ts Wires dump download/load + enrichment phases into Temporal activities and cleanup.
services/apps/packages_worker/src/bin/cargo-worker.ts New worker entrypoint that initializes the service, schedules Cargo sync, and starts processing.
services/apps/packages_worker/src/activities.ts Re-exports Cargo activities from the root activities index.
services/apps/packages_worker/package.json Adds start/dev scripts for cargo-worker and adds pg-copy-streams deps.
scripts/services/cargo-worker.yaml Adds Docker Compose definitions for cargo-worker and cargo-worker-dev.
scripts/builders/packages.env Adds cargo-worker to the packages image/service build list.
pnpm-lock.yaml Locks pg-copy-streams (and types) and includes incidental lockfile metadata changes.
Files not reviewed (1)
  • pnpm-lock.yaml: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +13
import { pgpQx } from '@crowd/data-access-layer/src/queryExecutor'
import { getDbConnection } from '@crowd/database'

import { enrichVersions } from './src/cargo/enrich'

async function main() {
const conn = await getDbConnection({
host: 'localhost',
port: 5434,
database: 'packages-db',
user: 'postgres',
password: 'example',
} as never)
Comment on lines +1 to +4
export interface CargoConfig {
dumpUrl: string
dumpDir: string
}
Comment on lines +344 to +348
`INSERT INTO audit_field_changes (worker, purl, changed_fields)
SELECT $(worker), p.purl, array_agg(DISTINCT ac.field)
FROM ${STAGING_SCHEMA}.audit_changes ac
JOIN packages p ON p.id = ac.package_id
GROUP BY p.purl`,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants