ProcessBench

Business-specific AI benchmark packs for teams that already have AI workflows.

Public model benchmarks answer, "Which model is generally strong?"

ProcessBench helps answer, "Which model, prompt, setting, or vendor still works for this business process?"

It gives teams a practical starting point for turning real workflow examples into repeatable regression checks. The goal is not to replace eval frameworks. The goal is to make the first useful benchmark easier to design.

Start Here

If you are not technical, start with this 30-minute path:

Pick one workflow that already uses AI.
Collect 5 to 10 real examples from that workflow.
Remove names, customer data, secrets, and private account details.
Write what a good answer must contain.
Write what would make the answer unacceptable.
Ask a technical teammate to put those examples into the closest folder below.
Use the report template to decide whether the model, prompt, setting, or vendor is safe to change.

The first benchmark does not need to be large. It needs to make the most important failures visible before they reach customers, operators, or internal teams.

Pick A Starting Point

If you were sent this repository by a product, operations, AI, or technology leader, start with the folder closest to the workflow you need to test.

Business context	Start here	Typical workflows
B2B software, telecom, professional services	`b2b-software-telecom-professional-services/`	ticket routing, RFP answers, delivery notes, account summaries, telecom triage
Manufacturing, industrial, energy	`manufacturing-industrial-energy/`	maintenance logs, SOP answers, quality incidents, engineering changes, scheduling constraints
Retail, e-commerce, consumer goods	`retail-ecommerce-consumer-goods/`	product content, search relevance, returns, campaign copy, merchandising checks

Each folder contains:

README.md for the local workflow context
deliverables.md with benchmark pack ideas
promptfoo.example.yaml as a runnable or adaptable starting point
prompts/ with minimal prompt templates referenced by the example config

What You Build

The useful output is a small report:

what changed
which examples passed
which examples regressed
which failures block rollout
what should happen next

What This Is

ProcessBench sits between public model benchmarks and ad hoc spreadsheet testing.

It helps structure:

representative workflow cases
expected outputs
hard blockers
regression checks
model, prompt, setting, or vendor comparisons
report formats for rollout decisions

It is intentionally small. The benchmark packs are meant to be copied, edited, and connected to the evaluation stack already used by your team.

Who This Is For

The likely reader is a:

product or operations owner
technical lead
AI engineer
solutions architect
analytics or data lead
platform engineer

You do not need to be an evals researcher. A non-technical owner can define the examples, expected outcomes, and unacceptable failures. A technical teammate can turn that into YAML or JSONL and connect it to the model provider.

Quick Start

Open the folder closest to your workflow.
Read deliverables.md and pick one benchmark shape.
Open promptfoo.example.yaml.
Replace the sample cases with anonymized examples from your process.
Encode hard blockers as assertions where possible.
Run the example locally or adapt it to your existing eval stack.
Summarize results with templates/regression-report.md.

No-Credential Demo

Run the included offline demo first. It does not call a model. It shows the shape of the expected pass/fail report.

npm run demo

The demo reads examples/b2b-support-demo.jsonl, checks expected fields and hard blockers, and prints a small rollout report. Use it to understand the benchmark structure before connecting real model providers.

Minimal Run Path

The example configs use promptfoo because it is easy to inspect, easy to run locally, and works well for model, prompt, and provider comparison.

Install dependencies:

npm install

Run one example:

npm run eval:b2b

You will need provider credentials for the models you keep in the config. Remove providers you do not use.

What To Change First

Replace these before trusting any result:

sample inputs
expected JSON fields
hard blockers
provider list
prompt wording
report thresholds

The bundled examples are not universal benchmarks. They are scaffolds for building business-specific ones.

What Counts As A Good First Benchmark

A useful first benchmark usually has:

5 to 20 anonymized examples
one clear workflow
expected fields or decisions
hard blockers that stop rollout
one report comparing current and proposed behavior

Do not start with every workflow. Start where a wrong answer would waste time, trigger escalation, mislead a customer, or create rework.

Builds On

ProcessBench is a workflow-pack layer on top of existing evaluation tools.

promptfoo for declarative model, prompt, and provider comparison.
DeepEval for Python-native LLM tests and custom metrics.
Ragas for retrieval-heavy workflows.
LangGraph and LangSmith for stateful agent workflows and tracing.
OpenAI Evals and Inspect for private or research-grade eval design.

Related Repositories

ProcessBench is the benchmark-pack layer. These related repositories cover adjacent parts of the same production-AI workflow.

Repository	Where it fits
`skillgate`	Deterministic finish-line gates for AI coding agents. Use it when the output is code or a repo change and the acceptance criteria should block commit or publish.
`draftcat`	Governed AI pipelines with deterministic steps, approval, dispatch, and audit. Use it when a workflow should run with operator control.
`agent-approval-gate`	A smaller approval-gate pattern with JSON schemas, n8n, and email examples. Use it when you need a minimal approval workflow.
`agentic-task-system`	Persistent task and context memory for agents. Use it when benchmark cases, decisions, and follow-ups need to stay connected to work.
`foundations`	Local-first context discovery. Use it before broad search when you want to start from proven tools, notes, and code.

The split is intentional: ProcessBench helps decide whether an AI workflow still works; the other repositories help run, gate, approve, and remember the work around that workflow.

Templates

templates/benchmark-sprint.md for planning a first benchmark pass
templates/scoring-rubric.md for defining pass, warning, fail, and blocker outcomes
templates/regression-report.md for summarizing model or prompt changes
templates/use-case-fixture.jsonl for example fixture shape

Repository Layout

b2b-software-telecom-professional-services/
manufacturing-industrial-energy/
retail-ecommerce-consumer-goods/
adapters/
templates/

Scope

ProcessBench focuses on commercial and operational AI workflows where quality, consistency, throughput, and process control matter.

Finance, insurance, healthcare, and pharma are intentionally not the starting point here. Those areas often need heavier regulatory operating models than this repository should imply.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
adapters		adapters
assets		assets
b2b-software-telecom-professional-services		b2b-software-telecom-professional-services
examples		examples
manufacturing-industrial-energy		manufacturing-industrial-energy
retail-ecommerce-consumer-goods		retail-ecommerce-consumer-goods
scripts		scripts
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProcessBench

Start Here

Pick A Starting Point

What You Build

What This Is

Who This Is For

Quick Start

No-Credential Demo

Minimal Run Path

What To Change First

What Counts As A Good First Benchmark

Builds On

Related Repositories

Templates

Repository Layout

Scope

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProcessBench

Start Here

Pick A Starting Point

What You Build

What This Is

Who This Is For

Quick Start

No-Credential Demo

Minimal Run Path

What To Change First

What Counts As A Good First Benchmark

Builds On

Related Repositories

Templates

Repository Layout

Scope

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages