# benchflow **Repository Path**: wild-mechanical-small-flat/benchflow ## Basic Information - **Project Name**: benchflow - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-12 - **Last Updated**: 2026-05-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

BenchFlow

Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent

PyPI Discord
## What BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle. - **Any ACP agent** — Gemini CLI, Claude Code, Codex, OpenCode, OpenHands, OpenClaw, Pi, or your own - **Single + multi + progressive** — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python `BaseUser` callback - **Sandbox backends** — Docker locally, Daytona for parallel cloud runs, Modal for serverless/GPU-backed task environments - **Hardened verifier** — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature ## Install ```bash uv tool install benchflow ``` Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). Set `DAYTONA_API_KEY` for Daytona runs or configure Modal auth for Modal runs; export the relevant agent API key (`GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) or run `claude login` / `codex --login` for subscription auth. ## Documentation Start with [Getting started](./docs/getting-started.md), then [Concepts](./docs/concepts.md) for the mental model. Then by goal: | If you want to… | Read | |------------------|------| | Run an eval on an existing task | [Getting started](./docs/getting-started.md) | | Understand Trial / Scene / Role / Verifier | [Concepts](./docs/concepts.md) | | Author a new task | [Task authoring](./docs/task-authoring.md) | | Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs | [Use cases](./docs/use-cases.md) | | Multi-round single-agent (progressive disclosure, oracle access) | [Progressive disclosure](./docs/progressive-disclosure.md) | | Skill evaluation (when the artifact is a skill, not a workspace) | [Skill eval](./docs/skill-eval.md) | | Understand the security model | [Sandbox hardening](./docs/sandbox-hardening.md) | | CLI flags + commands | [CLI reference](./docs/reference/cli.md) | | Python API surface | [Python API reference](./docs/reference/python-api.md) | Notebooks and runnable example scripts live under [`docs/examples/`](./docs/examples/) so examples stay versioned with the docs that explain them. ## Benchmark task sources BenchFlow's helper scripts can materialize benchmark task repos under `.ref/`. For SkillsBench, [`benchmarks/run_skillsbench.py`](./benchmarks/run_skillsbench.py) calls `ensure_tasks("skillsbench")`, which clones [`benchflow-ai/skillsbench`](https://github.com/benchflow-ai/skillsbench) from the `main` branch into `.ref/skillsbench/tasks` when the local task cache is missing. SkillsBench itself sources BenchFlow from GitHub `main` in its [`pyproject.toml`](https://github.com/benchflow-ai/skillsbench/blob/main/pyproject.toml). After a BenchFlow change lands, run `uv lock --upgrade-package benchflow` in SkillsBench when you need its lockfile to point at the newest BenchFlow commit. ## Featured - **Progressive disclosure on SWE-bench Pro** — the `BaseUser` abstraction drives a multi-round trial: terse round-0 prompt → failing-test hints → full spec. 5/5 oracle on Daytona, runnable demo at [`docs/examples/swebench_pro_progressive_disclosure.ipynb`](./docs/examples/swebench_pro_progressive_disclosure.ipynb). Also benchflow's [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) parity answer for the no-second-LLM case. See [Progressive disclosure](./docs/progressive-disclosure.md). ## Research artifacts Two runnable labs validate the security story: - [`labs/benchjack-sandbox-hardening/`](./labs/benchjack-sandbox-hardening/) — end-to-end demo that 0.2.1+ blocks three [BenchJack](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) exploits that flip 0.2.0's reward from 0.0 to 1.0. - [`labs/reward-hack-matrix/`](./labs/reward-hack-matrix/) — full reward-hack sweep across real benchmarks comparing 0.2.0 vs 0.2.2. ## Audience - **Eval researchers / paper writers** → [Getting started](./docs/getting-started.md) → [Concepts](./docs/concepts.md) → [Use cases](./docs/use-cases.md) - **Task authors** → [Task authoring](./docs/task-authoring.md) → [Sandbox hardening](./docs/sandbox-hardening.md) - **Agent builders integrating with benchflow** → [Concepts](./docs/concepts.md) → [Python API reference](./docs/reference/python-api.md) → [`benchflow.agents.registry`](./src/benchflow/agents/registry.py) - **Existing Harbor users migrating** → [Use cases — migration section](./docs/use-cases.md#migration-from-harbor) → [Progressive disclosure (Harbor #1316 parity)](./docs/progressive-disclosure.md#comparison-with-multi-agent-simulated-user-harbor-1316-parity) ## Contributing PRs welcome. Open against `main`. CI runs ruff + tests on every PR; please run `ruff check .` and `pytest tests/` locally first. For a release: bump `pyproject.toml` to the next stable version, tag `v` on main, push the tag — CI publishes to PyPI. Then bump main to the next `.dev0`. ## License Apache-2.0.