sth-bench-automation is the orchestration layer for STH benchmark runs. It
does not replace the benchmark repositories. Instead, it coordinates them into
a common run layout with structured transport profiles, bundle-aware preflight,
resumable step state, recovery artifacts, and deterministic local publish staging.
For operator handoff and agent-driven execution, see the detailed V7 run guide:
docs/v7-run-guide.md.
- models direct access and jump-box chains with per-host auth and retry policy
- runs a persistent workflow state machine:
discoverpreflightstageexecutecollectpublishfinalize
- writes AI-friendly machine-readable run data:
status.jsonevents.jsonlcommand_ledger.jsonltransfer_ledger.jsonlartifact_manifest.jsonpublish_ledger.jsonrecovery_summary.md
- stages deterministic publish payloads under
publish-staging/so resume and later GitHub automation can reuse completed benchmark artifacts without rerunning them - prepares local per-benchmark git worktrees under
publish-worktrees/when a target repo is available, so publish retries can start from a staged branch rather than raw run artifacts - creates deterministic local result commits in those publish worktrees, so the next GitHub-facing step can push prepared branches instead of reconstructing changes
- records push-ready manifests when a committed publish worktree has a matching
remote, while preserving explicit
remote_missingorremote_mismatchstate when it does not - can also push those prepared branches when a publish target opts into
push_local = true, then records a verifiedpushed_localmanifest that resume can reuse without pushing again - generates a deterministic PR-ready manifest and body after a successful push,
including a compare URL and suggested
gh pr createcommand, so the last GitHub step is reproducible - can optionally create the PR through
ghwhencreate_pr_local = true, while preserving explicitgh_missing,gh_auth_missing, andpr_create_failedstates instead of hiding that work - reuses an existing PR for the pushed branch when one is already open, so resume and reruns do not spam duplicate PRs or turn “already exists” into a noisy failure
- makes
validate --scope publishinspect the local repo mapping, git/worktree state,originremote match, andghavailability/auth ahead of the actual publish phase, while reporting the resolved publish target key each adapter is using - resolves both publish target keys and local repo paths through adapter metadata, so future benchmark repos do not need duplicated wiring in both CLI and publish code
- uses that same adapter metadata for controller-side staging/preflight repo lookup, which keeps local bundle probes and fallback readiness aligned as new benchmarks are added
- also tracks which adapters support controller-side local dist bundles and local bundle builders in metadata, so dry-run staging and live preflight stay in sync
- prefers prebuilt benchmark bundles and records packaging provenance when a run falls back to local wrapper staging, bootstrap, or source build
- builds the latency and Geekbench local wrapper bundles through their own repo scripts before resorting to loose local-repo snapshot fallback, which makes the fallback path more reproducible and reusable
- supports structured AgentSTH topology control for:
- official methodology
- P-core and E-core supplements
- SMT physical-only vs logical mode
- socket-local runs
- pinned CPU lists
- explicit multi-agent split matrices
preflightsystem-infolatencyagentsthgeekbench
preflight is inserted automatically before benchmark execution unless you
explicitly run only preflight.
cd /Users/patrickkennedy/Desktop/AgentC/sth-bench-automation
. .venv/bin/activate
sth-bench-automation validate \
--config configs/sth-lab-formal.toml \
--host turind8 \
--scope all
sth-bench-automation run \
--config configs/sth-lab-formal.toml \
--host turind8 \
--benchmarks system-info,latency,agentsth,geekbench
sth-bench-automation status \
--run-dir automation-runs/2026-04-18/turind8/20260418T120000Z
sth-bench-automation resume \
--config configs/sth-lab-formal.toml \
--run-dir automation-runs/2026-04-18/turind8/20260418T120000Z
sth-bench-automation recover \
--run-dir automation-runs/2026-04-18/turind8/20260418T120000Zautomation-runs/
2026-04-18/
turind8/
20260418T120000Z/
run_manifest.json
status.json
events.jsonl
command_ledger.jsonl
transfer_ledger.jsonl
artifact_manifest.json
publish_ledger.json
recovery_summary.md
host_inventory_snapshot.json
publish-staging/
publish-worktrees/
logs/
preflight/
benchmarks/
system-info/
latency/
agentsth/
geekbench/
The formal config now points at benchmark repos plus a bundle cache:
[paths]
system_info_repo = "../../sth-system-info"
latency_repo = "../../sth-cpu-latency-tool"
agentsth_repo = "../../repo"
geekbench_repo = "../../sth-geekbench"
bundle_cache_dir = "../bundle-cache"Inventories can now describe jump-box routing explicitly instead of using loose proxy commands:
[auth_refs.jumpbox-password]
type = "password_env"
value = "STH_JUMPBOX_PASSWORD"
[[hosts]]
name = "turind8"
hostname = "turind8"
[hosts.transport]
type = "jump_chain"
[[hosts.transport.hops]]
name = "thor"
hostname = "thor"
username = "sthuser"
auth_ref = "jumpbox-password"The normal path is bundle-first:
- choose a compatible prebuilt bundle
- verify checksum
- stage or download the bundle
- execute
preflight and validate --scope staging now verify the actual selected
bundle artifact on the controller, not just the platform match. When a cached
preflight report exists for a host, staging validation reuses that platform
snapshot so dry runs make the same bundle decision the real workflow does. The
staging report also predicts whether each benchmark will use a prebuilt bundle,
a captured local wrapper bundle, or a fallback path. For AgentSTH it also
lists the planned run families, including the mainstream benchmark profile plus
any supplemental benchmark-profile variants and cached-topology-dependent
supplements that would run on that host, and it now reports when a compatible
captured-cache AgentSTH bundle is available before source-build fallback. The
staging report now also predicts the same readiness state live preflight would
use, including whether a benchmark is only viable because a controller-side
repo fallback exists or is already packaging-blocked before execution.
It now also reports the expected remote runtime commands, dependency hints, and
bootstrap packages for the selected path, so host-side requirements such as
lstopo(hwloc) are visible before a live run.
Those staging and preflight reports now also include the derived bootstrap
install command for the detected package manager, so the fix can be copied
directly instead of reconstructed from the package list.
When validate --scope all has SSH credentials, it also probes those selected
runtime commands on the remote host and reports missing prerequisites directly
instead of only predicting them from config and cached platform data.
For source-build fallback paths such as AgentSTH, those reports now also
distinguish package-managed missing prerequisites from tools the workflow can
bootstrap itself, such as cargo via rustup, so a bootstrapable source-build
host does not get flattened into the same state as a genuinely blocked one.
If the host is unreachable, the staging report now records that those remote
checks were skipped due to connectivity and preserves the expected command and
package context instead of leaving the probe result empty.
The same structured connectivity payload is now used for auth/config failures
such as missing password environment variables, so validation still returns a
usable report instead of aborting before JSON output.
When a live remote runtime probe succeeds, the staging report now also publishes
validated_readiness_status and readiness_basis so remote dependency
failures can downgrade a previously predicted ready state.
validate now exits non-zero when connectivity, staging readiness, or required
publish readiness checks fail, so scripts can rely on the exit code instead of
always parsing the JSON report first.
Both staging validation and live preflight also record the required bundle
capabilities for each benchmark, so bundle mismatches are easier to debug from
the report output without chasing separate repo metadata.
AgentSTH source-build fallback tarballs now also strip macOS sidecar files and
archive pax metadata before upload, so controller-side packaging is less likely
to fail when a benchmark operator is building from a macOS workstation.
Controller-side fallback details now also carry a basis field so reports can
distinguish repo-backed fallbacks from standalone runner-backed ones such as a
Geekbench wrapper path.
Bundle matching also normalizes common architecture aliases such as
arm64/aarch64 and amd64/x86_64, so configured bundles, local dist
bundles, and AgentSTH captured-cache reuse all follow the same platform match
rules.
Live preflight now preserves the same local bundle-builder metadata that dry-run
staging reports, including builder path and latest local bundle sidecar details,
instead of collapsing that controller-side packaging state down to a boolean.
The human-readable preflight, status, and recovery summaries now also surface
the local builder sidecar filename plus non-repo fallback basis values such as a
standalone Geekbench runner path, so operator-facing output matches the richer
JSON metadata more closely.
Dry-run staging and live preflight now also share the same readiness classifier,
so states like ready_with_local_bundle, bundle_invalid_fallback_available,
and packaging_blocked are derived from one common ruleset instead of drifting
between separate code paths.
Live preflight benchmark reports now also include status_basis, so operator
output can show whether a ready or blocked state came from a configured bundle,
captured cache reuse, a local bundle builder, controller fallback, or missing
runtime dependencies.
Configured bundles, AgentSTH captured-cache bundles, and local dist bundles now
also share one precedence resolver across dry-run staging and live preflight, so
both paths agree on which controller-side bundle source would actually win.
Operator-facing status and recovery output now also classify running phase
heartbeats as fresh or stale using the recorded heartbeat interval, so long
running phases are easier to distinguish from truly stalled runs. The status
CLI now uses the same structured current-position block as recovery summaries
instead of printing the active phase as a raw JSON dict, and stale running
phases are now called out explicitly with Attention: stalled. The persisted
status.json also carries structured heartbeat and attention fields for
active phases/current pointers so agents do not need to infer stall state from
timestamps alone, and it now includes a top-level monitoring block with the
derived run state (idle, running, or stalled), a short summary, any
stalled_phases, and the active adapter/phase when one is running. The
status CLI, plus resume/recover paths, now also backfill that same monitoring
summary for older run directories whose status.json predates the new field so
legacy runs become agent-friendly on first touch. Persisted status also includes a
top-level alerts block that summarizes failed steps, stalled phases, and
blocked publish states so agents can see the highest-priority issues without
walking the full nested structure; it now also includes a primary alert with
recommended next actions for the most urgent current problem, plus top-level
next_action / next_actions fields for direct agent handoff. That same
derived alert block now also carries command_hints plus top-level
next_command / next_commands, so other agents can move straight to exact
status, recover, or resume invocations without synthesizing the CLI by
hand. When config and host details are available, connectivity, staging, and
publish problems now prefer the matching scoped validate command ahead of a
blind recover. It also carries path_hints plus prioritized next_path /
next_paths, so the first investigation file is explicit too; when the run has
step artifacts or publish manifests for the active alert, those are preferred
ahead of the generic ledgers. Dependency-driven step failures now also expose
next_install_command / next_install_commands when the exact bootstrap
install command is already known from the run state, so a follow-on agent can
move directly to remediation without rebuilding the package-manager invocation
from package lists. Publish issues now also expose next_diagnostic_command /
next_diagnostic_commands when the run already knows the exact repo-local
git or gh command that should be inspected next, so agents can separate
“validate the pipeline state” from “inspect the broken publish environment.”
Run state now also persists per-step manifest files under state/steps/...,
and those step manifests are registered as normal artifacts so alerts can hand
another agent a concrete manifest file instead of falling back to the generic
command ledger when no more specific benchmark artifact exists yet. The status
command and resume/recover path now also backfill those step manifest files for
older runs that only had embedded step manifests in status.json, so the newer
alert handoff stays useful on legacy run directories too, and they now also
rewrite the derived monitoring and alerts blocks when those persisted
snapshots are missing or still on an older schema. Those same compatibility
paths now also restore artifact_manifest.json when legacy runs are missing it,
so file-handoff hints keep pointing at a real manifest instead of a stale path,
and they now restore recovery_summary.md too so alert handoff files stay
complete even on older run directories that predate the newer summary contract.
They also restore publish_ledger.json from persisted publish state, so publish
issue file hints do not degrade into dead paths on legacy runs. They also
restore missing command_ledger.jsonl and transfer_ledger.jsonl files as
empty compatibility ledgers on first touch, so recovery summaries and file
hints keep pointing at live investigation paths even when the original ledgers
were lost. They also rebuild run_manifest.json from persisted run state when
older runs are missing it, so the directory regains a complete top-level index
of status, ledgers, and benchmark manifests on first touch, and they now
restore host_inventory_snapshot.json from persisted host/config state too so
that top-level run metadata stays complete instead of degrading into a missing
snapshot path. For real run
directories, derived path_hints and next_path(s) now also omit generic
ledger files that do not actually exist yet, so agent handoff prefers live
artifacts over placeholder paths.
Recovery summaries now include that same derived alert
summary, primary command, primary file, and primary next step when the run has
active issues.
Live preflight summaries also record the planned AgentSTH mainstream and
supplemental run families so the generated summary.md matches the richer
staging validation view, and they now surface controller-side latency and
Geekbench local-dist bundle reuse or local wrapper-builder availability so
packaging readiness is described consistently across dry-run validation,
preflight, and recovery output. Live preflight also tracks whether execution is
only viable because a local controller repo fallback exists, and it now fails
early with a packaging-blocked state when that fallback path is unavailable.
Missing-command summaries now include package hints for key host prerequisites
such as lstopo(hwloc), so topology-tooling gaps are easier to spot before
execution.
The latency adapter now resolves or builds its bundle during stage, so long
controller-side packaging work is visible separately from benchmark execution in
status, recovery output, and command logs.
Fallbacks are recorded explicitly in the manifest. Today:
system-infosupports controller-side wrapper-bundle staging with local repo fallbacklatencysupports wrapper-bundle staging for the controller-side CLI, while remote benchmark prerequisites and installs are still handled by the latency tool itselfAgentSTHsupports bundle-first with source-build fallbackGeekbenchsupports wrapper-bundle staging with result and claim-link capture
agentsth in the automation config now separates:
cpu_poolthreadingsingle_agentmulti_agentsupplemental
This keeps the official comparable run intact while allowing labeled supplemental P-core, E-core, SMT, socket-local, or pinned-CPU runs.
It also supports supplemental benchmark-profile variants. For example, the
formal config can keep v6.1-official as the mainstream AgentSTH path while
adding v6-supplemental-compression-2gb as a clearly labeled secondary
single-agent and multi-agent run family for side-data collection.