Skip to content

STH-Dev/STH-Bench-Automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STH Bench Automation

sth-bench-automation is the orchestration layer for STH benchmark runs. It does not replace the benchmark repositories. Instead, it coordinates them into a common run layout with structured transport profiles, bundle-aware preflight, resumable step state, recovery artifacts, and deterministic local publish staging.

For operator handoff and agent-driven execution, see the detailed V7 run guide: docs/v7-run-guide.md.

What It Does

  • models direct access and jump-box chains with per-host auth and retry policy
  • runs a persistent workflow state machine:
    • discover
    • preflight
    • stage
    • execute
    • collect
    • publish
    • finalize
  • writes AI-friendly machine-readable run data:
    • status.json
    • events.jsonl
    • command_ledger.jsonl
    • transfer_ledger.jsonl
    • artifact_manifest.json
    • publish_ledger.json
    • recovery_summary.md
  • stages deterministic publish payloads under publish-staging/ so resume and later GitHub automation can reuse completed benchmark artifacts without rerunning them
  • prepares local per-benchmark git worktrees under publish-worktrees/ when a target repo is available, so publish retries can start from a staged branch rather than raw run artifacts
  • creates deterministic local result commits in those publish worktrees, so the next GitHub-facing step can push prepared branches instead of reconstructing changes
  • records push-ready manifests when a committed publish worktree has a matching remote, while preserving explicit remote_missing or remote_mismatch state when it does not
  • can also push those prepared branches when a publish target opts into push_local = true, then records a verified pushed_local manifest that resume can reuse without pushing again
  • generates a deterministic PR-ready manifest and body after a successful push, including a compare URL and suggested gh pr create command, so the last GitHub step is reproducible
  • can optionally create the PR through gh when create_pr_local = true, while preserving explicit gh_missing, gh_auth_missing, and pr_create_failed states instead of hiding that work
  • reuses an existing PR for the pushed branch when one is already open, so resume and reruns do not spam duplicate PRs or turn “already exists” into a noisy failure
  • makes validate --scope publish inspect the local repo mapping, git/worktree state, origin remote match, and gh availability/auth ahead of the actual publish phase, while reporting the resolved publish target key each adapter is using
  • resolves both publish target keys and local repo paths through adapter metadata, so future benchmark repos do not need duplicated wiring in both CLI and publish code
  • uses that same adapter metadata for controller-side staging/preflight repo lookup, which keeps local bundle probes and fallback readiness aligned as new benchmarks are added
  • also tracks which adapters support controller-side local dist bundles and local bundle builders in metadata, so dry-run staging and live preflight stay in sync
  • prefers prebuilt benchmark bundles and records packaging provenance when a run falls back to local wrapper staging, bootstrap, or source build
  • builds the latency and Geekbench local wrapper bundles through their own repo scripts before resorting to loose local-repo snapshot fallback, which makes the fallback path more reproducible and reusable
  • supports structured AgentSTH topology control for:
    • official methodology
    • P-core and E-core supplements
    • SMT physical-only vs logical mode
    • socket-local runs
    • pinned CPU lists
    • explicit multi-agent split matrices

Current Adapters

  • preflight
  • system-info
  • latency
  • agentsth
  • geekbench

preflight is inserted automatically before benchmark execution unless you explicitly run only preflight.

Workflow Commands

cd /Users/patrickkennedy/Desktop/AgentC/sth-bench-automation
. .venv/bin/activate

sth-bench-automation validate \
  --config configs/sth-lab-formal.toml \
  --host turind8 \
  --scope all

sth-bench-automation run \
  --config configs/sth-lab-formal.toml \
  --host turind8 \
  --benchmarks system-info,latency,agentsth,geekbench

sth-bench-automation status \
  --run-dir automation-runs/2026-04-18/turind8/20260418T120000Z

sth-bench-automation resume \
  --config configs/sth-lab-formal.toml \
  --run-dir automation-runs/2026-04-18/turind8/20260418T120000Z

sth-bench-automation recover \
  --run-dir automation-runs/2026-04-18/turind8/20260418T120000Z

Run Layout

automation-runs/
  2026-04-18/
    turind8/
      20260418T120000Z/
        run_manifest.json
        status.json
        events.jsonl
        command_ledger.jsonl
        transfer_ledger.jsonl
        artifact_manifest.json
        publish_ledger.json
        recovery_summary.md
        host_inventory_snapshot.json
        publish-staging/
        publish-worktrees/
        logs/
        preflight/
        benchmarks/
          system-info/
          latency/
          agentsth/
          geekbench/

Config Highlights

The formal config now points at benchmark repos plus a bundle cache:

[paths]
system_info_repo = "../../sth-system-info"
latency_repo = "../../sth-cpu-latency-tool"
agentsth_repo = "../../repo"
geekbench_repo = "../../sth-geekbench"
bundle_cache_dir = "../bundle-cache"

Inventories can now describe jump-box routing explicitly instead of using loose proxy commands:

[auth_refs.jumpbox-password]
type = "password_env"
value = "STH_JUMPBOX_PASSWORD"

[[hosts]]
name = "turind8"
hostname = "turind8"

[hosts.transport]
type = "jump_chain"

[[hosts.transport.hops]]
name = "thor"
hostname = "thor"
username = "sthuser"
auth_ref = "jumpbox-password"

Packaging Policy

The normal path is bundle-first:

  1. choose a compatible prebuilt bundle
  2. verify checksum
  3. stage or download the bundle
  4. execute

preflight and validate --scope staging now verify the actual selected bundle artifact on the controller, not just the platform match. When a cached preflight report exists for a host, staging validation reuses that platform snapshot so dry runs make the same bundle decision the real workflow does. The staging report also predicts whether each benchmark will use a prebuilt bundle, a captured local wrapper bundle, or a fallback path. For AgentSTH it also lists the planned run families, including the mainstream benchmark profile plus any supplemental benchmark-profile variants and cached-topology-dependent supplements that would run on that host, and it now reports when a compatible captured-cache AgentSTH bundle is available before source-build fallback. The staging report now also predicts the same readiness state live preflight would use, including whether a benchmark is only viable because a controller-side repo fallback exists or is already packaging-blocked before execution. It now also reports the expected remote runtime commands, dependency hints, and bootstrap packages for the selected path, so host-side requirements such as lstopo(hwloc) are visible before a live run. Those staging and preflight reports now also include the derived bootstrap install command for the detected package manager, so the fix can be copied directly instead of reconstructed from the package list. When validate --scope all has SSH credentials, it also probes those selected runtime commands on the remote host and reports missing prerequisites directly instead of only predicting them from config and cached platform data. For source-build fallback paths such as AgentSTH, those reports now also distinguish package-managed missing prerequisites from tools the workflow can bootstrap itself, such as cargo via rustup, so a bootstrapable source-build host does not get flattened into the same state as a genuinely blocked one. If the host is unreachable, the staging report now records that those remote checks were skipped due to connectivity and preserves the expected command and package context instead of leaving the probe result empty. The same structured connectivity payload is now used for auth/config failures such as missing password environment variables, so validation still returns a usable report instead of aborting before JSON output. When a live remote runtime probe succeeds, the staging report now also publishes validated_readiness_status and readiness_basis so remote dependency failures can downgrade a previously predicted ready state. validate now exits non-zero when connectivity, staging readiness, or required publish readiness checks fail, so scripts can rely on the exit code instead of always parsing the JSON report first. Both staging validation and live preflight also record the required bundle capabilities for each benchmark, so bundle mismatches are easier to debug from the report output without chasing separate repo metadata. AgentSTH source-build fallback tarballs now also strip macOS sidecar files and archive pax metadata before upload, so controller-side packaging is less likely to fail when a benchmark operator is building from a macOS workstation. Controller-side fallback details now also carry a basis field so reports can distinguish repo-backed fallbacks from standalone runner-backed ones such as a Geekbench wrapper path. Bundle matching also normalizes common architecture aliases such as arm64/aarch64 and amd64/x86_64, so configured bundles, local dist bundles, and AgentSTH captured-cache reuse all follow the same platform match rules. Live preflight now preserves the same local bundle-builder metadata that dry-run staging reports, including builder path and latest local bundle sidecar details, instead of collapsing that controller-side packaging state down to a boolean. The human-readable preflight, status, and recovery summaries now also surface the local builder sidecar filename plus non-repo fallback basis values such as a standalone Geekbench runner path, so operator-facing output matches the richer JSON metadata more closely. Dry-run staging and live preflight now also share the same readiness classifier, so states like ready_with_local_bundle, bundle_invalid_fallback_available, and packaging_blocked are derived from one common ruleset instead of drifting between separate code paths. Live preflight benchmark reports now also include status_basis, so operator output can show whether a ready or blocked state came from a configured bundle, captured cache reuse, a local bundle builder, controller fallback, or missing runtime dependencies. Configured bundles, AgentSTH captured-cache bundles, and local dist bundles now also share one precedence resolver across dry-run staging and live preflight, so both paths agree on which controller-side bundle source would actually win. Operator-facing status and recovery output now also classify running phase heartbeats as fresh or stale using the recorded heartbeat interval, so long running phases are easier to distinguish from truly stalled runs. The status CLI now uses the same structured current-position block as recovery summaries instead of printing the active phase as a raw JSON dict, and stale running phases are now called out explicitly with Attention: stalled. The persisted status.json also carries structured heartbeat and attention fields for active phases/current pointers so agents do not need to infer stall state from timestamps alone, and it now includes a top-level monitoring block with the derived run state (idle, running, or stalled), a short summary, any stalled_phases, and the active adapter/phase when one is running. The status CLI, plus resume/recover paths, now also backfill that same monitoring summary for older run directories whose status.json predates the new field so legacy runs become agent-friendly on first touch. Persisted status also includes a top-level alerts block that summarizes failed steps, stalled phases, and blocked publish states so agents can see the highest-priority issues without walking the full nested structure; it now also includes a primary alert with recommended next actions for the most urgent current problem, plus top-level next_action / next_actions fields for direct agent handoff. That same derived alert block now also carries command_hints plus top-level next_command / next_commands, so other agents can move straight to exact status, recover, or resume invocations without synthesizing the CLI by hand. When config and host details are available, connectivity, staging, and publish problems now prefer the matching scoped validate command ahead of a blind recover. It also carries path_hints plus prioritized next_path / next_paths, so the first investigation file is explicit too; when the run has step artifacts or publish manifests for the active alert, those are preferred ahead of the generic ledgers. Dependency-driven step failures now also expose next_install_command / next_install_commands when the exact bootstrap install command is already known from the run state, so a follow-on agent can move directly to remediation without rebuilding the package-manager invocation from package lists. Publish issues now also expose next_diagnostic_command / next_diagnostic_commands when the run already knows the exact repo-local git or gh command that should be inspected next, so agents can separate “validate the pipeline state” from “inspect the broken publish environment.” Run state now also persists per-step manifest files under state/steps/..., and those step manifests are registered as normal artifacts so alerts can hand another agent a concrete manifest file instead of falling back to the generic command ledger when no more specific benchmark artifact exists yet. The status command and resume/recover path now also backfill those step manifest files for older runs that only had embedded step manifests in status.json, so the newer alert handoff stays useful on legacy run directories too, and they now also rewrite the derived monitoring and alerts blocks when those persisted snapshots are missing or still on an older schema. Those same compatibility paths now also restore artifact_manifest.json when legacy runs are missing it, so file-handoff hints keep pointing at a real manifest instead of a stale path, and they now restore recovery_summary.md too so alert handoff files stay complete even on older run directories that predate the newer summary contract. They also restore publish_ledger.json from persisted publish state, so publish issue file hints do not degrade into dead paths on legacy runs. They also restore missing command_ledger.jsonl and transfer_ledger.jsonl files as empty compatibility ledgers on first touch, so recovery summaries and file hints keep pointing at live investigation paths even when the original ledgers were lost. They also rebuild run_manifest.json from persisted run state when older runs are missing it, so the directory regains a complete top-level index of status, ledgers, and benchmark manifests on first touch, and they now restore host_inventory_snapshot.json from persisted host/config state too so that top-level run metadata stays complete instead of degrading into a missing snapshot path. For real run directories, derived path_hints and next_path(s) now also omit generic ledger files that do not actually exist yet, so agent handoff prefers live artifacts over placeholder paths. Recovery summaries now include that same derived alert summary, primary command, primary file, and primary next step when the run has active issues. Live preflight summaries also record the planned AgentSTH mainstream and supplemental run families so the generated summary.md matches the richer staging validation view, and they now surface controller-side latency and Geekbench local-dist bundle reuse or local wrapper-builder availability so packaging readiness is described consistently across dry-run validation, preflight, and recovery output. Live preflight also tracks whether execution is only viable because a local controller repo fallback exists, and it now fails early with a packaging-blocked state when that fallback path is unavailable. Missing-command summaries now include package hints for key host prerequisites such as lstopo(hwloc), so topology-tooling gaps are easier to spot before execution.

The latency adapter now resolves or builds its bundle during stage, so long controller-side packaging work is visible separately from benchmark execution in status, recovery output, and command logs.

Fallbacks are recorded explicitly in the manifest. Today:

  • system-info supports controller-side wrapper-bundle staging with local repo fallback
  • latency supports wrapper-bundle staging for the controller-side CLI, while remote benchmark prerequisites and installs are still handled by the latency tool itself
  • AgentSTH supports bundle-first with source-build fallback
  • Geekbench supports wrapper-bundle staging with result and claim-link capture

AgentSTH Policy Surface

agentsth in the automation config now separates:

  • cpu_pool
  • threading
  • single_agent
  • multi_agent
  • supplemental

This keeps the official comparable run intact while allowing labeled supplemental P-core, E-core, SMT, socket-local, or pinned-CPU runs.

It also supports supplemental benchmark-profile variants. For example, the formal config can keep v6.1-official as the mainstream AgentSTH path while adding v6-supplemental-compression-2gb as a clearly labeled secondary single-agent and multi-agent run family for side-data collection.

About

Top-level orchestration for STH benchmark workflows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages