feat: support multiple benchmark framework! by 123liuziming · Pull Request #200 · alibaba/loongsuite-python-agent

123liuziming · 2026-05-26T08:17:09Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test A

Does This PR Require a Core Repo Change?

Yes. - Link to PR:
No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

CLAassistant · 2026-05-26T08:17:18Z

All committers have signed the CLA.

Copilot

Pull request overview

This PR expands the LoongSuite/OpenTelemetry instrumentation surface to cover additional benchmark frameworks (e.g., WildToolBench, WideSearch, WebArena, VitaBench, slop-code-bench, OpenHands V0, BFCL v4, mini-swe-agent, claw-eval, AlgoTune), and updates the shared GenAI util types to carry more framework-level metadata.

Changes:

Extend EntryInvocation to support system_instruction and tool_definitions.
Add multiple new instrumentation packages (each with packaging metadata, utilities, and initial tests/docs) for additional benchmark frameworks.
Add/adjust framework-specific utilities, wrappers, and test scaffolding across the new instrumentation packages.

Reviewed changes

Copilot reviewed 128 out of 139 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
util/opentelemetry-util-genai/src/opentelemetry/util/genai/extended_types.py	Extends shared GenAI invocation model (adds system instruction + tool definitions to ENTRY).
packages.txt	Adds a pinned environment/package snapshot file.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_instrumentor.py	Adds WildToolBench instrumentor lifecycle tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_error_scenarios.py	Adds WildToolBench error/edge-case tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_entry_span.py	Adds WildToolBench ENTRY span tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/test_agent_span.py	Adds WildToolBench AGENT span tests.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/conftest.py	Adds WildToolBench test fixtures/exporters and env setup.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/tests/init.py	Initializes WildToolBench test package.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/version.py	Introduces WildToolBench instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/utils.py	Adds small WildToolBench helper utilities.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/package.py	Declares WildToolBench instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/src/opentelemetry/instrumentation/wildtool/init.py	Implements WildToolBench instrumentor and patch lifecycle.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/README.md	Documents WildToolBench instrumentation usage/topology.
instrumentation-loongsuite/loongsuite-instrumentation-wildtool/pyproject.toml	Adds WildToolBench packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/tests/init.py	Initializes WideSearch test package.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/version.py	Introduces WideSearch instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/utils.py	Adds WideSearch conversion/extraction helpers (invocations/messages/tools).
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/package.py	Declares WideSearch instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/src/opentelemetry/instrumentation/widesearch/init.py	Implements WideSearch instrumentor and wrap points.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/README.md	Documents WideSearch instrumentation usage.
instrumentation-loongsuite/loongsuite-instrumentation-widesearch/pyproject.toml	Adds WideSearch packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/version.py	Introduces WebArena instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/package.py	Declares WebArena instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/_state.py	Adds WebArena cross-wrapper state management via ContextVars.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/_attrs.py	Adds WebArena attribute constants + truncation/serialization helpers.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/internal/init.py	Initializes WebArena internal package.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/src/opentelemetry/instrumentation/webarena/config.py	Adds WebArena env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-webarena/pyproject.toml	Adds WebArena packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-vita/tests/conftest.py	Adds VitaBench test fixtures/exporters and env setup.
instrumentation-loongsuite/loongsuite-instrumentation-vita/tests/init.py	Initializes VitaBench test package.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/version.py	Introduces VitaBench instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/utils.py	Adds VitaBench message/tool conversion helpers.
instrumentation-loongsuite/loongsuite-instrumentation-vita/src/opentelemetry/instrumentation/vita/package.py	Declares VitaBench instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-vita/README.md	Documents VitaBench instrumentation usage and DashScope notes.
instrumentation-loongsuite/loongsuite-instrumentation-vita/pyproject.toml	Adds VitaBench packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/setup.sh	Adds VitaBench DashScope example setup script.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/README.md	Adds VitaBench DashScope example instructions.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/vitabench-dashscope/cmd.sh	Adds VitaBench DashScope example run script.
instrumentation-loongsuite/loongsuite-instrumentation-vita/examples/init.py	Initializes VitaBench examples package.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/tests/init.py	Initializes Terminus2 test package.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/test-requirements.txt	Adds Terminus2 test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/src/opentelemetry/instrumentation/terminus2/version.py	Introduces Terminus2 instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/src/opentelemetry/instrumentation/terminus2/package.py	Declares Terminus2 instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-terminus2/pyproject.toml	Adds Terminus2 packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_workflow_span.py	Adds slop-code workflow/CHAIN tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_task_span.py	Adds slop-code TASK span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_step_span.py	Adds slop-code STEP span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_llm_span.py	Adds slop-code LLM span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_hierarchy.py	Adds slop-code parent/child hierarchy tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_entry_span.py	Adds slop-code ENTRY span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/test_agent_span.py	Adds slop-code AGENT span tests.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/tests/init.py	Initializes slop-code test package.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/test-requirements.txt	Adds slop-code test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/workflow.py	Adds slop-code CHAIN/workflow wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/tool.py	Adds slop-code TOOL wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/task.py	Adds slop-code ENTRY+TASK wrapper implementation for checkpoints.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/step.py	Adds slop-code STEP wrapper implementation for ReAct rounds.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/llm.py	Adds slop-code LLM wrapper implementation for rubric judge calls.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/entry.py	Adds slop-code ENTRY wrapper implementations.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/agent.py	Adds slop-code AGENT wrapper implementation.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/wrappers/init.py	Initializes slop-code wrappers package.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/version.py	Introduces slop-code instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/utils.py	Adds slop-code helper utilities (safe getters, truncation, message schema).
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/src/opentelemetry/instrumentation/slop_code/package.py	Declares slop-code instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/README.md	Documents slop-code instrumentation span tree and usage.
instrumentation-loongsuite/loongsuite-instrumentation-slop-code/pyproject.toml	Adds slop-code packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/tests/test_v0_wrappers.py	Adds OpenHands V0 wrapper behavior tests.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/tests/init.py	Initializes OpenHands test package.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/test-requirements.txt	Adds OpenHands test requirements list.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/version.py	Introduces OpenHands instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/package.py	Declares OpenHands instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/utils.py	Adds OpenHands serialization helpers for semconv I/O/message capture.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/session_context.py	Adds OpenHands cross-thread context bridge and tool registry.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/constants.py	Adds OpenHands constant attribute keys/framework identity.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/internal/init.py	Initializes OpenHands internal package.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/src/opentelemetry/instrumentation/openhands/config.py	Adds OpenHands env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/README.rst	Documents OpenHands V0 instrumentation behavior and topology.
instrumentation-loongsuite/loongsuite-instrumentation-openhands/pyproject.toml	Adds OpenHands packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/version.py	Introduces mini-swe-agent instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/package.py	Declares mini-swe-agent instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/delegates.py	Adds mini-swe-agent TOOL span delegate (environment execute).
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/cli_wrappers.py	Adds mini-swe-agent CLI ENTRY wrapper via Typer app proxy.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/agent_wrappers.py	Adds mini-swe-agent AGENT/STEP wrappers for DefaultAgent.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/internal/init.py	Initializes mini-swe-agent internal package.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/config.py	Adds mini-swe-agent env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/src/opentelemetry/instrumentation/minisweagent/init.py	Implements mini-swe-agent instrumentor and patch lifecycle.
instrumentation-loongsuite/loongsuite-instrumentation-minisweagent/pyproject.toml	Adds mini-swe-agent packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/version.py	Introduces claw-eval instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/package.py	Declares claw-eval instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/internal/init.py	Initializes claw-eval internal package.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/src/opentelemetry/instrumentation/claw_eval/config.py	Adds claw-eval env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-claw-eval/pyproject.toml	Adds claw-eval packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/tests/test_instrumentor.py	Adds BFCL v4 smoke tests (graceful instrument/uninstrument).
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/tests/init.py	Initializes BFCL v4 test package.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/version.py	Introduces BFCL v4 instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/utils.py	Adds BFCL v4 content-capture helper utilities.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/package.py	Declares BFCL v4 instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/threading_propagation.py	Adds BFCL v4 context-propagating ThreadPoolExecutor.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/state.py	Adds BFCL v4 per-thread ReAct state via contextvars.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/provider.py	Adds BFCL v4 provider inference/mapping logic.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/attributes.py	Adds BFCL v4 attribute key constants.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/src/opentelemetry/instrumentation/bfclv4/internal/init.py	Initializes BFCL v4 internal package.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/README.md	Documents BFCL v4 instrumentation usage/topology.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/pyproject.toml	Adds BFCL v4 packaging metadata and deps.
instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/CHANGELOG.md	Adds BFCL v4 changelog for initial release.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/version.py	Introduces AlgoTune instrumentation version module.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/package.py	Declares AlgoTune instrumentation “instruments” metadata.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/internal/utils.py	Adds AlgoTune shared utilities (truncation, provider inference, STEP cleanup).
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/internal/init.py	Initializes AlgoTune internal package.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/src/opentelemetry/instrumentation/algotune/config.py	Adds AlgoTune env-var driven configuration.
instrumentation-loongsuite/loongsuite-instrumentation-algotune/pyproject.toml	Adds AlgoTune packaging metadata and deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

sipercai

Thanks for adding support for multiple benchmark frameworks. I validated this branch in a clean Python 3.12 environment instead of relying only on static review.

A few parts look healthy: opentelemetry-util-genai passes its test suite (264 passed, 1 skipped), BFCLv4 tests pass (11 passed), WideSearch tests pass (24 passed), and the new instrumentors can import/instrument/uninstrument without crashing when optional benchmark frameworks are absent.

However, I do not think this is ready to merge yet because several changed packages fail their own focused runtime checks:

OpenHands: 6 failed, 7 passed. The failures include missing AGENT/STEP spans, broken cross-thread trace continuity, and asyncio.CancelledError being marked as an ERROR span.
Slop-code: 10 failed, 12 passed. The failures include missing STEP spans, incorrect workflow -> task parentage, span-name mismatches, and an AgentRunner.run hook target that is not present in the test fixture.
Vita: 11 failed, 2 passed. The tests require the real vita framework, but the PR does not provide a reproducible test setup for that dependency.
WildTool: test collection fails with 5 import errors because tests import wtb, but the package test extras do not install it.
Terminus2: the test directory collects 0 tests, so there is currently no focused telemetry-contract coverage for that new instrumentation.

Please fix the failing focused tests and make each new plugin's test setup reproducible before merge. I would also expect the PR to clean up the readiness issues already visible in static checks: precommit/formatting failures, readiness metadata gaps, and non-reproducible local file:// entries in packages.txt.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

sipercai · 2026-05-27T05:50:55Z

One practical way to reduce this PR's LoongSuite CI surface after #201 lands is to make the new benchmark plugins first-class LoongSuite packages in tox-loongsuite.ini.

Please do not rely on ad-hoc pytest commands for these packages. The dynamic detector added in #201 can only select a scoped matrix when the changed package is registered in tox-loongsuite.ini; otherwise the package is treated as unknown and CI falls back to the full LoongSuite matrix.

Suggested flow after #201 is merged and this branch is rebased onto the latest main:

Add one env family per new package to the envlist.

Example pattern:
```
; loongsuite-instrumentation-bfclv4
py3{10,11,12,13}-test-loongsuite-instrumentation-bfclv4
lint-loongsuite-instrumentation-bfclv4

; loongsuite-instrumentation-openhands
py3{10,11,12,13}-test-loongsuite-instrumentation-openhands
lint-loongsuite-instrumentation-openhands
```
Please repeat this pattern for each new package in this PR, choosing the supported Python versions from each package's pyproject.toml:
- loongsuite-instrumentation-algotune
- loongsuite-instrumentation-bfclv4
- loongsuite-instrumentation-claw-eval
- loongsuite-instrumentation-minisweagent
- loongsuite-instrumentation-openhands
- loongsuite-instrumentation-slop-code
- loongsuite-instrumentation-terminus2
- loongsuite-instrumentation-vita
- loongsuite-instrumentation-webarena
- loongsuite-instrumentation-widesearch
- loongsuite-instrumentation-wildtool
Add dependency selectors under [testenv] deps for each package.

Example pattern:
```
bfclv4: {[testenv]test_deps}
bfclv4: -r {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/test-requirements.txt

openhands: {[testenv]test_deps}
openhands: -r {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-openhands/test-requirements.txt
```
If a package needs framework-specific packages to run its tests, please put them in that package's test-requirements.txt (or a clearly named requirements file) so the tox env is reproducible from a clean checkout.

Add commands for each package.

Example pattern:

test-loongsuite-instrumentation-bfclv4: pytest {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-bfclv4/tests {posargs}
lint-loongsuite-instrumentation-bfclv4: python -m ruff check {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-bfclv4

test-loongsuite-instrumentation-openhands: pytest {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-openhands/tests {posargs}
lint-loongsuite-instrumentation-openhands: python -m ruff check {toxinidir}/instrumentation-loongsuite/loongsuite-instrumentation-openhands

Make sure every registered package has meaningful tests.

Registering a package with an empty test directory would reduce the CI matrix but would not validate the instrumentation. For example, a package like Terminus2 should have at least focused lifecycle/span-tree tests before being registered as passing CI.

Regenerate and validate the workflow metadata.

tox -c tox-loongsuite.ini -e py311-test-detect-loongsuite-changes
tox -c tox-loongsuite.ini -e generate-loongsuite
tox -c tox-loongsuite.ini -e py312-test-loongsuite-instrumentation-bfclv4 -- -ra
tox -c tox-loongsuite.ini -e lint-loongsuite-instrumentation-bfclv4

Repeat the focused test-* / lint-* tox envs for the other newly registered packages.

One more note: this PR currently also changes util/opentelemetry-util-genai/.... In #201's detector, util-genai is intentionally treated as shared LoongSuite surface, so that change still triggers a full LoongSuite run. If the goal is package-scoped CI only, please split or remove the util-genai change from this PR.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

…ard v4 Introduce loongsuite-instrumentation-bfclv4 covering BFCL v4 (bfcl_eval) per the design in llm-dev/bfclv4/execute.md: * ENTRY span around bfcl_eval._llm_response_generation.generate_results, with a narrow swap of that module's ThreadPoolExecutor name to a contextvars-propagating subclass so worker threads inherit the ENTRY trace context. * AGENT span around BaseHandler.inference (kind=AGENT, op=invoke_agent), picking up token usage from the metadata BFCL writes back. * STEP spans created reflectively for every concrete handler discovered via bfcl_eval.constants.model_config.MODEL_CONFIG_MAPPING; each STEP re-invokes the handler's _parse_query_response_* to harvest token counts and latency. * Per-call TOOL spans emitted from bfcl_eval.eval_checker.multi_turn_eval.multi_turn_utils.execute_multi_turn_func_call (one span per func_call entry in the batch). * Provider override mapping that routes OSSMODEL handlers to vllm/sglang based on args.backend, plus contextvars-based bfcl.turn_idx / gen_ai.react.round tracking. LLM spans are intentionally not created by this plugin; they continue to be produced by the downstream vendor SDK probes (OpenAI / Anthropic / DashScope / etc.). (cherry picked from commit cccf54b) Co-authored-by: 123liuziming <[email protected]>

(cherry picked from commit 3d08e03) Co-authored-by: 123liuziming <[email protected]>

The AGENT/ENTRY spans previously JSON-stringified BFCL's nested ``[[{...}],[{...}]]`` question/result structure into a single message content, producing the surprising "content has a serialised array inside it" pattern. Now flattens the structure one level so each role/content pair becomes its own ``{role, parts:[{type,content}]}`` message on both ``gen_ai.input.messages`` and ``gen_ai.output.messages``. Also surfaces BFCL-captured error strings (``Error during inference:``, ``Error during execution:``) and unhandled wrapped exceptions via ``span.record_exception`` so spans marked ERROR carry a visible exception event with the error message instead of just a status code. Change-Id: I372e87b683f907431889ac4d306bf6c235ec36ac Co-developed-by: Claude <[email protected]>

…ments DashScope's OpenAI-compatible streaming response can emit tool-call argument deltas with `arguments=None`, which made `"".join(tool_call.arguments)` raise `TypeError: sequence item N: expected str instance, NoneType found` during span finalization and aborted every bfclv4 benchmark run. Filter out None parts at both legacy and current stream-wrapper join sites. Change-Id: I76b8e0104dacac1a1ecebd41be74283700d46f2c Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.7 <[email protected]>

…hecker multi_turn_checker re-exports execute_multi_turn_func_call and invokes it twice per entry during evaluation: once to replay the model's tool-call trace, once to replay the ground-truth trace. Both calls run *after* inference, outside any ENTRY/AGENT/STEP context, so each one produced a trace-rooted orphan TOOL span. Drop checker from the wrap targets; the two inference-side bindings (multi_turn_utils source module + base_handler re-export) still cover every TOOL span we actually want. Change-Id: Ife24d8ba2595fc2c10a0dcfc47de5521ad67d3c7 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.7 <[email protected]>

Wrapping multi_turn_utils.execute_multi_turn_func_call replaces the source module attribute, which is fine at instrument time but causes orphan TOOL spans later: bfcl_eval/__main__.py lazily imports multi_turn_checker.py during `bfcl evaluate`, and that import resolves the wrapper into checker's local binding. Ground-truth replay then emits TOOL spans outside any ENTRY/AGENT/STEP context. Wrap only base_handler's local binding, which is set during BaseHandler.inference wrap in step 2 and is the sole inference-time caller of execute_multi_turn_func_call. Change-Id: I9145484b1fa9b8bf9cc3899b6a551aec62b856ac Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.7 <[email protected]>

* ENTRY span no longer mirrors OpenInference input.value / output.value; the same payload is still on gen_ai.input.messages / gen_ai.output.messages. * Bump _to_jsonable max_depth from 3 to 8 so AGENT message parts (and their tool_call arguments / tool_call_response results) serialize as real JSON objects instead of Python repr strings with single quotes. Change-Id: I248356773980d8688e4d76336d92f3935c2e15b2 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.7 <[email protected]>

… to AGENT span The AGENT span now records the system prompt template and tool schema, enabling full observability of agent configuration in trace data. Change-Id: I02509ebd7e6a2ff1bdaa64d257fdb231df7f36b5 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

… AGENT span ENTRY span now captures problem_names/model from CLI kwargs. Runner ENTRY span falls back to run_spec.template when problem fields are absent. AGENT span extracts system_prompt from instance.system_template, instance.system_prompt, or the first system message in _messages/_steps. Change-Id: I222c187fc780aefe96d64f452a13a22f0bef41ae Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

… STEP span When using the Python API (DefaultAgent.run() directly), no ENTRY span was produced because the existing ENTRY was only on the CLI Typer path. Now DefaultAgentRunWrapper auto-creates an ENTRY span (with input messages) when one is not already active, using a ContextVar guard to avoid double ENTRY when invoked via the `mini` CLI. Also suppress the empty STEP span that occurs when step/cost limits are already exceeded — the step immediately raises LimitsExceeded with no LLM or TOOL work, producing noise in traces. Change-Id: Iae01f70a3c0f7b3a0daa8c03e35e9d482b9fa8d8 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ions to AGENT span The AGENT span now records the system prompt (Current Date from env_info) and available tool schemas from the test entry, enabling full observability of agent configuration in trace data. Change-Id: I64ed458018f5f138a06787b1d8cd938e662a8284 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

WebArena's agent/__init__.py re-exports construct_agent via `from .agent import construct_agent`. When wrapt wraps the function on the agent.agent module, the cached reference in agent's namespace remains unwrapped. This causes `from agent import construct_agent` (the common import pattern in run.py) to bypass instrumentation. Add ("agent", "construct_agent") to _PATCH_TARGETS so both references are wrapped and the AGENT(create_agent) span fires regardless of import style. Change-Id: I443c244fa9bb630491852849bea9a75f66d33d58 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

…, tool.definitions per ARMS spec - Fix capture_message_content() to accept SPAN_ONLY as a valid truthy value (matching algotune/wildtool behavior) - Add gen_ai.input.messages (user intent + previous action) to AGENT span - Add gen_ai.output.messages (LLM raw prediction) to AGENT span - Add gen_ai.system_instructions (from PromptConstructor intro) to AGENT span - Add gen_ai.tool.definitions (browser action types) to AGENT spans - Add input.mime_type / output.mime_type to TASK span Implements the field requirements from: https://help.aliyun.com/zh/arms/application-monitoring/developer-reference/llm-trace-field-definition-description Change-Id: If233e99bb7354fd5f8ff6e23891d9fe4e95a2503 Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

Per ARMS spec, AGENT spans should only carry gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, and gen_ai.tool.definitions — not the generic input.value/output.value. Change-Id: I7861f4f00a41e2955097343ba032e22a9fa42c1f Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

…CHAIN 1. ENTRY span now records gen_ai.input.messages (task intent) and gen_ai.output.messages (initial observation after env.reset) 2. CHAIN span now records output.value with step/tool summary on close 3. Config file reader handles list format (webarena uses [config] arrays) Change-Id: I12e1d4e2022a0fd1882f11edfb87bf3e25c2a84c Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

ENTRY span should only use gen_ai.input.messages, not input.value. Change-Id: I76d32679c40107adec84986ef6d8de01bfb06a0a Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

Change-Id: I6a71a6150bb92d336ef19088ef521b6dff5be2ab

Change-Id: I604506992bda330e7ff62705e50ca79173bf8f3b

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

sipercai

Re-reviewed after the latest updates.

The previously requested admission blockers have been addressed: the new benchmark packages are covered by the LoongSuite tox matrix, the focused package tests are reproducible, generated workflow/config files are aligned, and the CI checks are now passing. The latest minisweagent CI fix is test-only and does not change instrumentation behavior.

Approving from my side.

Copilot AI review requested due to automatic review settings May 26, 2026 08:17

Copilot started reviewing on behalf of 123liuziming May 26, 2026 08:17 View session

github-actions Bot assigned 123liuziming, Cirilla-zmh and ralf0131 May 26, 2026

Copilot AI reviewed May 26, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 26, 2026 09:44

Copilot started reviewing on behalf of 123liuziming May 26, 2026 09:45 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 26, 2026 11:50

Copilot started reviewing on behalf of 123liuziming May 26, 2026 12:16 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

sipercai requested changes May 27, 2026

View reviewed changes

Comment thread instrumentation-loongsuite/loongsuite-instrumentation-wildtool/pyproject.toml Outdated

Comment thread instrumentation-loongsuite/loongsuite-instrumentation-wildtool/pyproject.toml

Comment thread instrumentation-loongsuite/loongsuite-instrumentation-vita/pyproject.toml

Copilot AI review requested due to automatic review settings May 27, 2026 05:14

Copilot started reviewing on behalf of 123liuziming May 27, 2026 05:14 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 27, 2026 07:43

Copilot started reviewing on behalf of 123liuziming May 27, 2026 07:43 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 27, 2026 08:51

Copilot started reviewing on behalf of 123liuziming May 27, 2026 08:52 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 27, 2026 10:33

Copilot AI reviewed May 27, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 28, 2026 08:50

Copilot AI reviewed May 28, 2026

View reviewed changes

musi and others added 2 commits May 28, 2026 18:01

feat: support bfclv4

38bb847

(cherry picked from commit 3d08e03) Co-authored-by: 123liuziming <[email protected]>

123liuziming and others added 23 commits May 28, 2026 18:01

webarena: remove input.value from ENTRY span

3506716

ENTRY span should only use gen_ai.input.messages, not input.value. Change-Id: I76d32679c40107adec84986ef6d8de01bfb06a0a Co-developed-by: Claude <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>

add ut

acb9391

Change-Id: I6a71a6150bb92d336ef19088ef521b6dff5be2ab

test openhands

387cf53

Change-Id: I604506992bda330e7ff62705e50ca79173bf8f3b

chore: prepare benchmark instrumentation tests

5d8b911

chore: drop unrelated openai v2 change

334c176

chore: align benchmark package versions

49666b6

test: align minisweagent version expectation

c237d10

chore: regenerate benchmark bootstrap versions

71c2832

docs: align slop-code span tree

8234260

fix: preserve openai v2 stream argument guard

df155e7

sipercai force-pushed the feat/bench branch from 142ed34 to df155e7 Compare May 28, 2026 11:21

test: stabilize minisweagent trajectory mock

c6a0e3f

Copilot AI review requested due to automatic review settings May 29, 2026 02:18

Copilot AI reviewed May 29, 2026

View reviewed changes

sipercai approved these changes May 29, 2026

View reviewed changes

steverao approved these changes May 29, 2026

View reviewed changes

steverao merged commit 6c50471 into main May 29, 2026
198 checks passed

Conversation

123liuziming commented May 26, 2026

Description

Type of change

How Has This Been Tested?

Does This PR Require a Core Repo Change?

Checklist:

Uh oh!

CLAassistant commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

sipercai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

sipercai commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

sipercai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CLAassistant commented May 26, 2026 •

edited

Loading