Skip to content

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646

Merged
romrak merged 9 commits into
masterfrom
rr/summary-endpoint-eval
Jun 8, 2026
Merged

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646
romrak merged 9 commits into
masterfrom
rr/summary-endpoint-eval

Conversation

@romrak

@romrak romrak commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What

Adds a dashboard_summary test kind to gooddata-eval so the GoodData dashboard-summary feature can be evaluated end-to-end, plus two supporting fixes uncovered while running it locally.

1. dashboard_summary test kind (Path B — dedicated /summary endpoint)

  • SummaryClient — posts summary_input to POST /api/v1/ai/workspaces/{ws}/summary (AFM executed server-side; no SSE, no client-side result_id handling) and maps the JSON summary into a ChatResult.
  • DashboardSummaryEvaluator — rubric-based LLM judge. expected_output is a rubric of must_include / must_not_include / rubric criteria, each scored independently, so quality_score is the fraction satisfied. Pass requires all gating (must_*) criteria; rubric items are graded-only. must_not_include is scored by plain presence-detection + invert (asking the judge to reason about "avoidance" under an EXPECTED OUTPUT label flipped verdicts).
  • SummaryInput dataset field (only dashboard_id is required).
  • Runner ChatBackend now receives the whole DatasetItem; the CLI routes summary items to SummaryClient and everything else to ChatClient.
  • Registered as a lazy [llm-judge] evaluator (skipped without the extra, like general_question).

2. Fix: resolve the active LLM provider by type, not a fixed id

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id. Reading it by the hardcoded id activeLlmProvider missed existing settings ("no active LLM provider"), and activating created a second setting of the same type (HTTP 409). Now it looks the setting up by setting_type and reuses the existing id on activate (UPDATE, not CREATE).

3. Fix: evaluator-agnostic FAIL note in the console report

The FAIL "Notes" column hardcoded visualization check names and showed a misleading "no visualization created" for summary items. It now lists whichever boolean checks are False.

Docs & examples

  • README: dashboard_summary section (rubric shape, summary_input) + supported-test-kinds row.
  • Three self-describing example cases under examples/summary_dataset/: full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence). Each rubric is aligned to the endpoint's actual output and uses a small gating set.

Testing

  • New tests: test_summary_client.py, test_summary_evaluator.py, plus workspace lookup-by-type tests.
  • ruff format/check clean; full suite 115 passed.
  • Manually verified against a local gen-ai instance: endpoint path /summary and request/response shapes confirmed; all three example cases pass with the OpenAI judge.

Notes

  • Verified the endpoint path is /summary (not the /summarize referenced in some gen-ai notebooks). SummaryClient._PATH is the single place to change if it is ever renamed.
  • The chat-based summary skill (Path A — userContext + AFM result_ids) is intentionally not covered here; it is better smoke-tested via a Playwright e2e where the UI assembles that context. See docs/summary-eval.md (separate PR) for the strategy.

🤖 Generated with Claude Code

Roman Rakus and others added 3 commits June 4, 2026 17:26
Implements Path B: evaluate the dashboard-summary feature through the
dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary),
which executes AFM server-side — no SSE or client-side result_id wrangling.

- SummaryClient: posts summary_input, maps the JSON summary into ChatResult
- DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include /
  rubric), scored per-criterion so quality_score is the fraction satisfied
- SummaryInput dataset field; dashboard_id is the only required input
- runner ChatBackend now receives the DatasetItem; CLI routes summary items
  to SummaryClient and everything else to ChatClient
- registered as a lazy [llm-judge] evaluator; docs + example dataset + tests

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend
and may exist under any id (e.g. UI-generated). Reading it by the hardcoded
id "activeLlmProvider" missed existing settings (-> "no active LLM provider"),
and activating re-created a second setting of the same type (-> HTTP 409).

Now look it up by setting_type via list_workspace_settings, and reuse the
existing setting's id on activate so create_or_update performs an UPDATE.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- evaluator: score must_not_include via plain presence-detection then invert,
  instead of asking the judge to reason about avoidance under an "EXPECTED
  OUTPUT" label (which inverted verdicts on negative criteria)
- reporting: make the FAIL note evaluator-agnostic (list whichever boolean
  checks are False) instead of the visualization-only "no visualization created"
- examples: replace the single template with three self-describing cases —
  full-dashboard, selected-visualizations (scoping), and format-hint-brief
  (format adherence) — each with a small gating set and rubric aligned to the
  endpoint's actual output

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Roman Rakus and others added 2 commits June 4, 2026 19:03
…-eval

# Conflicts:
#	packages/gooddata-eval/README.md
#	packages/gooddata-eval/src/gooddata_eval/cli/main.py
#	packages/gooddata-eval/src/gooddata_eval/core/workspace.py
#	packages/gooddata-eval/tests/test_workspace.py
Running without --model takes the default branch, which set provider_name
but never provider_type, causing UnboundLocalError when building ResolvedModel.
Initialize it to "" alongside provider_name.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.18%. Comparing base (27021da) to head (ed3303d).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1646      +/-   ##
==========================================
+ Coverage   79.10%   79.18%   +0.08%     
==========================================
  Files         231      232       +1     
  Lines       15718    15791      +73     
==========================================
+ Hits        12433    12504      +71     
- Misses       3285     3287       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread packages/gooddata-eval/examples/summary_dataset/summary_format_hint_brief.json Outdated
Roman Rakus and others added 4 commits June 5, 2026 12:37
dashboard_summary items sourced from a Langfuse dataset previously lost
summary_input (SummaryClient then failed with "missing summary_input").
Map it from the item input object, metadata, or expectedOutput so summary
datasets round-trip through Langfuse. The --langfuse/--langfuse-dataset
contract is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Per PR review: the dashboard_summary example items were uploaded to a
Langfuse dataset, so the local examples/summary_dataset files are no longer
needed in the repo. The README still documents the format inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
ty rejected passing dict|None where SummaryInput|None is expected (pydantic
coerces at runtime, the type checker doesn't). Build the SummaryInput via
model_validate so the static type matches DatasetItem.summary_input.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Trace metadata is not a Langfuse dashboard breakdown dimension, so per-model
charts/filters were impossible. Expose the model on the first-class trace
`version` field so dashboards can break down / filter by "Version".

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@romrak romrak merged commit 0b20d26 into master Jun 8, 2026
13 checks passed
@romrak romrak deleted the rr/summary-endpoint-eval branch June 8, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants