feat(eval): evaluate the dashboard-summary skill via the /summary endpoint by romrak · Pull Request #1646 · gooddata/gooddata-python-sdk

romrak · 2026-06-04T16:57:33Z

What

Adds a dashboard_summary test kind to gooddata-eval so the GoodData dashboard-summary feature can be evaluated end-to-end, plus two supporting fixes uncovered while running it locally.

1. `dashboard_summary` test kind (Path B — dedicated `/summary` endpoint)

SummaryClient — posts summary_input to POST /api/v1/ai/workspaces/{ws}/summary (AFM executed server-side; no SSE, no client-side result_id handling) and maps the JSON summary into a ChatResult.
DashboardSummaryEvaluator — rubric-based LLM judge. expected_output is a rubric of must_include / must_not_include / rubric criteria, each scored independently, so quality_score is the fraction satisfied. Pass requires all gating (must_*) criteria; rubric items are graded-only. must_not_include is scored by plain presence-detection + invert (asking the judge to reason about "avoidance" under an EXPECTED OUTPUT label flipped verdicts).
SummaryInput dataset field (only dashboard_id is required).
Runner ChatBackend now receives the whole DatasetItem; the CLI routes summary items to SummaryClient and everything else to ChatClient.
Registered as a lazy [llm-judge] evaluator (skipped without the extra, like general_question).

2. Fix: resolve the active LLM provider by type, not a fixed id

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id. Reading it by the hardcoded id activeLlmProvider missed existing settings ("no active LLM provider"), and activating created a second setting of the same type (HTTP 409). Now it looks the setting up by setting_type and reuses the existing id on activate (UPDATE, not CREATE).

3. Fix: evaluator-agnostic FAIL note in the console report

The FAIL "Notes" column hardcoded visualization check names and showed a misleading "no visualization created" for summary items. It now lists whichever boolean checks are False.

Docs & examples

README: dashboard_summary section (rubric shape, summary_input) + supported-test-kinds row.
Three self-describing example cases under examples/summary_dataset/: full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence). Each rubric is aligned to the endpoint's actual output and uses a small gating set.

Testing

New tests: test_summary_client.py, test_summary_evaluator.py, plus workspace lookup-by-type tests.
ruff format/check clean; full suite 115 passed.
Manually verified against a local gen-ai instance: endpoint path /summary and request/response shapes confirmed; all three example cases pass with the OpenAI judge.

Notes

Verified the endpoint path is /summary (not the /summarize referenced in some gen-ai notebooks). SummaryClient._PATH is the single place to change if it is ever renamed.
The chat-based summary skill (Path A — userContext + AFM result_ids) is intentionally not covered here; it is better smoke-tested via a Playwright e2e where the UI assembles that context. See docs/summary-eval.md (separate PR) for the strategy.

🤖 Generated with Claude Code

Implements Path B: evaluate the dashboard-summary feature through the dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary), which executes AFM server-side — no SSE or client-side result_id wrangling. - SummaryClient: posts summary_input, maps the JSON summary into ChatResult - DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include / rubric), scored per-criterion so quality_score is the fraction satisfied - SummaryInput dataset field; dashboard_id is the only required input - runner ChatBackend now receives the DatasetItem; CLI routes summary items to SummaryClient and everything else to ChatClient - registered as a lazy [llm-judge] evaluator; docs + example dataset + tests Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id (e.g. UI-generated). Reading it by the hardcoded id "activeLlmProvider" missed existing settings (-> "no active LLM provider"), and activating re-created a second setting of the same type (-> HTTP 409). Now look it up by setting_type via list_workspace_settings, and reuse the existing setting's id on activate so create_or_update performs an UPDATE. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

- evaluator: score must_not_include via plain presence-detection then invert, instead of asking the judge to reason about avoidance under an "EXPECTED OUTPUT" label (which inverted verdicts on negative criteria) - reporting: make the FAIL note evaluator-agnostic (list whichever boolean checks are False) instead of the visualization-only "no visualization created" - examples: replace the single template with three self-describing cases — full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence) — each with a small gating set and rubric aligned to the endpoint's actual output Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…-eval # Conflicts: # packages/gooddata-eval/README.md # packages/gooddata-eval/src/gooddata_eval/cli/main.py # packages/gooddata-eval/src/gooddata_eval/core/workspace.py # packages/gooddata-eval/tests/test_workspace.py

Running without --model takes the default branch, which set provider_name but never provider_type, causing UnboundLocalError when building ResolvedModel. Initialize it to "" alongside provider_name. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

codecov · 2026-06-04T17:11:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.18%. Comparing base (27021da) to head (ed3303d).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1646      +/-   ##
==========================================
+ Coverage   79.10%   79.18%   +0.08%     
==========================================
  Files         231      232       +1     
  Lines       15718    15791      +73     
==========================================
+ Hits        12433    12504      +71     
- Misses       3285     3287       +2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dashboard_summary items sourced from a Langfuse dataset previously lost summary_input (SummaryClient then failed with "missing summary_input"). Map it from the item input object, metadata, or expectedOutput so summary datasets round-trip through Langfuse. The --langfuse/--langfuse-dataset contract is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Per PR review: the dashboard_summary example items were uploaded to a Langfuse dataset, so the local examples/summary_dataset files are no longer needed in the repo. The README still documents the format inline. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

ty rejected passing dict|None where SummaryInput|None is expected (pydantic coerces at runtime, the type checker doesn't). Build the SummaryInput via model_validate so the static type matches DatasetItem.summary_input. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Trace metadata is not a Langfuse dashboard breakdown dimension, so per-model charts/filters were impossible. Expose the model on the first-class trace `version` field so dashboards can break down / filter by "Version". Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Roman Rakus and others added 3 commits June 4, 2026 17:26

romrak requested review from hkad98, jaceksan, lupko and pcerny as code owners June 4, 2026 16:57

Roman Rakus and others added 2 commits June 4, 2026 19:03

hkad98 reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/gooddata-eval/examples/summary_dataset/summary_format_hint_brief.json Outdated

Roman Rakus and others added 4 commits June 5, 2026 12:37

zdenekmusil-gd approved these changes Jun 8, 2026

View reviewed changes

romrak merged commit 0b20d26 into master Jun 8, 2026
13 checks passed

romrak deleted the rr/summary-endpoint-eval branch June 8, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646

feat(eval): evaluate the dashboard-summary skill via the /summary endpoint#1646
romrak merged 9 commits into
masterfrom
rr/summary-endpoint-eval

romrak commented Jun 4, 2026

Uh oh!

codecov Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

romrak commented Jun 4, 2026

What

1. dashboard_summary test kind (Path B — dedicated /summary endpoint)

2. Fix: resolve the active LLM provider by type, not a fixed id

3. Fix: evaluator-agnostic FAIL note in the console report

Docs & examples

Testing

Notes

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `dashboard_summary` test kind (Path B — dedicated `/summary` endpoint)

codecov Bot commented Jun 4, 2026 •

edited

Loading