Skip to content

[bot] Merge master/0b20d26c into rel/dev#1651

Merged
yenkins-admin merged 10 commits into
rel/devfrom
snapshot-master-0b20d26c-to-rel/dev
Jun 8, 2026
Merged

[bot] Merge master/0b20d26c into rel/dev#1651
yenkins-admin merged 10 commits into
rel/devfrom
snapshot-master-0b20d26c-to-rel/dev

Conversation

@yenkins-admin
Copy link
Copy Markdown
Contributor

🚀 Automated PR to perform merge from master into rel/dev with changes up to 0b20d26 (created by https://github.com/gooddata/gooddata-python-sdk/actions/runs/27128109813).

Roman Rakus and others added 10 commits June 4, 2026 17:26
Implements Path B: evaluate the dashboard-summary feature through the
dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary),
which executes AFM server-side — no SSE or client-side result_id wrangling.

- SummaryClient: posts summary_input, maps the JSON summary into ChatResult
- DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include /
  rubric), scored per-criterion so quality_score is the fraction satisfied
- SummaryInput dataset field; dashboard_id is the only required input
- runner ChatBackend now receives the DatasetItem; CLI routes summary items
  to SummaryClient and everything else to ChatClient
- registered as a lazy [llm-judge] evaluator; docs + example dataset + tests

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend
and may exist under any id (e.g. UI-generated). Reading it by the hardcoded
id "activeLlmProvider" missed existing settings (-> "no active LLM provider"),
and activating re-created a second setting of the same type (-> HTTP 409).

Now look it up by setting_type via list_workspace_settings, and reuse the
existing setting's id on activate so create_or_update performs an UPDATE.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- evaluator: score must_not_include via plain presence-detection then invert,
  instead of asking the judge to reason about avoidance under an "EXPECTED
  OUTPUT" label (which inverted verdicts on negative criteria)
- reporting: make the FAIL note evaluator-agnostic (list whichever boolean
  checks are False) instead of the visualization-only "no visualization created"
- examples: replace the single template with three self-describing cases —
  full-dashboard, selected-visualizations (scoping), and format-hint-brief
  (format adherence) — each with a small gating set and rubric aligned to the
  endpoint's actual output

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…-eval

# Conflicts:
#	packages/gooddata-eval/README.md
#	packages/gooddata-eval/src/gooddata_eval/cli/main.py
#	packages/gooddata-eval/src/gooddata_eval/core/workspace.py
#	packages/gooddata-eval/tests/test_workspace.py
Running without --model takes the default branch, which set provider_name
but never provider_type, causing UnboundLocalError when building ResolvedModel.
Initialize it to "" alongside provider_name.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
dashboard_summary items sourced from a Langfuse dataset previously lost
summary_input (SummaryClient then failed with "missing summary_input").
Map it from the item input object, metadata, or expectedOutput so summary
datasets round-trip through Langfuse. The --langfuse/--langfuse-dataset
contract is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Per PR review: the dashboard_summary example items were uploaded to a
Langfuse dataset, so the local examples/summary_dataset files are no longer
needed in the repo. The README still documents the format inline.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
ty rejected passing dict|None where SummaryInput|None is expected (pydantic
coerces at runtime, the type checker doesn't). Build the SummaryInput via
model_validate so the static type matches DatasetItem.summary_input.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Trace metadata is not a Langfuse dashboard breakdown dimension, so per-model
charts/filters were impossible. Expose the model on the first-class trace
`version` field so dashboards can break down / filter by "Version".

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
feat(eval): evaluate the dashboard-summary skill via the /summary endpoint
@yenkins-admin yenkins-admin merged commit 24c41b4 into rel/dev Jun 8, 2026
1 check passed
@yenkins-admin yenkins-admin deleted the snapshot-master-0b20d26c-to-rel/dev branch June 8, 2026 09:23
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.18%. Comparing base (73a34c6) to head (0b20d26).
⚠️ Report is 521 commits behind head on rel/dev.

Additional details and impacted files
@@           Coverage Diff            @@
##           rel/dev    #1651   +/-   ##
========================================
  Coverage    79.18%   79.18%           
========================================
  Files          232      232           
  Lines        15791    15791           
========================================
  Hits         12504    12504           
  Misses        3287     3287           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants