[bot] Merge master/0b20d26c into rel/dev#1651
Merged
Merged
Conversation
Implements Path B: evaluate the dashboard-summary feature through the
dedicated synchronous endpoint (POST /api/v1/ai/workspaces/{ws}/summary),
which executes AFM server-side — no SSE or client-side result_id wrangling.
- SummaryClient: posts summary_input, maps the JSON summary into ChatResult
- DashboardSummaryEvaluator: rubric-based LLM judge (must_include / must_not_include /
rubric), scored per-criterion so quality_score is the fraction satisfied
- SummaryInput dataset field; dashboard_id is the only required input
- runner ChatBackend now receives the DatasetItem; CLI routes summary items
to SummaryClient and everything else to ChatClient
- registered as a lazy [llm-judge] evaluator; docs + example dataset + tests
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The workspace ACTIVE_LLM_PROVIDER setting is keyed by type on the backend and may exist under any id (e.g. UI-generated). Reading it by the hardcoded id "activeLlmProvider" missed existing settings (-> "no active LLM provider"), and activating re-created a second setting of the same type (-> HTTP 409). Now look it up by setting_type via list_workspace_settings, and reuse the existing setting's id on activate so create_or_update performs an UPDATE. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- evaluator: score must_not_include via plain presence-detection then invert, instead of asking the judge to reason about avoidance under an "EXPECTED OUTPUT" label (which inverted verdicts on negative criteria) - reporting: make the FAIL note evaluator-agnostic (list whichever boolean checks are False) instead of the visualization-only "no visualization created" - examples: replace the single template with three self-describing cases — full-dashboard, selected-visualizations (scoping), and format-hint-brief (format adherence) — each with a small gating set and rubric aligned to the endpoint's actual output Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…-eval # Conflicts: # packages/gooddata-eval/README.md # packages/gooddata-eval/src/gooddata_eval/cli/main.py # packages/gooddata-eval/src/gooddata_eval/core/workspace.py # packages/gooddata-eval/tests/test_workspace.py
Running without --model takes the default branch, which set provider_name but never provider_type, causing UnboundLocalError when building ResolvedModel. Initialize it to "" alongside provider_name. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
dashboard_summary items sourced from a Langfuse dataset previously lost summary_input (SummaryClient then failed with "missing summary_input"). Map it from the item input object, metadata, or expectedOutput so summary datasets round-trip through Langfuse. The --langfuse/--langfuse-dataset contract is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Per PR review: the dashboard_summary example items were uploaded to a Langfuse dataset, so the local examples/summary_dataset files are no longer needed in the repo. The README still documents the format inline. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
ty rejected passing dict|None where SummaryInput|None is expected (pydantic coerces at runtime, the type checker doesn't). Build the SummaryInput via model_validate so the static type matches DatasetItem.summary_input. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Trace metadata is not a Langfuse dashboard breakdown dimension, so per-model charts/filters were impossible. Expose the model on the first-class trace `version` field so dashboards can break down / filter by "Version". Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
feat(eval): evaluate the dashboard-summary skill via the /summary endpoint
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## rel/dev #1651 +/- ##
========================================
Coverage 79.18% 79.18%
========================================
Files 232 232
Lines 15791 15791
========================================
Hits 12504 12504
Misses 3287 3287 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Automated PR to perform merge from master into rel/dev with changes up to 0b20d26 (created by https://github.com/gooddata/gooddata-python-sdk/actions/runs/27128109813).