Skip to content

ci(nightly): fetch API keys from Key Vault via OIDC#120

Open
miguelgfierro wants to merge 9 commits intomainfrom
fix/actions-kv-secrets
Open

ci(nightly): fetch API keys from Key Vault via OIDC#120
miguelgfierro wants to merge 9 commits intomainfrom
fix/actions-kv-secrets

Conversation

@miguelgfierro
Copy link
Copy Markdown
Contributor

Summary

  • Removes direct use of ANTHROPIC_API_KEY, EMBEDDING_BINDING_HOST, and EMBEDDING_BINDING_API_KEY as GitHub repo secrets in nightly.yml
  • Replaces them with Azure Key Vault lookups (kv-firefly-signature) after OIDC login, with ::add-mask:: applied before writing to $GITHUB_ENV
  • Moves to per-job permissions blocks — benchmark no longer inherits issues:write from the test job
  • Simplifies the benchmark job: removes the "check secrets available" gate since the KV fetch either succeeds or fails the job explicitly

Infrastructure required (see PR description)

Before this workflow runs successfully, the following must be in place:

  1. Key Vault secrets (in kv-firefly-signature):

    • ANTHROPIC-API-KEY
    • EMBEDDING-BINDING-HOST
    • EMBEDDING-BINDING-API-KEY
  2. OIDC app registration (firefly-github-actions):

    • Federated credential subject: repo:fireflyframework/fireflyframework-agentic:ref:refs/heads/main
    • Roles: Key Vault Secrets User on the KV, AcrPush on fireflysignature, Contributor on firefly-mcp
  3. GitHub repository secrets (Actions → Secrets):

    • AZURE_CLIENT_ID → client ID of firefly-github-actions
    • AZURE_TENANT_IDcda419af-f2d8-444c-9248-4ada4f168c8a
    • AZURE_SUBSCRIPTION_IDe8b8063e-f842-4a59-9754-427ddb7bfb63
  4. Delete old GitHub secrets once the workflow is confirmed working:

    • ANTHROPIC_API_KEY, EMBEDDING_BINDING_HOST, EMBEDDING_BINDING_API_KEY

Replace GitHub repo secrets (ANTHROPIC_API_KEY, EMBEDDING_BINDING_HOST,
EMBEDDING_BINDING_API_KEY) with Azure Key Vault lookups. Add OIDC login
and kv-firefly-signature fetch to both the test and benchmark jobs.

Move from a single top-level permissions block to per-job permissions so
the benchmark job no longer inherits issues:write from the test job.
Separate the single shared service principal into two:
- AZURE_CI_CLIENT_ID (nightly/benchmark): Key Vault Secrets User only
- AZURE_DEPLOY_CLIENT_ID (deploy-mcp): AcrPush + Container App write only

Also hardcode tenant and subscription IDs as plain values — they are
not credentials and do not belong in the secrets store.
@javier-alvarez
Copy link
Copy Markdown
Contributor

Nightly does not work with this branch https://github.com/fireflyframework/fireflyframework-agentic/actions/runs/25543811013

Service principal error

The 25 870-row CSV converted to 24 399 chunks via MarkdownChunker
(one flat table, no heading boundaries).  Embedding and storing that
many vectors flooded the SQLite WAL and corrupted the database, causing
"database disk image is malformed" on the xlsx ingest that follows.

None of the 32 benchmark queries ask about billing data; the CSV is
purely a distractor.  200 rows (190 chunks) is enough to test that
billing-ledger content is not surfaced for company-fact queries, without
overwhelming the in-memory SQLite corpus.
paraphrase_ceo_idiomatic now returns the correct doc at rank 1 but a
different chunk (distractor docs shifted relative vector scores). The
answer is still correct; only substring_match drops to 27/28 = 0.9643.
shutil.copyfile on a WAL-mode SQLite database copies only the main
file, leaving committed WAL frames behind. When two DatabaseStore
instances share the same backend path (e.g. SqliteCorpus +
SqliteVecVectorStore both pointing to corpus.sqlite), each upload
produces an etag change that triggers the other store to re-download.
The download overwrites the shared cache and stales the WAL, wiping
all data written in that round.

Fix: detect SQLite files by their 16-byte magic header and use
sqlite3.Connection.backup() instead of shutil.copyfile. The backup API
walks both the main file and any unflushed WAL frames, so the uploaded
file is always a complete, consistent snapshot.
sqlite3.Connection.backup() opens the source without loading the
sqlite_vec extension, which makes schema parsing fail when the database
has a vec0 virtual table — producing a malformed copy.

The correct fix is to issue PRAGMA wal_checkpoint(FULL) on the existing
connection (which already has sqlite_vec loaded) immediately after each
COMMIT, so the main database file is fully up-to-date before
shutil.copyfile captures it for upload.

Reverts the _copy_for_upload / backup() helper added in the previous
attempt and restores plain shutil.copyfile in LocalBackend._upload_sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants