Improve context snapshots needed on ACLK connect#21904
Improve context snapshots needed on ACLK connect#21904stelfrag wants to merge 6 commits intonetdata:masterfrom
Conversation
- Introduce deferred checkpoint logic to handle pending context loads and post-processing. - Log detailed reasons and timestamps for deferred checkpoints. - Update `aclk_sync_cfg_t` struct with additional fields for deferred checkpoint tracking. - Ensure proper handling of deferred counts and reasons upon context activation.
- Extract deferred checkpoint handling into a new `rrdcontext_checkpoint_defer()` function. - Introduce limits for maximum deferrals and cumulative deferral time. - Update `aclk_sync_cfg_t` to enhance tracking of deferred checkpoints with additional fields. - Improve logging to include detailed deferral reasons and time-related data.
… consolidating logic. Add support for replaying deferred checkpoints post context processing.
- Introduce `pending_ctx_spinlock` to ensure thread-safe access to pending checkpoint fields. - Lock critical sections in `rrdcontext_checkpoint_defer()` and checkpoint processing to prevent race conditions. - Initialize spinlock during ACLK sync configuration setup.
- Return success/failure state from `rrdcontext_checkpoint_save_pending()`. - Replay deferred checkpoints if post-processing queue is drained or on timeout. - Add `pending_ctx_saved_time_s` for tracking checkpoint save time. - Enhance spinlock safety and logging for deferred checkpoint replays and timeouts.
There was a problem hiding this comment.
2 issues found across 5 files
Confidence score: 3/5
- There is moderate merge risk because both reported issues are concrete runtime behaviors (severity 7/10 and 6/10), not just style or maintainability concerns.
- In
src/database/contexts/rrdcontext.c, deferred checkpoint replay can ignoreRRDHOST_FLAG_PENDING_CONTEXT_LOAD, which may trigger replay before context loading completes and lead to incorrect initialization/order-of-operations behavior. - In
src/database/sqlite/sqlite_aclk.c, dereferencingold_aclk_host_configwithout a NULL guard around the new spinlock path can crash when another thread clears the pointer first. - Pay close attention to
src/database/contexts/rrdcontext.candsrc/database/sqlite/sqlite_aclk.c- replay ordering and concurrent NULL-dereference risk need validation.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/database/sqlite/sqlite_aclk.c">
<violation number="1" location="src/database/sqlite/sqlite_aclk.c:1187">
P2: Guard against old_aclk_host_config being NULL before dereferencing it; the new spinlock lock can crash if another thread already cleared the pointer.</violation>
</file>
<file name="src/database/contexts/rrdcontext.c">
<violation number="1" location="src/database/contexts/rrdcontext.c:374">
P1: Deferred checkpoint replay ignores `RRDHOST_FLAG_PENDING_CONTEXT_LOAD`, so replay may occur before context loading finishes.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Cloud as ACLK (Cloud)
participant RRD as RRDContext Handler
participant Host as RRDHOST (State)
participant Worker as RRDContext Worker
participant DB as Context Snapshot / Hub
Note over Cloud, DB: ACLK Checkpoint Flow (New Deferral Logic)
Cloud->>>RRD: NEW: rrdcontext_hub_checkpoint_command(cmd)
RRD->>Host: Check RRDHOST_FLAG_PENDING_CONTEXT_LOAD
RRD->>Host: Check pp_queue (post-processing) size
alt Host Busy (Loading or Queueing)
RRD->>Host: NEW: rrdcontext_checkpoint_save_pending()
Note right of Host: Save claim_id, node_id, hash,<br/>and timestamp to aclk_sync_cfg
Host-->>RRD: pending_ctx_checkpoint = true
RRD-->>Cloud: Acknowledged (Deferred)
else Host Idle
RRD->>RRD: rrdcontext_checkpoint_execute()
RRD->>DB: CHANGED: Send snapshot (if version mismatch)
RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
end
Note over Worker, Host: Periodic Worker Loop (~1s)
loop Every Iteration
Worker->>Host: rrdcontext_post_process_queued_contexts()
Worker->>>Host: NEW: rrdcontext_hub_pending_checkpoint_replay()
opt pending_ctx_checkpoint is true
alt pp_queue is empty OR timeout (>300s)
Host->>Host: NEW: Extract deferred command details
Host->>RRD: rrdcontext_checkpoint_execute()
RRD->>DB: Send snapshot (if version mismatch)
RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
Host->>Host: Reset pending_ctx_checkpoint = false
else Still Processing
Host-->>Worker: Continue deferring
end
end
end
Note over Host, DB: Cleanup on Host/ACLK Disconnect
Host->>Host: NEW: destroy_aclk_config()
Note right of Host: Acquire spinlock, free pending strings
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@cubic-dev-ai PR review |
@stelfrag I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 5 files
Confidence score: 3/5
- Potential user-impacting risk: deferred checkpoints may starve when a context load is pending because replay happens only after the pending-load early-continue in
src/database/contexts/rrdcontext-worker.c. - Score reflects a concrete behavioral risk (severity 6/10) rather than a confirmed regression, so overall risk is moderate.
- Pay close attention to
src/database/contexts/rrdcontext-worker.c- ensure deferred checkpoints aren’t skipped during pending context loads.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/database/contexts/rrdcontext-worker.c">
<violation number="1" location="src/database/contexts/rrdcontext-worker.c:1087">
P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Cloud as Netdata Cloud (Hub)
participant ACLK as ACLK Command Thread
participant Config as Host ACLK Config (Storage)
participant Worker as RRDContext Worker Thread
participant Engine as Context Snapshot Engine
Note over Cloud,Engine: Connection / Checkpoint Initiation
Cloud->>ACLK: Checkpoint Command (claim_id, version_hash)
ACLK->>ACLK: Check if Host Busy (pp_queue or pending load)
alt Host Busy (Processing Contexts)
ACLK->>Config: NEW: Save pending checkpoint (Spinlock protected)
Note right of Config: Stores claim_id, node_id, hash, and timestamp
else Host Idle
ACLK->>Engine: CHANGED: Execute Checkpoint immediately
end
Note over Worker,Config: Background Replay Logic
loop Every Worker Tick
Worker->>Worker: Process Context Queue (pp_queue)
Worker->>Config: NEW: rrdcontext_hub_pending_checkpoint_replay()
opt Queue Empty OR 5-minute Timeout
Config-->>Worker: Return Saved Checkpoint Data
Worker->>Engine: CHANGED: Execute Checkpoint
Worker->>Config: NEW: Clear pending state (Spinlock)
end
end
Note over Engine,Cloud: Checkpoint Execution Path
Engine->>Engine: Compare Local Hash vs Command Hash
alt Hash Mismatch
Engine->>Engine: Recalculate Host Retention
Engine->>Cloud: NEW: Send Context Snapshot (Full State)
else Hash Matches
Note over Engine,Cloud: Skip Snapshot (Optimization)
end
Engine->>Config: CHANGED: Enable ACLK_STREAM_CONTEXTS flag
Note over Config: Cleanup on Disconnect
opt Host/Config Destruction
Config->>Config: NEW: Locked cleanup of pending strings
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| rrdcontext_post_process_queued_contexts(host); | ||
|
|
||
| // replay deferred checkpoint if post-processing drained the queue, or on timeout | ||
| rrdcontext_hub_pending_checkpoint_replay(host); |
There was a problem hiding this comment.
P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/database/contexts/rrdcontext-worker.c, line 1087:
<comment>Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</comment>
<file context>
@@ -1083,6 +1083,9 @@ void rrdcontext_main(void *ptr) {
rrdcontext_post_process_queued_contexts(host);
+ // replay deferred checkpoint if post-processing drained the queue, or on timeout
+ rrdcontext_hub_pending_checkpoint_replay(host);
+
hub_queued_contexts_for_all_hosts += rrdcontext_queue_entries(&host->rrdctx.hub_queue);
</file context>
There was a problem hiding this comment.
Pull request overview
Defers ACLK context snapshot checkpoints while a host’s context load or post-processing is still in progress, then replays the checkpoint once processing completes (or after a timeout). This aims to reduce duplicate context snapshots and avoid races with active context streaming.
Changes:
- Added per-host “pending checkpoint” state (claim_id/node_id/version_hash/timestamp) protected by a spinlock in the ACLK host config.
- Updated checkpoint handling to save/defer checkpoints during context load/post-processing and execute them later via a replay path.
- Hooked replay into the context worker loop and hardened ACLK config destruction to safely free pending checkpoint data.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/database/sqlite/sqlite_aclk.h |
Extends ACLK per-host config with spinlock-protected pending checkpoint fields. |
src/database/sqlite/sqlite_aclk.c |
Initializes the new spinlock and safely detaches/frees pending checkpoint strings during config destruction. |
src/database/contexts/rrdcontext.h |
Exposes the pending-checkpoint replay entry point for the context worker. |
src/database/contexts/rrdcontext.c |
Implements save/defer/execute/replay logic for context checkpoints, including timeout-based replay and stale-claim checks. |
src/database/contexts/rrdcontext-worker.c |
Calls the replay function after post-processing to re-enable streaming and/or send snapshots when safe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Summary by cubic
Defers ACLK context snapshot checkpoints while context recalculation/post-processing is in progress, then replays them when processing completes or after a 5-minute timeout. This cuts duplicate snapshots and avoids races with active streaming.
New Features
Bug Fixes
Written for commit e33978b. Summary will update on new commits.