Skip to content

Improve context snapshots needed on ACLK connect#21904

Draft
stelfrag wants to merge 6 commits intonetdata:masterfrom
stelfrag:reduce_ctx_snapshots
Draft

Improve context snapshots needed on ACLK connect#21904
stelfrag wants to merge 6 commits intonetdata:masterfrom
stelfrag:reduce_ctx_snapshots

Conversation

@stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Mar 6, 2026

Summary
  • Defer context snapshots if version recalculation is pending (send snapshot when done)
    • This should reduce context snapshos sent to cloud

Summary by cubic

Defers ACLK context snapshot checkpoints while context recalculation/post-processing is in progress, then replays them when processing completes or after a 5-minute timeout. This cuts duplicate snapshots and avoids races with active streaming.

  • New Features

    • Defer and save checkpoint (claim_id, node_id, version_hash, timestamp) when context load or post-processing is active.
    • Worker replays the saved checkpoint when the post-processing queue drains, or forces replay after 5 minutes; discards if claim_id is stale.
    • Centralized execution path: compare version hash, send snapshot on mismatch, then re-enable streaming; pending state is spinlock-protected.
  • Bug Fixes

    • Add null check and locked cleanup in ACLK config destruction to avoid races and null derefs when freeing pending checkpoint data.

Written for commit e33978b. Summary will update on new commits.

stelfrag added 5 commits March 6, 2026 09:54
- Introduce deferred checkpoint logic to handle pending context loads and post-processing.
- Log detailed reasons and timestamps for deferred checkpoints.
- Update `aclk_sync_cfg_t` struct with additional fields for deferred checkpoint tracking.
- Ensure proper handling of deferred counts and reasons upon context activation.
- Extract deferred checkpoint handling into a new `rrdcontext_checkpoint_defer()` function.
- Introduce limits for maximum deferrals and cumulative deferral time.
- Update `aclk_sync_cfg_t` to enhance tracking of deferred checkpoints with additional fields.
- Improve logging to include detailed deferral reasons and time-related data.
… consolidating logic. Add support for replaying deferred checkpoints post context processing.
- Introduce `pending_ctx_spinlock` to ensure thread-safe access to pending checkpoint fields.
- Lock critical sections in `rrdcontext_checkpoint_defer()` and checkpoint processing to prevent race conditions.
- Initialize spinlock during ACLK sync configuration setup.
- Return success/failure state from `rrdcontext_checkpoint_save_pending()`.
- Replay deferred checkpoints if post-processing queue is drained or on timeout.
- Add `pending_ctx_saved_time_s` for tracking checkpoint save time.
- Enhance spinlock safety and logging for deferred checkpoint replays and timeouts.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Confidence score: 3/5

  • There is moderate merge risk because both reported issues are concrete runtime behaviors (severity 7/10 and 6/10), not just style or maintainability concerns.
  • In src/database/contexts/rrdcontext.c, deferred checkpoint replay can ignore RRDHOST_FLAG_PENDING_CONTEXT_LOAD, which may trigger replay before context loading completes and lead to incorrect initialization/order-of-operations behavior.
  • In src/database/sqlite/sqlite_aclk.c, dereferencing old_aclk_host_config without a NULL guard around the new spinlock path can crash when another thread clears the pointer first.
  • Pay close attention to src/database/contexts/rrdcontext.c and src/database/sqlite/sqlite_aclk.c - replay ordering and concurrent NULL-dereference risk need validation.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/database/sqlite/sqlite_aclk.c">

<violation number="1" location="src/database/sqlite/sqlite_aclk.c:1187">
P2: Guard against old_aclk_host_config being NULL before dereferencing it; the new spinlock lock can crash if another thread already cleared the pointer.</violation>
</file>

<file name="src/database/contexts/rrdcontext.c">

<violation number="1" location="src/database/contexts/rrdcontext.c:374">
P1: Deferred checkpoint replay ignores `RRDHOST_FLAG_PENDING_CONTEXT_LOAD`, so replay may occur before context loading finishes.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Cloud as ACLK (Cloud)
    participant RRD as RRDContext Handler
    participant Host as RRDHOST (State)
    participant Worker as RRDContext Worker
    participant DB as Context Snapshot / Hub

    Note over Cloud, DB: ACLK Checkpoint Flow (New Deferral Logic)

    Cloud->>>RRD: NEW: rrdcontext_hub_checkpoint_command(cmd)
    
    RRD->>Host: Check RRDHOST_FLAG_PENDING_CONTEXT_LOAD
    RRD->>Host: Check pp_queue (post-processing) size

    alt Host Busy (Loading or Queueing)
        RRD->>Host: NEW: rrdcontext_checkpoint_save_pending()
        Note right of Host: Save claim_id, node_id, hash,<br/>and timestamp to aclk_sync_cfg
        Host-->>RRD: pending_ctx_checkpoint = true
        RRD-->>Cloud: Acknowledged (Deferred)
    else Host Idle
        RRD->>RRD: rrdcontext_checkpoint_execute()
        RRD->>DB: CHANGED: Send snapshot (if version mismatch)
        RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
    end

    Note over Worker, Host: Periodic Worker Loop (~1s)

    loop Every Iteration
        Worker->>Host: rrdcontext_post_process_queued_contexts()
        Worker->>>Host: NEW: rrdcontext_hub_pending_checkpoint_replay()
        
        opt pending_ctx_checkpoint is true
            alt pp_queue is empty OR timeout (>300s)
                Host->>Host: NEW: Extract deferred command details
                Host->>RRD: rrdcontext_checkpoint_execute()
                RRD->>DB: Send snapshot (if version mismatch)
                RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
                Host->>Host: Reset pending_ctx_checkpoint = false
            else Still Processing
                Host-->>Worker: Continue deferring
            end
        end
    end

    Note over Host, DB: Cleanup on Host/ACLK Disconnect
    Host->>Host: NEW: destroy_aclk_config()
    Note right of Host: Acquire spinlock, free pending strings
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@stelfrag
Copy link
Collaborator Author

stelfrag commented Mar 6, 2026

@cubic-dev-ai PR review

@cubic-dev-ai
Copy link
Contributor

cubic-dev-ai bot commented Mar 6, 2026

@cubic-dev-ai PR review

@stelfrag I have started the AI code review. It will take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Confidence score: 3/5

  • Potential user-impacting risk: deferred checkpoints may starve when a context load is pending because replay happens only after the pending-load early-continue in src/database/contexts/rrdcontext-worker.c.
  • Score reflects a concrete behavioral risk (severity 6/10) rather than a confirmed regression, so overall risk is moderate.
  • Pay close attention to src/database/contexts/rrdcontext-worker.c - ensure deferred checkpoints aren’t skipped during pending context loads.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/database/contexts/rrdcontext-worker.c">

<violation number="1" location="src/database/contexts/rrdcontext-worker.c:1087">
P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Cloud as Netdata Cloud (Hub)
    participant ACLK as ACLK Command Thread
    participant Config as Host ACLK Config (Storage)
    participant Worker as RRDContext Worker Thread
    participant Engine as Context Snapshot Engine

    Note over Cloud,Engine: Connection / Checkpoint Initiation

    Cloud->>ACLK: Checkpoint Command (claim_id, version_hash)
    
    ACLK->>ACLK: Check if Host Busy (pp_queue or pending load)
    
    alt Host Busy (Processing Contexts)
        ACLK->>Config: NEW: Save pending checkpoint (Spinlock protected)
        Note right of Config: Stores claim_id, node_id, hash, and timestamp
    else Host Idle
        ACLK->>Engine: CHANGED: Execute Checkpoint immediately
    end

    Note over Worker,Config: Background Replay Logic

    loop Every Worker Tick
        Worker->>Worker: Process Context Queue (pp_queue)
        
        Worker->>Config: NEW: rrdcontext_hub_pending_checkpoint_replay()
        
        opt Queue Empty OR 5-minute Timeout
            Config-->>Worker: Return Saved Checkpoint Data
            Worker->>Engine: CHANGED: Execute Checkpoint
            Worker->>Config: NEW: Clear pending state (Spinlock)
        end
    end

    Note over Engine,Cloud: Checkpoint Execution Path

    Engine->>Engine: Compare Local Hash vs Command Hash
    
    alt Hash Mismatch
        Engine->>Engine: Recalculate Host Retention
        Engine->>Cloud: NEW: Send Context Snapshot (Full State)
    else Hash Matches
        Note over Engine,Cloud: Skip Snapshot (Optimization)
    end
    
    Engine->>Config: CHANGED: Enable ACLK_STREAM_CONTEXTS flag

    Note over Config: Cleanup on Disconnect
    opt Host/Config Destruction
        Config->>Config: NEW: Locked cleanup of pending strings
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

rrdcontext_post_process_queued_contexts(host);

// replay deferred checkpoint if post-processing drained the queue, or on timeout
rrdcontext_hub_pending_checkpoint_replay(host);
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/database/contexts/rrdcontext-worker.c, line 1087:

<comment>Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</comment>

<file context>
@@ -1083,6 +1083,9 @@ void rrdcontext_main(void *ptr) {
             rrdcontext_post_process_queued_contexts(host);
 
+            // replay deferred checkpoint if post-processing drained the queue, or on timeout
+            rrdcontext_hub_pending_checkpoint_replay(host);
+
             hub_queued_contexts_for_all_hosts += rrdcontext_queue_entries(&host->rrdctx.hub_queue);
</file context>
Fix with Cubic

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Defers ACLK context snapshot checkpoints while a host’s context load or post-processing is still in progress, then replays the checkpoint once processing completes (or after a timeout). This aims to reduce duplicate context snapshots and avoid races with active context streaming.

Changes:

  • Added per-host “pending checkpoint” state (claim_id/node_id/version_hash/timestamp) protected by a spinlock in the ACLK host config.
  • Updated checkpoint handling to save/defer checkpoints during context load/post-processing and execute them later via a replay path.
  • Hooked replay into the context worker loop and hardened ACLK config destruction to safely free pending checkpoint data.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/database/sqlite/sqlite_aclk.h Extends ACLK per-host config with spinlock-protected pending checkpoint fields.
src/database/sqlite/sqlite_aclk.c Initializes the new spinlock and safely detaches/frees pending checkpoint strings during config destruction.
src/database/contexts/rrdcontext.h Exposes the pending-checkpoint replay entry point for the context worker.
src/database/contexts/rrdcontext.c Implements save/defer/execute/replay logic for context checkpoints, including timeout-based replay and stale-claim checks.
src/database/contexts/rrdcontext-worker.c Calls the replay function after post-processing to re-enable streaming and/or send snapshots when safe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants