Improve context snapshots needed on ACLK connect by stelfrag · Pull Request #21904 · netdata/netdata

stelfrag · 2026-03-06T09:24:27Z

Summary

Defer context snapshots if version recalculation is pending (send snapshot when done)
- This should reduce context snapshos sent to cloud

Summary by cubic

Defers ACLK context snapshot checkpoints while context recalculation/post-processing is in progress, then replays them when processing completes or after a 5-minute timeout. This cuts duplicate snapshots and avoids races with active streaming.

New Features
- Defer and save checkpoint (claim_id, node_id, version_hash, timestamp) when context load or post-processing is active.
- Worker replays the saved checkpoint when the post-processing queue drains, or forces replay after 5 minutes; discards if claim_id is stale.
- Centralized execution path: compare version hash, send snapshot on mismatch, then re-enable streaming; pending state is spinlock-protected.
Bug Fixes
- Add null check and locked cleanup in ACLK config destruction to avoid races and null derefs when freeing pending checkpoint data.

^{Written for commit e33978b. Summary will update on new commits.}

- Introduce deferred checkpoint logic to handle pending context loads and post-processing. - Log detailed reasons and timestamps for deferred checkpoints. - Update `aclk_sync_cfg_t` struct with additional fields for deferred checkpoint tracking. - Ensure proper handling of deferred counts and reasons upon context activation.

- Extract deferred checkpoint handling into a new `rrdcontext_checkpoint_defer()` function. - Introduce limits for maximum deferrals and cumulative deferral time. - Update `aclk_sync_cfg_t` to enhance tracking of deferred checkpoints with additional fields. - Improve logging to include detailed deferral reasons and time-related data.

… consolidating logic. Add support for replaying deferred checkpoints post context processing.

- Introduce `pending_ctx_spinlock` to ensure thread-safe access to pending checkpoint fields. - Lock critical sections in `rrdcontext_checkpoint_defer()` and checkpoint processing to prevent race conditions. - Initialize spinlock during ACLK sync configuration setup.

- Return success/failure state from `rrdcontext_checkpoint_save_pending()`. - Replay deferred checkpoints if post-processing queue is drained or on timeout. - Add `pending_ctx_saved_time_s` for tracking checkpoint save time. - Enhance spinlock safety and logging for deferred checkpoint replays and timeouts.

cubic-dev-ai

2 issues found across 5 files

Confidence score: 3/5

There is moderate merge risk because both reported issues are concrete runtime behaviors (severity 7/10 and 6/10), not just style or maintainability concerns.
In src/database/contexts/rrdcontext.c, deferred checkpoint replay can ignore RRDHOST_FLAG_PENDING_CONTEXT_LOAD, which may trigger replay before context loading completes and lead to incorrect initialization/order-of-operations behavior.
In src/database/sqlite/sqlite_aclk.c, dereferencing old_aclk_host_config without a NULL guard around the new spinlock path can crash when another thread clears the pointer first.
Pay close attention to src/database/contexts/rrdcontext.c and src/database/sqlite/sqlite_aclk.c - replay ordering and concurrent NULL-dereference risk need validation.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/database/sqlite/sqlite_aclk.c">

<violation number="1" location="src/database/sqlite/sqlite_aclk.c:1187">
P2: Guard against old_aclk_host_config being NULL before dereferencing it; the new spinlock lock can crash if another thread already cleared the pointer.</violation>
</file>

<file name="src/database/contexts/rrdcontext.c">

<violation number="1" location="src/database/contexts/rrdcontext.c:374">
P1: Deferred checkpoint replay ignores `RRDHOST_FLAG_PENDING_CONTEXT_LOAD`, so replay may occur before context loading finishes.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Cloud as ACLK (Cloud)
    participant RRD as RRDContext Handler
    participant Host as RRDHOST (State)
    participant Worker as RRDContext Worker
    participant DB as Context Snapshot / Hub

    Note over Cloud, DB: ACLK Checkpoint Flow (New Deferral Logic)

    Cloud->>>RRD: NEW: rrdcontext_hub_checkpoint_command(cmd)
    
    RRD->>Host: Check RRDHOST_FLAG_PENDING_CONTEXT_LOAD
    RRD->>Host: Check pp_queue (post-processing) size

    alt Host Busy (Loading or Queueing)
        RRD->>Host: NEW: rrdcontext_checkpoint_save_pending()
        Note right of Host: Save claim_id, node_id, hash,<br/>and timestamp to aclk_sync_cfg
        Host-->>RRD: pending_ctx_checkpoint = true
        RRD-->>Cloud: Acknowledged (Deferred)
    else Host Idle
        RRD->>RRD: rrdcontext_checkpoint_execute()
        RRD->>DB: CHANGED: Send snapshot (if version mismatch)
        RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
    end

    Note over Worker, Host: Periodic Worker Loop (~1s)

    loop Every Iteration
        Worker->>Host: rrdcontext_post_process_queued_contexts()
        Worker->>>Host: NEW: rrdcontext_hub_pending_checkpoint_replay()
        
        opt pending_ctx_checkpoint is true
            alt pp_queue is empty OR timeout (>300s)
                Host->>Host: NEW: Extract deferred command details
                Host->>RRD: rrdcontext_checkpoint_execute()
                RRD->>DB: Send snapshot (if version mismatch)
                RRD->>Host: Enable RRDHOST_FLAG_ACLK_STREAM_CONTEXTS
                Host->>Host: Reset pending_ctx_checkpoint = false
            else Still Processing
                Host-->>Worker: Continue deferring
            end
        end
    end

    Note over Host, DB: Cleanup on Host/ACLK Disconnect
    Host->>Host: NEW: destroy_aclk_config()
    Note right of Host: Acquire spinlock, free pending strings

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/database/contexts/rrdcontext.c

src/database/sqlite/sqlite_aclk.c

stelfrag · 2026-03-06T09:52:48Z

@cubic-dev-ai PR review

cubic-dev-ai · 2026-03-06T09:52:59Z

@cubic-dev-ai PR review

@stelfrag I have started the AI code review. It will take a few minutes to complete.

cubic-dev-ai

1 issue found across 5 files

Confidence score: 3/5

Potential user-impacting risk: deferred checkpoints may starve when a context load is pending because replay happens only after the pending-load early-continue in src/database/contexts/rrdcontext-worker.c.
Score reflects a concrete behavioral risk (severity 6/10) rather than a confirmed regression, so overall risk is moderate.
Pay close attention to src/database/contexts/rrdcontext-worker.c - ensure deferred checkpoints aren’t skipped during pending context loads.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/database/contexts/rrdcontext-worker.c">

<violation number="1" location="src/database/contexts/rrdcontext-worker.c:1087">
P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Cloud as Netdata Cloud (Hub)
    participant ACLK as ACLK Command Thread
    participant Config as Host ACLK Config (Storage)
    participant Worker as RRDContext Worker Thread
    participant Engine as Context Snapshot Engine

    Note over Cloud,Engine: Connection / Checkpoint Initiation

    Cloud->>ACLK: Checkpoint Command (claim_id, version_hash)
    
    ACLK->>ACLK: Check if Host Busy (pp_queue or pending load)
    
    alt Host Busy (Processing Contexts)
        ACLK->>Config: NEW: Save pending checkpoint (Spinlock protected)
        Note right of Config: Stores claim_id, node_id, hash, and timestamp
    else Host Idle
        ACLK->>Engine: CHANGED: Execute Checkpoint immediately
    end

    Note over Worker,Config: Background Replay Logic

    loop Every Worker Tick
        Worker->>Worker: Process Context Queue (pp_queue)
        
        Worker->>Config: NEW: rrdcontext_hub_pending_checkpoint_replay()
        
        opt Queue Empty OR 5-minute Timeout
            Config-->>Worker: Return Saved Checkpoint Data
            Worker->>Engine: CHANGED: Execute Checkpoint
            Worker->>Config: NEW: Clear pending state (Spinlock)
        end
    end

    Note over Engine,Cloud: Checkpoint Execution Path

    Engine->>Engine: Compare Local Hash vs Command Hash
    
    alt Hash Mismatch
        Engine->>Engine: Recalculate Host Retention
        Engine->>Cloud: NEW: Send Context Snapshot (Full State)
    else Hash Matches
        Note over Engine,Cloud: Skip Snapshot (Optimization)
    end
    
    Engine->>Config: CHANGED: Enable ACLK_STREAM_CONTEXTS flag

    Note over Config: Cleanup on Disconnect
    opt Host/Config Destruction
        Config->>Config: NEW: Locked cleanup of pending strings
    end

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-06T09:58:51Z

src/database/contexts/rrdcontext-worker.c

            rrdcontext_post_process_queued_contexts(host);

+            // replay deferred checkpoint if post-processing drained the queue, or on timeout
+            rrdcontext_hub_pending_checkpoint_replay(host);


P2: Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/database/contexts/rrdcontext-worker.c, line 1087: <comment>Deferred checkpoints can starve during pending context load because replay is called only after the pending-load early-continue.</comment> <file context> @@ -1083,6 +1083,9 @@ void rrdcontext_main(void *ptr) { rrdcontext_post_process_queued_contexts(host); + // replay deferred checkpoint if post-processing drained the queue, or on timeout + rrdcontext_hub_pending_checkpoint_replay(host); + hub_queued_contexts_for_all_hosts += rrdcontext_queue_entries(&host->rrdctx.hub_queue); </file context>

Copilot

Pull request overview

Defers ACLK context snapshot checkpoints while a host’s context load or post-processing is still in progress, then replays the checkpoint once processing completes (or after a timeout). This aims to reduce duplicate context snapshots and avoid races with active context streaming.

Changes:

Added per-host “pending checkpoint” state (claim_id/node_id/version_hash/timestamp) protected by a spinlock in the ACLK host config.
Updated checkpoint handling to save/defer checkpoints during context load/post-processing and execute them later via a replay path.
Hooked replay into the context worker loop and hardened ACLK config destruction to safely free pending checkpoint data.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/database/sqlite/sqlite_aclk.h`	Extends ACLK per-host config with spinlock-protected pending checkpoint fields.
`src/database/sqlite/sqlite_aclk.c`	Initializes the new spinlock and safely detaches/frees pending checkpoint strings during config destruction.
`src/database/contexts/rrdcontext.h`	Exposes the pending-checkpoint replay entry point for the context worker.
`src/database/contexts/rrdcontext.c`	Implements save/defer/execute/replay logic for context checkpoints, including timeout-based replay and stale-claim checks.
`src/database/contexts/rrdcontext-worker.c`	Calls the replay function after post-processing to re-enable streaming and/or send snapshots when safe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

stelfrag added 5 commits March 6, 2026 09:54

Simplify deferred checkpoint handling by removing deferral limits and…

1f80fb6

… consolidating logic. Add support for replaying deferred checkpoints post context processing.

github-actions bot added area/database area/aclk labels Mar 6, 2026

cubic-dev-ai bot reviewed Mar 6, 2026

View reviewed changes

src/database/contexts/rrdcontext.c Show resolved Hide resolved

src/database/sqlite/sqlite_aclk.c Show resolved Hide resolved

Add null check for old_aclk_host_config in ACLK cleanup logic

e33978b

cubic-dev-ai bot reviewed Mar 6, 2026

View reviewed changes

ilyam8 requested a review from Copilot March 6, 2026 10:06

Copilot started reviewing on behalf of ilyam8 March 6, 2026 10:07 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve context snapshots needed on ACLK connect#21904

Improve context snapshots needed on ACLK connect#21904
stelfrag wants to merge 6 commits intonetdata:masterfrom
stelfrag:reduce_ctx_snapshots

stelfrag commented Mar 6, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

stelfrag commented Mar 6, 2026

Uh oh!

cubic-dev-ai bot commented Mar 6, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Mar 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stelfrag commented Mar 6, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stelfrag commented Mar 6, 2026

Uh oh!

cubic-dev-ai bot commented Mar 6, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stelfrag commented Mar 6, 2026 •

edited by cubic-dev-ai bot

Loading

cubic-dev-ai bot Mar 6, 2026 •

edited

Loading