Conversation
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis pull request introduces a comprehensive end-to-end test infrastructure for VM scanning on OpenShift clusters. It includes CI pipeline updates, Kubernetes CNV operator setup, a full Go-based E2E test suite with supporting helpers, Prometheus observability utilities, and shell scripts for test orchestration and custom VM image building. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (3)
vm-image.sh (2)
51-52: Non-deterministic file selection withfind | head -1.If the source image contains multiple
.qcow2or.imgfiles, the selected file depends on filesystem ordering. Consider usingfind ... -print -quitfor explicit first-match behavior, or adding validation that exactly one disk file exists.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vm-image.sh` around lines 51 - 52, The DISK_NAME assignment uses a non-deterministic pipeline ("find ... | head -1") which can pick different files; update the logic around DISK_NAME (the line that constructs DISK_NAME using WORKDIR) to use a deterministic first-match find invocation (e.g., find ... -print -quit) or, better, enumerate matches and validate there is exactly one .qcow2/.img file and error if zero or more than one; ensure the echo "==> Disk: $DISK_NAME" only runs after successful validation.
60-64: Credentials written to temporary script file.The subscription credentials in
REGISTER_ARGSare interpolated intocustomize.shwhich is written to disk. While thetrap 'rm -rf "$WORKDIR"' EXITprovides cleanup, the credentials exist in plaintext on the filesystem during execution.For CI/developer tooling this may be acceptable, but consider:
- The script could read credentials from environment variables inside the container instead of baking them into the script
- Process listing could expose the credentials when
subscription-manager registerruns🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@vm-image.sh` around lines 60 - 64, The script currently bakes plaintext credentials into customize.sh by interpolating REGISTER_ARGS; instead stop expanding REGISTER_ARGS into the file and have the script read credentials from environment variables at runtime (e.g. REGISTER_USERNAME and REGISTER_PASSWORD) and invoke subscription-manager without placing secrets on the command line (for example pipe or heredoc to subscription-manager or otherwise provide password via stdin), then explicitly unset those env vars and remove WORKDIR after use; update the code that writes customize.sh (and the script body) to reference REGISTER_USERNAME/REGISTER_PASSWORD and to avoid passing secrets as command-line args to subscription-manager (use stdin/heredoc/piping) and ensure cleanup/unset of the env vars inside the customize.sh script.tests/testmetrics/parse.go (1)
123-143: Prefix matching may match unintended metrics.
strings.HasPrefix(line, metricPrefix)can match metrics that share a common prefix. For example, queryingrox_sensor_vmwould match bothrox_sensor_vm_totalandrox_sensor_vm_reports_total.Consider validating that the character after the metric name is either
{(labels) or whitespace to ensure exact name matching.♻️ Proposed fix for exact metric name matching
func parseCounter(body, metricPrefix string) (float64, bool) { for _, line := range strings.Split(body, "\n") { line = strings.TrimSpace(line) if line == "" || strings.HasPrefix(line, "#") { continue } if !strings.HasPrefix(line, metricPrefix) { continue } + // Ensure exact metric name match (next char must be '{', ' ', or end of name) + rest := line[len(metricPrefix):] + if len(rest) > 0 && rest[0] != '{' && rest[0] != ' ' { + continue + } fields := strings.Fields(line)Apply the same fix to
parseCounterWithLabel.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/testmetrics/parse.go` around lines 123 - 143, parseCounter currently uses strings.HasPrefix(line, metricPrefix) which can incorrectly match longer metric names; update parseCounter to ensure the character immediately after metricPrefix in the line (if any) is either '{' (labels), whitespace (space or tab), or end-of-line so you only match the exact metric name, and apply the identical validation to parseCounterWithLabel; locate the checks using metricPrefix inside parseCounter and parseCounterWithLabel and replace the simple HasPrefix acceptance with this exact-name post-character check before parsing fields and values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/vm-scanning-lib.sh`:
- Around line 31-35: The verify_kvm_available function currently aggregates
devices.kubevirt.io/kvm across all nodes and uses a regex that can miss
multi-token all-zero results (e.g. "0 0") and doesn't limit to worker nodes;
update it to query only worker nodes (e.g., use a node selector/label like
node-role.kubernetes.io/worker or exclude control-plane/master labels) and then
parse the returned tokens individually to ensure at least one node has a
capacity > 0; produce a list/array from the kubectl jsonpath output for the
symbol verify_kvm_available and iterate over each token, failing (die) only if
every token is zero/empty or if no worker nodes are found.
In `@tests/testmetrics/scrape.go`:
- Around line 234-242: The ScrapeComponent function builds PodCollectOptions
without populating RestConfig, which breaks port-forward transport; update
ScrapeComponent signature to accept a *rest.Config parameter and pass it into
the PodCollectOptions.RestConfig field so that when transport ==
TransportPortForward the port-forward logic receives the required rest
configuration (modify references in ScrapeComponent and the PodCollectOptions
construction to include RestConfig).
In `@tests/vm_scanning_e2e_utils_test.go`:
- Around line 194-195: The test currently treats PodPending as ready via the
phaseReady boolean; change the readiness check to require pod.Status.Phase ==
coreV1.PodRunning AND the PodReady condition equals coreV1.ConditionTrue instead
of allowing PodPending. Update the phaseReady assignment (and the equivalent
block later) to inspect pod.Status.Conditions for coreV1.PodReady and compare
its Status to coreV1.ConditionTrue, and use that combined check when gating
hostPath evidence collection (replace references to coreV1.PodPending with the
PodReady-condition check).
In `@tests/vm_scanning_suite_test.go`:
- Around line 1223-1225: The assertion requiring acks_received > 0 contradicts
the known-gap note; update the test around requirePositive(comp, cq[5], ...) so
it is conditional: if the ACK-propagation fix/feature is enabled (use your
existing feature toggle or flag check) then assert > 0 via requirePositive(comp,
cq[5], ...), otherwise relax or skip the check (e.g., assert >= 0 or skip the
assertion) and keep the explanatory message; reference the existing call site
requirePositive, the comp variable and cq[5] context when making the conditional
change.
- Around line 762-765: When Create returns IsAlreadyExists, fetch the current
Secret via s.k8sClient.CoreV1().Secrets(s.namespace).Get(ctx, secret.Name,
metaV1.GetOptions{}), copy the existing Secret's ResourceVersion (and any other
server-set metadata you need) onto the local secret object, then call Update
with that updated secret; this ensures the Update on the secret variable
succeeds instead of being rejected for a missing/incorrect resourceVersion.
In `@tests/vm_scanning_utils_test.go`:
- Around line 60-79: Reorder loadVMScanConfig so environment-only prerequisites
(e.g., requireEnv for RHEL_ACTIVATION_ORG and RHEL_ACTIVATION_KEY) are validated
before any host-binary probes; specifically, move the requireEnv checks for
activation-related variables to occur prior to calling discoverVirtctlPath and
discoverRoxagentBinaryPath in loadVMScanConfig (and make the same change in the
similar block around lines 106-123), so missing activation env vars fail fast
and are not masked by local binary discovery errors.
- Around line 25-28: The two constants defaultRepo2CPEPrimaryURL and
defaultRepo2CPEFallbackURL are identical, so on primary fetch failure the code
just retries the same endpoint; change the fallback to be a true fallback by
either (A) requiring the ROXAGENT_REPO2CPE_FALLBACK_URL env var be set and use
that for defaultRepo2CPEFallbackURL, or (B) derive a local file path under the
repo/artifacts directory and set defaultRepo2CPEFallbackURL to that path; update
the constant(s) in tests/vm_scanning_utils_test.go (and any test setup that
depends on defaultRepo2CPEFallbackURL) so the fallback is distinct from
defaultRepo2CPEPrimaryURL and tests assert using the fallback path when the
primary URL is unreachable.
In `@tests/vmhelpers/guest.go`:
- Around line 62-68: The current branch treats any non-transport pgrep error as
"idle"; change it to only treat the pgrep exit-code-1 (no matches) as idle and
propagate all other command failures. In the block handling err (after
errors.Is(err, errSSHTransport)), detect whether err is an *exec.ExitError and
extract the exit status (use ExitError.ExitCode() or
syscall.WaitStatus.ExitStatus()); if the exit code == 1 return (true, "no RHSM
uploader processes", nil), otherwise return (false, "", err) so real pgrep
failures are surfaced. Update the logic around the err handling in the function
that invokes pgrep in tests/vmhelpers/guest.go accordingly.
In `@tests/vmhelpers/ssh_command_framework.go`:
- Around line 104-126: The retry loop in retryOnSSHTransport currently retries
any error matching errSSHTransport, causing terminal failures to spin; introduce
and use a distinct sentinel errRetryableSSHTransport for only retryable
transport glitches and update runSSHCommandWithFramework so it returns
errRetryableSSHTransport for transient/socket/connection issues but returns the
original errSSHTransport (or a wrapped error not matching
errRetryableSSHTransport) for terminal auth failures (bad key/user); then change
callers to pass the same fn into retryOnSSHTransport unchanged (so
retryOnSSHTransport continues to check errors.Is(..., errRetryableSSHTransport)
rather than errSSHTransport) ensuring terminal auth errors are returned
immediately while only genuine retryable transport errors are retried.
In `@tests/vmhelpers/virtctl_ssh.go`:
- Around line 29-39: The buildVirtctlSSHCommand function currently uses
strconv.Quote which emits Go string literals unsafe for POSIX shells; update
buildVirtctlSSHCommand to produce shell-safe single-quoted arguments (escape any
single quote by closing, inserting '"'"', and reopening) or call a known library
such as github.com/alessio/shellescape.Quote for each arg instead of
strconv.Quote so remote --command receives correct POSIX-quoted arguments; keep
the existing join behavior (strings.Join with spaces) and replace references to
strconv.Quote in buildVirtctlSSHCommand accordingly.
In `@tests/vmhelpers/virtctl.go`:
- Around line 136-157: The current formatRemoteCommandStreamsForInlineLog only
truncates stdout but not stderr, allowing unbounded stderr into logs; modify the
function to apply truncateMiddleLines(stderr, inlineLogMaxHeadTailLines) (after
trimming) before writing the "stderr:\n" block so both stderr and stdout are
subject to the same large-output guard; keep use of strings.TrimSpace, preserve
labels ("stderr:" then "stdout:"), and reuse the existing truncateMiddleLines
and inlineLogMaxHeadTailLines symbols.
---
Nitpick comments:
In `@tests/testmetrics/parse.go`:
- Around line 123-143: parseCounter currently uses strings.HasPrefix(line,
metricPrefix) which can incorrectly match longer metric names; update
parseCounter to ensure the character immediately after metricPrefix in the line
(if any) is either '{' (labels), whitespace (space or tab), or end-of-line so
you only match the exact metric name, and apply the identical validation to
parseCounterWithLabel; locate the checks using metricPrefix inside parseCounter
and parseCounterWithLabel and replace the simple HasPrefix acceptance with this
exact-name post-character check before parsing fields and values.
In `@vm-image.sh`:
- Around line 51-52: The DISK_NAME assignment uses a non-deterministic pipeline
("find ... | head -1") which can pick different files; update the logic around
DISK_NAME (the line that constructs DISK_NAME using WORKDIR) to use a
deterministic first-match find invocation (e.g., find ... -print -quit) or,
better, enumerate matches and validate there is exactly one .qcow2/.img file and
error if zero or more than one; ensure the echo "==> Disk: $DISK_NAME" only runs
after successful validation.
- Around line 60-64: The script currently bakes plaintext credentials into
customize.sh by interpolating REGISTER_ARGS; instead stop expanding
REGISTER_ARGS into the file and have the script read credentials from
environment variables at runtime (e.g. REGISTER_USERNAME and REGISTER_PASSWORD)
and invoke subscription-manager without placing secrets on the command line (for
example pipe or heredoc to subscription-manager or otherwise provide password
via stdin), then explicitly unset those env vars and remove WORKDIR after use;
update the code that writes customize.sh (and the script body) to reference
REGISTER_USERNAME/REGISTER_PASSWORD and to avoid passing secrets as command-line
args to subscription-manager (use stdin/heredoc/piping) and ensure cleanup/unset
of the env vars inside the customize.sh script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 0e87f1b1-3de3-46b6-b925-e5a38a23e7b9
📒 Files selected for processing (47)
.openshift-ci/begin.sh.openshift-ci/ci_tests.py.openshift-ci/dispatch.shdocs/superpowers/plans/2026-03-27-vm-scanning-e2e.mddocs/superpowers/specs/2026-03-26-vm-scanning-e2e-design.mdscripts/ci/jobs/ocp_vm_scanning_e2e_tests.pytest.shtests/Makefiletests/e2e/lib.shtests/e2e/run-vm-scanning.shtests/e2e/vm-scanning-lib.shtests/e2e/yaml/cnv-operator/hyperconverged.yamltests/e2e/yaml/cnv-operator/namespace.yamltests/e2e/yaml/cnv-operator/operator-group.yamltests/e2e/yaml/cnv-operator/subscription.yamltests/testdata/vm-scanning/cloud-init.yaml.tmpltests/testdata/vm-scanning/vmscanning.gotests/testmetrics/parse.gotests/testmetrics/parse_test.gotests/testmetrics/scrape.gotests/testmetrics/scrape_test.gotests/vm_scanning_e2e_utils_preflight_test.gotests/vm_scanning_e2e_utils_test.gotests/vm_scanning_suite_test.gotests/vm_scanning_test.gotests/vm_scanning_utils_test.gotests/vmhelpers/central.gotests/vmhelpers/central_test.gotests/vmhelpers/guest.gotests/vmhelpers/guest_ssh_internal.gotests/vmhelpers/guest_ssh_policy.gotests/vmhelpers/guest_test.gotests/vmhelpers/observability.gotests/vmhelpers/observability_test.gotests/vmhelpers/roxagent.gotests/vmhelpers/roxagent_test.gotests/vmhelpers/ssh_command_framework.gotests/vmhelpers/virtctl.gotests/vmhelpers/virtctl_process_nonunix.gotests/vmhelpers/virtctl_process_unix.gotests/vmhelpers/virtctl_ssh.gotests/vmhelpers/virtctl_test.gotests/vmhelpers/vm.gotests/vmhelpers/vm_test.gotests/vmhelpers/wait.gotests/vmhelpers/wait_test.govm-image.sh
| verify_kvm_available() { | ||
| local kvm_capacity | ||
| kvm_capacity="$(kubectl get nodes -o jsonpath='{.items[*].status.capacity.devices\.kubevirt\.io/kvm}' 2>/dev/null || true)" | ||
| if [[ -z "$kvm_capacity" ]] || [[ "$kvm_capacity" =~ ^[[:space:]]*0*[[:space:]]*$ ]]; then | ||
| die "No worker nodes with devices.kubevirt.io/kvm > 0. Nested virtualization may not be enabled on this cluster." |
There was a problem hiding this comment.
KVM preflight can false-pass on multi-node zero capacity and ignores worker scoping.
Line 33 collects capacities across all nodes, and Line 34’s regex can miss the all-zero multi-token case (e.g., 0 0). This can let the suite proceed on unsupported clusters.
Suggested fix
verify_kvm_available() {
local kvm_capacity
- kvm_capacity="$(kubectl get nodes -o jsonpath='{.items[*].status.capacity.devices\.kubevirt\.io/kvm}' 2>/dev/null || true)"
- if [[ -z "$kvm_capacity" ]] || [[ "$kvm_capacity" =~ ^[[:space:]]*0*[[:space:]]*$ ]]; then
+ kvm_capacity="$(kubectl get nodes -l 'node-role.kubernetes.io/worker' \
+ -o jsonpath='{range .items[*]}{.status.capacity.devices\.kubevirt\.io/kvm}{"\n"}{end}' 2>/dev/null || true)"
+ if ! awk 'NF && ($1 + 0) > 0 { found=1 } END { exit(found ? 0 : 1) }' <<<"$kvm_capacity"; then
die "No worker nodes with devices.kubevirt.io/kvm > 0. Nested virtualization may not be enabled on this cluster."
fi
info "KVM device capacity on nodes: ${kvm_capacity}"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| verify_kvm_available() { | |
| local kvm_capacity | |
| kvm_capacity="$(kubectl get nodes -o jsonpath='{.items[*].status.capacity.devices\.kubevirt\.io/kvm}' 2>/dev/null || true)" | |
| if [[ -z "$kvm_capacity" ]] || [[ "$kvm_capacity" =~ ^[[:space:]]*0*[[:space:]]*$ ]]; then | |
| die "No worker nodes with devices.kubevirt.io/kvm > 0. Nested virtualization may not be enabled on this cluster." | |
| verify_kvm_available() { | |
| local kvm_capacity | |
| kvm_capacity="$(kubectl get nodes -l 'node-role.kubernetes.io/worker' \ | |
| -o jsonpath='{range .items[*]}{.status.capacity.devices\.kubevirt\.io/kvm}{"\n"}{end}' 2>/dev/null || true)" | |
| if ! awk 'NF && ($1 + 0) > 0 { found=1 } END { exit(found ? 0 : 1) }' <<<"$kvm_capacity"; then | |
| die "No worker nodes with devices.kubevirt.io/kvm > 0. Nested virtualization may not be enabled on this cluster." | |
| fi | |
| info "KVM device capacity on nodes: ${kvm_capacity}" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/vm-scanning-lib.sh` around lines 31 - 35, The verify_kvm_available
function currently aggregates devices.kubevirt.io/kvm across all nodes and uses
a regex that can miss multi-token all-zero results (e.g. "0 0") and doesn't
limit to worker nodes; update it to query only worker nodes (e.g., use a node
selector/label like node-role.kubernetes.io/worker or exclude
control-plane/master labels) and then parse the returned tokens individually to
ensure at least one node has a capacity > 0; produce a list/array from the
kubectl jsonpath output for the symbol verify_kvm_available and iterate over
each token, failing (die) only if every token is zero/empty or if no worker
nodes are found.
| func ScrapeComponent(ctx context.Context, clientset kubernetes.Interface, target ScrapeTarget, transport Transport, queries []Query) (map[string]Value, error) { | ||
| snaps, err := CollectFromPods(ctx, clientset, PodCollectOptions{ | ||
| Namespace: target.Namespace, | ||
| LabelSelector: target.LabelSelector, | ||
| FieldSelector: target.FieldSelector, | ||
| Port: target.MetricsPort, | ||
| MetricsPath: target.MetricsPath, | ||
| Transport: transport, | ||
| }) |
There was a problem hiding this comment.
Missing RestConfig in ScrapeComponent breaks port-forward transport.
ScrapeComponent constructs PodCollectOptions without setting RestConfig. When transport == TransportPortForward, this will always fail with the error at line 182-184 since RestConfig is required for port-forward transport.
🐛 Proposed fix
The function signature needs to accept *rest.Config and pass it through:
-func ScrapeComponent(ctx context.Context, clientset kubernetes.Interface, target ScrapeTarget, transport Transport, queries []Query) (map[string]Value, error) {
+func ScrapeComponent(ctx context.Context, clientset kubernetes.Interface, restCfg *rest.Config, target ScrapeTarget, transport Transport, queries []Query) (map[string]Value, error) {
snaps, err := CollectFromPods(ctx, clientset, PodCollectOptions{
Namespace: target.Namespace,
LabelSelector: target.LabelSelector,
FieldSelector: target.FieldSelector,
Port: target.MetricsPort,
MetricsPath: target.MetricsPath,
Transport: transport,
+ RestConfig: restCfg,
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func ScrapeComponent(ctx context.Context, clientset kubernetes.Interface, target ScrapeTarget, transport Transport, queries []Query) (map[string]Value, error) { | |
| snaps, err := CollectFromPods(ctx, clientset, PodCollectOptions{ | |
| Namespace: target.Namespace, | |
| LabelSelector: target.LabelSelector, | |
| FieldSelector: target.FieldSelector, | |
| Port: target.MetricsPort, | |
| MetricsPath: target.MetricsPath, | |
| Transport: transport, | |
| }) | |
| func ScrapeComponent(ctx context.Context, clientset kubernetes.Interface, restCfg *rest.Config, target ScrapeTarget, transport Transport, queries []Query) (map[string]Value, error) { | |
| snaps, err := CollectFromPods(ctx, clientset, PodCollectOptions{ | |
| Namespace: target.Namespace, | |
| LabelSelector: target.LabelSelector, | |
| FieldSelector: target.FieldSelector, | |
| Port: target.MetricsPort, | |
| MetricsPath: target.MetricsPath, | |
| Transport: transport, | |
| RestConfig: restCfg, | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/testmetrics/scrape.go` around lines 234 - 242, The ScrapeComponent
function builds PodCollectOptions without populating RestConfig, which breaks
port-forward transport; update ScrapeComponent signature to accept a
*rest.Config parameter and pass it into the PodCollectOptions.RestConfig field
so that when transport == TransportPortForward the port-forward logic receives
the required rest configuration (modify references in ScrapeComponent and the
PodCollectOptions construction to include RestConfig).
| phaseReady := pod.Status.Phase == coreV1.PodRunning || pod.Status.Phase == coreV1.PodPending | ||
| if !phaseReady { |
There was a problem hiding this comment.
Do not treat PodPending as ready in VSOCK preflight.
Using pending pods as “ready” can false-pass preflight and shift failures downstream. Gate on running + PodReady=True before accepting hostPath evidence.
Suggested fix
- phaseReady := pod.Status.Phase == coreV1.PodRunning || pod.Status.Phase == coreV1.PodPending
+ phaseReady := pod.Status.Phase == coreV1.PodRunning
+ podReady := false
+ for _, c := range pod.Status.Conditions {
+ if c.Type == coreV1.PodReady && c.Status == coreV1.ConditionTrue {
+ podReady = true
+ break
+ }
+ }
+ phaseReady = phaseReady && podReady
if !phaseReady {
fmt.Fprintf(&diag, "namespace %q pod %q: phase=%q\n", ns, pod.Name, pod.Status.Phase)
}Also applies to: 213-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vm_scanning_e2e_utils_test.go` around lines 194 - 195, The test
currently treats PodPending as ready via the phaseReady boolean; change the
readiness check to require pod.Status.Phase == coreV1.PodRunning AND the
PodReady condition equals coreV1.ConditionTrue instead of allowing PodPending.
Update the phaseReady assignment (and the equivalent block later) to inspect
pod.Status.Conditions for coreV1.PodReady and compare its Status to
coreV1.ConditionTrue, and use that combined check when gating hostPath evidence
collection (replace references to coreV1.PodPending with the PodReady-condition
check).
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Create(ctx, secret, metaV1.CreateOptions{}) | ||
| if apierrors.IsAlreadyExists(err) { | ||
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Update(ctx, secret, metaV1.UpdateOptions{}) | ||
| } |
There was a problem hiding this comment.
Fetch the existing Secret before updating it.
On the reuse path this Update sends a brand-new object without the current resourceVersion, so Kubernetes will reject it. That turns namespace reuse into a hard failure instead of idempotent setup.
Suggested fix
_, err = s.k8sClient.CoreV1().Secrets(s.namespace).Create(ctx, secret, metaV1.CreateOptions{})
if apierrors.IsAlreadyExists(err) {
- _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Update(ctx, secret, metaV1.UpdateOptions{})
+ existing, getErr := s.k8sClient.CoreV1().Secrets(s.namespace).Get(ctx, vmImagePullSecretName, metaV1.GetOptions{})
+ require.NoError(s.T(), getErr, "get existing image pull secret %q in namespace %q", vmImagePullSecretName, s.namespace)
+ existing.Type = coreV1.SecretTypeDockerConfigJson
+ if existing.Data == nil {
+ existing.Data = map[string][]byte{}
+ }
+ existing.Data[coreV1.DockerConfigJsonKey] = dockerCfg
+ _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Update(ctx, existing, metaV1.UpdateOptions{})
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Create(ctx, secret, metaV1.CreateOptions{}) | |
| if apierrors.IsAlreadyExists(err) { | |
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Update(ctx, secret, metaV1.UpdateOptions{}) | |
| } | |
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Create(ctx, secret, metaV1.CreateOptions{}) | |
| if apierrors.IsAlreadyExists(err) { | |
| existing, getErr := s.k8sClient.CoreV1().Secrets(s.namespace).Get(ctx, vmImagePullSecretName, metaV1.GetOptions{}) | |
| require.NoError(s.T(), getErr, "get existing image pull secret %q in namespace %q", vmImagePullSecretName, s.namespace) | |
| existing.Type = coreV1.SecretTypeDockerConfigJson | |
| if existing.Data == nil { | |
| existing.Data = map[string][]byte{} | |
| } | |
| existing.Data[coreV1.DockerConfigJsonKey] = dockerCfg | |
| _, err = s.k8sClient.CoreV1().Secrets(s.namespace).Update(ctx, existing, metaV1.UpdateOptions{}) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vm_scanning_suite_test.go` around lines 762 - 765, When Create returns
IsAlreadyExists, fetch the current Secret via
s.k8sClient.CoreV1().Secrets(s.namespace).Get(ctx, secret.Name,
metaV1.GetOptions{}), copy the existing Secret's ResourceVersion (and any other
server-set metadata you need) onto the local secret object, then call Update
with that updated secret; this ensures the Update on the secret variable
succeeds instead of being rejected for a missing/incorrect resourceVersion.
| requirePositive(comp, cq[5], | ||
| "compliance relay acks_received should be > 0 when Central ACKs an index report; "+ | ||
| "known gap: Sensor does not propagate ACKs back to the compliance relay yet") |
There was a problem hiding this comment.
This assertion contradicts the known-gap note.
The test currently requires acks_received > 0, but the message says ACKs are not propagated back to the compliance relay yet. As written, the suite will fail until that product gap is closed. Gate this assertion behind the feature/fix, or relax it for now.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vm_scanning_suite_test.go` around lines 1223 - 1225, The assertion
requiring acks_received > 0 contradicts the known-gap note; update the test
around requirePositive(comp, cq[5], ...) so it is conditional: if the
ACK-propagation fix/feature is enabled (use your existing feature toggle or flag
check) then assert > 0 via requirePositive(comp, cq[5], ...), otherwise relax or
skip the check (e.g., assert >= 0 or skip the assertion) and keep the
explanatory message; reference the existing call site requirePositive, the comp
variable and cq[5] context when making the conditional change.
There was a problem hiding this comment.
This is known and okay like this. We are working in a parallel track to add the support for ACKs
There was a problem hiding this comment.
@vikin91, understood — the assertion is intentionally failing-until-fixed, acting as a built-in gate that will start passing once the ACK-propagation work in the parallel track lands. I'll note this for future reviews.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| func loadVMScanConfig() (*vmScanConfig, error) { | ||
| cfg := &vmScanConfig{} | ||
|
|
||
| var err error | ||
| if cfg.ImageRHEL9, err = requireEnv("VM_IMAGE_RHEL9"); err != nil { | ||
| return nil, err | ||
| } | ||
| if cfg.ImageRHEL10, err = requireEnv("VM_IMAGE_RHEL10"); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| cfg.GuestUserRHEL9 = envOrDefault("VM_GUEST_USER_RHEL9", defaultGuestUser) | ||
| cfg.GuestUserRHEL10 = envOrDefault("VM_GUEST_USER_RHEL10", defaultGuestUser) | ||
|
|
||
| if cfg.VirtctlPath, err = discoverVirtctlPath(); err != nil { | ||
| return nil, err | ||
| } | ||
| if cfg.RoxagentBinaryPath, err = discoverRoxagentBinaryPath(); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
Validate env-only prerequisites before probing local binaries.
loadVMScanConfig looks up virtctl and roxagent before it checks activation requirements. That means a missing RHEL_ACTIVATION_ORG/RHEL_ACTIVATION_KEY is masked by host-specific tool discovery failures, and the config tests in this file stop being hermetic.
💡 Proposed reordering
- if cfg.VirtctlPath, err = discoverVirtctlPath(); err != nil {
- return nil, err
- }
- if cfg.RoxagentBinaryPath, err = discoverRoxagentBinaryPath(); err != nil {
- return nil, err
- }
-
cfg.Repo2CPEPrimaryURL = envOrDefault("ROXAGENT_REPO2CPE_PRIMARY_URL", defaultRepo2CPEPrimaryURL)
cfg.Repo2CPEFallbackURL = envOrDefault("ROXAGENT_REPO2CPE_FALLBACK_URL", defaultRepo2CPEFallbackURL)
cfg.Repo2CPEPrimaryAttempts = envIntOrDefault("ROXAGENT_REPO2CPE_PRIMARY_ATTEMPTS", defaultRepo2CPEAttempts)
@@
if cfg.RequireActivation {
if cfg.ActivationOrg == "" {
return nil, errors.New("activation is required but RHEL_ACTIVATION_ORG is not set")
}
if cfg.ActivationKey == "" {
return nil, errors.New("activation is required but RHEL_ACTIVATION_KEY is not set")
}
}
+
+ if cfg.VirtctlPath, err = discoverVirtctlPath(); err != nil {
+ return nil, err
+ }
+ if cfg.RoxagentBinaryPath, err = discoverRoxagentBinaryPath(); err != nil {
+ return nil, err
+ }Also applies to: 106-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vm_scanning_utils_test.go` around lines 60 - 79, Reorder
loadVMScanConfig so environment-only prerequisites (e.g., requireEnv for
RHEL_ACTIVATION_ORG and RHEL_ACTIVATION_KEY) are validated before any
host-binary probes; specifically, move the requireEnv checks for
activation-related variables to occur prior to calling discoverVirtctlPath and
discoverRoxagentBinaryPath in loadVMScanConfig (and make the same change in the
similar block around lines 106-123), so missing activation env vars fail fast
and are not masked by local binary discovery errors.
| if err != nil { | ||
| if errors.Is(err, errSSHTransport) { | ||
| return false, "", err | ||
| } | ||
| // pgrep exits 1 when no processes match. | ||
| return true, "no RHSM uploader processes", nil | ||
| } |
There was a problem hiding this comment.
Only treat pgrep “no matches” as idle.
This branch converts every non-transport pgrep failure into “idle”. If pgrep itself fails for any other reason, the precheck silently succeeds and the later dnf flow runs on bad assumptions, which will make the suite flaky and much harder to diagnose. Please distinguish exit-code-1/no-match from real command failures and propagate the latter.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vmhelpers/guest.go` around lines 62 - 68, The current branch treats any
non-transport pgrep error as "idle"; change it to only treat the pgrep
exit-code-1 (no matches) as idle and propagate all other command failures. In
the block handling err (after errors.Is(err, errSSHTransport)), detect whether
err is an *exec.ExitError and extract the exit status (use ExitError.ExitCode()
or syscall.WaitStatus.ExitStatus()); if the exit code == 1 return (true, "no
RHSM uploader processes", nil), otherwise return (false, "", err) so real pgrep
failures are surfaced. Update the logic around the err handling in the function
that invokes pgrep in tests/vmhelpers/guest.go accordingly.
| // retryOnSSHTransport retries fn whenever it returns an errSSHTransport error. | ||
| // Non-transport errors and nil are returned immediately. The retry loop is | ||
| // bounded by ctx — callers should set an appropriate deadline/timeout. | ||
| func retryOnSSHTransport(ctx context.Context, logf func(string, ...any), desc string, fn func(ctx context.Context) error) error { | ||
| var lastErr error | ||
| for attempt := 1; ; attempt++ { | ||
| lastErr = fn(ctx) | ||
| if lastErr == nil || !errors.Is(lastErr, errSSHTransport) { | ||
| return lastErr | ||
| } | ||
| if logf != nil { | ||
| logf("%s: SSH transport error (attempt %d), retrying in %s: %v", | ||
| desc, attempt, sshTransportRetryInterval, lastErr) | ||
| } | ||
| timer := time.NewTimer(sshTransportRetryInterval) | ||
| select { | ||
| case <-ctx.Done(): | ||
| timer.Stop() | ||
| return fmt.Errorf("%s: context expired while retrying SSH transport error: %w (last: %v)", | ||
| desc, ctx.Err(), lastErr) | ||
| case <-timer.C: | ||
| } | ||
| } |
There was a problem hiding this comment.
Don’t retry terminal SSH transport failures.
retryOnSSHTransport retries any error that matches errSSHTransport, but runSSHCommandWithFramework uses that same sentinel for both retryable transport glitches and terminal failures like SSH authentication errors. A bad key/user will therefore spin until the outer context times out instead of failing immediately.
💡 Suggested direction
-var errSSHTransport = errors.New("SSH transport error")
+var (
+ errSSHTransport = errors.New("SSH transport error")
+ errRetryableSSHTransport = errors.New("retryable SSH transport error")
+) func retryOnSSHTransport(ctx context.Context, logf func(string, ...any), desc string, fn func(ctx context.Context) error) error {
var lastErr error
for attempt := 1; ; attempt++ {
lastErr = fn(ctx)
- if lastErr == nil || !errors.Is(lastErr, errSSHTransport) {
+ if lastErr == nil || !errors.Is(lastErr, errRetryableSSHTransport) {
return lastErr
}Then wrap only the retryable branch in runSSHCommandWithFramework with errRetryableSSHTransport, while leaving terminal auth failures as plain errSSHTransport.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vmhelpers/ssh_command_framework.go` around lines 104 - 126, The retry
loop in retryOnSSHTransport currently retries any error matching
errSSHTransport, causing terminal failures to spin; introduce and use a distinct
sentinel errRetryableSSHTransport for only retryable transport glitches and
update runSSHCommandWithFramework so it returns errRetryableSSHTransport for
transient/socket/connection issues but returns the original errSSHTransport (or
a wrapped error not matching errRetryableSSHTransport) for terminal auth
failures (bad key/user); then change callers to pass the same fn into
retryOnSSHTransport unchanged (so retryOnSSHTransport continues to check
errors.Is(..., errRetryableSSHTransport) rather than errSSHTransport) ensuring
terminal auth errors are returned immediately while only genuine retryable
transport errors are retried.
| // buildVirtctlSSHCommand joins shell command parts into one string with per-argument strconv.Quote quoting. | ||
| func buildVirtctlSSHCommand(command ...string) string { | ||
| if len(command) == 0 { | ||
| return "" | ||
| } | ||
| quoted := make([]string, 0, len(command)) | ||
| for _, arg := range command { | ||
| quoted = append(quoted, strconv.Quote(arg)) | ||
| } | ||
| return strings.Join(quoted, " ") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Go, is strconv.Quote appropriate for POSIX shell argument escaping when building a single SSH remote command string (e.g., --command "...")? If not, what escaping strategy is recommended?
💡 Result:
No, strconv.Quote is not appropriate for POSIX shell argument escaping when building a single SSH remote command string (e.g., ssh host --command "..."). strconv.Quote produces Go-syntax double-quoted string literals with escapes like \t, \n, \xFF that are not valid in POSIX shell syntax, leading to syntax errors or incorrect interpretation when executed remotely via the shell (SSH runs the command via
Citations:
- 1: https://go.dev/src/strconv/quote.go
- 2: https://go.dev/pkg/strconv/
- 3: https://pkg.go.dev/[email protected]
- 4: https://github.com/alessio/shellescape
- 5: https://github.com/alessio/shellescape/blob/master/shellescape.go
- 6: https://groups.google.com/g/golang-nuts/c/5oxAFSqfN9E
- 7: https://pkg.go.dev/github.com/alessio/shellescape
- 8: https://pkg.go.dev/github.com/chhongzh/shlex
- 9: https://pkg.go.dev/github.com/apparentlymart/go-shquot/shquot
Use shell-safe quoting for remote --command, not strconv.Quote.
strconv.Quote produces Go-literal syntax with escapes like \t and \n that are not valid in POSIX shell, causing parsing failures. For example, strconv.Quote("foo\tbar") produces "foo\tbar" where \t is a literal backslash-t, not a tab, breaking argument semantics when passed to the remote shell via SSH.
Replace with proper POSIX shell quoting (single quotes with ' escaped as '"'"'), or use an established library like github.com/alessio/shellescape.Quote.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vmhelpers/virtctl_ssh.go` around lines 29 - 39, The
buildVirtctlSSHCommand function currently uses strconv.Quote which emits Go
string literals unsafe for POSIX shells; update buildVirtctlSSHCommand to
produce shell-safe single-quoted arguments (escape any single quote by closing,
inserting '"'"', and reopening) or call a known library such as
github.com/alessio/shellescape.Quote for each arg instead of strconv.Quote so
remote --command receives correct POSIX-quoted arguments; keep the existing join
behavior (strings.Join with spaces) and replace references to strconv.Quote in
buildVirtctlSSHCommand accordingly.
| // formatRemoteCommandStreamsForInlineLog formats stdout and stderr for multi-line completion logs. | ||
| // Large outputs are truncated to the first and last inlineLogMaxHeadTailLines lines. | ||
| func formatRemoteCommandStreamsForInlineLog(stdout, stderr string) string { | ||
| var b strings.Builder | ||
| stderr = strings.TrimSpace(stderr) | ||
| stdout = strings.TrimSpace(stdout) | ||
| if stderr != "" { | ||
| b.WriteString("stderr:\n") | ||
| b.WriteString(stderr) | ||
| } | ||
| if stdout != "" { | ||
| if b.Len() > 0 { | ||
| b.WriteString("\n") | ||
| } | ||
| b.WriteString("stdout:\n") | ||
| b.WriteString(truncateMiddleLines(stdout, inlineLogMaxHeadTailLines)) | ||
| } | ||
| if b.Len() == 0 { | ||
| return "output: <empty stdout/stderr>" | ||
| } | ||
| return b.String() | ||
| } |
There was a problem hiding this comment.
Truncate stderr too, not just stdout.
The large-output guard only applies to stdout right now. A noisy failure path can still dump unbounded stderr into CI logs, which is the common case for command failures and defeats the “large outputs are truncated” contract.
💡 Proposed fix
func formatRemoteCommandStreamsForInlineLog(stdout, stderr string) string {
var b strings.Builder
stderr = strings.TrimSpace(stderr)
stdout = strings.TrimSpace(stdout)
if stderr != "" {
b.WriteString("stderr:\n")
- b.WriteString(stderr)
+ b.WriteString(truncateMiddleLines(stderr, inlineLogMaxHeadTailLines))
}
if stdout != "" {
if b.Len() > 0 {
b.WriteString("\n")
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/vmhelpers/virtctl.go` around lines 136 - 157, The current
formatRemoteCommandStreamsForInlineLog only truncates stdout but not stderr,
allowing unbounded stderr into logs; modify the function to apply
truncateMiddleLines(stderr, inlineLogMaxHeadTailLines) (after trimming) before
writing the "stderr:\n" block so both stderr and stdout are subject to the same
large-output guard; keep use of strings.TrimSpace, preserve labels ("stderr:"
then "stdout:"), and reuse the existing truncateMiddleLines and
inlineLogMaxHeadTailLines symbols.
🚀 Build Images ReadyImages are ready for commit dffe479. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-689-gdffe479fcf |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #20042 +/- ##
==========================================
- Coverage 49.67% 49.50% -0.18%
==========================================
Files 2765 2778 +13
Lines 209049 211031 +1982
==========================================
+ Hits 103855 104476 +621
- Misses 97517 98824 +1307
- Partials 7677 7731 +54
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User request (feature): "design the end-to-end tests for the VM scanning feature." User request (branch): "add the changes to the piotr/ROX-29577-VM4VM-e2e-tests branch." Add spec and impl plan (Superpower artifacts) Add a Go test_e2e VM-scanning suite that provisions RHEL 9 and RHEL 10 KubeVirt VMs, installs and runs roxagent, validates Central scan visibility, and captures progressive diagnostics for flaky-path triage. Wire a dedicated OpenShift CI job and e2e scripts with CNV/VSOCK prerequisite checks and activation/repo2cpe guardrails. AI-assisted: this commit was partially generated with AI assistance. Made-with: Cursor
|
This change is part of the following stack: Change managed by git-spice. |
61abd1f to
dffe479
Compare
Description
User request (feature): "Design the end-to-end tests for the VM scanning feature."
Add a Go test_e2e VM-scanning suite that provisions RHEL 9 and RHEL 10 KubeVirt VMs, installs and runs roxagent, validates Central scan visibility, and captures progressive diagnostics for flaky-path triage.
Wire a dedicated OpenShift CI job and e2e scripts with CNV/VSOCK prerequisite checks and activation/repo2cpe guardrails.
AI-assisted: this commit was partially generated with AI assistance.
Made-with: Cursor
User-facing documentation
Testing and quality
Automated testing
How I validated my change