Skip to content

feat(vmm): preserve serial logs across VM restarts#548

Merged
kvinwang merged 2 commits intomasterfrom
feat/vmm-serial-log-history
Mar 17, 2026
Merged

feat(vmm): preserve serial logs across VM restarts#548
kvinwang merged 2 commits intomasterfrom
feat/vmm-serial-log-history

Conversation

@kvinwang
Copy link
Copy Markdown
Collaborator

@kvinwang kvinwang commented Mar 17, 2026

Problem

When a dstack VM crashes (kernel panic with panic=1) or gets restarted, QEMU truncates serial.log on the next boot — losing all previous boot logs. This is especially painful in TDX VMs where non-resettable CPUs cause cpus are not resettable, terminating → rapid restart loops, making it impossible to diagnose the original panic.

Changes

vmm: preserve serial logs across restarts

  • Before each VM restart, append serial.log content to serial.history.log with a timestamped ===== boot @ <rfc3339> ===== separator
  • Cap serial.history.log at a configurable max size (default 4MB, trimmed from the front to keep most-recent data)
  • Add the same boot separator timestamps to stdout.log and stderr.log so individual QEMU sessions are clearly delimited
  • New config option serial_history_max_bytes (supports human-readable sizes: "4M", "512K")

ci: fix Docker Build Check on PRs

  • github.sha for pull_request events is a temporary merge commit that only exists in GitHub's internal refs and is never pushed to the repo — Dockerfiles doing git clone && git checkout ${DSTACK_REV} fail with exit 128
  • Use github.event.pull_request.head.sha || github.sha so PRs use the actual branch HEAD while push-to-branch events continue to use github.sha

Test plan

  • Restart a VM, verify serial.history.log is created with previous boot's serial output and a ===== boot @ <timestamp> ===== separator
  • Verify stdout.log and stderr.log contain boot separators between restarts
  • Verify history truncation works when exceeding configured serial_history_max_bytes — tested with 1K limit, 88KB history correctly trimmed to 991 bytes keeping tail content
  • Verify serial.log still works normally for the current boot

QEMU truncates serial.log on each boot, losing previous crash/panic
logs that are critical for debugging restart loops.

- Append serial.log to serial.history.log with boot timestamp separator
  before each VM restart, capped at a configurable max size (default 4MB)
- Add boot separator timestamps to stdout.log and stderr.log so
  individual boot sessions are clearly delimited
- Add `serial_history_max_bytes` config option (supports human-readable
  sizes like "4MB", "512KB")
github.sha in pull_request events is a temporary merge commit that only
exists in GitHub's internal refs (refs/pull/N/merge) and is never pushed
to the repo. Dockerfiles that do `git clone && git checkout ${DSTACK_REV}`
fail with exit 128 because this SHA is unreachable.

Use github.event.pull_request.head.sha for PR events, falling back to
github.sha for push-to-branch events where it is always a real commit.
@kvinwang kvinwang merged commit c019113 into master Mar 17, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant