From af9b1f9369b13a695d6f83bd540d5f0fb8a56bb6 Mon Sep 17 00:00:00 2001 From: Till JS Date: Tue, 7 Apr 2026 13:19:46 +0200 Subject: [PATCH] fix(mac-mini): make startup.sh idempotent and non-destructive MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous startup.sh checked colima status via `colima status | grep running` and, if that failed, ran `colima stop --force` unconditionally before starting. This is destructive: a transient status mis-detection can kill a healthy running VM, and the subsequent start often hangs because of leftover locks/processes. Triggered today during the ManaCore→Mana rename: reloading the docker-startup LaunchAgent ran the script, which falsely concluded colima was down, killed the running VM, and left 12 zombie limactl processes plus a stale disk lock symlink. The whole production stack (incl. Forgejo) was offline until manual cleanup. Changes: - Use `docker info` as the readiness check instead of `colima status` — it directly tests the thing we care about (docker socket reachable) - Only do cleanup work when we actually need to start; never SIGKILL a running VM as a "precaution" - When we do need to start: reap any zombie limactl/colima processes from prior failed runs, and clear the stale disk-in-use lock if no process actually holds it - Verify successful start with `docker info`, not `colima status` Co-Authored-By: Claude Opus 4.6 (1M context) --- scripts/mac-mini/startup.sh | 38 ++++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-) diff --git a/scripts/mac-mini/startup.sh b/scripts/mac-mini/startup.sh index c6353d37e..a6fc468bd 100755 --- a/scripts/mac-mini/startup.sh +++ b/scripts/mac-mini/startup.sh @@ -54,15 +54,34 @@ if [ ! -d "/Volumes/ManaData" ]; then fi # ─── Start Colima ─── -if colima status 2>/dev/null | grep -q "running"; then - log "Colima already running" +# Use `docker info` as the source of truth for "is the runtime usable" instead +# of `colima status`, which can mis-report and trigger a destructive restart. +if docker info >/dev/null 2>&1; then + log "Colima already running (docker reachable)" else + log "Colima not reachable, preparing fresh start..." + + # Reap zombie colima/limactl processes from previously failed starts. + # These hold locks that prevent a clean start. Do NOT touch a running VM. + for pat in "colima stop" "limactl stop" "colima daemon" "limactl hostagent" "limactl usernet"; do + pids=$(pgrep -f "$pat" || true) + if [ -n "$pids" ]; then + log " reaping stale: $pat ($pids)" + echo "$pids" | xargs kill -9 2>/dev/null || true + fi + done + sleep 1 + + # Clear stale disk lock if no process actually owns it. + # The lock is a symlink at /Volumes/ManaData/colima-disk/in_use_by → ~/.colima/_lima/colima + # If the symlink exists but no limactl/vz process is running, the lock is stale. + LOCK="/Volumes/ManaData/colima-disk/in_use_by" + if [ -L "$LOCK" ] && ! pgrep -f "limactl hostagent" >/dev/null 2>&1; then + log " removing stale disk lock: $LOCK" + rm -f "$LOCK" + fi + log "Starting Colima..." - - # Clear stale process state from hard shutdown (stop only, never delete — delete wipes all images) - colima stop --force 2>/dev/null || true - sleep 2 - colima start \ --cpu 8 \ --memory 12 \ @@ -74,8 +93,9 @@ else --mount /Volumes/ManaData:w \ 2>&1 | tee -a "$LOG_FILE" - if ! colima status 2>/dev/null | grep -q "running"; then - log "ERROR: Colima failed to start" + # Verify with docker info, not colima status (more reliable) + if ! docker info >/dev/null 2>&1; then + log "ERROR: Colima failed to start (docker not reachable)" exit 1 fi log "Colima started successfully"