fix(mac-mini): make startup.sh idempotent and non-destructive

The previous startup.sh checked colima status via `colima status | grep running`
and, if that failed, ran `colima stop --force` unconditionally before starting.
This is destructive: a transient status mis-detection can kill a healthy running
VM, and the subsequent start often hangs because of leftover locks/processes.

Triggered today during the ManaCore→Mana rename: reloading the docker-startup
LaunchAgent ran the script, which falsely concluded colima was down, killed the
running VM, and left 12 zombie limactl processes plus a stale disk lock symlink.
The whole production stack (incl. Forgejo) was offline until manual cleanup.

Changes:
- Use `docker info` as the readiness check instead of `colima status` —
  it directly tests the thing we care about (docker socket reachable)
- Only do cleanup work when we actually need to start; never SIGKILL a
  running VM as a "precaution"
- When we do need to start: reap any zombie limactl/colima processes from
  prior failed runs, and clear the stale disk-in-use lock if no process
  actually holds it
- Verify successful start with `docker info`, not `colima status`

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-07 13:19:46 +02:00
parent a9529bcf1b
commit af9b1f9369

View file

@ -54,15 +54,34 @@ if [ ! -d "/Volumes/ManaData" ]; then
fi
# ─── Start Colima ───
if colima status 2>/dev/null | grep -q "running"; then
log "Colima already running"
# Use `docker info` as the source of truth for "is the runtime usable" instead
# of `colima status`, which can mis-report and trigger a destructive restart.
if docker info >/dev/null 2>&1; then
log "Colima already running (docker reachable)"
else
log "Colima not reachable, preparing fresh start..."
# Reap zombie colima/limactl processes from previously failed starts.
# These hold locks that prevent a clean start. Do NOT touch a running VM.
for pat in "colima stop" "limactl stop" "colima daemon" "limactl hostagent" "limactl usernet"; do
pids=$(pgrep -f "$pat" || true)
if [ -n "$pids" ]; then
log " reaping stale: $pat ($pids)"
echo "$pids" | xargs kill -9 2>/dev/null || true
fi
done
sleep 1
# Clear stale disk lock if no process actually owns it.
# The lock is a symlink at /Volumes/ManaData/colima-disk/in_use_by → ~/.colima/_lima/colima
# If the symlink exists but no limactl/vz process is running, the lock is stale.
LOCK="/Volumes/ManaData/colima-disk/in_use_by"
if [ -L "$LOCK" ] && ! pgrep -f "limactl hostagent" >/dev/null 2>&1; then
log " removing stale disk lock: $LOCK"
rm -f "$LOCK"
fi
log "Starting Colima..."
# Clear stale process state from hard shutdown (stop only, never delete — delete wipes all images)
colima stop --force 2>/dev/null || true
sleep 2
colima start \
--cpu 8 \
--memory 12 \
@ -74,8 +93,9 @@ else
--mount /Volumes/ManaData:w \
2>&1 | tee -a "$LOG_FILE"
if ! colima status 2>/dev/null | grep -q "running"; then
log "ERROR: Colima failed to start"
# Verify with docker info, not colima status (more reliable)
if ! docker info >/dev/null 2>&1; then
log "ERROR: Colima failed to start (docker not reachable)"
exit 1
fi
log "Colima started successfully"