fix(macmini): repair container auto-recovery (broken --env-file path)

Two unrelated bugs in scripts/mac-mini/ensure-containers-running.sh,
both caught while debugging a mana-auth crash loop on 2026-04-08:

1. The recovery path passed --env-file "$PROJECT_ROOT/.env.macmini" to
   docker compose, but that file has never existed on the server — only
   .env does, and compose auto-loads it from the working directory. The
   explicit --env-file silently caused recovered containers to start with
   empty secrets (e.g. blank MANA_AUTH_KEK), which made mana-auth crash
   the moment it came back up. The auto-recovery loop was therefore
   self-defeating: it kept "fixing" auth into the same broken state
   every 5 minutes for hours, with no notification because compose
   exited 0. Drop --env-file entirely and cd into PROJECT_ROOT so
   compose's standard .env discovery applies.

2. mana-infra-minio-init is a one-shot job container that legitimately
   sits in "exited" state after running once. The script flagged it as
   "stuck" every cycle, tried to "recover" it, and spammed the log with
   ERROR lines. Add an explicit ONESHOT_INIT_CONTAINERS allowlist and
   skip those names in both the initial scan and the post-recovery
   verification.

Also tee compose output into the log so future failures actually leave
a breadcrumb instead of disappearing into the void.

Also: bump @mlc-ai/web-llm from a transitive dep (via @mana/local-llm)
to a direct dep of @mana/web. SvelteKit's adapter-node post-build
Rollup pass uses the web app's direct deps as its externals heuristic;
without this entry it warns "@mlc-ai/web-llm ... could not be resolved
- treating it as an external dependency" on every build. Functionally
harmless (the dynamic import in LocalLLMEngine only fires in the
browser), but the warning hid a real adapter-node misconfiguration
that would have bitten us if we'd ever tried to SSR /llm-test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Till JS 2026-04-08 18:17:31 +02:00
parent 94ab125fbb
commit c5e5963cbe
3 changed files with 256 additions and 327 deletions

View file

@ -16,10 +16,25 @@ export PATH="/usr/local/bin:/opt/homebrew/bin:$PATH"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
COMPOSE_FILE="$PROJECT_ROOT/docker-compose.macmini.yml"
ENV_FILE="$PROJECT_ROOT/.env.macmini"
LOG_FILE="/tmp/mana-container-health.log"
RESTART_TRACKER="/tmp/mana-restart-tracker"
# Container names that legitimately exit after a one-shot job completes.
# These are NOT broken when in "exited" state — skip them entirely instead
# of trying to "recover" them every 5 minutes (which both spams the log
# and would actually re-run the init job needlessly).
ONESHOT_INIT_CONTAINERS=(
mana-infra-minio-init
)
is_oneshot_init() {
local name="$1"
for c in "${ONESHOT_INIT_CONTAINERS[@]}"; do
[ "$c" = "$name" ] && return 0
done
return 1
}
# Load notification config if exists
if [ -f "$PROJECT_ROOT/.env.notifications" ]; then
source "$PROJECT_ROOT/.env.notifications"
@ -58,8 +73,16 @@ if ! docker info >/dev/null 2>&1; then
exit 1
fi
# Get containers that are NOT running (Created, Exited)
STUCK_CONTAINERS=$(docker ps -a --filter "status=created" --filter "status=exited" --format "{{.Names}}" | grep "^mana-" || true)
# Get containers that are NOT running (Created, Exited), excluding one-shot
# init containers that are *expected* to be in "exited" state.
ALL_STUCK=$(docker ps -a --filter "status=created" --filter "status=exited" --format "{{.Names}}" | grep "^mana-" || true)
STUCK_CONTAINERS=""
for c in $ALL_STUCK; do
if is_oneshot_init "$c"; then
continue
fi
STUCK_CONTAINERS="${STUCK_CONTAINERS:+$STUCK_CONTAINERS$'\n'}$c"
done
# Get containers that are crash-looping (Restarting)
CRASHLOOP_CONTAINERS=$(docker ps -a --filter "status=restarting" --format "{{.Names}}" | grep "^mana-" || true)
@ -187,7 +210,14 @@ for container in $ALL_PROBLEM_CONTAINERS; do
if [ -n "$SERVICE_NAME" ]; then
log " Starting service: $SERVICE_NAME"
docker compose -f "$COMPOSE_FILE" --env-file "$ENV_FILE" up -d "$SERVICE_NAME" 2>&1 || {
# NOTE: do NOT pass --env-file here. docker compose auto-loads .env
# from $PROJECT_ROOT, which is what every other compose invocation
# in this repo relies on (build-app.sh, deploy.sh, manual ops). The
# previous --env-file pointed at .env.macmini which never existed
# on the server, so recoveries silently created containers with
# blank secrets — that's how mana-auth ended up in a crash loop
# with empty MANA_AUTH_KEK on 2026-04-08.
(cd "$PROJECT_ROOT" && docker compose -f "$COMPOSE_FILE" up -d "$SERVICE_NAME") 2>&1 | tee -a "$LOG_FILE" || {
log " WARNING: Failed to start $SERVICE_NAME via compose, trying direct start"
docker start "$container" 2>&1 || true
}
@ -198,7 +228,14 @@ done
sleep 10
# Verify containers are now running (check for created, exited, AND restarting)
STILL_STUCK=$(docker ps -a --filter "status=created" --filter "status=exited" --format "{{.Names}}" | grep "^mana-" || true)
ALL_STILL_STUCK=$(docker ps -a --filter "status=created" --filter "status=exited" --format "{{.Names}}" | grep "^mana-" || true)
STILL_STUCK=""
for c in $ALL_STILL_STUCK; do
if is_oneshot_init "$c"; then
continue
fi
STILL_STUCK="${STILL_STUCK:+$STILL_STUCK$'\n'}$c"
done
STILL_CRASHING=$(docker ps -a --filter "status=restarting" --format "{{.Names}}" | grep "^mana-" || true)
ALL_STILL_BROKEN=$(echo -e "$STILL_STUCK\n$STILL_CRASHING" | grep -v "^$" | sort -u || true)