managarten/scripts/mac-mini/configure-ollama.sh
Till JS 56ffcbac39 feat: add Ollama memory optimization, LLM metrics, and chat streaming
Three improvements to the unified LLM infrastructure:

1. Ollama memory optimization (scripts/mac-mini/configure-ollama.sh):
   - OLLAMA_KEEP_ALIVE=5m → models unload after 5min idle (saves 3-16GB RAM)
   - OLLAMA_NUM_PARALLEL=1 → predictable memory usage
   - OLLAMA_MAX_LOADED_MODELS=1 → max 1 model in RAM at a time

2. Request-level metrics in @manacore/shared-llm:
   - LlmRequestMetrics interface (model, latency, tokens, fallback detection)
   - LlmMetricsCollector class with summary stats (for health endpoints)
   - Optional onMetrics callback in LlmModuleOptions
   - Automatic metrics emission in chatMessages() (success + error)

3. Chat streaming (token-by-token SSE):
   - Backend: POST /chat/completions/stream SSE endpoint
   - OllamaService.createStreamingCompletion() via llm.chatStreamMessages()
   - ChatService.createStreamingCompletion() with upfront credit consumption
   - Web: chatApi.createStreamingCompletion() SSE consumer
   - Chat store: sendMessage() now streams tokens into assistant message
   - UI updates reactively as each token arrives

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 09:41:33 +01:00

90 lines
3 KiB
Bash
Executable file

#!/bin/bash
# Configure Ollama for optimal memory usage on Mac Mini
#
# Sets OLLAMA_KEEP_ALIVE=5m so models unload from RAM after 5 minutes
# of inactivity. This is critical on the 16GB Mac Mini where Ollama
# models can consume 3-16 GB RAM.
#
# Run on the Mac Mini:
# ./scripts/mac-mini/configure-ollama.sh
set -e
PLIST_DIR="$HOME/Library/LaunchAgents"
OLLAMA_PLIST="$PLIST_DIR/homebrew.mxcl.ollama.plist"
echo "=== Ollama Memory Optimization ==="
echo ""
# Check if Ollama is installed
if ! command -v ollama &>/dev/null && [ ! -f /opt/homebrew/bin/ollama ]; then
echo "ERROR: Ollama not found. Install with: brew install ollama"
exit 1
fi
# Create override plist that sets environment variables
# This is the recommended way to add env vars to a Homebrew service
OVERRIDE_PLIST="$PLIST_DIR/com.manacore.ollama-env.plist"
cat > "$OVERRIDE_PLIST" << 'PLIST'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.manacore.ollama-env</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>-c</string>
<string>
# Set Ollama environment variables system-wide via launchctl
launchctl setenv OLLAMA_KEEP_ALIVE 5m
launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
</string>
</array>
<key>RunAtLoad</key>
<true/>
</dict>
</plist>
PLIST
echo "Created: $OVERRIDE_PLIST"
# Apply immediately (no reboot needed)
launchctl setenv OLLAMA_KEEP_ALIVE 5m
launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
echo ""
echo "Environment variables set:"
echo " OLLAMA_KEEP_ALIVE=5m (unload models after 5min idle → saves 3-16GB RAM)"
echo " OLLAMA_FLASH_ATTENTION=1 (faster attention computation)"
echo " OLLAMA_KV_CACHE_TYPE=q8_0 (efficient KV cache)"
echo " OLLAMA_NUM_PARALLEL=1 (max 1 parallel request → predictable memory)"
echo " OLLAMA_MAX_LOADED_MODELS=1 (max 1 model in RAM at a time)"
echo ""
# Restart Ollama to pick up new settings
echo "Restarting Ollama..."
/opt/homebrew/bin/brew services restart ollama 2>/dev/null || {
echo "Homebrew restart failed, trying launchctl..."
launchctl stop homebrew.mxcl.ollama 2>/dev/null
sleep 2
launchctl start homebrew.mxcl.ollama 2>/dev/null
}
echo ""
echo "Done! Verify with:"
echo " ollama ps # Should show no loaded models (or model with 5m timeout)"
echo " curl localhost:11434/api/ps # Same via API"
echo ""
echo "Expected behavior:"
echo " - First request: ~2-5s cold start (model loads into RAM)"
echo " - Subsequent requests within 5min: instant (model in RAM)"
echo " - After 5min idle: model unloads, RAM freed"