feat: add Ollama memory optimization, LLM metrics, and chat streaming

Three improvements to the unified LLM infrastructure: 1. Ollama memory optimization (scripts/mac-mini/configure-ollama.sh): - OLLAMA_KEEP_ALIVE=5m → models unload after 5min idle (saves 3-16GB RAM) - OLLAMA_NUM_PARALLEL=1 → predictable memory usage - OLLAMA_MAX_LOADED_MODELS=1 → max 1 model in RAM at a time 2. Request-level metrics in @manacore/shared-llm: - LlmRequestMetrics interface (model, latency, tokens, fallback detection) - LlmMetricsCollector class with summary stats (for health endpoints) - Optional onMetrics callback in LlmModuleOptions - Automatic metrics emission in chatMessages() (success + error) 3. Chat streaming (token-by-token SSE): - Backend: POST /chat/completions/stream SSE endpoint - OllamaService.createStreamingCompletion() via llm.chatStreamMessages() - ChatService.createStreamingCompletion() with upfront credit consumption - Web: chatApi.createStreamingCompletion() SSE consumer - Chat store: sendMessage() now streams tokens into assistant message - UI updates reactively as each token arrives Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:36:41 +02:00 · 2026-03-24 09:41:33 +01:00 · 2026-03-24 09:41:33 +01:00 · 56ffcbac39
commit 56ffcbac39
parent ecda4535d8
13 changed files with 462 additions and 29 deletions
--- a/packages/shared-llm/src/interfaces/llm-options.interface.ts
+++ b/packages/shared-llm/src/interfaces/llm-options.interface.ts
@ -1,4 +1,5 @@
 import type { ModuleMetadata, Type } from '@nestjs/common';
+import type { MetricsCallback } from '../utils/metrics';

 export interface LlmModuleOptions {
 	/** mana-llm service URL (default: http://localhost:3025) */
@ -13,6 +14,8 @@ export interface LlmModuleOptions {
 	maxRetries?: number;
 	/** Enable debug logging (default: false) */
 	debug?: boolean;
+	/** Optional callback invoked after every LLM request with metrics */
+	onMetrics?: MetricsCallback;
 }

 export interface LlmModuleAsyncOptions extends Pick<ModuleMetadata, 'imports'> {
@ -33,6 +36,7 @@ export interface ResolvedLlmOptions {
 	timeout: number;
 	maxRetries: number;
 	debug: boolean;
+	onMetrics?: MetricsCallback;
 }

 export function resolveOptions(options: LlmModuleOptions): ResolvedLlmOptions {
@ -43,5 +47,6 @@ export function resolveOptions(options: LlmModuleOptions): ResolvedLlmOptions {
 		timeout: options.timeout ?? 120_000,
 		maxRetries: options.maxRetries ?? 2,
 		debug: options.debug ?? false,
+		onMetrics: options.onMetrics,
 	};
 }