mirror of
https://github.com/Memo-2023/mana-monorepo.git
synced 2026-05-14 20:21:09 +02:00
docs(reports): geocoding self-hosting decision — recommend Photon on mana-gpu
Compares Pelias / Nominatim / Photon for self-hosting on the GPU server, with current (2026-04-28) numbers from upstream docs + GraphHopper's Photon-data downloads: Photon Europe pre-built dump: 30.6 GB, weekly refresh Photon Germany pre-built dump: 5.8 GB, weekly refresh Nominatim Germany import: ~100 GB disk, 8–12 h, 12 GB RAM Pelias DACH (current): 3 GB RAM, 4 services, JS patch hack Recommendation: Photon Europe-wide on mana-gpu. Single Java process, embedded OpenSearch, no PBF import (download a tarball, restart), weekly auto-updates from GraphHopper, integrates with the wrapper's existing PhotonProvider via just an env-var change. Once self-hosted, Photon registers as `privacy: 'local'` — the sensitive-query block (Hausarzt, Klinikum, …) gets a real local backend and no longer has to return empty results when Pelias is down. Public Photon stays in the chain as a `privacy: 'public'` last-resort fallback. Migration plan included (~3–4 h total, ~1 h waiting), with phase-by-phase risk assessment. Pelias does not return — the 3 GB RAM + multi-container + patched JS combination has no operational case once we have a self-hosted Photon that already matches our wrapper's wire format.
This commit is contained in:
parent
e4d9dc5b2e
commit
6f83fba66a
2 changed files with 219 additions and 0 deletions
219
docs/reports/geocoding-self-hosting-2026-04-28.md
Normal file
219
docs/reports/geocoding-self-hosting-2026-04-28.md
Normal file
|
|
@ -0,0 +1,219 @@
|
||||||
|
# Geocoding Self-Hosting — Decision Report
|
||||||
|
|
||||||
|
**Status:** Recommendation — pending migration
|
||||||
|
**Date:** 2026-04-28
|
||||||
|
**Context:** Pelias was retired from the Mac mini on 2026-04-28 (3 GB RAM was crushing the host into 8.6 GB swap). The wrapper now serves all queries through public Photon + Nominatim, with sensitive-query blocking + coord quantization as privacy mitigations. We need a self-hosted geocoder back in the chain so sensitive queries (`Hausarzt`, `Klinikum`, …) don't return zero results when the user actually wants them, and so we don't depend on a third party for routine address lookups.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
**Self-host [Photon](https://github.com/komoot/photon) (Europe-wide) on `mana-gpu`.**
|
||||||
|
|
||||||
|
- **Disk:** ~80 GB unpacked (we have it on the GPU server)
|
||||||
|
- **RAM:** 4–8 GB Java heap (negligible vs the Mac mini's 3 GB Pelias overhead)
|
||||||
|
- **Setup:** download a pre-built tarball from GraphHopper, `docker run`, point the wrapper at it. **No PBF import, no patching, no Elasticsearch container to babysit.**
|
||||||
|
- **Updates:** weekly re-download of the latest dump, ~30 min of cron + `docker restart`
|
||||||
|
- **Maintenance:** single Java process, no schema migration, no admin lookups, no sensitive config
|
||||||
|
|
||||||
|
This replaces Pelias entirely. Once it's running, Photon becomes a **`privacy: 'local'`** provider and the sensitive-query block now has a real local backend to fall back to — meaning users can search for medical/crisis services without hitting the public OSM at all.
|
||||||
|
|
||||||
|
Pelias does not return.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision criteria
|
||||||
|
|
||||||
|
In rough priority order:
|
||||||
|
|
||||||
|
1. **Privacy fit** — must serve sensitive queries (Hausarzt, Psychiater, …) without leaking to a third party. Means we need a `privacy: 'local'` provider.
|
||||||
|
2. **Operational cost** — every minute spent on geocoding is a minute not spent on Mana itself. Setup, updates, recovery from breakage.
|
||||||
|
3. **Resource fit** — must coexist with STT/TTS/Image-Gen/Video-Gen/Ollama on the GPU server without GPU-pass-through conflicts.
|
||||||
|
4. **DACH data quality** — German addresses + venue names. Compound-word handling ("Münsterplatz"), umlauts, postcode formats.
|
||||||
|
5. **API surface** — autocomplete (typing-fast suggestions), forward search, reverse geocoding. Categories nice-to-have.
|
||||||
|
6. **Reuse of existing wrapper code** — we already have provider adapters for Pelias, Photon, Nominatim. Anything that doesn't match one of those means new code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Candidates
|
||||||
|
|
||||||
|
### 1. Pelias (current, retired)
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **RAM** | ~3.2 GB (libpostal: 2 GB, ES: 1.2 GB, API: 100 MB) |
|
||||||
|
| **Disk** | ~5 GB ES index |
|
||||||
|
| **Setup** | 4 docker services + manual `dach-latest.osm.pbf` rename + `analysis-icu` plugin install + 30–45 min import + patched `geojsonify_place_details.js` |
|
||||||
|
| **Updates** | Manual re-import (30–45 min) every few weeks |
|
||||||
|
| **Wire format** | Multi-tag categories (`food/retail/nightlife`) — richest of the three |
|
||||||
|
| **Privacy** | `local` (self-hosted) |
|
||||||
|
| **Pre-built data** | None — must run the importer |
|
||||||
|
|
||||||
|
**Verdict:** the multi-tag taxonomy is genuinely useful but everything else is friction. The patched JS file (overriding `condition: checkCategoryParam` → `condition: () => true`) is a permanent maintenance liability — it has to be regenerated on every Pelias API image bump. There is no operational reason to bring Pelias back.
|
||||||
|
|
||||||
|
### 2. Nominatim
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **RAM** | 12 GB during import for Germany alone; 2 GB minimum to run; 128 GB recommended for planet |
|
||||||
|
| **Disk** | **~100 GB for Germany alone** (per [user reports](https://github.com/mediagis/nominatim-docker/discussions/265)); 1 TB for planet |
|
||||||
|
| **Setup** | One docker-compose (Postgres + Nominatim worker), 8–12 h import for Germany |
|
||||||
|
| **Updates** | OSM replication via differential updates (continuous) |
|
||||||
|
| **Wire format** | `class:type` raw OSM tags (already mapped in our `osm-category-map.ts`) |
|
||||||
|
| **Privacy** | `local` |
|
||||||
|
| **Pre-built data** | None — must run the importer |
|
||||||
|
|
||||||
|
**Verdict:** the disk number is the killer. **100 GB for Germany alone** is wildly disproportionate for our use case (mostly DACH addresses + restaurant names), driven by the flatnode file plus the rich admin-boundary indexing Nominatim does. The 8–12 h import is also bad — every geographic data refresh becomes a half-day operation. Used by OSM itself and Wikipedia, so quality is unquestionable, but the resource fit is wrong for a side service.
|
||||||
|
|
||||||
|
### 3. Photon (recommended)
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| **RAM** | 4–8 GB Java heap configurable via `-Xmx`; planet-wide deployment recommends 64 GB but Europe runs comfortably on 6–8 GB |
|
||||||
|
| **Disk** | **5.8 GB for Germany dump (compressed), 30.6 GB for full Europe v1.x dump** ([GraphHopper downloads](https://download1.graphhopper.com/public/europe/index.html)). Unpacks to ~80 GB for Europe. |
|
||||||
|
| **Setup** | `docker run`, mount the unpacked dump, expose port 2322. **No PBF import.** |
|
||||||
|
| **Updates** | **Weekly pre-built dumps from GraphHopper.** Download new tar.bz2, restart. ~30 min total operator time. |
|
||||||
|
| **Wire format** | `osm_key:osm_value` raw OSM tags (already mapped) |
|
||||||
|
| **Privacy** | `local` once self-hosted |
|
||||||
|
| **Pre-built data** | **Yes — country, region, and planet, refreshed weekly** |
|
||||||
|
|
||||||
|
**Verdict:** the "pre-built index" is the deciding feature. It collapses the entire data-pipeline complexity that Pelias and Nominatim ask us to manage. Java 21 + embedded OpenSearch in a single process. The wire format already matches our existing `PhotonProvider` adapter — switching from "public Photon" to "self-hosted Photon" is literally an env-var change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource comparison summary
|
||||||
|
|
||||||
|
| Tool | Setup time | RAM (steady) | Disk | Update mechanism | Maintenance burden |
|
||||||
|
|---|---|---|---|---|---|
|
||||||
|
| **Pelias DACH** | 30–45 min import + patch hack | 3.2 GB | 5 GB | Manual re-import | High (4 containers, JS patch) |
|
||||||
|
| **Nominatim Germany** | 8–12 h import | 2–4 GB | **~100 GB** | OSM replication | Medium (Postgres tuning) |
|
||||||
|
| **Photon Europe** | 5–10 min download | 4–8 GB | 30 GB → 80 GB unpacked | **Weekly tarball** | Low (1 container, no DB) |
|
||||||
|
| **Photon Germany** | 2–5 min download | 2–4 GB | 5.8 GB → ~15 GB unpacked | Weekly tarball | Low |
|
||||||
|
|
||||||
|
For DACH+ scope, Photon-Germany is the lightest option that still covers all our users. Photon-Europe is the only-marginally-heavier option that future-proofs against any non-DACH user (events module, travel scenarios).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Privacy implications
|
||||||
|
|
||||||
|
Currently the wrapper has two `privacy: 'public'` providers (Photon, Nominatim) and zero `local` ones (Pelias is stopped). A sensitive query like "Hausarzt Konstanz" returns 0 results with `notice: 'sensitive_local_unavailable'` — privacy-correct but UX-painful.
|
||||||
|
|
||||||
|
**After self-hosting Photon on `mana-gpu`:**
|
||||||
|
|
||||||
|
- Photon-self-hosted is registered with `privacy: 'local'`
|
||||||
|
- The sensitive-query block now has a real backend → users get results without their query leaving our network
|
||||||
|
- Public Photon and Nominatim can stay in the chain as last-resort `privacy: 'public'` fallbacks for obscure non-DACH queries
|
||||||
|
- OR drop them entirely — we no longer need third-party fallbacks if our own Photon is reliable
|
||||||
|
|
||||||
|
**Recommendation:** keep public Photon as a third-tier `public` fallback, drop public Nominatim. The chain becomes:
|
||||||
|
|
||||||
|
```
|
||||||
|
1. self-hosted Photon (mana-gpu) privacy: local
|
||||||
|
2. public Photon (komoot.io) privacy: public ← only when self-hosted is down
|
||||||
|
AND query isn't sensitive
|
||||||
|
```
|
||||||
|
|
||||||
|
This gives us belt-and-suspenders: even if a Pelias/Photon migration breaks something, sensitive queries still hold the privacy line because the chain filters public providers in `localOnly` mode regardless of which one is up.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration plan
|
||||||
|
|
||||||
|
Estimated total time: **3–4 hours**, of which ~1 h is download/unpack waiting time. Most of it is one-off setup that won't be repeated.
|
||||||
|
|
||||||
|
### Phase 1 — GPU server prep (1.5 h, requires physical access)
|
||||||
|
|
||||||
|
1. Verify `mana-gpu` has ≥ 100 GB free disk on a fast SSD. Photon Java heap is GC-sensitive; spinning rust would hurt latency.
|
||||||
|
2. Install **Docker Desktop for Windows** with WSL2 backend. (WSL2 is more compatible with the Java 21 + OpenSearch stack than native Hyper-V containers.)
|
||||||
|
3. Verify existing GPU services (Ollama, image-gen, video-gen, STT, TTS) still work after Docker Desktop install — Hyper-V mode can briefly conflict with CUDA. Run a quick STT inference smoke as the canary.
|
||||||
|
4. Open inbound TCP 2322 in Windows Firewall, restricted to LAN only.
|
||||||
|
|
||||||
|
### Phase 2 — Photon container (45 min, ~30 min of which is download)
|
||||||
|
|
||||||
|
1. `mkdir D:\photon-data` (or wherever you've got space)
|
||||||
|
2. Download from GraphHopper:
|
||||||
|
```powershell
|
||||||
|
cd D:\photon-data
|
||||||
|
curl -O https://download1.graphhopper.com/public/europe/photon-db-europe-1.0-latest.tar.bz2
|
||||||
|
tar -xjf photon-db-europe-1.0-latest.tar.bz2
|
||||||
|
```
|
||||||
|
(Country-only is also viable — start with Germany if you want to get something running fast and switch to Europe later.)
|
||||||
|
3. Run Photon:
|
||||||
|
```powershell
|
||||||
|
docker run -d --name photon -p 2322:2322 `
|
||||||
|
-v D:\photon-data\photon_data:/photon/photon_data `
|
||||||
|
komoot/photon
|
||||||
|
```
|
||||||
|
4. Smoke test from the GPU server:
|
||||||
|
```powershell
|
||||||
|
curl http://localhost:2322/api?q=Konstanz`&limit=2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3 — Wire it into the wrapper (30 min)
|
||||||
|
|
||||||
|
In `services/mana-geocoding/.env` (or `docker-compose.macmini.yml`'s mana-geocoding env block):
|
||||||
|
|
||||||
|
```env
|
||||||
|
GEOCODING_PROVIDERS=self_photon,photon
|
||||||
|
PHOTON_API_URL=http://192.168.178.11:2322 # self_photon points here
|
||||||
|
# Keep PHOTON_API_URL_PUBLIC=https://photon.komoot.io as last-resort
|
||||||
|
```
|
||||||
|
|
||||||
|
In `services/mana-geocoding/src/app.ts`, register a second Photon provider with `privacy: 'local'` (a small refactor — the existing `PhotonProvider` class takes config, just instantiate twice).
|
||||||
|
|
||||||
|
In `services/mana-geocoding/src/providers/photon.ts`, expose `privacy` as a constructor argument so the same class can serve both roles.
|
||||||
|
|
||||||
|
Tests: extend `chain.test.ts` to verify the order pelias-class → photon-class → public Photon → public Nominatim.
|
||||||
|
|
||||||
|
### Phase 4 — Validate + cut over (30 min)
|
||||||
|
|
||||||
|
1. Deploy the updated wrapper to mana-server.
|
||||||
|
2. Smoke: `curl https://mana.how/api/v1/geocode/search?q=Hausarzt+Konstanz` should now return real results (was empty before this work).
|
||||||
|
3. Health: `curl https://mana.how/api/v1/geocode/health/providers` should show `self_photon: healthy`.
|
||||||
|
4. Watch latency for 24 h via the existing Prometheus probes.
|
||||||
|
5. Pelias container can be deleted from Mac mini (`docker compose -f services/mana-geocoding/pelias/docker-compose.yml down -v`) — frees 5 GB disk + the Docker volume.
|
||||||
|
|
||||||
|
### Phase 5 — Maintenance baseline (10 min/week)
|
||||||
|
|
||||||
|
1. Cron job on mana-gpu: every Sunday night, download the latest Photon dump, unpack to a sibling directory, swap-symlink, restart container. ~30 min unattended.
|
||||||
|
2. Keep CLAUDE.md in `services/mana-geocoding/` updated when the topology changes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
1. **GPU server RAM** — we don't know the actual amount. If it's <16 GB, drop to Photon-Germany only and skip Europe.
|
||||||
|
2. **Backup strategy** — Photon's data is reproducible (download from GraphHopper anytime), so no backup needed. Confirm this assumption — if GraphHopper goes away, we lose the easy-update path.
|
||||||
|
3. **Reverse-geocode quality** — Photon's reverse implementation is OK but not its strongest feature. If we see degraded reverse results vs the old Pelias setup, we can layer a tiny Nominatim instance on top later. Not worth doing pre-emptively.
|
||||||
|
4. **Cross-LAN latency** — adds 5–20 ms vs the old localhost setup. Acceptable; cache TTL stays 24 h for local provider.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why not other tools
|
||||||
|
|
||||||
|
- **Mimirsbrunn** (Pelias-derived): less maintained, French/Spanish focus, smaller community. No win over Photon.
|
||||||
|
- **Gisgraphy:** Java + Postgres, similar resource profile to Nominatim, less actively maintained than either Nominatim or Photon. No win.
|
||||||
|
- **OpenAddresses + custom indexer:** months of work, and we'd be the only users. Hard pass.
|
||||||
|
- **Self-hosted Mapbox:** doesn't exist as such; their offering requires their cloud.
|
||||||
|
- **Bezahltes API als Backup-Tier (MapTiler / OpenCage):** still worth adding later as a 4th tier behind self-hosted-Photon + public-fallbacks. Not blocking.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this avoids
|
||||||
|
|
||||||
|
- **Re-running the Pelias import pipeline.** That alone would have been 45–90 min of operator time per data refresh.
|
||||||
|
- **The libpostal RAM tax.** Photon does its own address parsing without libpostal's 2 GB model.
|
||||||
|
- **The patched JS file.** Photon returns OSM tags by default; no API patch needed.
|
||||||
|
- **A second Postgres tenant.** Nominatim would force one. Photon is fully self-contained.
|
||||||
|
- **Public-API dependency for the warm path.** Photon-self-hosted is privacy-clean for ALL queries, not just sensitive ones.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Sources
|
||||||
|
|
||||||
|
- [Photon GitHub repo & README](https://github.com/komoot/photon) — hardware requirements, Java 21+, OpenSearch backend
|
||||||
|
- [GraphHopper Photon downloads (Europe)](https://download1.graphhopper.com/public/europe/index.html) — 30.6 GB Europe v1.x; 5.8 GB Germany v1.x; weekly refresh
|
||||||
|
- [Nominatim 5.3.2 Installation docs](https://nominatim.org/release-docs/latest/admin/Installation/) — 128 GB RAM recommended planet, 1 TB disk
|
||||||
|
- [mediagis/nominatim-docker discussion #265](https://github.com/mediagis/nominatim-docker/discussions/265) — Germany-import resource reports (12 GB RAM, ~100 GB disk, 8–12 h)
|
||||||
|
- [Photon OpenSearch wiki page](https://wiki.openstreetmap.org/wiki/Photon) — region scoping, memory tuning
|
||||||
|
- Internal: [`services/mana-geocoding/CLAUDE.md`](../../services/mana-geocoding/CLAUDE.md) for the current Pelias setup we're replacing
|
||||||
Loading…
Add table
Add a link
Reference in a new issue