Infrastructure & Deployment Guide
Grio AI Education Platform
Internal Technical Document for Dev Team & DevOps
1. Infrastructure Strategy
Hybrid Multi-Phase Architecture
Grio uses a hybrid infrastructure model optimized for African education contexts, balancing cost, sovereignty, and scalability.
Phase 1: Cloud Foundation (Hetzner) - Entry point for all new deployments - Hetzner Cloud (Germany/Finland) for backend, DB, AI workloads - CX/CPX instances: ~10-30 EUR/month, fast provisioning - Rationale: GDPR-compliant, no vendor lock-in, 50% cost vs AWS, sovereign data - Geographic selection: EU default, on-demand region expansion
Phase 2: Edge Layer (Regional) - Uganda-based edge servers (pilot phase) - Local curriculum cache, session storage, API response layer - Schools/district hubs with intermittent connectivity - Sync mechanisms: daily/weekly to central, real-time when available
Phase 3: On-Premises (Ministry/Government) - Fully country-controlled infrastructure - Rack servers in national data center + school clusters - K3s enables seamless transition from cloud/edge to on-prem - Database replication, offline-first architecture
Why Hetzner?
| Factor | Hetzner | AWS | Azure | GCP |
|---|---|---|---|---|
| Data Sovereignty | ✓ EU (GDPR) | Limited | Limited | Limited |
| Cost (CPU/RAM) | €0.005/hour | $0.01+ | $0.008+ | $0.005+ |
| No Vendor Lock | ✓ Standard APIs | ✗ Proprietary | ✗ Proprietary | ✗ Proprietary |
| Setup Speed | Minutes | Hours | Hours | Hours |
| Dedicated Support | Included | Extra cost | Extra cost | Extra cost |
2. Compute Layer
Server Sizing by Service
Backend Services (Django + API) - Min: CX21 (2 vCPU, 4GB RAM) — dev/staging - Prod: CPX31 (8 vCPU, 16GB RAM) — peak load ~500 concurrent - Auto-scaling via K3s: horizontal pod autoscaling 2-8 replicas
AI Service (LLM Inference) - Small models (<7B): CPX41 (16 vCPU, 32GB RAM) - Larger models (>13B): Dedicated server + GPU - Consider: vLLM + GGUF quantization to reduce VRAM needs - Alternative: API endpoint to Hugging Face / Together.ai (managed)
Vector Database (Qdrant) - Min: CX31 (4 vCPU, 8GB RAM) for <1M embeddings - Prod: CPX51 (24 vCPU, 48GB RAM) for 5M+ embeddings - SSD mandatory; HDD not suitable
Database (PostgreSQL) - Dedicated server (PX92 or Hetzner Managed DB) - Minimum: 32GB RAM, NVMe SSD, daily backups - Non-negotiable for production
Edge Nodes (Schools/Districts) - Raspberry Pi 4B (8GB): cache + local sync agent - Or mini-PC (Intel N100): full K3s cluster locally - Cost: $100-300 per site, supports 50-200 students
3. Containerization (Mandatory)
Docker Standards
All services must be containerized. No exceptions.
Base Images:
# Backend (Django)
FROM python:3.12-slim
# Frontend (Next.js)
FROM node:20-alpine
# AI (PyTorch)
FROM pytorch/pytorch:2.1-cuda12.1-runtime-slimRegistry: Docker Hub (free) or self-hosted Harbor on Hetzner
Kubernetes via K3s
K3s is lightweight Kubernetes ideal for resource-constrained environments.
Installation:
# Master node
curl -sfL https://get.k3s.io | sh -
# Agent nodes
curl -sfL https://get.k3s.io | K3S_URL=https://master:6443 \
K3S_TOKEN=<token> sh -Why K3s over full Kubernetes: - 50% smaller memory footprint - Single binary deployment - Built-in Traefik ingress + ServiceLB - Perfect for on-prem + edge
Environment Layers
| Layer | Tool | Replicas | SLA |
|---|---|---|---|
| Local Dev | Docker Compose | 1 | N/A |
| Staging | K3s single-node | 1 | Best effort |
| Production | K3s 3+ nodes | 3-8 | 99.5% uptime |
4. Database Infrastructure
PostgreSQL (Primary)
Hosted Options: 1. Hetzner Managed Database (simplest, automated backups) 2. Dedicated server + manual setup (full control, lower cost)
Backup Strategy:
# Daily backups to MinIO
0 2 * * * /usr/local/bin/backup-postgres.sh
# Retention: 30 days rolling
# Test restores: weekly from backupReplication: - Primary-Replica for HA (Streaming Replication) - Replica in secondary region (Phase 2 goal)
Qdrant (Vector DB)
Self-hosted vector database for curriculum RAG.
Configuration:
# qdrant/config.yaml
wal:
dir_path: ./storage/wal
storage:
snapshots_path: ./storage/snapshotsSizing: - 1 embedding dimension = ~200 bytes overhead - Curriculum estimates: 100K documents × 1K dimensions = ~100GB SSD
Persistence: - Daily snapshots to MinIO - Full snapshot recovery <10 minutes
MinIO (S3-Compatible Storage)
Stores videos, lesson PDFs, student work, media assets.
Setup (K3s):
helm install minio minio/minio \
--set rootUser=admin \
--set rootPassword=<secure-pwd> \
--set persistence.size=500GiBuckets: - curriculum-assets/ — lesson videos, PDFs, images - student-work/ — submissions, quizzes, progress - system-logs/ — application logs (Loki) - backups/ — database snapshots
Replication: MinIO internal mirroring or cross-server bucket sync
Redis (Caching Layer)
Session storage, curriculum cache, API response caching.
K3s Deployment:
helm install redis bitnami/redis \
--set auth.enabled=true \
--set auth.password=<secure> \
--set persistence.size=10GiTTL Policies: - Sessions: 24 hours - API cache: 1 hour - Curriculum index: 12 hours
5. Networking & Routing
Load Balancer: Traefik
Kubernetes-native ingress controller (built into K3s).
Configuration:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grio-api
spec:
tls:
- hosts:
- api.grio.local
secretName: letsencrypt-prod
rules:
- host: api.grio.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: django-api
port:
number: 8000Alternative: NGINX Ingress (heavier, enterprise-grade)
CDN & Caching
Cloudflare (Recommended): - DNS only (never proxy through CF for data sovereignty) - Caching rules for static assets (images, videos, CSS) - DDoS protection, Firewall rules - Cost: $200/month Business plan or $0 if self-managed DNS
Self-Managed: Cache-Control headers + Redis + MinIO direct serving
SSL/TLS
Let’s Encrypt via cert-manager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yamlAuto-renewal: cert-manager handles 30+ day pre-renewal
DNS Configuration
api.grio.local A <Hetzner-IP>
app.grio.local A <Hetzner-IP>
assets.grio.local A <MinIO-IP>
vector.grio.local A <Qdrant-IP>6. CI/CD Pipeline
GitHub Actions Workflow
name: Deploy to Production
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t grio-api:${{ github.sha }} .
- name: Run tests
run: docker run grio-api:${{ github.sha }} pytest
- name: Push to registry
run: docker push <registry>/grio-api:${{ github.sha }}
- name: Deploy to K3s
run: |
kubectl set image deployment/django-api \
django-api=<registry>/grio-api:${{ github.sha }}
kubectl rollout status deployment/django-apiStaging vs Production
Staging: Auto-deploy on main branch, manual promotion to production
Rollback: kubectl rollout undo deployment/django-api
7. Monitoring & Observability
Prometheus (Metrics)
Scrapes metrics from all services.
# prometheus.yaml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'django'
static_configs:
- targets: ['django-api:8000']
- job_name: 'qdrant'
static_configs:
- targets: ['qdrant:6333']Grafana (Dashboards)
Pre-built dashboards for K3s, PostgreSQL, application metrics.
Key Metrics: - API response time (p50, p95, p99) - CPU/Memory utilization by pod - Database query latency - Cache hit rate (Redis) - Token usage (AI workloads)
Loki (Centralized Logging)
Lightweight log aggregation.
helm install loki grafana/loki-stack \
--set promtail.enabled=true \
--set grafana.enabled=falseLog Retention: 30 days
Uptime Monitoring
Uptime Kuma (self-hosted):
docker run -d --restart unless-stopped \
-p 3001:3001 \
-v uptime-kuma:/app/data \
louislam/uptime-kuma:latestCritical Alerts
| Alert | Threshold | Action |
|---|---|---|
| API response time | >1000ms p95 | Page on-call |
| Pod crash loop | 3+ restarts/10min | Notify devops |
| Database connection error | >10% | Critical incident |
| Disk usage | >85% | Scale volume |
8. Local Development Setup
Prerequisites
- Node.js 20.x (nvm recommended)
- Python 3.12+ with uv package manager
- PostgreSQL 15+ (local or Docker)
- Docker & Docker Compose
- Git 2.40+
Repository Structure
grio/ # Monorepo root
├── grio-api/ # Django backend
│ ├── manage.py
│ ├── pyproject.toml
│ ├── Dockerfile
│ └── ...
├── grio/ # Next.js frontend
│ ├── package.json
│ ├── next.config.js
│ └── ...
└── docker-compose.yml # Local dev orchestrationBackend Setup
cd grio-api
# Install dependencies
uv sync
# Run migrations
uv run manage.py migrate
# Load seed data
uv run manage.py seed_curriculum
# Start dev server
uv run manage.py runserver 0.0.0.0:8000Frontend Setup
cd grio
npm install
npm run dev
# Runs on http://localhost:3000Docker Compose (All-in-One)
docker-compose up -d
# Services start:
# - Django API: http://localhost:8000
# - Next.js App: http://localhost:3000
# - PostgreSQL: localhost:5432
# - Qdrant: http://localhost:6333
# - MinIO: http://localhost:9000 (user: minioadmin, pwd: minioadmin)Development Sanity Checks
| Check | Command | Expected |
|---|---|---|
| Backend API | curl http://localhost:8000/health | {"status": "ok"} |
| DB connection | uv run manage.py dbshell | PostgreSQL prompt |
| Frontend build | npm run build | .next/ directory created |
| Vector DB | curl http://localhost:6333/health | {"status": "ok"} |
Common Issues
| Issue | Fix |
|---|---|
| Port 5432 already in use | lsof -i :5432 & kill PID, or use different port |
ModuleNotFoundError: No module named 'X' | uv sync to update lockfile |
next: command not found | npm install -g next or use npx next |
| PostgreSQL migrations fail | Delete .venv/, run uv sync again |
9. Offline & Low-Bandwidth Strategy
Edge Node Architecture
Deploy lightweight K3s + SQLite mirrors in schools/districts.
Components: - Curriculum cache (SQLite + MinIO local mirror) - Session sync queue (RocksDB) - Sync agent (Python daemon)
Sync Logic:
# sync_agent.py - runs on edge node
def sync_curriculum():
if internet_connected():
pull_latest_curriculum() # from central Qdrant
sync_student_submissions() # upload to central
else:
use_local_cache()
queue_offline_changes()Bandwidth Optimization: - Curriculum updates: delta sync only (changed documents) - Videos: adaptive bitrate, HD available when bandwidth permits - Sync frequency: daily if connected, manual on-demand
Offline Capabilities: - Read curriculum: ✓ Full access - Submit work: ✓ Queued locally - AI chat: ✓ Lightweight model only - Sync: ✗ Requires connectivity
10. Security Hardening
HTTPS Everywhere
All traffic encrypted end-to-end via Let’s Encrypt.
# Verify HSTS header
curl -I https://api.grio.local | grep Strict-Transport
# Output: Strict-Transport-Security: max-age=63072000Firewall (UFW on Hetzner)
# Allow SSH, API, web traffic only
ufw allow 22/tcp # SSH
ufw allow 80/tcp # HTTP (Let's Encrypt renewal)
ufw allow 443/tcp # HTTPS
ufw allow from 10.0.0.0/8 # K3s internal
ufw default deny incoming
ufw enableVPN for Admin Access
Production Hetzner nodes accessible only via Wireguard VPN.
# Generate keys
wg genkey | tee wg-private.key | wg pubkey > wg-public.key
# Add to Hetzner firewall
ufw allow from <vpn-client-ip>Data Encryption at Rest
PostgreSQL:
-- Enable pgcrypto
CREATE EXTENSION pgcrypto;
-- Encrypt sensitive fields
ALTER TABLE users ADD COLUMN email_encrypted bytea;MinIO: S3-side encryption with CMK
Audit & Access Logs
# K3s audit logging
--audit-log-path=/var/log/k3s-audit.log
--audit-log-maxage=30Centralize to: Loki or MinIO (analysis via Grafana)
11. Backup & Disaster Recovery
PostgreSQL Backup Schedule
#!/bin/bash
# backup-postgres.sh
set -e
BACKUP_DIR="/backups/postgresql"
DATE=$(date +%Y%m%d-%H%M%S)
# Full backup
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
gzip > $BACKUP_DIR/full-$DATE.sql.gz
# Upload to MinIO
mc cp $BACKUP_DIR/full-$DATE.sql.gz \
minio/backups/postgres/
# Retention: keep 30 most recent
find $BACKUP_DIR -name "full-*.sql.gz" -mtime +30 -deleteSchedule: Daily 02:00 UTC (cron)
Qdrant Snapshot Strategy
# Trigger snapshot (via API)
curl -X POST http://qdrant:6333/snapshots
# Download to MinIO
mc cp qdrant/snapshots/snapshot-* minio/backups/qdrant/
# Retention: 14 days rollingMinIO Replication
Cross-server sync:
mc mirror --watch \
minio/curriculum-assets/ \
minio-backup/curriculum-assets/Retention: 30-day rolling backup
Recovery Procedures
PostgreSQL Recovery:
# From backup
gunzip < $BACKUP_DIR/full-20240324-020000.sql.gz | \
psql -h $NEW_DB_HOST -U $DB_USER $DB_NAME
# Verify
SELECT COUNT(*) FROM users; # Should match pre-backup countQdrant Recovery:
# Download snapshot
mc cp minio/backups/qdrant/snapshot-xyz qdrant/snapshots/
# Restart Qdrant
kubectl rollout restart statefulset/qdrantRTO/RPO Targets: - RTO (Recovery Time): <1 hour - RPO (Recovery Point): <24 hours (daily backups)
Quick Reference
Production Health Check:
kubectl get nodes && \
kubectl get pods -A && \
curl https://api.grio.local/health && \
psql -h db.grio.local -c "SELECT 1"Deploy New Version:
git push origin main # Triggers GitHub Actions
# Wait 5-10 min for auto-deploy to staging
# Manual promotion: kubectl set image deployment/... <new-image>Emergency Rollback:
kubectl rollout undo deployment/django-apiLast Updated: 2026-03-24 Maintained by: DevOps Team