Infrastructure & Deployment Guide

Grio AI Education Platform

Internal Technical Document for Dev Team & DevOps

1. Infrastructure Strategy

Hybrid Multi-Phase Architecture

Grio uses a hybrid infrastructure model optimized for African education contexts, balancing cost, sovereignty, and scalability.

Phase 1: Cloud Foundation (Hetzner) - Entry point for all new deployments - Hetzner Cloud (Germany/Finland) for backend, DB, AI workloads - CX/CPX instances: ~10-30 EUR/month, fast provisioning - Rationale: GDPR-compliant, no vendor lock-in, 50% cost vs AWS, sovereign data - Geographic selection: EU default, on-demand region expansion

Phase 2: Edge Layer (Regional) - Uganda-based edge servers (pilot phase) - Local curriculum cache, session storage, API response layer - Schools/district hubs with intermittent connectivity - Sync mechanisms: daily/weekly to central, real-time when available

Phase 3: On-Premises (Ministry/Government) - Fully country-controlled infrastructure - Rack servers in national data center + school clusters - K3s enables seamless transition from cloud/edge to on-prem - Database replication, offline-first architecture

Why Hetzner?

Factor	Hetzner	AWS	Azure	GCP
Data Sovereignty	✓ EU (GDPR)	Limited	Limited	Limited
Cost (CPU/RAM)	€0.005/hour	$0.01+	$0.008+	$0.005+
No Vendor Lock	✓ Standard APIs	✗ Proprietary	✗ Proprietary	✗ Proprietary
Setup Speed	Minutes	Hours	Hours	Hours
Dedicated Support	Included	Extra cost	Extra cost	Extra cost

2. Compute Layer

Server Sizing by Service

Backend Services (Django + API) - Min: CX21 (2 vCPU, 4GB RAM) — dev/staging - Prod: CPX31 (8 vCPU, 16GB RAM) — peak load ~500 concurrent - Auto-scaling via K3s: horizontal pod autoscaling 2-8 replicas

AI Service (LLM Inference) - Small models (<7B): CPX41 (16 vCPU, 32GB RAM) - Larger models (>13B): Dedicated server + GPU - Consider: vLLM + GGUF quantization to reduce VRAM needs - Alternative: API endpoint to Hugging Face / Together.ai (managed)

Vector Database (Qdrant) - Min: CX31 (4 vCPU, 8GB RAM) for <1M embeddings - Prod: CPX51 (24 vCPU, 48GB RAM) for 5M+ embeddings - SSD mandatory; HDD not suitable

Database (PostgreSQL) - Dedicated server (PX92 or Hetzner Managed DB) - Minimum: 32GB RAM, NVMe SSD, daily backups - Non-negotiable for production

Edge Nodes (Schools/Districts) - Raspberry Pi 4B (8GB): cache + local sync agent - Or mini-PC (Intel N100): full K3s cluster locally - Cost: $100-300 per site, supports 50-200 students

3. Containerization (Mandatory)

Docker Standards

All services must be containerized. No exceptions.

Base Images:

# Backend (Django)
FROM python:3.12-slim

# Frontend (Next.js)
FROM node:20-alpine

# AI (PyTorch)
FROM pytorch/pytorch:2.1-cuda12.1-runtime-slim

Registry: Docker Hub (free) or self-hosted Harbor on Hetzner

Kubernetes via K3s

K3s is lightweight Kubernetes ideal for resource-constrained environments.

Installation:

# Master node
curl -sfL https://get.k3s.io | sh -

# Agent nodes
curl -sfL https://get.k3s.io | K3S_URL=https://master:6443 \
  K3S_TOKEN=<token> sh -

Why K3s over full Kubernetes: - 50% smaller memory footprint - Single binary deployment - Built-in Traefik ingress + ServiceLB - Perfect for on-prem + edge

Environment Layers

Layer	Tool	Replicas	SLA
Local Dev	Docker Compose	1	N/A
Staging	K3s single-node	1	Best effort
Production	K3s 3+ nodes	3-8	99.5% uptime

4. Database Infrastructure

PostgreSQL (Primary)

Hosted Options: 1. Hetzner Managed Database (simplest, automated backups) 2. Dedicated server + manual setup (full control, lower cost)

Backup Strategy:

# Daily backups to MinIO
0 2 * * * /usr/local/bin/backup-postgres.sh

# Retention: 30 days rolling
# Test restores: weekly from backup

Replication: - Primary-Replica for HA (Streaming Replication) - Replica in secondary region (Phase 2 goal)

Qdrant (Vector DB)

Self-hosted vector database for curriculum RAG.

Configuration:

# qdrant/config.yaml
wal:
  dir_path: ./storage/wal
storage:
  snapshots_path: ./storage/snapshots

Sizing: - 1 embedding dimension = ~200 bytes overhead - Curriculum estimates: 100K documents × 1K dimensions = ~100GB SSD

Persistence: - Daily snapshots to MinIO - Full snapshot recovery <10 minutes

MinIO (S3-Compatible Storage)

Stores videos, lesson PDFs, student work, media assets.

Setup (K3s):

helm install minio minio/minio \
  --set rootUser=admin \
  --set rootPassword=<secure-pwd> \
  --set persistence.size=500Gi

Buckets: - curriculum-assets/ — lesson videos, PDFs, images - student-work/ — submissions, quizzes, progress - system-logs/ — application logs (Loki) - backups/ — database snapshots

Replication: MinIO internal mirroring or cross-server bucket sync

Redis (Caching Layer)

Session storage, curriculum cache, API response caching.

K3s Deployment:

helm install redis bitnami/redis \
  --set auth.enabled=true \
  --set auth.password=<secure> \
  --set persistence.size=10Gi

TTL Policies: - Sessions: 24 hours - API cache: 1 hour - Curriculum index: 12 hours

5. Networking & Routing

Load Balancer: Traefik

Kubernetes-native ingress controller (built into K3s).

Configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grio-api
spec:
  tls:
  - hosts:
    - api.grio.local
    secretName: letsencrypt-prod
  rules:
  - host: api.grio.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: django-api
            port:
              number: 8000

Alternative: NGINX Ingress (heavier, enterprise-grade)

CDN & Caching

Cloudflare (Recommended): - DNS only (never proxy through CF for data sovereignty) - Caching rules for static assets (images, videos, CSS) - DDoS protection, Firewall rules - Cost: $200/month Business plan or $0 if self-managed DNS

Self-Managed: Cache-Control headers + Redis + MinIO direct serving

SSL/TLS

Let’s Encrypt via cert-manager:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

Auto-renewal: cert-manager handles 30+ day pre-renewal

DNS Configuration

api.grio.local       A  <Hetzner-IP>
app.grio.local       A  <Hetzner-IP>
assets.grio.local    A  <MinIO-IP>
vector.grio.local    A  <Qdrant-IP>

6. CI/CD Pipeline

GitHub Actions Workflow

name: Deploy to Production
on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build Docker image
        run: docker build -t grio-api:${{ github.sha }} .

      - name: Run tests
        run: docker run grio-api:${{ github.sha }} pytest

      - name: Push to registry
        run: docker push <registry>/grio-api:${{ github.sha }}

      - name: Deploy to K3s
        run: |
          kubectl set image deployment/django-api \
            django-api=<registry>/grio-api:${{ github.sha }}
          kubectl rollout status deployment/django-api

Staging vs Production

Staging: Auto-deploy on main branch, manual promotion to production

Rollback: kubectl rollout undo deployment/django-api

7. Monitoring & Observability

Prometheus (Metrics)

Scrapes metrics from all services.

# prometheus.yaml
global:
  scrape_interval: 30s
scrape_configs:
  - job_name: 'django'
    static_configs:
    - targets: ['django-api:8000']
  - job_name: 'qdrant'
    static_configs:
    - targets: ['qdrant:6333']

Grafana (Dashboards)

Pre-built dashboards for K3s, PostgreSQL, application metrics.

Key Metrics: - API response time (p50, p95, p99) - CPU/Memory utilization by pod - Database query latency - Cache hit rate (Redis) - Token usage (AI workloads)

Loki (Centralized Logging)

Lightweight log aggregation.

helm install loki grafana/loki-stack \
  --set promtail.enabled=true \
  --set grafana.enabled=false

Log Retention: 30 days

Uptime Monitoring

Uptime Kuma (self-hosted):

docker run -d --restart unless-stopped \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  louislam/uptime-kuma:latest

Critical Alerts

Alert	Threshold	Action
API response time	>1000ms p95	Page on-call
Pod crash loop	3+ restarts/10min	Notify devops
Database connection error	>10%	Critical incident
Disk usage	>85%	Scale volume

8. Local Development Setup

Prerequisites

Node.js 20.x (nvm recommended)
Python 3.12+ with uv package manager
PostgreSQL 15+ (local or Docker)
Docker & Docker Compose
Git 2.40+

Repository Structure

grio/                         # Monorepo root
├── grio-api/                # Django backend
│   ├── manage.py
│   ├── pyproject.toml
│   ├── Dockerfile
│   └── ...
├── grio/                     # Next.js frontend
│   ├── package.json
│   ├── next.config.js
│   └── ...
└── docker-compose.yml        # Local dev orchestration

Backend Setup

cd grio-api

# Install dependencies
uv sync

# Run migrations
uv run manage.py migrate

# Load seed data
uv run manage.py seed_curriculum

# Start dev server
uv run manage.py runserver 0.0.0.0:8000

Frontend Setup

cd grio

npm install
npm run dev

# Runs on http://localhost:3000

Docker Compose (All-in-One)

docker-compose up -d

# Services start:
# - Django API: http://localhost:8000
# - Next.js App: http://localhost:3000
# - PostgreSQL: localhost:5432
# - Qdrant: http://localhost:6333
# - MinIO: http://localhost:9000 (user: minioadmin, pwd: minioadmin)

Development Sanity Checks

Check	Command	Expected
Backend API	`curl http://localhost:8000/health`	`{"status": "ok"}`
DB connection	`uv run manage.py dbshell`	PostgreSQL prompt
Frontend build	`npm run build`	`.next/` directory created
Vector DB	`curl http://localhost:6333/health`	`{"status": "ok"}`

Common Issues

Issue	Fix
Port 5432 already in use	`lsof -i :5432` & kill PID, or use different port
`ModuleNotFoundError: No module named 'X'`	`uv sync` to update lockfile
`next: command not found`	`npm install -g next` or use `npx next`
PostgreSQL migrations fail	Delete `.venv/`, run `uv sync` again

9. Offline & Low-Bandwidth Strategy

Edge Node Architecture

Deploy lightweight K3s + SQLite mirrors in schools/districts.

Components: - Curriculum cache (SQLite + MinIO local mirror) - Session sync queue (RocksDB) - Sync agent (Python daemon)

Sync Logic:

# sync_agent.py - runs on edge node
def sync_curriculum():
    if internet_connected():
        pull_latest_curriculum()  # from central Qdrant
        sync_student_submissions()  # upload to central
    else:
        use_local_cache()
        queue_offline_changes()

Bandwidth Optimization: - Curriculum updates: delta sync only (changed documents) - Videos: adaptive bitrate, HD available when bandwidth permits - Sync frequency: daily if connected, manual on-demand

Offline Capabilities: - Read curriculum: ✓ Full access - Submit work: ✓ Queued locally - AI chat: ✓ Lightweight model only - Sync: ✗ Requires connectivity

10. Security Hardening

HTTPS Everywhere

All traffic encrypted end-to-end via Let’s Encrypt.

# Verify HSTS header
curl -I https://api.grio.local | grep Strict-Transport

# Output: Strict-Transport-Security: max-age=63072000

Firewall (UFW on Hetzner)

# Allow SSH, API, web traffic only
ufw allow 22/tcp           # SSH
ufw allow 80/tcp           # HTTP (Let's Encrypt renewal)
ufw allow 443/tcp          # HTTPS
ufw allow from 10.0.0.0/8  # K3s internal
ufw default deny incoming
ufw enable

VPN for Admin Access

Production Hetzner nodes accessible only via Wireguard VPN.

# Generate keys
wg genkey | tee wg-private.key | wg pubkey > wg-public.key

# Add to Hetzner firewall
ufw allow from <vpn-client-ip>

Data Encryption at Rest

PostgreSQL:

-- Enable pgcrypto
CREATE EXTENSION pgcrypto;

-- Encrypt sensitive fields
ALTER TABLE users ADD COLUMN email_encrypted bytea;

MinIO: S3-side encryption with CMK

Audit & Access Logs

# K3s audit logging
--audit-log-path=/var/log/k3s-audit.log
--audit-log-maxage=30

Centralize to: Loki or MinIO (analysis via Grafana)

11. Backup & Disaster Recovery

PostgreSQL Backup Schedule

#!/bin/bash
# backup-postgres.sh
set -e

BACKUP_DIR="/backups/postgresql"
DATE=$(date +%Y%m%d-%H%M%S)

# Full backup
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | \
  gzip > $BACKUP_DIR/full-$DATE.sql.gz

# Upload to MinIO
mc cp $BACKUP_DIR/full-$DATE.sql.gz \
  minio/backups/postgres/

# Retention: keep 30 most recent
find $BACKUP_DIR -name "full-*.sql.gz" -mtime +30 -delete

Schedule: Daily 02:00 UTC (cron)

Qdrant Snapshot Strategy

# Trigger snapshot (via API)
curl -X POST http://qdrant:6333/snapshots

# Download to MinIO
mc cp qdrant/snapshots/snapshot-* minio/backups/qdrant/

# Retention: 14 days rolling

MinIO Replication

Cross-server sync:

mc mirror --watch \
  minio/curriculum-assets/ \
  minio-backup/curriculum-assets/

Retention: 30-day rolling backup

Recovery Procedures

PostgreSQL Recovery:

# From backup
gunzip < $BACKUP_DIR/full-20240324-020000.sql.gz | \
  psql -h $NEW_DB_HOST -U $DB_USER $DB_NAME

# Verify
SELECT COUNT(*) FROM users;  # Should match pre-backup count

Qdrant Recovery:

# Download snapshot
mc cp minio/backups/qdrant/snapshot-xyz qdrant/snapshots/

# Restart Qdrant
kubectl rollout restart statefulset/qdrant

RTO/RPO Targets: - RTO (Recovery Time): <1 hour - RPO (Recovery Point): <24 hours (daily backups)

Quick Reference

Production Health Check:

kubectl get nodes && \
kubectl get pods -A && \
curl https://api.grio.local/health && \
psql -h db.grio.local -c "SELECT 1"

Deploy New Version:

git push origin main  # Triggers GitHub Actions
# Wait 5-10 min for auto-deploy to staging
# Manual promotion: kubectl set image deployment/... <new-image>

Emergency Rollback:

kubectl rollout undo deployment/django-api

Last Updated: 2026-03-24 Maintained by: DevOps Team