Technical Documentation
Document 03 of 12

AI/RAG Engine Specification

Dev Team March 2026 Grio AI Education Platform

Grio AI Engine Specification v1.0

Status: Development Phase 1 Last Updated: March 24, 2026 Audience: Engineering Team, Curriculum Integration Classification: Internal Technical Document


1. AI Architecture Overview

Grio’s AI system is built on a 3-layer curriculum-first stack, designed specifically for structured education delivery, not general-purpose chatting.

Stack Layers

┌─────────────────────────────────────────────────────┐
│  LAYER 3: Grio Tutoring Engine                      │
│  ├─ Mode Router (Teach/Explore/Practice/Revision)  │
│  ├─ Session State Manager                          │
│  ├─ Prompt Constructor                             │
│  └─ Response Validator                             │
├─────────────────────────────────────────────────────┤
│  LAYER 2: RAG Pipeline                             │
│  ├─ Query Parser                                   │
│  ├─ Vector Retrieval (Qdrant)                      │
│  ├─ Curriculum Context Injector                    │
│  └─ Chunk Reranker                                 │
├─────────────────────────────────────────────────────┤
│  LAYER 1: LLM Backend                              │
│  ├─ OpenAI GPT-4o (Phase 1 API)                    │
│  ├─ Self-hosted LLaMA (Phase 2+ fallback)          │
│  └─ Token Counter & Cost Manager                   │
└─────────────────────────────────────────────────────┘

Design Philosophy


2. LLM Strategy

Phase 1: OpenAI API (Current Target)

Model: GPT-4o Rationale: - Fastest to deployment - Reliable performance on multi-step reasoning (essential for tutoring) - Cost manageable at ~$0.03 per 1K input tokens, $0.06 per 1K output tokens - Built-in safety guardrails

Cost Estimates: - Assume 500 daily active students, 30 messages per student per day - ~15K messages/day; average 800 tokens per call = ~12M tokens/day - Rough daily cost: ~$360 (manageable for ed-tech scale)

Phase 2+: Self-Hosted Open-Source (Migration Path)

Candidates: - LLaMA 2 (70B): Strong reasoning, good for tutoring semantics - Mistral (7B/13B): Lightweight, fast inference on consumer GPUs - Mixtral (8x7B): Expert mixture, handles diverse topics well

Hybrid Recommendation:

LLM Layer = OpenAI GPT-4o API
RAG Layer = Self-hosted on GPU servers (A100/H100)
Logic Layer = Custom Python inference engine

This keeps latency-critical LLM calls fast while offloading vector ops to dedicated hardware.

Model Selection Criteria

  1. Must handle multi-turn dialogue with coherent reasoning
  2. Must follow system prompts strictly (curriculum-locking critical)
  3. Token efficiency for cost/speed trade-off
  4. Safety: Low false-positive on topic restriction
  5. Education-specific training (if applicable)

3. RAG Pipeline (Retrieval-Augmented Generation)

RAG is the backbone. Every LLM call receives curriculum context to ensure responses stay grounded.

Data Source: LessonContent

Content originates from Django LessonContent model: - Textbook excerpts (chunked by section/subsection) - Lecture slides (transcribed + visual descriptions) - Worked examples - Practice problems with solutions - Exam question banks (UNEB past papers) - Teacher notes

Embedding Pipeline

Command: uv run python manage.py embed_lesson_content

Process: 1. Query all LessonContent objects with status=‘published’ 2. Chunk content by natural boundaries (sections, subsections, examples) 3. Chunk size: 512 tokens max (balance: specificity vs. retrieval efficiency) 4. Overlap: 64 tokens (preserve cross-section context) 5. Embed each chunk using OpenAI text-embedding-3-small (~$0.02 per 1M tokens) 6. Store vectors + metadata (lesson_id, subject, topic, chunk_index) in Qdrant 7. Log completion: chunks embedded, vectors stored, version ID

Frequency: Weekly automated job or on-demand after curriculum updates.

Vector Database: Qdrant

Why Qdrant: - Lightweight, self-hostable (Docker) - Efficient HNSW indexing - Built-in payload filtering (critical for topic-locking) - Supports hybrid search (semantic + keyword)

Setup:

# docker-compose.yml snippet
qdrant:
  image: qdrant/qdrant:latest
  ports:
    - "6333:6333"  # HTTP API
  volumes:
    - qdrant_storage:/qdrant/storage
  environment:
    QDRANT_API_KEY: ${QDRANT_API_KEY}

Alternative: Weaviate (heavier, more ML-ops overhead; skip for MVP).

Retrieval Flow

  1. Query Reception: Student message + lesson context (class, subject, topic, session_id)
  2. Scope Filtering: Add metadata filter: subject == "Mathematics" AND topic == "Number Bases"
  3. Semantic Search: Query Qdrant with student’s question (embedded), retrieve top-5 chunks
  4. Ranking: Re-rank by relevance score + chunk freshness
  5. Context Assembly: Concatenate top-3 results into “Context:” block
  6. Injection: Pass to LLM as structured context in user message

Pseudo-code:

def retrieve_context(query: str, subject: str, topic: str, top_k: int = 5):
    query_embedding = embed(query)
    results = qdrant.search(
        vector=query_embedding,
        limit=top_k,
        query_filter={
            "must": [
                {"key": "subject", "match": {"value": subject}},
                {"key": "topic", "match": {"value": topic}}
            ]
        }
    )
    context = "\n".join([r.payload["text"] for r in results[:3]])
    return context

Chunking Strategy

Rationale: Prevents context bloat while preserving semantic coherence.

Re-indexing on Content Updates

When curriculum content changes: 1. Django signal triggers on LessonContent.save() 2. Chunk updated content 3. Delete old vectors from Qdrant (by lesson_id) 4. Embed + insert new vectors 5. Async queue job (Celery) to avoid blocking HTTP response


4. System Prompt Construction

Every LLM call is prefaced with a dynamic system prompt assembled by the Prompt Manager. No hardcoded system prompts—all derived from session + lesson metadata.

Components Assembled Per Call

system_prompt = f"""
You are a classroom tutor for {class_level} {subject}
(Topic: {current_topic}).

Teaching Style:
- Explain step-by-step for ages {min_age}-{max_age}
- Use simple language; break complex ideas into parts
- Ask questions to check understanding
- Encourage effort; frame mistakes as learning

Curriculum Boundaries:
- Only teach content from: {current_topic}
- Do not introduce unrelated topics
- If asked off-topic: "Great question! Let's focus on {current_topic} first."

{mode_specific_rules}

Examples & Context:
- Use examples relevant to {region} (Uganda/Zambia context)
- Align explanations with {exam_board} standards
- Reference curriculum materials when possible

Response Format:
- Keep explanations concise (2-3 sentences per chunk)
- Use bullet points for lists
- Always ask "Do you understand?" or "Next?" for pacing
"""

Example (Senior 1 Mathematics, Number Bases, Teach Mode)

You are a classroom tutor for Senior 1 Mathematics (Topic: Number Bases).

Teaching Style:
- Explain step-by-step for ages 12-14
- Use concrete examples; build from decimal to binary/hex
- After each concept, pause and ask "Do you understand?"
- Celebrate effort

Curriculum Boundaries:
- Only cover: converting between bases, place value, binary/hex operations
- Do not introduce: number theory, modular arithmetic (save for S.4)
- If asked about unrelated topics: "Good question! For now, let's focus on Number Bases."

Mode: TEACH
- Follow structure: Introduction → Explanation → Example → Practice → Quiz → Recap
- Pull full lesson slides from curriculum
- After each section, wait for "Next" before proceeding
- Do not skip any step

Examples & Context:
- Use phone numbers (256...), money (UGX), memory sizes (MB, GB)
- Align with UNEB Senior 1 Mathematics syllabus
- Reference textbook: "Secondary Mathematics Book 1, Chapter 3"

Response Format:
- Keep each explanation to 2-3 sentences
- Use bullet points
- Always ask "Do you understand?" before next step

5. Conversation Flow

Message Structure Per LLM Call

Each call to the LLM includes:

messages = [
    {
        "role": "system",
        "content": system_prompt  # Generated per spec in Section 4
    },
    {
        "role": "assistant",
        "content": previous_explanation  # If continuing prior turn
    },
    {
        "role": "user",
        "content": f"Context from curriculum:\n{rag_context}\n\n---\nStudent question: {student_query}"
    }
]

Session State Management

Maintain in Redis:

session = {
    "session_id": "uuid",
    "student_id": "uuid",
    "lesson_id": "uuid",
    "subject": "Mathematics",
    "topic": "Number Bases",
    "mode": "Teach",
    "current_step": 2,  # Track pacing (Intro=1, Explanation=2, ...)
    "conversation_history": [...],  # Last 10 exchanges
    "last_rag_context": {...},  # Cache to avoid re-retrieving same query
    "tokens_used": 2400,  # For cost tracking
    "started_at": "2026-03-24T10:30:00Z"
}

Context Window Management


6. Mode-Specific AI Behavior

The AI behaves differently depending on the selected learning mode. System prompt changes per mode.

Teach Mode

Goal: Deliver structured lesson content step-by-step

Behavior: 1. Follow rigid structure: Intro → Explanation → Example → Practice → Quiz → Recap 2. Pull full lesson content from RAG (not snippets) 3. Deliver step-by-step; student must click “Next” to advance 4. Cannot skip steps—enforced by prompt + logic layer 5. Use “Today we’re learning about {topic}…” opening

System Prompt Addition:

Mode: TEACH
Structure: You must follow these steps in order:
1. INTRO (1-2 sentences): "Today we're learning about X"
2. EXPLANATION (3-5 sentences): Key concept, broken into parts
3. EXAMPLE (worked example): Step-by-step solution
4. PRACTICE: Give 1 easy problem, wait for student answer
5. QUIZ: Give 1 harder problem, check answer
6. RECAP (bullet points): 3-5 key takeaways

You cannot skip steps or re-order. Wait for "Next" button before advancing.

Explore Mode

Goal: Free-form Q&A, but curriculum-anchored

Behavior: 1. Answer questions broadly within subject area 2. Default to curriculum content when available 3. Can venture slightly beyond topic if relevant to subject 4. Always attempt redirect: “This ties into {topic}…” 5. Discourage off-topic questions

System Prompt Addition:

Mode: EXPLORE
- Answer questions within {subject}
- Prefer curriculum content, but can expand if relevant
- If student asks off-topic: "That's interesting! It's related to [subject area].
  For now, let's focus on what's in our curriculum."
- Keep answers conversational but accurate

Practice Mode

Goal: Generate and validate practice problems

Behavior: 1. Generate 3-5 problems based on current topic 2. Check student’s answer (exact match or accept equivalent forms) 3. Provide immediate feedback: correct/incorrect + explanation 4. Adaptive difficulty (future): harder if 2+ correct, easier if < 1 correct 5. Track accuracy for learning dashboard

System Prompt Addition:

Mode: PRACTICE
- Generate problems at level: {difficulty_level} (Easy/Medium/Hard)
- For each problem, accept equivalent answers (e.g., "2 + 3" = "5" = "5.0")
- Feedback format: "[Correct/Incorrect] Because: [explanation]"
- After 5 problems: "You got X/5. Ready for more or recap?"

Revision Mode

Goal: Rapid concept review + memory testing

Behavior: 1. Generate concept summaries (bullet-point format) 2. Rapid-fire recall questions 3. Emphasis on key terms, definitions, formulas 4. Short, punchy delivery 5. “Flashcard-style” interaction

System Prompt Addition:

Mode: REVISION
- Generate concise bullet-point summaries (max 5 points)
- Followed by 5 recall questions (definition, formula, example)
- Answer format: "Q: [question]\nA: [answer]"
- After 5 Q&As: Ask if student wants more revision or move on

Exam Prep Mode (Planned)

Goal: UNEB past paper drilling with timed constraints

Behavior: - Pull past exam questions from LessonContent - Impose time limits (5-15 min per question type) - Score & explain answers - Track performance against exam standards


7. AI Behavior Enforcement (Critical Rules)

These rules are non-negotiable and enforced at multiple layers (prompt + code):

1. Curriculum-First Responses

Every answer must trace back to curriculum. RAG context is mandatory.

Enforcement: - Code layer: All LLM calls include assert rag_context is not None - Prompt layer: “Only use content from {topic}” - Test layer: Regex check that response includes curriculum reference

2. Topic-Locking

AI cannot leave the selected topic under any circumstance.

Enforcement: - Metadata filter in Qdrant: only retrieve chunks matching topic == current_topic - Prompt: “If asked outside {topic}, politely redirect: ‘Let’s focus on {topic} first.’” - Response validator: Check that output does not mention unrelated topics

3. Age-Appropriate Explanations

Vocabulary and complexity must match student age range (metadata: min_age, max_age).

Enforcement: - Prompt includes: “Use vocabulary appropriate for ages {min_age}-{max_age}” - Readability checker: Flesch-Kincaid grade level must match target age - Avoid: complex jargon, abstract theory, unrelated tangents

4. Localized Examples

Always use Uganda/Zambia context (currency, places, cultural references).

Enforcement: - Prompt: “Use examples relevant to {region} (Uganda/Zambia context)” - RAG context includes regional examples from lesson content - Avoid: USD pricing, Western cultural references unless unavoidable

5. UNEB Exam Standard Alignment

Teaching must align with UNEB syllabus for rigorous assessment.

Enforcement: - LessonContent metadata includes exam_board = "UNEB" - Prompt references: “Align explanations with UNEB standards” - RAG retrieves only UNEB-approved content

6. Off-Topic Question Handling

When student asks unrelated question, AI must politely redirect—not ignore or refuse.

Example: - Student: “How do I make a video game?” - AI: “That’s a cool interest! For now, let’s focus on Number Bases. After we finish, you can explore coding. Now, where were we? Do you understand place value?”

7. AI Must Never Run Without Curriculum Context

Critical: If RAG fails (Qdrant down, no matching chunks), AI does not answer.

Fallback Behavior:

if not rag_context or len(rag_context) < 100:
    return {
        "status": "error",
        "message": "I couldn't find curriculum content for that question. Please try again or ask your teacher.",
        "error_code": "RAG_UNAVAILABLE"
    }

8. Avatar & Voice Integration (Future/Planned)

Animated Avatar

Phase 2+: Lip-synced avatar to humanize tutoring experience.

Options: - HeyGen Streaming API: Pre-recorded videos, real-time mouth-sync - D-ID: Live avatar generation (more flexible, higher latency)

Placement Rules: - Left or center of screen (not obstructive) - Proportional to classroom UI (not oversized) - Optional: can be toggled off by student

Voice Synthesis

Options: - ElevenLabs: High-quality, fast, ~$0.30 per 1K characters - OpenAI TTS: Integrated, $0.015 per 1K characters (cost-effective)

Implementation: - Stream audio chunks as response is generated (don’t wait for full response) - Sync avatar mouth movements to audio playback


9. AI Backend Services

Architecture

Django App (Grio)
├── `/api/ai/message` (POST) → Message Handler
│   ├─ Validate input + session
│   ├─ Retrieve curriculum context (RAG)
│   ├─ Build system prompt
│   ├─ Call LLM (OpenAI)
│   └─ Validate + store response
├── `/api/ai/health` (GET) → Health Check
└── `/api/ai/embed-content` (POST) → Embedding Job

LLM Prompt/Retrieval Management

Use LangChain or custom orchestration:

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.memory import ConversationBufferMemory

class GrioTutoringChain:
    def __init__(self, lesson_id, topic, mode):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
        self.system_prompt = self.build_system_prompt(lesson_id, topic, mode)
        self.memory = ConversationBufferMemory(max_token_limit=3000)

    def answer(self, query):
        context = retrieve_context(query, self.lesson_id, self.topic)
        response = self.llm.invoke({
            "system": self.system_prompt,
            "context": context,
            "query": query
        })
        return response

Caching Strategy

Cache Backend: Redis, TTL-based expiration

Health Check Endpoint

GET /api/ai/health/

Response:
{
    "status": "healthy",
    "llm_status": "online",
    "qdrant_status": "online",
    "uptime_seconds": 86400,
    "last_message_processed": "2026-03-24T15:45:00Z"
}

Error Handling & Fallback

FailureFallback
RAG unavailableReturn error + suggestion to contact teacher
LLM timeout (>10s)Return cached prior response if available; else error
LLM returns off-topicValidate + re-prompt with stricter constraints
Token limit exceededSummarize conversation history; continue

10. Performance & Optimization

Response Latency Targets

Caching Strategy

  1. System prompts: Keyed by (lesson_id, topic, mode); 1-hour TTL
  2. RAG results: Keyed by hash(query + filters); 30-min TTL
  3. Embeddings: Batch cache for daily re-indexing jobs
  4. LLM responses: Exact-match query cache per session; 5-min TTL

Batch Embedding for Curriculum Updates

# Run weekly or after content changes
uv run python manage.py embed_lesson_content \
    --batch-size 32 \
    --workers 4 \
    --force-refresh

Performance: - Embed 10K chunks in ~15 minutes on standard GPU - Async job queue (Celery) to avoid blocking API

Cost Management (API-Based LLM)

Monthly tracking: - Log tokens per student per lesson - Alert if monthly spend > $10K (scale trigger) - Optimize prompt length periodically - Consider self-hosted fallback if costs exceed budget

Cost optimization levers: 1. Shorter system prompts (already minimal) 2. Smaller context windows (currently 3 chunks; could reduce to 1-2) 3. Use GPT-3.5-turbo for non-critical modes (explore, practice) 4. Batch embeddings (don’t re-embed unchanged content)


Appendix: Deployment Checklist


Document Version: 1.0 Next Review: June 2026 Contact: Engineering Lead (AI/ML)