Grio AI Engine Specification v1.0
Status: Development Phase 1 Last Updated: March 24, 2026 Audience: Engineering Team, Curriculum Integration Classification: Internal Technical Document
1. AI Architecture Overview
Grio’s AI system is built on a 3-layer curriculum-first stack, designed specifically for structured education delivery, not general-purpose chatting.
Stack Layers
┌─────────────────────────────────────────────────────┐
│ LAYER 3: Grio Tutoring Engine │
│ ├─ Mode Router (Teach/Explore/Practice/Revision) │
│ ├─ Session State Manager │
│ ├─ Prompt Constructor │
│ └─ Response Validator │
├─────────────────────────────────────────────────────┤
│ LAYER 2: RAG Pipeline │
│ ├─ Query Parser │
│ ├─ Vector Retrieval (Qdrant) │
│ ├─ Curriculum Context Injector │
│ └─ Chunk Reranker │
├─────────────────────────────────────────────────────┤
│ LAYER 1: LLM Backend │
│ ├─ OpenAI GPT-4o (Phase 1 API) │
│ ├─ Self-hosted LLaMA (Phase 2+ fallback) │
│ └─ Token Counter & Cost Manager │
└─────────────────────────────────────────────────────┘Design Philosophy
- Curriculum-First: AI never operates without curriculum anchoring. Every response derives from lesson content.
- Not a Chatbot: Grio is a structured tutoring system with progression rules, not a general Q&A bot.
- Teacher-Aligned: System prompts embed pedagogical rules (pacing, age-appropriateness, sequencing).
- Measurable Outcomes: Every AI interaction correlates to learning objectives and can be audited.
2. LLM Strategy
Phase 1: OpenAI API (Current Target)
Model: GPT-4o Rationale: - Fastest to deployment - Reliable performance on multi-step reasoning (essential for tutoring) - Cost manageable at ~$0.03 per 1K input tokens, $0.06 per 1K output tokens - Built-in safety guardrails
Cost Estimates: - Assume 500 daily active students, 30 messages per student per day - ~15K messages/day; average 800 tokens per call = ~12M tokens/day - Rough daily cost: ~$360 (manageable for ed-tech scale)
Phase 2+: Self-Hosted Open-Source (Migration Path)
Candidates: - LLaMA 2 (70B): Strong reasoning, good for tutoring semantics - Mistral (7B/13B): Lightweight, fast inference on consumer GPUs - Mixtral (8x7B): Expert mixture, handles diverse topics well
Hybrid Recommendation:
LLM Layer = OpenAI GPT-4o API
RAG Layer = Self-hosted on GPU servers (A100/H100)
Logic Layer = Custom Python inference engineThis keeps latency-critical LLM calls fast while offloading vector ops to dedicated hardware.
Model Selection Criteria
- Must handle multi-turn dialogue with coherent reasoning
- Must follow system prompts strictly (curriculum-locking critical)
- Token efficiency for cost/speed trade-off
- Safety: Low false-positive on topic restriction
- Education-specific training (if applicable)
3. RAG Pipeline (Retrieval-Augmented Generation)
RAG is the backbone. Every LLM call receives curriculum context to ensure responses stay grounded.
Data Source: LessonContent
Content originates from Django LessonContent model: - Textbook excerpts (chunked by section/subsection) - Lecture slides (transcribed + visual descriptions) - Worked examples - Practice problems with solutions - Exam question banks (UNEB past papers) - Teacher notes
Embedding Pipeline
Command: uv run python manage.py embed_lesson_content
Process: 1. Query all LessonContent objects with status=‘published’ 2. Chunk content by natural boundaries (sections, subsections, examples) 3. Chunk size: 512 tokens max (balance: specificity vs. retrieval efficiency) 4. Overlap: 64 tokens (preserve cross-section context) 5. Embed each chunk using OpenAI text-embedding-3-small (~$0.02 per 1M tokens) 6. Store vectors + metadata (lesson_id, subject, topic, chunk_index) in Qdrant 7. Log completion: chunks embedded, vectors stored, version ID
Frequency: Weekly automated job or on-demand after curriculum updates.
Vector Database: Qdrant
Why Qdrant: - Lightweight, self-hostable (Docker) - Efficient HNSW indexing - Built-in payload filtering (critical for topic-locking) - Supports hybrid search (semantic + keyword)
Setup:
# docker-compose.yml snippet
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333" # HTTP API
volumes:
- qdrant_storage:/qdrant/storage
environment:
QDRANT_API_KEY: ${QDRANT_API_KEY}Alternative: Weaviate (heavier, more ML-ops overhead; skip for MVP).
Retrieval Flow
- Query Reception: Student message + lesson context (class, subject, topic, session_id)
- Scope Filtering: Add metadata filter:
subject == "Mathematics" AND topic == "Number Bases" - Semantic Search: Query Qdrant with student’s question (embedded), retrieve top-5 chunks
- Ranking: Re-rank by relevance score + chunk freshness
- Context Assembly: Concatenate top-3 results into “Context:” block
- Injection: Pass to LLM as structured context in user message
Pseudo-code:
def retrieve_context(query: str, subject: str, topic: str, top_k: int = 5):
query_embedding = embed(query)
results = qdrant.search(
vector=query_embedding,
limit=top_k,
query_filter={
"must": [
{"key": "subject", "match": {"value": subject}},
{"key": "topic", "match": {"value": topic}}
]
}
)
context = "\n".join([r.payload["text"] for r in results[:3]])
return contextChunking Strategy
- Lesson chunks: 250-400 words (roughly 1-2 paragraphs of textbook content)
- Example chunks: Include full worked example + solution
- Problem chunks: Question + answer key (for practice validation)
- Semantic boundaries: Break at topic transitions, not mid-sentence
Rationale: Prevents context bloat while preserving semantic coherence.
Re-indexing on Content Updates
When curriculum content changes: 1. Django signal triggers on LessonContent.save() 2. Chunk updated content 3. Delete old vectors from Qdrant (by lesson_id) 4. Embed + insert new vectors 5. Async queue job (Celery) to avoid blocking HTTP response
4. System Prompt Construction
Every LLM call is prefaced with a dynamic system prompt assembled by the Prompt Manager. No hardcoded system prompts—all derived from session + lesson metadata.
Components Assembled Per Call
system_prompt = f"""
You are a classroom tutor for {class_level} {subject}
(Topic: {current_topic}).
Teaching Style:
- Explain step-by-step for ages {min_age}-{max_age}
- Use simple language; break complex ideas into parts
- Ask questions to check understanding
- Encourage effort; frame mistakes as learning
Curriculum Boundaries:
- Only teach content from: {current_topic}
- Do not introduce unrelated topics
- If asked off-topic: "Great question! Let's focus on {current_topic} first."
{mode_specific_rules}
Examples & Context:
- Use examples relevant to {region} (Uganda/Zambia context)
- Align explanations with {exam_board} standards
- Reference curriculum materials when possible
Response Format:
- Keep explanations concise (2-3 sentences per chunk)
- Use bullet points for lists
- Always ask "Do you understand?" or "Next?" for pacing
"""Example (Senior 1 Mathematics, Number Bases, Teach Mode)
You are a classroom tutor for Senior 1 Mathematics (Topic: Number Bases).
Teaching Style:
- Explain step-by-step for ages 12-14
- Use concrete examples; build from decimal to binary/hex
- After each concept, pause and ask "Do you understand?"
- Celebrate effort
Curriculum Boundaries:
- Only cover: converting between bases, place value, binary/hex operations
- Do not introduce: number theory, modular arithmetic (save for S.4)
- If asked about unrelated topics: "Good question! For now, let's focus on Number Bases."
Mode: TEACH
- Follow structure: Introduction → Explanation → Example → Practice → Quiz → Recap
- Pull full lesson slides from curriculum
- After each section, wait for "Next" before proceeding
- Do not skip any step
Examples & Context:
- Use phone numbers (256...), money (UGX), memory sizes (MB, GB)
- Align with UNEB Senior 1 Mathematics syllabus
- Reference textbook: "Secondary Mathematics Book 1, Chapter 3"
Response Format:
- Keep each explanation to 2-3 sentences
- Use bullet points
- Always ask "Do you understand?" before next step5. Conversation Flow
Message Structure Per LLM Call
Each call to the LLM includes:
messages = [
{
"role": "system",
"content": system_prompt # Generated per spec in Section 4
},
{
"role": "assistant",
"content": previous_explanation # If continuing prior turn
},
{
"role": "user",
"content": f"Context from curriculum:\n{rag_context}\n\n---\nStudent question: {student_query}"
}
]Session State Management
Maintain in Redis:
session = {
"session_id": "uuid",
"student_id": "uuid",
"lesson_id": "uuid",
"subject": "Mathematics",
"topic": "Number Bases",
"mode": "Teach",
"current_step": 2, # Track pacing (Intro=1, Explanation=2, ...)
"conversation_history": [...], # Last 10 exchanges
"last_rag_context": {...}, # Cache to avoid re-retrieving same query
"tokens_used": 2400, # For cost tracking
"started_at": "2026-03-24T10:30:00Z"
}Context Window Management
- Store last 10 exchanges in session history (avoid infinite context)
- Summarize old exchanges if > 15 turns: “Earlier, we discussed [summary]”
- Reset context every 30 minutes or 5000 tokens (whichever first)
- Log session summaries to database for audit trail
6. Mode-Specific AI Behavior
The AI behaves differently depending on the selected learning mode. System prompt changes per mode.
Teach Mode
Goal: Deliver structured lesson content step-by-step
Behavior: 1. Follow rigid structure: Intro → Explanation → Example → Practice → Quiz → Recap 2. Pull full lesson content from RAG (not snippets) 3. Deliver step-by-step; student must click “Next” to advance 4. Cannot skip steps—enforced by prompt + logic layer 5. Use “Today we’re learning about {topic}…” opening
System Prompt Addition:
Mode: TEACH
Structure: You must follow these steps in order:
1. INTRO (1-2 sentences): "Today we're learning about X"
2. EXPLANATION (3-5 sentences): Key concept, broken into parts
3. EXAMPLE (worked example): Step-by-step solution
4. PRACTICE: Give 1 easy problem, wait for student answer
5. QUIZ: Give 1 harder problem, check answer
6. RECAP (bullet points): 3-5 key takeaways
You cannot skip steps or re-order. Wait for "Next" button before advancing.Explore Mode
Goal: Free-form Q&A, but curriculum-anchored
Behavior: 1. Answer questions broadly within subject area 2. Default to curriculum content when available 3. Can venture slightly beyond topic if relevant to subject 4. Always attempt redirect: “This ties into {topic}…” 5. Discourage off-topic questions
System Prompt Addition:
Mode: EXPLORE
- Answer questions within {subject}
- Prefer curriculum content, but can expand if relevant
- If student asks off-topic: "That's interesting! It's related to [subject area].
For now, let's focus on what's in our curriculum."
- Keep answers conversational but accuratePractice Mode
Goal: Generate and validate practice problems
Behavior: 1. Generate 3-5 problems based on current topic 2. Check student’s answer (exact match or accept equivalent forms) 3. Provide immediate feedback: correct/incorrect + explanation 4. Adaptive difficulty (future): harder if 2+ correct, easier if < 1 correct 5. Track accuracy for learning dashboard
System Prompt Addition:
Mode: PRACTICE
- Generate problems at level: {difficulty_level} (Easy/Medium/Hard)
- For each problem, accept equivalent answers (e.g., "2 + 3" = "5" = "5.0")
- Feedback format: "[Correct/Incorrect] Because: [explanation]"
- After 5 problems: "You got X/5. Ready for more or recap?"Revision Mode
Goal: Rapid concept review + memory testing
Behavior: 1. Generate concept summaries (bullet-point format) 2. Rapid-fire recall questions 3. Emphasis on key terms, definitions, formulas 4. Short, punchy delivery 5. “Flashcard-style” interaction
System Prompt Addition:
Mode: REVISION
- Generate concise bullet-point summaries (max 5 points)
- Followed by 5 recall questions (definition, formula, example)
- Answer format: "Q: [question]\nA: [answer]"
- After 5 Q&As: Ask if student wants more revision or move onExam Prep Mode (Planned)
Goal: UNEB past paper drilling with timed constraints
Behavior: - Pull past exam questions from LessonContent - Impose time limits (5-15 min per question type) - Score & explain answers - Track performance against exam standards
7. AI Behavior Enforcement (Critical Rules)
These rules are non-negotiable and enforced at multiple layers (prompt + code):
1. Curriculum-First Responses
Every answer must trace back to curriculum. RAG context is mandatory.
Enforcement: - Code layer: All LLM calls include assert rag_context is not None - Prompt layer: “Only use content from {topic}” - Test layer: Regex check that response includes curriculum reference
2. Topic-Locking
AI cannot leave the selected topic under any circumstance.
Enforcement: - Metadata filter in Qdrant: only retrieve chunks matching topic == current_topic - Prompt: “If asked outside {topic}, politely redirect: ‘Let’s focus on {topic} first.’” - Response validator: Check that output does not mention unrelated topics
3. Age-Appropriate Explanations
Vocabulary and complexity must match student age range (metadata: min_age, max_age).
Enforcement: - Prompt includes: “Use vocabulary appropriate for ages {min_age}-{max_age}” - Readability checker: Flesch-Kincaid grade level must match target age - Avoid: complex jargon, abstract theory, unrelated tangents
4. Localized Examples
Always use Uganda/Zambia context (currency, places, cultural references).
Enforcement: - Prompt: “Use examples relevant to {region} (Uganda/Zambia context)” - RAG context includes regional examples from lesson content - Avoid: USD pricing, Western cultural references unless unavoidable
5. UNEB Exam Standard Alignment
Teaching must align with UNEB syllabus for rigorous assessment.
Enforcement: - LessonContent metadata includes exam_board = "UNEB" - Prompt references: “Align explanations with UNEB standards” - RAG retrieves only UNEB-approved content
6. Off-Topic Question Handling
When student asks unrelated question, AI must politely redirect—not ignore or refuse.
Example: - Student: “How do I make a video game?” - AI: “That’s a cool interest! For now, let’s focus on Number Bases. After we finish, you can explore coding. Now, where were we? Do you understand place value?”
7. AI Must Never Run Without Curriculum Context
Critical: If RAG fails (Qdrant down, no matching chunks), AI does not answer.
Fallback Behavior:
if not rag_context or len(rag_context) < 100:
return {
"status": "error",
"message": "I couldn't find curriculum content for that question. Please try again or ask your teacher.",
"error_code": "RAG_UNAVAILABLE"
}8. Avatar & Voice Integration (Future/Planned)
Animated Avatar
Phase 2+: Lip-synced avatar to humanize tutoring experience.
Options: - HeyGen Streaming API: Pre-recorded videos, real-time mouth-sync - D-ID: Live avatar generation (more flexible, higher latency)
Placement Rules: - Left or center of screen (not obstructive) - Proportional to classroom UI (not oversized) - Optional: can be toggled off by student
Voice Synthesis
Options: - ElevenLabs: High-quality, fast, ~$0.30 per 1K characters - OpenAI TTS: Integrated, $0.015 per 1K characters (cost-effective)
Implementation: - Stream audio chunks as response is generated (don’t wait for full response) - Sync avatar mouth movements to audio playback
9. AI Backend Services
Architecture
Django App (Grio)
├── `/api/ai/message` (POST) → Message Handler
│ ├─ Validate input + session
│ ├─ Retrieve curriculum context (RAG)
│ ├─ Build system prompt
│ ├─ Call LLM (OpenAI)
│ └─ Validate + store response
├── `/api/ai/health` (GET) → Health Check
└── `/api/ai/embed-content` (POST) → Embedding JobLLM Prompt/Retrieval Management
Use LangChain or custom orchestration:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.memory import ConversationBufferMemory
class GrioTutoringChain:
def __init__(self, lesson_id, topic, mode):
self.llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
self.system_prompt = self.build_system_prompt(lesson_id, topic, mode)
self.memory = ConversationBufferMemory(max_token_limit=3000)
def answer(self, query):
context = retrieve_context(query, self.lesson_id, self.topic)
response = self.llm.invoke({
"system": self.system_prompt,
"context": context,
"query": query
})
return responseCaching Strategy
- System prompt cache: Store per (lesson_id, topic, mode) tuple for 1 hour
- Context cache: Store RAG results per query + filters for 30 min
- Response cache: Cache identical queries within same session for 5 min
Cache Backend: Redis, TTL-based expiration
Health Check Endpoint
GET /api/ai/health/
Response:
{
"status": "healthy",
"llm_status": "online",
"qdrant_status": "online",
"uptime_seconds": 86400,
"last_message_processed": "2026-03-24T15:45:00Z"
}Error Handling & Fallback
| Failure | Fallback |
|---|---|
| RAG unavailable | Return error + suggestion to contact teacher |
| LLM timeout (>10s) | Return cached prior response if available; else error |
| LLM returns off-topic | Validate + re-prompt with stricter constraints |
| Token limit exceeded | Summarize conversation history; continue |
10. Performance & Optimization
Response Latency Targets
- Teach mode: < 3 seconds (acceptable for step-by-step delivery)
- Explore mode: < 4 seconds (more context required)
- Practice mode: < 2 seconds (students impatient with feedback)
- Revision mode: < 2 seconds (rapid-fire nature)
Caching Strategy
- System prompts: Keyed by (lesson_id, topic, mode); 1-hour TTL
- RAG results: Keyed by hash(query + filters); 30-min TTL
- Embeddings: Batch cache for daily re-indexing jobs
- LLM responses: Exact-match query cache per session; 5-min TTL
Batch Embedding for Curriculum Updates
# Run weekly or after content changes
uv run python manage.py embed_lesson_content \
--batch-size 32 \
--workers 4 \
--force-refreshPerformance: - Embed 10K chunks in ~15 minutes on standard GPU - Async job queue (Celery) to avoid blocking API
Cost Management (API-Based LLM)
Monthly tracking: - Log tokens per student per lesson - Alert if monthly spend > $10K (scale trigger) - Optimize prompt length periodically - Consider self-hosted fallback if costs exceed budget
Cost optimization levers: 1. Shorter system prompts (already minimal) 2. Smaller context windows (currently 3 chunks; could reduce to 1-2) 3. Use GPT-3.5-turbo for non-critical modes (explore, practice) 4. Batch embeddings (don’t re-embed unchanged content)
Appendix: Deployment Checklist
- Qdrant instance running (Docker/K8s)
- OpenAI API key secured (rotate monthly)
- LessonContent chunking + embedding pipeline operational
- System prompt templates defined per mode
- Curriculum context validator in code
- Health check endpoint monitored (PagerDuty)
- Session storage (Redis) with TTL policies
- Response latency monitoring (Prometheus/Grafana)
- Cost tracking dashboard (OpenAI usage)
- Fallback behavior tested (RAG/LLM failures)
- Age-appropriate language checker integrated
- Topic-locking enforced in all modes
- UNEB standards audit completed
Document Version: 1.0 Next Review: June 2026 Contact: Engineering Lead (AI/ML)