Architecture

Overview

AITA follows a three-layer architecture:

┌─────────────────────────────────────────────────────┐
│                   Frontend (Streamlit)               │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │  Login /  │  │  Chat UI     │  │  Admin Panel  │  │
│  │  Google   │  │  (multi-turn │  │  (settings,   │  │
│  │  Auth     │  │   dialogue)  │  │   analytics)  │  │
│  └──────────┘  └──────────────┘  └───────────────┘  │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                 RAG Pipeline                         │
│  1. Embed user query (OpenAI)                       │
│  2. Retrieve top-k chunks from FAISS (week-filtered)│
│  3. Inject current homework if relevant             │
│  4. Build prompt with system instructions + context  │
│  5. Generate response (OpenAI)                      │
└────────────────────────┬────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────┐
│                    Data Layer                        │
│  ┌─────────────┐  ┌────────────┐  ┌──────────────┐  │
│  │ FAISS Index  │  │ SQLite DB  │  │ Course Docs  │  │
│  │ (embeddings) │  │ (logs,     │  │ (PDFs, .tex) │  │
│  │              │  │  feedback)  │  │              │  │
│  └─────────────┘  └────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────┘

Request Flow

When a student asks a question, here’s what happens:

  1. Query embedding — The user’s question is embedded using OpenAI’s text-embedding-3-large model.

  2. Retrieval — The embedding is used to search the FAISS index for the top-k most similar document chunks. Chunks are filtered by max_week so only content from topics already covered is returned.

  3. Homework injection — If the query mentions “homework”, “hw”, or “assignment”, the system checks whether the current week’s homework is in the retrieved results. If not, it’s injected as the first chunk.

  4. Prompt construction — A system prompt is built with:

    • The course-specific pedagogical instructions

    • Week-awareness: current week, covered topics, future topics

    • The retrieved course material chunks

  5. LLM generation — The complete message list (system prompt + chat history + user query) is sent to the LLM.

  6. Logging — The interaction (question, response, sources, week) is saved to the SQLite database.

Week-Aware Filtering

Every document chunk has a max_week metadata field set during ingestion. This is determined by matching the filename against the mappings in CourseConfig:

  • topic_num_to_week — for handouts and slides (by number prefix)

  • hw_num_to_week — for homework files

  • lab_num_to_week — for lab files

  • study_guide_to_week — for study guides and quizzes

  • textbook_chapter_to_week — for wikibook chapters

During retrieval, chunks with max_week > current_week are excluded. This prevents the chatbot from discussing future topics.

The system prompt also lists covered and uncovered topics, instructing the LLM to redirect questions about future material.

Document Ingestion Pipeline

The ingestion pipeline (run_ingestion) processes course materials into a searchable FAISS index:

  1. Collection — Collector functions scan the course materials directory and load documents (PDFs via PDFMiner, LaTeX files, or web pages).

  2. Chunking — Each document is split into overlapping chunks (default: 2048 chars with 256 char overlap). Source metadata is prepended to each chunk.

  3. Embedding — Chunks are embedded in batches using the OpenAI API.

  4. Indexing — Embeddings are stored in a FAISS IndexFlatIP (inner product / cosine similarity) index. Chunk metadata is saved alongside in a pickle file.

  5. Backup — If a previous index exists, it’s backed up with a timestamp.

Module Dependency Graph

__init__.py
    └── config.py      (CourseConfig, set_config, get_config)
    └── app.py          (Streamlit UI)
            ├── config.py
            ├── rag.py         (RAG pipeline)
            │       ├── config.py
            │       └── [OpenAI API, FAISS]
            ├── db.py          (SQLite logging)
            │       └── config.py
            ├── admin.py       (Admin dashboard)
            │       ├── config.py
            │       └── db.py
            └── oauth_store.py (PKCE state)

ingest.py  (standalone, called from add_document.py)
    ├── config.py
    ├── utils.py
    └── [OpenAI API, FAISS, PDFMiner]