`aita_core.ingest` — Document Ingestion

The ingestion module processes course materials into a searchable FAISS vector index.

Pipeline Runner

aita_core.ingest.run_ingestion(config, collectors=None)[source]

Run the full document ingestion pipeline.

Parameters:

config – CourseConfig instance.
collectors – Optional list of (name, callable) pairs. Each callable receives config and returns a list of docs. If None, uses default collectors for the standard directory layout.

Default Collectors

These collectors work with the standard directory layout:

aita_core.ingest.collect_handouts(config)[source]: Load handout PDFs from Handouts/Handouts/.

aita_core.ingest.collect_homework(config)[source]: Load homework PDFs from Homework handouts/Homework handouts/, skipping solutions.

aita_core.ingest.collect_slides(config)[source]: Load slide content from Slides/Slides/<topic>/ (content.tex or Notes.pdf).

aita_core.ingest.collect_syllabus(config_or_dir)[source]

Load syllabus from standard location.

Accepts either a CourseConfig or a course_materials_dir path for backwards compatibility.

aita_core.ingest.collect_wikibook(config)[source]

Collect chapters from a Wikibook URL.

Requires config to have:

textbook_url: str — base URL, e.g.: “https://en.wikibooks.org/wiki/Fundamentals_of_Transportation”
textbook_chapter_to_week: dict[str, int] — maps chapter slug to week,: e.g. {“Trip_Generation”: 2, “Route_Choice”: 4}

Document Loaders

aita_core.ingest.load_pdf(file_path, source_label, max_week=1)[source]

aita_core.ingest.load_tex(file_path, source_label, max_week=1)[source]

aita_core.ingest.load_wikibook_page(url)[source]: Fetch a Wikibook page and extract clean text content.

Text Chunking

aita_core.ingest.chunk_documents(docs, chunk_size=2048, overlap=256)[source]: Split all documents into chunks, preserving metadata.

aita_core.ingest.chunk_text(text, chunk_size=2048, overlap=256)[source]: Split text into overlapping chunks.

Embedding & Indexing

aita_core.ingest.get_embeddings(texts, embedding_model='text-embedding-3-large', batch_size=100)[source]: Call OpenAI embeddings API in batches. Returns numpy array.

aita_core.ingest.build_faiss_index(embeddings)[source]: Build a FAISS index from embeddings.

aita_core.ingest.save_index(index, chunks, faiss_dir, docs_dir, backup_dir)[source]: Save FAISS index and chunk metadata, with backup of existing data.

Week Assignment

aita_core.ingest.get_week_for_filename(filename, topic_num_to_week, hw_num_to_week, lab_num_to_week, study_guide_to_week)[source]: Determine which week a document belongs to based on its filename.

Custom Collectors

If your course materials don’t follow the standard directory layout, write custom collector functions and pass them to run_ingestion():

def my_collect_handouts(config):
    \"\"\"Collect handout PDFs from a custom location.\"\"\"
    docs = []
    handout_dir = os.path.join(config.course_materials_dir, "my_handouts")
    for pdf in sorted(os.listdir(handout_dir)):
        if not pdf.endswith(".pdf"):
            continue
        path = os.path.join(handout_dir, pdf)
        week = get_week_for_filename(
            pdf, config.topic_num_to_week, config.hw_num_to_week,
            config.lab_num_to_week, config.study_guide_to_week,
        )
        docs.extend(load_pdf(path, f"Handout: {pdf}", max_week=week))
    return docs

run_ingestion(CONFIG, collectors=[
    ("handouts", my_collect_handouts),
    ("syllabus", collect_syllabus),
])

Each collector function receives the CourseConfig and returns a list of document dicts with the structure:

{
    "text": "The full text content...",
    "metadata": {
        "source": "/path/to/file.pdf",      # or URL
        "source_label": "Handout: Topic.pdf", # display label
        "max_week": 3,                        # week availability
    }
}

aita_core.ingest — Document Ingestion