aita_core.ingest — Document Ingestion
The ingestion module processes course materials into a searchable FAISS vector index.
Pipeline Runner
- aita_core.ingest.run_ingestion(config, collectors=None)[source]
Run the full document ingestion pipeline.
- Parameters:
config – CourseConfig instance.
collectors – Optional list of (name, callable) pairs. Each callable receives config and returns a list of docs. If None, uses default collectors for the standard directory layout.
Default Collectors
These collectors work with the standard directory layout:
- aita_core.ingest.collect_homework(config)[source]
Load homework PDFs from Homework handouts/Homework handouts/, skipping solutions.
- aita_core.ingest.collect_slides(config)[source]
Load slide content from Slides/Slides/<topic>/ (content.tex or Notes.pdf).
- aita_core.ingest.collect_syllabus(config_or_dir)[source]
Load syllabus from standard location.
Accepts either a CourseConfig or a course_materials_dir path for backwards compatibility.
- aita_core.ingest.collect_wikibook(config)[source]
Collect chapters from a Wikibook URL.
- Requires config to have:
- textbook_url: str — base URL, e.g.
“https://en.wikibooks.org/wiki/Fundamentals_of_Transportation”
- textbook_chapter_to_week: dict[str, int] — maps chapter slug to week,
e.g. {“Trip_Generation”: 2, “Route_Choice”: 4}
Document Loaders
Text Chunking
Embedding & Indexing
Week Assignment
Custom Collectors
If your course materials don’t follow the standard directory layout, write
custom collector functions and pass them to run_ingestion():
def my_collect_handouts(config):
\"\"\"Collect handout PDFs from a custom location.\"\"\"
docs = []
handout_dir = os.path.join(config.course_materials_dir, "my_handouts")
for pdf in sorted(os.listdir(handout_dir)):
if not pdf.endswith(".pdf"):
continue
path = os.path.join(handout_dir, pdf)
week = get_week_for_filename(
pdf, config.topic_num_to_week, config.hw_num_to_week,
config.lab_num_to_week, config.study_guide_to_week,
)
docs.extend(load_pdf(path, f"Handout: {pdf}", max_week=week))
return docs
run_ingestion(CONFIG, collectors=[
("handouts", my_collect_handouts),
("syllabus", collect_syllabus),
])
Each collector function receives the CourseConfig and returns a list of
document dicts with the structure:
{
"text": "The full text content...",
"metadata": {
"source": "/path/to/file.pdf", # or URL
"source_label": "Handout: Topic.pdf", # display label
"max_week": 3, # week availability
}
}