Adversarial Retrieval and LLMs Syllabus (Advanced Topic)

Course Organization

This course examines how large language models handle — and fail to handle — factual knowledge, and how adversaries exploit these failure modes in information retrieval and generation systems. Lectures are organized in four modules that move from internal model mechanics outward to ecosystem-level attacks.

For a broader overview of the Trust and Safety space, see the Trust & Safety class.

Before taking the class, read through the LLM background literature (considered pre-requisite)

Lectures and Delivery Order

Lecture 1: Memorization, Generalization, and Specialization in LLMs

  • Source: Memorization, Generalization, and Specialization in LLMs
    Introduces the core tension at the heart of the course: LLMs memorize training data (enabling recall but risking privacy leakage and stale knowledge) while also generalizing (enabling zero-shot tasks but introducing hallucinations). Covers finetuning vs. zero-shot prompting on QA tasks, the VoLTA vision-language model as an extended case, and why retrieval-augmented generation (RAG) reduces memorization-driven errors. Establishes the vocabulary for lectures 2–4.

Lecture 2: LLM Hallucinations and Knowledge Conflicts

  • Source: LLM Hallucinations and Knowledge Conflicts
    Deepens the hallucination picture by distinguishing faithfulness hallucinations (model contradicts its context) from factuality hallucinations (model contradicts the world). Introduces knowledge conflicts — situations where parametric knowledge, retrieved knowledge, and real-world facts diverge — and discusses how RLHF safety tuning interacts with faithfulness. Covers entity substitution frameworks and conflict-inducing dataset construction.

Lecture 3: Adversarial Adaptation in Information Systems

  • Source: Adversarial Adaptation In Information Systems
    Broadens scope from model internals to the adversarial information ecosystem. Uses a "means, motives, and opportunities" framework to analyze how actors adapt content to manipulate search rankings, social platforms, and recommendation systems. Covers SEO manipulation, social bot adaptation, memorialization hacking, and the trustworthiness/pluralism tradeoff that constrains platform interventions. This lecture is the conceptual bridge between the model-focused material (Lectures 1–2) and the attack-focused material (Lecture 4).

Lecture 4: Adversarial Attacks on IR Systems

  • Source: Adversarial Attacks on IR Systems
    Catalogues specific technical attacks against information retrieval systems: malicious text and image encoding, gradient-based multi-view topic attacks, poisoned corpus attacks, and RAG-specific poisoning. Applies the means/motives/opportunities framework to SEO attack vectors, including evidence that unreliable news sites are disproportionately linked by paid schemes. Covers the AREA (Adversarial REtrieval Attack) literature.

Tutorial Sessions

Tutorial: How Do I Make a Good Classifier?
A Python-focused practical guide to binary classification: data collection, annotation and inter-rater reliability (Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha), preprocessing, class imbalance handling, model selection, hyperparameter tuning, and evaluation (precision, recall, F1, ROC-AUC). Best delivered as a lab session after Lecture 2, when students have encountered hallucination/conflict classification tasks in context.

AI Worksheets
Extended worksheet collection supporting the readings and lectures. Covers alignment, reasoning, RAG, vision-language models, and adversarial scenarios.


Trust

  • Hallucinations + misinformation (SegSub)
  • Typologies
  • Generality vs specialization
  • Need for RAG to stay up to date...

Safety

  • Adversarial information retrieval
  • Jailbreaks of agentic AI (benchmark papers, sudo rm -rf agent security)
  • Ethics

Readings

See Advanced Topics/Adversarial Retrieval and LLMs/LLM background literature for primers on model design and training processes.

Alignment

Reasoning

Claims against Reasoning

Retrieval-Augmented Generation

Hallucination

Vision-Language Models