Trust & Safety Class

Curriculum

Fri, 01 May 2026 16:25:22 GMT

The three course assignments are scaffolded to follow this arc: the Discord Bot covers reactive moderation (Weeks 2–6), the Bluesky Labeler covers proactive moderation (Weeks 6–10), and the Podcast Factchecking project covers applied T&S research (Weeks 10–15).Lecture 1: Introduction to Trust & Safety Source: Introduction to Trust and Safety Content: purpose and history of T&S; high-level taxonomy of abuse types (CSAM, hate speech, spam, terrorism, self-harm, fraud, etc.); reactive vs. proactive models; building T&S teams; overview of automated technologies
Supplement: 🎥 20-min lecture on intelligence in T&S by Inbal Goldberger (ActiveFence)
Short case video: 🎥 Case video / Transcript Lecture 2: Large-Scale T&S Systems in Practice
Source: Large Scale Trust & Safety Systems Content: how industry T&S infrastructure is organized at scale; the organizational modules and overarching problems; provides the map for the rest of the semester Note: this lecture situates everything covered in Lectures 3–30; use it as a semester roadmap that students can refer back to throughout the course.
Reading: 📘 tsbook ch1 — "Fighting the Forever War" (Stamos, Grossman, Pfefferkorn)
Active Learning: Play Moderator Mayhem individually before class; debrief in-class on the ambiguity revealed by the game.
Note on scope: Use Dotmocracy in this first week to have students vote on which abuse types the class will cover in depth. This shapes the harm-specific weeks (Weeks 7–9, 14) and ensures the class addresses what students find most pressing. Lecture 3: Pitfalls of Binary Classification
Source: Pitfalls of Binary Classification Content: enforcing platform policies against concrete examples; borderline cases; contrasting policies for the same harm; why binary framing breaks down at the margins Active learning strategies built in: "Set It Up" — classify 3 examples against a reference policy; identify the obvious violation and the obvious non-violation Think-pair-share — discuss borderline cases in pairs, then share with class (use clickers for voting) Contrasting cases — given two policies for the same harm, diagnose where each breaks down (false positives, false negatives, ambiguity, multiple thresholds needed)
Reference policies: Pitfalls of Binary Classification links to Political Ad Policies case Lecture 4: Content Moderation I — History, Models, and the Anti-Censorship Ethos
Source: Content Moderation Content: early internet norms; anti-censorship ethos and how it shaped moderation; models for content moderation; framing effects of regulatory language; scale considerations Assigned: Discord Bot Milestone 1 — Abuse Study and Content Policy (individual; due end of Week 4)
Reading: TSPA T&S Fundamentals handbook; TSPA library selections at instructor discretionLecture 5: Content Moderation II — Commercial Moderators, Community Norms, and Scale
Source: Content Moderation (continuation) Content: what commercial content moderation looks like at platforms like Twitch; moderator actions (remove, warn, ban, shadow block); relationship between moderators and platform admins; the psychological burden on moderators
Exercise: 🔗 Content Moderation exercises doc
Short case video: 🎥 Case video / Lesson plan Lecture 6: Community Moderation and Self-Governance
Source: 🔗 Seering (KAIST) community moderation slides Content: community-based moderation models (Reddit, Discord, Wikipedia); how rules emerge; when community governance succeeds or fails
Supplement: Seering KAIST syllabus for framing
Reading: TSPA Handbook, Content Moderation and OperationsLecture 7: Technical Lab — Discord Bot Setup Workshop Source: Assignments/Discord Bot/discord_bot_assignment/ starter code Content: what makes a good user reporting flow; tradeoffs between specificity and usability; the behind-the-scenes moderator review flow; multi-tier review; outcomes (remove, warn, shadow-block, escalate) Lab: Python async programming; discord.py API; bot initialization; the mod channel pattern; forwarding messages; emoji reactions for moderator workflows Lab: students fork the repo and get their bots running in their group channels; TAs available Lecture 8: Metrics and Measurement I — What to Measure and Why
Source: Metrics and Measurement
Also available: 🔗 Google Slides (fall 2023) Content: what is a metric; generic platform metrics (DAU, retention) vs. T&S-specific metrics; defining "success" in T&S; prevalence estimates; common industry metrics; developing new metrics
Exercise: 🔗 Metrics exercises doc — students design metrics for their chosen Discord Bot abuse type DUE: Discord Bot M1 — Abuse Study and Content PolicyASSIGNED: Discord Bot M2 — Content Moderation Bot (group; due end of Week 6) Forming groups: after M1 is submitted, form interdisciplinary groups of 4–5. Ideally mix students with technical and policy backgrounds if the course is cross-listed.
Reading: Metrics exercises doc; T&S reading list Module CLecture 9: Governments and the Internet I — U.S. Law and Section 230
Source: Government Regulation Content: why the U.S. became the center of the internet economy; overall U.S. approach to internet policy; Section 230 ("the 26 words that created the internet") — text, scope, limits, and contested interpretations In-class activity: read Section 230 aloud in full; discuss its implications for T&S enforcement Lecture 10: Governments and the Internet II — International Regulation and Copyright
Source: Government Regulation (continuation)
Supplement: 🔗 Copyright Safe Harbors and DMCA slides by Justin Francese (U of Oregon) Content: EU Digital Services Act; Australia's eSafety Commissioner model; notice-and-takedown; copyright and safe harbor; how different regulatory frameworks shape platform behavior
Supplement: 🎥 16-min lecture on Australia's eSafety Commissioner
Reading: T&S reading list Module B, The Twenty-Six Words That Created the Internet by Jeff KosseffLecture 11: Proactive Moderation — Classifiers, Hash-Matching, and Labeling Systems Source: 📓 Assignments/Bluesky Moderation/bluesky_labeler_assignment/bluesky-lecture.ipynb Content: text classification pipelines; perceptual hashing (PHash, PhotoDNA); the labeler architecture on Bluesky (AppView → Relay → PDS); differences between proactive automated detection and reactive user-reporting In-class coding demo: run the Bluesky starter code; attach a test label; view it in the Bluesky UI Lecture 12: Decentralized Moderation and the AT Protocol Content: federated/decentralized platform architectures; how AT Protocol enables user-configurable filtering; trade-offs between centralized and decentralized moderation; third-party labelers as a governance model
Reference: Bluesky moderation architecture docs; list of labelers DUE: Discord Bot M2 — Content Moderation Bot. Extra credit: Discord Bot M3 Smart ClassifierASSIGNED: Bluesky Moderation M1–M4 (setup, T&S words, news citation, perceptual hashing; due end of Week 9)Reading: Weapons of Math Destruction by Cathy O'NeilLecture 13: Harassment and Hate Speech I — Definitions, Scale, and Policy
Source: Harassment and Hate Speech Content: spectrum of harassment; hate speech definitions and jurisdictional variation; which identities are most targeted; the role of anonymity and pseudonymity; platform policy evolution Lecture 14: Harassment and Hate Speech II — Automated Detection and Exercises
Source: Harassment and Hate Speech (continuation)
Supplement: 🔗 McLester hate speech and harassment slides (UAB)
Supplement: 🎥 25-min lecture on online hate speech by Mark Schneider (UPenn)
Exercises: 🔗 Harassment and Hate Speech exercises doc In-class structured debate: safety (public harm) vs. censorship (freedom of speech); how do different platforms' decisions reflect this tradeoff?
Reading: 📘 tsbook ch6 — Harassment (full chapter)Lecture 15: Terrorism, Radicalization, and Extremism
Source: Terrorism Radicalization and Extremism
Also available: 🔗 Google Slides (fall 2023) Content: definitions; radicalization models; online recruitment pipelines; counter-terrorism vs. counter-violent extremism (CVE); the live-streaming of attacks; role of platform algorithms in amplification
Supplement: 🎥 30-min lecture on terrorism, radicalization, and extremism by Marten Risius (U of Queensland) Lecture 16: CVE, Counter-Speech, and Policy Responses
Source: Terrorism Radicalization and Extremism (exercises + discussion)
Exercises: 🔗 Terrorism, Radicalization, and Extremism exercises doc Case study: GIFCT hash-sharing database; Jan. 6th Capitol Riot deplatforming decisions
Reading: T&S reading list Module G on Terrorism.Lecture 17: Authentication, Identity, and Platform Manipulation
Source: Authentication, Identity, and Platform Manipulation Content: authentication models; real-name vs. pseudonymous policies; coordinated inauthentic behavior (CIB); sockpuppet networks; astroturfing; state-sponsored information operations
Exercises: 🔗 Auth/Identity exercises doc
Supplement: 🔗 McLester investigations and intelligence lecture Lecture 18: Spam, Fraud, and Account Integrity Content: spam taxonomy; online fraud (scams, phishing, impersonation); account takeovers; the arms race between spammers and platforms; ML-based abuse detection at account level Reading: 📘 tsbook ch2 — Spam and Online Fraud (Nelly Agbogu case study is an excellent discussion anchor) DUE: Bluesky M1–M4 (T&S words labeler, news citation labeler, perceptual hash dog labeler) ASSIGNED: Bluesky M5 — Policy Proposal Labeler (due end of Week 10)
Reading: T&S reading list Module K on Authenticity.Lecture 19: Misinformation — Definitions, Spread, and Platform Responses
Source: Misinformation Content: the mis/dis/mal taxonomy (Wardle & Derakhshan information disorder framework); how false information spreads through social networks; the role of algorithmic amplification; platform interventions (labels, friction, removal, amplification reduction); the fact-checking ecosystem
Supplement: 🎥 45-min lecture on misinformation by Sarah Shirazyan (Stanford Law) Lecture 20: Misinformation Detection Tutorial
Source: Misinformation Detection Tutorial (IC2S2 2025) Content: hands-on NLP tutorial — claim-level check-worthiness detection using the CT24 dataset; full pipeline from data loading through feature engineering, classifier training (logistic regression, BERT-based), evaluation (precision, recall, F1), and error analysis; AI-generated misinformation detection techniques; Podcast Factchecking preview using the PodChecker system (Irmetova et al., 2026) This is a lab-style session; students should have Python and the tutorial dependencies installed before class.
Reading: T&S reading list Module F on the Information Environment.DUE: Bluesky M5 ASSIGNED: Podcast Factchecking (data + analysis due Week 13; final report due Week 15) Coordinating M5 with earlier work: students should implement the same abuse type they researched for Discord Bot M1, making the policy proposal in M5 a direct technical extension of the written analysis from Week 4. Consider also requiring students to apply a counter-intervention framing from Lecture 21 (Adversarial Adaptation) to their M5 policy design. Lecture 21: Source Credibility and Misinformation Source Detection
Source: Source Credibility Content: what makes a source credible; SEO-based misinformation source detection using CommonCrawl webgraphs; backlinking patterns as credibility signals; multi-class classification of news domains; feature importances for predicting credibility vs. political reliability; limitations (implied content, domain decay, propaganda vs. opinion) Lecture 22: Intervention Effectiveness — Misinformation and Search Rankings
Source: Intervention Effectiveness — Misinformation and Search Rankings Content: small-scale PageRank-based interventions; personalized PageRank and authority-based reranking; large-scale link scheme removal; "multi-category" scheme removal as a more precise intervention tool; traffic estimates from CommonCrawl vs. SimilarWeb; design principles for robust interventions; open problems and future directions
Reading: Sample of papers from the special topic reading list on Misinformation, at least 1 paper per category.Lecture 23: Adversarial Adaptation and the Limitations of Interventions
Source: Adversarial Adaptation and the Limitations of Interventions Content: how adversaries adapt to interventions over time (SEO gaming, platform manipulation, bot evolution); the credibility–pluralism tradeoff — credibility-based filtering reduces source diversity; assortativity in news transition matrices; Wasserstein distances to quantify polarization effects; principles for adversarially robust policy design This lecture directly sets up the proactive moderation problem introduced in Week 11: reactive interventions are always lagging, which motivates automated proactive systems. Lecture 24: Types of Attack Surfaces I — Safety Perspective
Source: Types of Attack Surfaces Content: attack surface taxonomy; how bad actors exploit platform features; API abuse; content injection; account compromise vectors
Reading: T&S reading list Module L on Attack SurfacesLecture 25: Types of Attack Surfaces II — Security Perspective
Source: Types of Attack Surfaces (continuation) Content: platform defenses; CAPTCHAs and bot detection; rate limiting; shadow-banning; detection pipelines
Exercises: 🔗 Attack Surfaces exercises doc Lecture 26: Emerging Topics I — AI in Trust and Safety
Source: Emerging Topics — AI in Trust and Safety
Also available: 🔗 Google Slides (January 2024) Content: AI and ML in T&S (generative AI, detection, red-teaming); AR/VR harm areas; emerging platforms and harm surfaces; T&S career pathways; what skills employers look for Interactive demo: AI-generated content identification exercise (slides include an interactive breakout)
Exercises: 🔗 Emerging Technologies exercises doc DUE: Podcast Factchecking data + analysis
ASSIGNED: Course paper writeup. Reference: Trust & Safety Journal for graduate-level research directions, and ICWSM / IC2S2 for computational social science.
Reading: T&S reading list Module M on Emerging TechnologiesLecture 27: Emerging Topics II — Adversarial Retrieval and LLMs
Source: Emerging Topics — Adversarial Retrieval Content: how adversaries manipulate retrieval-augmented generation (RAG) systems and search indexes; corpus poisoning and gradient-based attacks on IR systems; SEO manipulation as an information operation; connecting the misinformation interventions from Weeks 9–10 to the LLM attack surface; adversarial means, motives, and opportunities in the AI era Lecture 28: Emerging Topics III — LLM Hallucinations and Knowledge Conflicts
Source: Emerging Topics — LLM Hallucinations and Knowledge Conflicts Content: faithfulness vs. factuality hallucinations; knowledge conflicts between parametric memory, retrieved context, and ground truth; RLHF safety tuning and jailbreaking; detection and mitigation approaches; implications for T&S practitioners deploying LLM-based moderation or fact-checking systems
Reading: Sample of papers from the special topic reading list on Adversarial Retrieval and LLMs, at least 1 paper per category.Lecture 29: Project Presentations I Format: groups present Discord Bot M3 results (or Podcast Factchecking for individually structured courses); guest judges from industry where possible (see Consortium member list in the README for potential invitees) ~8 minutes per group + Q&A; rubric focuses on policy motivation, technical implementation, testing and evaluation, and ethical reflection Lecture 30: Project Presentations II + Course Debrief Format: remaining presentations + open debrief Discussion: what has changed in T&S since the semester began? What did the class get wrong? What questions remain?
Optional: play Trust & Safety Tycoon as a closing reflection on organizational complexity DUE: Reports due ⚠️ Content Warning: These topics cover deeply sensitive material. They are for further reading only and are not examinable or covered directly in class. They are included here for completeness. If this content causes any distress, please reach out for support. Links to the student centers for mental health and wellbeing are provided. Child and Adult Sexual Exploitation
Source: 🔗 Google Slides Content: CSAM definitions and legal landscape; PhotoDNA and hash-matching at scale; NCMEC partnerships and reporting obligations; grooming detection; sextortion; proxy content for classroom exercises
Exercises: 🔗 CASE exercises doc
Short case video: 🔗 CASE case video
Reading: 📘 tsbook ch7 — Child Sexual Exploitation Suicide, Self-Harm, and Platform Well-Being
Source: 🔗 Google Slides (fall 2023) Content: safe messaging guidelines; the role of algorithmic amplification in self-harm content; contagion effects; platform design for well-being; tension between supporting at-risk users and removing harmful content; mental health resources for moderators
Supplement: 🎥 45-min lecture by Katherine Keyes (Columbia University)
Exercises: 🔗 Suicide, Self-Harm, and Well-Being exercises doc
Primary textbook: 📘 tsbook — The book is a living draft; chapters on misinformation, extremism, and emerging tech may be added.
Consortium reading list: 🔗 Full reading list (Google Docs) — organized by module, aligns directly with lecture sequence above.
TSPA curriculum: T&S Fundamentals and Library
The lectures draw on the Trust & Safety Teaching Consortium materials. Likewise, assignments are accredited where previous materials have been drawn on.

Course Overview

Fri, 01 May 2026 15:21:57 GMT

Ambiguities in ethics and platform policy → Try Moderator Mayhem
Complexity of organizational problems and tradeoffs associated with trust & safety → Try Trust & Safety Tycoon Requires coding in Python; intro CS background assumed For research components, familiarity with at least one of: HCI / user studies ML methods Data science and statistics If you find your interest lies in a specific subtopic of Trust & Safety, checkout one of the related special topics to this class:
Misinformation
Social Network Analysis
Adversarial Retrieval and LLMs Overall T&S: An understanding of the most pressing challenges for online global communication platforms Foundational knowledge of current research in Online Trust and Safety A draft policy proposal and a working implementation of that proposal Content Moderation: Understand the breadth of models for content moderation (reactive, proactive, community-governed, automated) Conceptualize different approaches to moderating a space and reflect on how these models could evolve Critically assess how automated content moderation mechanisms handle borderline and ambiguous cases Algorithmic Tradeoffs: Identify critical issues and ethical dilemmas in algorithmic systems, especially in T&S Analyze technical, social, and policy-based responses to online harms (misinformation, extremist content, harassment, and others) Develop in-depth knowledge of at least one selected harm type across the policy, technical, and organizational dimensions Organizational Dynamics: Understand platform T&S operations through the following categories: Account vs. content moderation Methods of Access vs. Harm Organizational vs. technical complexity (Content-Neutral Outcomes) The three assignments are designed to build on one another: reactive moderation → proactive moderation → applied research.Source: Trust and Safety Engineering (Stanford CS152) + Cornell Tech CS 5342A three-milestone project in which you act as the T&S team at a social media platform: M1 (individual, Week 4): Abuse Research Report (2000–4000 words) covering one abuse type: description, actor/victim profiles, details, relevant technologies, and specific recommendations. Plus a Policy Comparison Table for three platforms. M2 (group, Week 6): Design and implement a user reporting flow and behind-the-scenes moderator flow as a Discord bot in Python. M3 (group, Week 6, extra credit): Extend the bot with automated detection — a classifier trained or prompted on your chosen abuse type. Full spec: Assignments/Discord Bot/Discord Bot.mdSource: Cornell Tech CS 5342Build a Bluesky labeler — a service that attaches categorical labels to posts and accounts. Users who subscribe to your labeler can configure how labels affect what they see. M1: Labeler setup (AT Protocol, Bluesky account, starter code) M2: Label posts matching T&S-related words and domains (text matching) M3: Label posts linking to specific news sources (domain matching) M4: Label dog photos using perceptual hashing (image matching) M5 (policy proposal, Week 10): Extend your labeler to handle a harm of your choice; document your process, testing, and ethical analysis in a 10-minute video Full spec: Assignments/Bluesky Moderation/Bluesky Moderation.md Starter code: Assignments/Bluesky Moderation/bluesky_labeler_assignment/bluesky-assign3/Source: Irmetova, Liu, Teleki, Carragher, Zhang, & Caverlee (2026). PodChecker: An Interpretable Fact-Checking Companion for Podcasts.Collect and analyze podcast data through the lens of a fact-checking or trust-and-safety application. The reference implementation (PodChecker) provides a claim-extraction and credibility-analysis pipeline; students may extend, replicate, or critically analyze it using a different dataset or harm type.Full spec: Assignments/Podcast Factchecking/Podcast Factchecking.md Code: Assignments/Podcast Factchecking/PodChecker/
Working textbook: tsbook (Stamos, Grossman, Pfefferkorn) — available chapters: Introduction, Spam/Fraud, Harassment, Child Sexual Exploitation
TSPA resources: Handbook and Library
Consortium: reading list
The course lectures are modeled on the Teaching Trust & Safety Consortium.
See Curriculum for the full week-by-week schedule.This course covers material that many students will find disturbing or personally resonant, including hate speech (Week 7) and terrorism (Week 8). Materials on child exploitation and suicide/self-harm are included as optional further reading and are not covered directly in class. Following the Consortium's guidance: Topics are announced at least one week in advance Students may skip any class covering sensitive content and associated readings without penalty Assignments and homeworks cannot be skipped, whoever students may choose the area of research for these Written summaries of key policy points are provided as alternatives for skipped sessions If you find course content affecting your well-being, please reach out — resources include the course teaching

Misinformation Syllabus (Advanced Topic)

Fri, 01 May 2026 15:12:56 GMT

This course examines online misinformation and disinformation from interdisciplinary perspectives — drawing on communication studies, political science, cognitive psychology, and computational methods. Lectures move from definitional and theoretical foundations through empirical analysis of spread and vulnerability, to computational detection techniques and platform-level interventions, and finally to the emerging challenge of AI-generated misinformation.For a broader overview of the Trust and Safety space, see the Trust & Safety class.Lecture 1: Defining Misinformation (Consortium Information Environment)
Source: Define Misinfo (Consortium Information Environment) Establishes the foundational vocabulary of the course: the distinctions among misinformation, disinformation, and malinformation; the Wardle & Derakhshan information disorder framework; typologies of false and misleading content; and the information environment as the broader context in which misinformation operates. Students leave with a shared conceptual language for the rest of the course. Lecture 2: Content Moderation Overview (Consortium)
Source: Content Moderation Overview (Consortium) Introduces how platforms respond to misinformation through reactive and proactive moderation. Covers the spectrum of moderation models (removal, labeling, demotion, counter-speech), the role of human reviewers vs. automated systems, and the inherent tradeoffs between free expression and harm reduction. Provides operational context before the course turns to technical detection. Lecture 3: Detection and Discovery of Misinformation Sources
Source: Detection and Discovery of Misinformation Sources Technical lecture covering how to identify and classify misinformation-producing websites using SEO network features, backlinking patterns, and multi-class classification. Key topics: construction of the SEO network from CommonCrawl data, predictive power of network features over credibility labels, limitations of current approaches (implied content, propaganda vs. opinion, link decay). Establishes the computational approach that underpins Lectures 4 and 5. Lecture 4: Misinformation Resilient Search Rankings
Source: Misinformation Resilient Search Rankings Builds on the source detection methods from Lecture 3 to ask: how do we intervene at the search level? Covers small-scale interventions (PageRank, Personalized PageRank, authority-based reranking), large-scale interventions targeting link schemes, and the design principles that make interventions robust. Discusses evidence that link schemes disproportionately link to unreliable news and that "multi-category" scheme removal has higher marginal effectiveness. Lecture 5: Credibility Pluralism Tradeoff
Source: Credibility Pluralism Tradeoff Complicates the intervention story. Using CommonCrawl and GDELT data, this lecture demonstrates that credibility-based filtering and viewpoint diversity (pluralism) are in tension: interventions that reduce low-credibility content tend to reduce the diversity of sources users encounter. Introduces assortativity analysis and Wasserstein distances as tools for measuring polarization in news transition matrices. Lecture 6: Media Influences — Structural Dimensions of Credibility, Bias, and Ownership
Source: Media Influences Zooms out to the structural and economic dimensions of the media ecosystem: how source credibility, political bias, and corporate ownership interact to shape information quality. Covers multi-agent dynamic scenarios and behavioral determinants of media use. Serves as a research-synthesis and ongoing-work session, best positioned after students have seen the computational interventions of Lectures 3–5. Tutorial Session: Fact-Checking NLP (IC2S2 2025 Tutorial)
Source: Misinformation Detection Tutorial (IC2S2_25) Hands-on notebook using the CT24 check-worthiness dataset. Covers the full pipeline: data loading and exploration, feature engineering, training a claim-level check-worthiness classifier, evaluation (precision, recall, F1), and error analysis. Recommended placement: after Lecture 3 (Detection and Discovery), once students have the conceptual vocabulary for detection. Can optionally be extended with the LLM-based detection approaches from the reading list.
Dartmouth: Political Misinformation and Conspiracy Theories (Brendan Nyhan)
Uni Bamberg: Misinformation, Disinformation and Other Digital Fakery (Andreas Jungherr) UNC: Misinformation and Society (Francesca Tripodi)
UNC: Critical Disinformation Studies: A Syllabus
Zotero (from King et. al., 2025): https://www.zotero.org/groups/5535941/interventions-literature-review/libraryKing, Catherine, Peter Carragher, and Kathleen M. Carley. "Mapping the Scientific Literature on Misinformation Interventions: A Bibliometric Review." Workshop Proceedings of the 19th International AAAI Conference on Web and Social Media. Vol. 2025. 2025.
https://workshop-proceedings.icwsm.org/pdf/2025_10.pdf
Aïmeur, Esma, Sabrine Amri, and Gilles Brassard. "Fake news, disinformation and misinformation in social media: a review." Social Network Analysis and Mining 13.1 (2023): 30. https://doi.org/10.1007/s13278-023-01028-5
Altay, S., Berriche, M., Heuer, H., Farkas, J., & Rathje, S. (2023). A survey of expert views on misinformation: Definitions, determinants, solutions, and future of the field. Harvard Kennedy School Misinformation Review. https://doi.org/10.37016/mr-2020-119
Broda, E., & Strömbäck, J. (2024). Misinformation, Disinformation, and Fake News: Lessons from an Interdisciplinary, Systematic Literature Review. Annals of the International Communication Association, 48(2), 139–166. https://doi.org/10.1080/23808985.2024.2323736
Ecker, U. K. H., Tay, L. Q., Roozenbeek, J., van der Linden, S., Cook, J., Oreskes, N., & Lewandowsky, S. (2024). Why misinformation must not be ignored.American Psychologist. Advance online publication. https://doi.org/10.1037/amp0001448
Kapantai, E., Christopoulou, A., Berberidis, C., & Peristeras, V. (2020). A systematic literature review on disinformation: Toward a unified taxonomical framework. New Media & Society, 23(5), 1301-1326. https://doi.org/10.1177/1461444820959296
Murphy, G., de Saint Laurent, C., Reynolds, M., Aftab, O., Hegarty, K. Sun, Y. & Greene, C. M. (2023). What do we study when we study misinformation? A scoping review of experimental research (2016-2022). Harvard Kennedy School (HKS) Misinformation Review. ttps://doi.org/10.37016/mr-2020-130
Pérez-Escolar, M., Lilleker, D., & Tapia-Frade, A. (2023). A systematic literature review of the phenomenon of disinformation and misinformation. Media and communication, 11(2), 76-87. https://doi.org/10.17645/mac.v11i2.6453
Saeidnia, H. R., Hosseini, E., Lund, B., Tehrani, M. A., Zaker, S., & Molaei, S. (2025). Artificial intelligence in the battle against disinformation and misinformation: A systematic review of challenges and approaches. Knowledge and Information Systems, 67(4), 3139–3158. https://doi.org/10.1007/s10115-024-02337-7
Tandoc Jr. EC. The facts of fake news: A research review. Sociology Compass. 2019; 13:e12724. https://doi.org/10.1111/soc4.12724
Chadwick, A., & Stanyer, J. (2022). Deception as a Bridging Concept in the Study of Disinformation, Misinformation, and Misperceptions: Toward a Holistic Framework. Communication Theory, 32(1), 1–24. https://doi.org/10.1093/ct/qtab019
Freelon, D., & and Wells, C. (2020). Disinformation as Political Communication. Political Communication, 37(2), 145–156. https://doi.org/10.1080/10584609.2020.1723755
Molina, M. D., Sundar, S. S., Le, T., & Lee, D. (2019). “Fake News” Is Not Simply False Information: A Concept Explication and Taxonomy of Online Content. American Behavioral Scientist, 65(2), 180-212. https://doi.org/10.1177/0002764219878224 (Original work published 2021)
Starbird, K. (2024). Facts, frames, and (mis) interpretations: understanding rumors as collective sensemaking. Link: Facts, frames, and (mis)interpretations: Understanding rumors as collective sensemaking
Tandoc, E. C., Lim, Z. W., & Ling, R. (2017). Defining “Fake News”: A typology of scholarly definitions. Digital Journalism, 6(2), 137–153. https://doi.org/10.1080/21670811.2017.1360143Wardle, C., & Derakhshan, H. (2017). Information disorder: Toward an interdisciplinary framework for research and policymaking (Vol. 27, pp. 1-107). Strasbourg: Council of Europe.
Wu, L., Morstatter, F., Carley, K. M., & Liu, H. (2019). Misinformation in social media: definition, manipulation, and detection. ACM SIGKDD explorations newsletter, 21(2), 80-90. https://doi.org/10.1145/3373464.3373475
Adams, Z., Osman, M., Bechlivanidis, C., & Meder, B. (2023). (Why) Is Misinformation a Problem? Perspectives on Psychological Science, 18(6), 1436-1463. https://doi.org/10.1177/17456916221141344 (Original work published 2023)
Ecker, U., Roozenbeek, J., Van Der Linden, S., Tay, L. Q., Cook, J., Oreskes, N., & Lewandowsky, S. (2024). Misinformation poses a bigger threat to democracy than you might think. Nature, 630(8015), 29-32. https://www.nature.com/articles/d41586-024-01587-3
McKay, S., & Tenove, C. (2020). Disinformation as a Threat to Deliberative Democracy. Political Research Quarterly, 74(3), 703-717. https://doi.org/10.1177/1065912920938143 (Original work published 2021)Woolley, S. C., & Howard, P. N. (2016). Automation, Algorithms, and Politics.pdf| Political Communication, Computational Propaganda, and Autonomous Agents—Introduction. International Journal of Communication, 10(0),
Altay, S., Berriche, M., & Acerbi, A. (2023). Misinformation on Misinformation: Conceptual and Methodological Challenges. Social Media + Society, 9(1), 20563051221150412. https://doi.org/10.1177/20563051221150412
Budak, C., Nyhan, B., Rothschild, D. M., Thorson, E., & Watts, D. J. (2024). Misunderstanding the harms of online misinformation. Nature, 630(8015), 45–53. https://doi.org/10.1038/s41586-024-07417-w
Harsin, J. (2024). Three Critiques of Disinformation (For-Hire) Scholarship: Definitional Vortexes, Disciplinary Unneighborliness, and Cryptonormativity. Social Media + Society, 10(1). https://doi.org/10.1177/20563051231224732
Nyhan, B. (2020). Facts and Myths about Misperceptions. Journal of Economic Perspectives, 34(3), 220–236. https://doi.org/10.1257/jep.34.3.220Pasquetto, I. V., Lim, G., & Bradshaw, S. (2024). Misinformed about misinformation: On the polarizing discourse on misinformation and its consequences for the field. Harvard Kennedy School (HKS) Misinformation Review, 5(5).Simon, F. M., Altay, S., & Mercier, H. (2023). Misinformation reloaded? Fears about the impact of generative AI on misinformation are overblown. Harvard Kennedy School Misinformation Review, 4(5).
Allen, J., Howland, B., Mobius, M., Rothschild, D., & Watts, D. J. (2020). Evaluating the fake news problem at the scale of the information ecosystem. Science Advances, 6(14), eaay3539. https://doi.org/10.1126/sciadv.aay3539
Baribi-Bartov, S., Swire-Thompson, B., & Grinberg, N. (2024). Supersharers of fake news on Twitter. Science, 384(6699), 979–982. https://doi.org/10.1126/science.adl4435
Chadwick, A., Vaccari, C., & Kaiser, J. (2022). The Amplification of Exaggerated and False News on Social Media: The Roles of Platform Use, Motivations, Affect, and Ideology. American Behavioral Scientist, 69(2), 113-130. https://doi.org/10.1177/00027642221118264
Goel, P., Green, J., Lazer, D. et al. Using co-sharing to identify use of mainstream news for promoting potentially misleading narratives. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02223-4Ozawa, J. V., Woolley, S., & Lukito, J. (2024). Taking the power back: How diaspora community organizations are fighting misinformation spread on encrypted messaging apps. Harvard Kennedy School Misinformation Review.Pathak, R., Spezzano, F., & Pera, M. S. (2023). Understanding the contribution of recommendation algorithms on misinformation recommendation and misinformation dissemination on social networks. ACM Transactions on the Web, 17(4), 1-26.
Renault, T., Mosleh, M., & Rand, D. G. (2025). Republicans are flagged more often than Democrats for sharing misinformation on X’s Community Notes. Proceedings of the National Academy of Sciences, 122(25), e2502053122. https://doi.org/10.1073/pnas.2502053122Tomassi, A., Falegnami, A., & Romano, E. (2024). Mapping automatic social media information disorder. The role of bots and AI in spreading misleading information in society. Plos one, 19(5), e0303183.
Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146–1151. https://doi.org/10.1126/science.aap9559
Anspach, N. M., & Carlson, T. N. (2024). Not who you think? Exposure and vulnerability to misinformation. New Media & Society, 26(8), 4847–4866. https://doi.org/10.1177/14614448221130422
Altay, S., & Acerbi, A. (2024). People believe misinformation is a threat because they assume others are gullible. New Media & Society, 26(11), 6440–6461. https://doi.org/10.1177/14614448231153379
Aslett, K., Sanderson, Z., Godel, W., Persily, N., Nagler, J., & Tucker, J. A. (2024). Online searches to evaluate misinformation can increase its perceived veracity. Nature, 625(7995), 548–556. https://doi.org/10.1038/s41586-023-06883-y
Ceylan, G., Anderson, I. A., & Wood, W. (2023). Sharing of misinformation is habitual, not just lazy or biased. Proceedings of the National Academy of Sciences, 120(4), e2216614120. https://doi.org/10.1073/pnas.2216614120
Ecker, U. K. H., Lewandowsky, S., Cook, J., Schmid, P., Fazio, L. K., Brashier, N., Kendeou, P., Vraga, E. K., & Amazeen, M. A. (2022). The psychological drivers of misinformation belief and its resistance to correction. Nature Reviews Psychology, 1(1), 13–29. https://doi.org/10.1038/s44159-021-00006-y
Ecker, U. K. H., Lewandowsky, S., Fenton, O., & Martin, K. (2014). Do people keep believing because they want to? Preexisting attitudes and the continued influence of misinformation. Memory & Cognition, 42(2), 292–304. https://doi.org/10.3758/s13421-013-0358-x
Flynn, D. j., Nyhan, B., & Reifler, J. (2017). The Nature and Origins of Misperceptions: Understanding False and Unsupported Beliefs About Politics. Political Psychology, 38(S1), 127–150. https://doi.org/10.1111/pops.12394
Kunst, J. R., Gundersen, A. B., Krysińska, I., Piasecki, J., Wójtowicz, T., Rygula, R., van der Linden, S., & Morzy, M. (2024). Leveraging artificial intelligence to identify the psychological factors associated with conspiracy theory beliefs online. Nature Communications, 15(1), 7497. https://doi.org/10.1038/s41467-024-51740-9
Lazer, D. M. J., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., Metzger, M. J., Nyhan, B., Pennycook, G., Rothschild, D., Schudson, M., Sloman, S. A., Sunstein, C. R., Thorson, E. A., Watts, D. J., & Zittrain, J. L. (2018). The science of fake news. Science, 359(6380), 1094–1096. https://doi.org/10.1126/science.aao2998
Pantazi, M., Hale, S., & Klein, O. (2021). Social and Cognitive Aspects of the Vulnerability to Political Misinformation. Political Psychology, 42(S1), 267–304. https://doi.org/10.1111/pops.12797
Pennycook, G., & Rand, D. G. (2019). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, 188, 39–50. https://doi.org/10.1016/j.cognition.2018.06.011
Sultan, M., Tump, A. N., Ehmann, N., Lorenz-Spreen, P., Hertwig, R., Gollwitzer, A., & Kurvers, R. H. J. M. (2024). Susceptibility to online misinformation: A systematic meta-analysis of demographic and psychological factors. Proceedings of the National Academy of Sciences, 121(47), e2409329121. https://doi.org/10.1073/pnas.2409329121
Van Bavel, J. J., Harris, E. A., Pärnamets, P., Rathje, S., Doell, K. C., & Tucker, J. A. (2021). Political Psychology in the Digital (mis)Information age: A Model of News Belief and Sharing. Social Issues and Policy Review, 15(1), 84–113. https://doi.org/10.1111/sipr.12077
Weeks, B. E. (2015). Emotions, Partisanship, and Misperceptions: How Anger and Anxiety Moderate the Effect of Partisan Bias on Susceptibility to Political Misinformation. Journal of Communication, 65(4), 699–719. https://doi.org/10.1111/jcom.12164
Arechar, A. A., Allen, J., Berinsky, A. J., Cole, R., Epstein, Z., Garimella, K., Gully, A., Lu, J. G., Ross, R. M., Stagnaro, M. N., Zhang, Y., Pennycook, G., & Rand, D. G. (2023). Understanding and combatting misinformation across 16 countries on six continents. Nature Human Behaviour, 7(9), 1502–1513. https://doi.org/10.1038/s41562-023-01641-6
Aruguete, N., Batista, F., Calvo, E., Guizzo-Altube, M., Scartascini, C., & Ventura, T. (2024). Framing fact-checks as a “confirmation” increases engagement with corrections of misinformation: A four-country study. Scientific Reports, 14(1), 3201. https://doi.org/10.1038/s41598-024-53337-0Bak-Coleman, J. B., Kennedy, I., Wack, M., Beers, A., Schafer, J. S., Spiro, E. S., ... & West, J. D. (2022). Combining interventions to reduce the spread of viral misinformation. Nature Human Behaviour, 6(10), 1372-1380.
Ecker, U. K. H., & Ang, L. C. (2019). Political Attitudes and the Processing of Misinformation Corrections. Political Psychology, 40(2), 241–260. https://doi.org/10.1111/pops.12494
Feuerriegel, S., DiResta, R., Goldstein, J. A., Kumar, S., Lorenz-Spreen, P., Tomz, M., & Pröllochs, N. (2023). Research can help to tackle AI-generated disinformation. Nature Human Behaviour, 7(11), 1818–1821. https://doi.org/10.1038/s41562-023-01726-2
Hoes, E., Aitken, B., Zhang, J., Gackowski, T., & Wojcieszak, M. (2024). Prominent misinformation interventions reduce misperceptions but increase scepticism. Nature Human Behaviour, 8(8), 1545–1553. https://doi.org/10.1038/s41562-024-01884-x
Kozyreva, A., Lorenz-Spreen, P., Herzog, S. M., Ecker, U. K. H., Lewandowsky, S., Hertwig, R., Ali, A., Bak-Coleman, J., Barzilai, S., Basol, M., Berinsky, A. J., Betsch, C., Cook, J., Fazio, L. K., Geers, M., Guess, A. M., Huang, H., Larreguy, H., Maertens, R., … Wineburg, S. (2024). Toolbox of individual-level interventions against online misinformation. Nature Human Behaviour, 8(6), 1044–1052. https://doi.org/10.1038/s41562-024-01881-0
Lewandowsky, S, and van der Linden, S. (2021). “Countering Misinformation and Fake News Through Inoculation and Prebunking.” European Review of Social Psychology 32 (2): 348–84. doi.org/10.1080/10463283.2021.1876983Maertens, R., Roozenbeek, J., Basol, M., & van der Linden, S. (2021). Long-term effectiveness of inoculation against misinformation: Three longitudinal experiments. Journal of Experimental Psychology: Applied, 27(1), 1.
Martel, C., & Rand, D. G. (2023). Misinformation warning labels are widely effective: A review of warning effects and their moderating features. Current Opinion in Psychology, 54, 101710. https://doi.org/10.1016/j.copsyc.2023.101710
Martel, C., & Rand, D. G. (2024). Fact-checker warning labels are effective even for those who distrust fact-checkers. Nature Human Behaviour, 8(10), 1957–1967. https://doi.org/10.1038/s41562-024-01973-x
McCabe, S.D., Ferrari, D., Green, J. et al. Post-January 6th deplatforming reduced the reach of misinformation on Twitter. Nature 630, 132–140 (2024). https://doi.org/10.1038/s41586-024-07524-8
Nyhan, B., & Reifler, J. (2010). When Corrections Fail: The Persistence of Political Misperceptions. Political Behavior, 32(2), 303–330. https://doi.org/10.1007/s11109-010-9112-2
Nyhan, B. (2021). Why the backfire effect does not explain the durability of political misperceptions. Proceedings of the National Academy of Sciences, 118(15), e1912440117. https://doi.org/10.1073/pnas.1912440117
Pennycook, G., & Rand, D. G. (2022). Accuracy prompts are a replicable and generalizable approach for reducing the spread of misinformation. Nature Communications, 13(1), 2333. https://doi.org/10.1038/s41467-022-30073-5
van der Linden, S. (2022). Misinformation: Susceptibility, spread, and interventions to immunize the public. Nature Medicine, 28(3), 460–467. https://doi.org/10.1038/s41591-022-01713-6
Allen, J., Watts, D. J., & Rand, D. G. (2024). Quantifying the impact of misinformation and vaccine-skeptical content on Facebook. Science, 384(6699), eadk3451. https://doi.org/10.1126/science.adk3451Lenti, J., Mejova, Y., Kalimeri, K., Panisson, A., Paolotti, D., Tizzani, M., & Starnini, M. (2023). Global misinformation spillovers in the vaccination debate before and during the COVID-19 pandemic: multilingual Twitter study. JMIR infodemiology, 3, e44714.Pielke Jr, R. A. (2004). When scientists politicize science: making sense of controversy over The Skeptical Environmentalist. Environmental Science & Policy, 7(5), 405-417.Vicari, R., & Komendatova, N. (2023). Systematic meta-analysis of research on AI tools to deal with misinformation on social media during natural and anthropogenic hazards and disasters. Humanities and Social Sciences Communications, 10(1), 1-14.West, J. D., & Bergstrom, C. T. (2021). Misinformation in and about science. Proceedings of the National Academy of Sciences, 118(15), e1912444117.Note: We don’t focus on CS literature for fake news detection etc. here but there is a ton of work in that space. These selected papers focus on applications of AI in the “era” of AI.
Augenstein, I., Bakker, M., Chakraborty, T., Corney, D., Ferrara, E., Gurevych, I., Hale, S., Hovy, E., Ji, H., Larraz, I., Menczer, F., Nakov, P., Papotti, P., Sahnan, D., Warren, G., & Zagni, G. (2025). Community Moderation and the New Epistemology of Fact Checking on Social Media (No. arXiv:2505.20067). arXiv. https://doi.org/10.48550/arXiv.2505.20067
Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., & Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking. Nature Machine Intelligence, 6(8), 852–863. https://doi.org/10.1038/s42256-024-00881-zCostello, T. H., Pennycook, G., & Rand, D. G. (2024). Durably reducing conspiracy beliefs through dialogues with AI. Science, 385(6714), eadq1814.Costello, T. H., Pennycook, G., & Rand, D. (2025). Just the facts: How dialogues with AI reduce conspiracy beliefs. OSF Preprint.
Luceri, L., Salkar, T. V., Balasubramanian, A., Pinto, G., Sun, C., & Ferrara, E. (2025). Coordinated Inauthentic Behavior on TikTok: Challenges and Opportunities for Detection in a Video-First Ecosystem (No. arXiv:2505.10867). arXiv. https://doi.org/10.48550/arXiv.2505.10867
Shoaib, M. R., Wang, Z., Ahvanooey, M. T., & Zhao, J. (2023). Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models. 2023 International Conference on Computer and Applications (ICCA), 1–7. https://doi.org/10.1109/ICCA59364.2023.10401723
Schmitt, V., Villa-Arenas, L.-F., Feldhus, N., Meyer, J., Spang, R. P., & Möller, S. (2024). The Role of Explainability in Collaborative Human-AI Disinformation Detection. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2157–2174. https://doi.org/10.1145/3630106.3659031Wang, J., Wang, X., & Yu, A. (2025). Tackling misinformation in mobile social networks a BERT-LSTM approach for enhancing digital literacy. Scientific Reports, 15(1), 1118.Xu, D., Fan, S., & Kankanhalli, M. (2023, October). Combating misinformation in the era of generative AI models. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 9291-9298).Yang, K. C., Varol, O., Davis, C. A., Ferrara, E., Flammini, A., & Menczer, F. (2019). Arming the public with artificial intelligence to counter social bots. Human Behavior and Emerging Technologies, 1(1), 48-61.
Yi, J., Xu, Z., Huang, T., & Yu, P. (2025). Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions. In Proceedings of the 2025 2nd International Conference on Generative Artificial Intelligence and Information Security (pp. 87–93). Association for Computing Machinery. https://doi.org/10.1145/3728725.3728739
Zhang, Y., Sharma, K., Du, L., & Liu, Y. (2024). Toward Mitigating Misinformation and Social Media Manipulation in LLM Era. Companion Proceedings of the ACM Web Conference 2024, 1302–1305. https://doi.org/10.1145/3589335.3641256Zhao, Y., Liu, B., Ding, M., Liu, B., Zhu, T., & Yu, X. (2023). Proactive deepfake defence via identity watermarking. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 4602-4611).
Chen, C., & Shu, K. (2024). Combating misinformation in the age of LLMs: Opportunities and challenges. AI Magazine, 45(3), 354-368. https://doi.org/10.1002/aaai.12188
-       LLMs Meet Misinformation (Canyu Chen and Kai Shu) (Project Website)Chen, C., & Shu, K (2024). Can LLM-Generated Misinformation Be Detected?. In The Twelfth International Conference on Learning Representations.
-       Can LLM-Generated Misinformation Be Detected (ICLR 2024) (Github Repo)Huang, R., Dugan, L., Yang, Y., & Callison-Burch, C. (2024, November). MiRAGeNews: Multimodal Realistic AI-Generated News Detection. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 16436-16448).
-       MiRAGeNews (Github Repo)
Lin, L., Gupta, N., Zhang, Y., Ren, H., Liu, C.-H., Ding, F., Wang, X., Li, X., Verdoliva, L., & Hu, S. (2025). Detecting Multimedia Generated by Large AI Models: A Survey (No. arXiv:2402.00045). arXiv. https://doi.org/10.48550/arXiv.2402.00045
Liu, A., Sheng, Q., & Hu, X. (2024). Preventing and Detecting Misinformation Generated by Large Language Models. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 3001–3004. https://doi.org/10.1145/3626772.3661377Wang, L. Z., Ma, Y., Gao, R., Guo, B., Zhu, H., Fan, W., ... & Ng, K. C. (2024). Megafake: a theory-driven dataset of fake news generated by large language models. arXiv preprint arXiv:2408.11871.
-       MegaFake Dataset (Github)Zhou, J., Zhang, Y., Luo, Q., Parker, A. G., & De Choudhury, M. (2023, April). Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. In Proceedings of the 2023 CHI conference on human factors in computing systems (pp. 1-20).Barman, D., Guo, Z., & Conlan, O. (2024). The dark side of language models: Exploring the potential of LLMs in multimedia disinformation generation and dissemination. Machine Learning with Applications, 100545.Calvo, P., & Saura García, C. (2024). Generative AI and Democracy: the synthetification of public opinion and its impacts. Available at SSRN 4911710.
Chu-Ke, C., & Dong, Y. (2024). Misinformation and Literacies in the Era of Generative Artificial Intelligence: A Brief Overview and a Call for Future Research. Emerging Media, 2(1), 70-85. https://doi.org/10.1177/27523543241240285 De Angelis, L., Baglivo, F., Arzilli, G., Privitera, G. P., Ferragina, P., Tozzi, A. E., & Rizzo, C. (2023). ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Frontiers in Public Health, 11, 1166120.
Ferrara, E. (2025). Charting the Landscape of Nefarious Uses of Generative Artificial Intelligence for Online Election Interference (No. arXiv:2406.01862). arXiv. https://doi.org/10.48550/arXiv.2406.01862
Garry, M., Chan, W. M., Foster, J., & Henkel, L. A. (2024). Large language models (LLMs) and the institutionalization of misinformation. Trends in cognitive sciences. https://www.cell.com/trends/cognitive-sciences/fulltext/S1364-6613(24)00221-3
Jaidka, K., Chen, T., Chesterman, S., Hsu, W., Kan, M.-Y., Kankanhalli, M., Lee, M. L., Seres, G., Sim, T., Taeihagh, A., Tung, A., Xiao, X., & Yue, A. (2025). Misinformation, Disinformation, and Generative AI: Implications for Perception and Policy. Digit. Gov.: Res. Pract., 6(1), 11:1-11:15. https://doi.org/10.1145/3689372
Schroeder, D. T., Cha, M., Baronchelli, A., Bostrom, N., Christakis, N. A., Garcia, D., Goldenberg, A., Kyrychenko, Y., Leyton-Brown, K., Lutz, N., Marcus, G., Menczer, F., Pennycook, G., Rand, D. G., Schweitzer, F., Summerfield, C., Tang, A., Bavel, J. V., Linden, S. van der, … Kunst, J. R. (2025). How Malicious AI Swarms Can Threaten Democracy (No. arXiv:2506.06299). arXiv. https://doi.org/10.48550/arXiv.2506.06299
Wack, M., Ehrett, C., Linvill, D., & Warren, P. (2025). Generative propaganda: Evidence of AI’s impact from a state-backed disinformation campaign. PNAS Nexus, 4(4), pgaf083. https://doi.org/10.1093/pnasnexus/pgaf083
Bashardoust, A., Feuerriegel, S., & Shrestha, Y. R. (2024). Comparing the Willingness to Share for Human-generated vs. AI-generated Fake News. Proc. ACM Hum.-Comput. Interact., 8(CSCW2), 489:1-489:21. https://doi.org/10.1145/3687028
Danry, V., Pataranutaporn, P., Groh, M., & Epstein, Z. (2025). Deceptive Explanations by Large Language Models Lead People to Change their Beliefs About Misinformation More Often than Honest Explanations. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–31. https://doi.org/10.1145/3706598.3713408
Groh, M., Sankaranarayanan, A., Singh, N., Kim, D. Y., Lippman, A., & Picard, R. (2024). Human detection of political speech deepfakes across transcripts, audio, and video. Nature Communications, 15(1), 7629. https://doi.org/10.1038/s41467-024-51998-zVaccari, C., & Chadwick, A. (2020). Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty, and trust in news. Social media+ society, 6(1), 2056305120903408.
Wittenberg, C., Epstein, Z., Péloquin-Skulski, G., Berinsky, A. J., & Rand, D. G. (2025). Labeling AI-generated media online. PNAS Nexus, 4(6), pgaf170. https://doi.org/10.1093/pnasnexus/pgaf170

01. sna_animal_networks

Tue, 28 Apr 2026 16:27:49 GMT

Assignment 3 - Podcast Factchecking

Mon, 27 Apr 2026 21:10:04 GMT

Credits: Irmetova, A., Liu, H., Teleki, M., Carragher, P., Zhang, J., & Caverlee, J. (2026). PodChecker: An Interpretable Fact-Checking Companion for Podcasts. GitHubPodcasts are one of the fastest-growing media formats worldwide, yet they receive almost none of the editorial oversight applied to broadcast journalism. Hosts and guests regularly make factual claims — about science, politics, health, history, economics — without correction, rebuttal, or verification. This makes podcasts a significant and underexplored surface for trust and safety concerns: misinformation, misleading framing, unverifiable assertions, and coordinated narrative-pushing can all enter the information ecosystem through podcast audio without triggering any of the automated moderation systems that operate on text.In this assignment, you will use PodChecker — an automated fact-checking pipeline for podcasts — to collect, analyze, and critically evaluate the factual claims made across a corpus of podcast episodes. PodChecker ingests podcast audio (via file upload or RSS feed), transcribes it using OpenAI Whisper, extracts atomic factual claims using an LLM, and fact-checks each claim using Perplexity's web-search API. The result is a claim-level credibility report — verdict (true / false / misleading / unverifiable) with supporting source URLs — and an episode-level credibility score.Recall that the Bluesky assignment focused on proactive, real-time moderation of individual posts. PodChecker asks a different question: what does applied computational research look like when the medium is audio, the content is long-form, and the platform has no built-in moderation infrastructure? By the end of this assignment you will have hands-on experience with the full pipeline from data collection to analysis to critical evaluation — the same research cycle used in real trust and safety science.You will select a podcast corpus relevant to a trust and safety harm of your choice, run the PodChecker pipeline across a set of episodes, and produce a written analysis of your findings. In the final milestone, you will either extend the system with a new capability, replicate and apply it to a new domain, or critically evaluate its limitations — documenting your process and findings in a short video presentation.There is no auto-grader for this assignment. Your grade depends on the quality of your corpus selection rationale, the rigor of your analysis, and the depth of your critical reflection — not on whether PodChecker produces a particular score.
Podcasts are public media, but analysis of named hosts and guests carries ethical responsibilities. Follow the course's policy on engaging with harmful content throughout this assignment: Do not analyze content depicting child exploitation, solicitation of illegal activity, or other severely harmful material. If your chosen podcast unexpectedly contains such content, stop and consult an instructor before proceeding. Be precise in your claims about speakers. Reporting that PodChecker labeled a claim as "false" is different from asserting that the host deliberately lied. Automated fact-checking has error rates; your writeup should reflect this. Do not publish or publicly share individual claim-level verdicts about named people without explicit instructor approval. The analysis is for academic purposes. API costs are real. PodChecker consumes OpenAI Whisper (transcription) and Perplexity Sonar (fact-checking) API calls. Budget your usage — run on a small sample first, cache results aggressively, and use the MAX_AUDIO_SIZE_MB cap to control costs. Discuss API cost management in your writeup. Analysis notebook (analysis/my_corpus_analysis.ipynb): a documented Jupyter notebook containing your corpus collection, credibility analysis, and visualizations. Code for any extensions (Milestone 3, Track A): well-commented Python, placed in analysis/ with a short README explaining how to run it.
A 10-minute recorded video presentation covering all three milestones (see Presentation Guidelines). Your presentation slides or any other materials used in the video. PodChecker is a research prototype with two usage modes:Web application — a React + Flask stack that accepts an audio file or RSS feed URL, runs the full pipeline, and renders a results table in the browser. This is the easiest way to verify that the system is working.Python analysis client — the PodCheckerClient class in analysis/podchecker_client.py allows you to run the pipeline programmatically across many episodes, with built-in audio and results caching. This is what you should use for your corpus analysis.RSS feed / audio file ↓ Whisper (small.en) ← OpenAI API ↓ transcript (text) ↓ Claim extraction ← OpenAI API (GPT-4o) ↓ Fact-checking loop ← Perplexity Sonar (web search) ↓ claim-level verdicts true / false / misleading / unverifiable ↓ credibility score (true=100%, false=0%, misleading=50%, unverifiable=excluded from score) Source reliability ratings (from site/backend/filtered_attrs.csv) assign a 1–6 quality score to fact-check sources; sources rated ≥ 5 are marked "trusted" with a star prefix in results. Python 3.10+ Node.js 18+ and npm ffmpeg (required for Whisper audio processing) OpenAI API key (for transcription and claim extraction) Perplexity API key (for web-search fact-checking) git clone https://github.com/annatastic/PodChecker.git cd PodChecker macOS (Homebrew):brew install ffmpeg
Windows: Download from ffmpeg.org/download.html and add to PATH.Ubuntu/Debian:sudo apt install ffmpeg Verify: ffmpeg -versioncd site/backend pip3 install --upgrade pip pip3 install pandas openai openai-whisper perplexityai feedparser requests flask flask-cors For the analysis notebook only (no backend server needed):pip3 install pandas openai openai-whisper perplexityai feedparser requests matplotlib Set your keys as environment variables (do not hard-code them in notebooks you submit):export OPENAI_API_KEY="sk-..." export PERPLEXITY_API_KEY="pplx-..." Or use a .env file (add .env to .gitignore before committing anything):OPENAI_API_KEY=sk-... PERPLEXITY_API_KEY=pplx-... Run the web application to confirm the full pipeline works:# Terminal 1: start the backend cd site/backend python3 app.py # runs on port 8000 # Terminal 2: start the frontend cd site/frontend npm install npm run dev # runs on port 5173
Open http://localhost:5173 in a browser. Use the sample report dropdown to verify that a pre-processed result loads correctly. You should not need to call any APIs to view sample reports.Due: end of Week 14 (submit as a brief written memo, ≤ 500 words + sample output)Select a podcast that is relevant to a trust and safety harm of your choice and verify that PodChecker can process it. Your podcast choice should be motivated by a specific T&S concern — not simply by personal interest or convenience.Good corpus choices share these properties: T&S relevance — the podcast regularly discusses topics where false or misleading claims could cause real-world harm (health misinformation, political disinformation, financial fraud, extremist rhetoric, etc.) Public RSS feed — the podcast is accessible via a public RSS feed or individual episode audio URLs; this is what PodChecker uses to ingest content Sufficient volume — the podcast has at least 15 recent episodes you can analyze; older archives are fine if they cover a coherent time period English audio — Whisper's small.en model performs best on English; non-English podcasts require the multilingual model (you may switch to it, but note this in your writeup) Here are illustrative examples — you are not limited to these:Submit a short memo covering: Podcast name, RSS URL, and episode count in the period you plan to analyze. T&S rationale — which harm type are you investigating, and why is this podcast a good data source for it? Sample output — run PodChecker on 1–2 episodes (use the web interface or the analysis client), and include a screenshot or table of the claim-level results. API cost estimate — based on your sample run, estimate the total OpenAI and Perplexity API cost for your full corpus (use the usage stats printed by the API calls). Propose a MAX_AUDIO_SIZE_MB cap if needed. Due: end of Week 15 (with Milestone 3)Run PodChecker across a corpus of at least 15 episodes and produce a rigorous quantitative analysis of claim-level credibility across your corpus.Use analysis/episode_credibility_analysis.ipynb as your starting point. Adapt it for your corpus by: Pointing it at your RSS feed (or a list of episode audio URLs if no RSS is available): from analysis import get_recent_episodes, compute_credibility_percentage, PodCheckerClient RSS_PATH = "my_podcast_rss.xml" # local copy of the RSS feed NUM_EPISODES = 15 episodes = get_recent_episodes(RSS_PATH, NUM_EPISODES) Initializing the client (always use mode="local" for corpus analysis — it is faster and cheaper than the HTTP mode): client = PodCheckerClient( openai_api_key=OPENAI_API_KEY, perplexity_api_key=PERPLEXITY_API_KEY, mode="local", max_audio_size_mb=60 # adjust based on your API budget ) Running the analysis loop — results are cached to data/ automatically so you can re-run the notebook without incurring API costs for already-processed episodes: episode_results = [] for episode in episodes: result = client.analyze_episode(episode, podcast_name="MyPodcast") episode_results.append({'episode': episode, 'result': result}) Your notebook must include the following:A. Episode credibility over time — a line plot of credibility score (0–100%) across episode dates, following the template in the starter notebook. Annotate any notable outliers.B. Claim-level verdict distribution — a bar chart or table showing the proportion of claims labeled true, false, misleading, and unverifiable across the full corpus. Compute this both per-episode and aggregated.C. Error analysis — for at least 10 claims that received a "false" or "misleading" verdict, manually verify the verdict by checking the supporting sources PodChecker provides. Report: How many did you agree with? Disagree? Find ambiguous? What patterns explain errors (hallucinated sources, out-of-date information, opinion framed as fact)? D. Failure modes — document any episodes that failed to process (rate limits, audio access errors, truncation) and how you handled them. What fraction of your corpus is usable?E. Cost accounting — report the actual API costs incurred (OpenAI token counts for Whisper + GPT-4o, Perplexity call count). Compare to your Milestone 1 estimate.Due: end of Week 15 (submitted together with Milestone 2)Choose one of the three tracks below. All tracks have equivalent weight in the grading rubric. Your choice should reflect what you find most interesting and what is most tractable given your corpus.Extend PodChecker with a new capability that addresses a gap in the current system. Examples: Multi-podcast comparison — run PodChecker on two or more podcasts covering the same topic (e.g., two podcasts on the same health topic with different credibility reputations) and compare their credibility profiles quantitatively. Claim-type taxonomy — add a claim-type classifier that categorizes claims before fact-checking (e.g., statistical claims, causal claims, identity claims, predictions) and analyze how credibility varies by claim type.
Alternative fact-checking source — replace or supplement Perplexity with a structured fact-check database (e.g., Google Fact Check Tools API, ClaimBuster, or Community Notes data) and compare the verdicts produced by each source on the same claims. Temporal trend detection — identify claims that recur across episodes and analyze how their veracity changes over time (e.g., does a podcast host's credibility improve after a public correction?). Your extension should include working code and a brief evaluation demonstrating that it produces meaningful results on your corpus.Replicate the core PodChecker analysis from Irmetova et al. (2026) on a new podcast corpus and apply the findings to a specific T&S question. This track is appropriate if your main interest is empirical analysis rather than system development.Your writeup should: Describe how your corpus differs from the paper's (podcast genre, time period, harm type). Report credibility scores, claim distributions, and source reliability breakdowns comparable to the paper's results. Apply the findings to a specific T&S question — for example: Does credibility score correlate with the podcast's media bias rating from an external source? Do episodes featuring certain guest types (politicians, scientists, activists) have systematically different verdicts? Critically discuss what the system gets right and what it misses for your specific harm type. Conduct a systematic evaluation of PodChecker's accuracy, limitations, and potential for harm — without building a new extension or collecting a large new corpus. This track is appropriate if you want to focus on evaluation methodology and ethical analysis.Your evaluation should address at least three of the following: Precision and recall — manually fact-check a stratified sample of claims (e.g., 30–50 claims) and compute precision, recall, and F1 against your ground-truth labels for each verdict category. Hallucination analysis — examine the supporting source URLs PodChecker provides. How often do the URLs actually support the verdict? How often are they irrelevant or broken? Claim extraction quality — evaluate whether the claims Whisper + GPT-4o extract are the important claims from the episode, or whether the system over- or under-samples certain claim types. Domain sensitivity — test the system on a domain where automated fact-checking is particularly risky (contested political topics, rapidly evolving science) and analyze where human judgment would be required. Bias and representation — does the system systematically produce different verdicts for claims made by speakers of different political affiliations, genders, or expertise levels? Design a small experiment to test this. Record a ~10-minute video covering all milestones. You will submit the video along with your notebook and slides. Structure your presentation as follows: Podcast corpus and T&S rationale (2 min) — introduce the podcast(s) you chose, the harm type you are investigating, and why this corpus is an interesting subject for T&S research. System overview (1 min) — briefly explain how PodChecker works (pipeline diagram from the README is fine); you can assume your audience knows what transcription and LLMs are. Corpus analysis results (3 min) — walk through your key findings from Milestone 2: credibility trend over time, verdict distribution, error analysis highlights. Show at least one chart. Extension / Replication / Critical Evaluation (2 min) — present your Milestone 3 track: what you built or analyzed, what you found, and what surprised you. Ethical reflection and limitations (1 min) — what are the risks of deploying a system like PodChecker at scale? What should a T&S practitioner know before using automated podcast fact-checking? Future directions (30 sec) — one specific, actionable improvement you would make if you had more time. Unlike the Bluesky assignment (which has an auto-grader), there is no single correctness score for this assignment. Instead, demonstrate rigor through: Reproducibility — your notebook should run end-to-end from a clean environment using cached data. Include a requirements.txt or environment spec. Sample size — at minimum 15 episodes with analyzable audio. Justify your sample size in the writeup. Manual verification — the error analysis in Milestone 2C is the closest thing to ground-truth evaluation; take it seriously. Spot-checking 10 claims is the minimum; more is better. Quantitative reporting — report credibility scores with summary statistics (mean, median, standard deviation); report claim counts and verdict breakdowns with exact numbers; plot over time when the corpus spans more than a week. Acknowledgment of failures — episodes that failed to process, API rate limits hit, audio truncations, and claims that were too ambiguous to verify are all expected and should be documented, not hidden.
GitHub repository — source code, README, sample reports analysis/episode_credibility_analysis.ipynb — starter notebook for corpus analysis analysis/podchecker_client.py — PodCheckerClient API documentation (docstrings) analysis/rss_utils.py — get_recent_episodes, compute_credibility_percentage functions
OpenAI API pricing — Whisper: $0.006/min of audio; GPT-4o: varies by token count
Perplexity API docs — Sonar model pricing per search call
Google Fact Check Tools API — free structured fact-check database (useful for Track A/B)
Listen Notes API — podcast search and RSS discovery
Podcast Index — open podcast RSS directory Most major podcast platforms (Spotify, Apple Podcasts) publish RSS feeds for public shows Course reading list — Misinformation sections (Lectures 17–21) for background on claim detection, source credibility, and intervention design Irmetova et al. (2026) — the PodChecker paper; available in PodChecker/ folder

Assignment 2 - Bluesky Moderation

Mon, 27 Apr 2026 21:09:56 GMT

Credits: Cornell CS 5342, original [link](CS5342 Automated moderator for Bluesky), GitHubIn this assignment, you will gain first-hand experience with Bluesky’s customizable approach to moderation. We’ll walk you through implementing a labeler, which is a service that attaches categorical labels to Bluesky posts and accounts. Users who subscribe to your labeler can configure how these labels are applied to the posts they see. For instance, your service may attach a label for spam or NSFW content (throughout, content will refer to both posts and accounts). Some users may wish to hide such content altogether, others may prefer that a badge be attached to it. Recall that content moderation is not solely about blocking harmful content. It can also be about organizing and displaying content in a way that is helpful to users. Similarly, labelers are not just for marking definitely objectionable posts and accounts. Here are a few examples:
The pronouns labeler allows users to display a badge on their profile indicating their pronouns that subscribers to the labeler can see.
The US Government Contributions labeler will apply badges to the accounts of representatives with the organizations that fund them. After subscribing to this labeler, you can look up Alexandria Ocasio-Cortez’s account, and see badges indicating that her donor list includes employees of or PACs tied to Alphabet and the City of New York.
This popular labeler attempts to identify AI-generated imagery.
You can find more examples of labelers at Bluesky-labelers.io. We encourage you to try some of them out before starting the assignment. In the first part of the assignment, you will build an automated labeler that will apply labels to Bluesky posts based on their text content. We will provide a test set of posts and their expected labels in a CSV file. You should not hard-code these labels in your implementation – we will test your code on some examples that do not appear in this test set, and a portion of your grade will be based on your labeler’s accuracy on these instances. Furthermore, if you do hard-code labels for particular posts, you will receive a 0 for the functionality score. If your labeler produces nothing for all inputs, it will also receive a 0.In the second part of this assignment, you will implement your own automated moderation policy as a Bluesky labeler. This can be the policy you articulated in Assignment 2, but you are free to choose another topic. We expect you to comprehensively test your code for this component. The extent to which your code and testing is well-documented will constitute a portion of your grade. This part of the assignment is more open-ended, so you’ll have to demonstrate to us that you’ve thought through how you can verify that your implementation will meet your stated moderation goal. The creativity you demonstrate in your chosen problem/solution will also constitute a portion of your grade.
Throughout this class we have discussed how safety measures can in turn be abused. We encourage you to continuously check the work that you are doing for unintended consequences and follow our course’s policy on engaging with harmful content. You should be particularly careful when completing Milestone 5. This will include documenting your process carefully, clearly signposting the exercise as an academic effort, and providing a way for labeled users to express any concerns with the label. A well-documented implementation of your labeler in python Your Part I implementation should be in automated_labeler.py For Part II, create a file named policy_proposal_labeler.py – you can base that implementation off of the code we provide for you. A 10-minute video presentation describing your implementation choices, your testing approach, and an evaluation of your solution for addressing the chosen harm Your presentation slides and any other materials you used to create your video
You can find a detailed discussion of the Bluesky moderation infrastructure here. We provide a high-level overview below. Account data is hosted at a personal data server (PDS). This data is distributed, via a relay, to an AppView service. Labelers are services that generate labels on posts and accounts. These labels are sent to the AppView. When user client devices download posts from the AppView, they obtain labels associated with the content they received, depending on what labelers they are subscribed to.

The figure above[1], from the Bluesky moderation infrastructure overview, provides a visual representation of how labels are generated and sent to the AppView. Bluesky provides its own labeler that handles platform-level moderation policies. In addition to this, users can opt into other third-party labelers for additional layers of moderation. You will implement and run one such labeler for this assignment.
You can access the starter code from the class Github, under the bluesky-assign3 directory.
In an actual production environment, your labeler would likely ingest posts from the firehose, which provides a stream of content as it is disseminated through the network. However, for the purposes of this assignment, your labeler will be ingesting posts from a CSV file. This will allow for easier testing and debugging.
In this assignment, your labeler consists of two components: the first is the labeling server, which interfaces with the AppView to attach labels to content. This is a Javascript program that uses the skyware/labeler library. The second component is your labeler bot, which you will implement as a python program that will interface with the labeler server to produce labels.
In order to create a labeler, you need a public domain to host the labeling server and a Bluesky account associated with the labeler. We will handle the hosting of the labeling server for you. If you’re interested in an additional challenge and want to own/operate your own labeling infrastructure, you can consult this guide. For the purposes of completing this assignment, you are not required to make your labeler live (i.e., emit public labels). In fact, for Part II, you should be careful to check with us before you start attaching labels to public posts. However, we give you the option to emit labels so you can see your hard work in action on the Bluesky network.Start by creating your own GitHub repository with a clone of the starter code.Make sure you have nodejs and python installed. Install the skyware labeler package with the following command:npm install @skyware/labelerMake sure you have Python 3 installed, along with the ATProto, Dotenv, Requests, and Perception modules:pip install atproto dotenv requests perceptionRun the following command in the starter code directory to ensure that you can access posts.python get_post_test.pyIf this runs successfully, you are ready to begin the assignment. Testing that you can emit public labels (the following section) is optional.From your browser, visit https:///xrpc/com.atproto.label.queryLabels This will display all the labelers that have been issued by your labeler. Initially, this list will be empty. Let’s change that. Run the following command to apply a label to the bsky.app account:python label.py post https://bsky.app/profile/bsky.app/post/3l6oveex3ii2l great
This applies a “great” label to the post at the specified URL. You can use the skyware/labeler command line utility to modify the labels that your labeler supports.Subscribe to your labeler from a different account (perhaps your personal bluesky account) and visit the post in the URL to observe that the label has been applied:

You can also use the Label Scanner tool to verify that your label was applied to the post.At this point, you have confirmed that you can emit labels for particular posts and accounts via the command line tool. That’s a great achievement already – you have the essential infrastructure for running your own third-party moderation service on Bluesky, congratulations!Now, you’ll automatically apply labels based on posts that meet certain criteria. Open up automated-labeler.py. You’ll notice that we provide a constructor for your labeler and a moderate_post function. This function takes as input a url to a Bluesky post and produces an List[str], i.e., the function returns a list of string, corresponding to a label if there is one to be added for the post, or a [] value if there are no labels to add. You must implement your labeling logic in this function as this will interface with the auto-grader for the assignment. When you run your labeler via test-labeler.py, the output of moderate_post will be used to emit a label via the label_post function defined in label.py. You can configure whether to actually emit labels to the Bluesky network via a command-line argument for test-labeler.py – while you’re testing your code, you shouldn’t be emitting labels.A portion of your grade will consist of your coding style – your code should be legible and well-organized. You should decompose the logic in moderate_post across different functions that you’ll define for your AutomatedLabeler.For Part I, we will provide an isolated testing script for you to test how your code generates labels. This will be the same script that our auto-grader will use in determining the functionality score for Part I. You may find this script helpful for your testing set-up in Part II as well.
A common moderation technique that platforms employ is text matching against a list of known harmful text. In this part of the assignment, you will implement this technique to label posts containing Trust-and-Safety-related words/domains. We sourced the words from the TSPA glossary. You will find this list in t-and-s-words.csv and a list of T&S domains in t-and-s-domains.csv. For each post in input-posts-t-and-s.csv, apply a “t-and-s” label to those that match either list. Make sure to take into account case sensitivity. If the word “moderate” is on the list, then a post containing the word “mOderAtE” should also be labeled.Label posts that link to news articles with the news publications with which they are affiliated. You will have to create labels for the following publications: CNN, BBC, NYT, Washington Post, Fox News, Reuters, NPR, AP. The file news-domains.csv will contain a list of the domains you should scan for, along with the label to apply. For each post in input-posts-cite.csv, apply the appropriate label(s). Note that if there are multiple news links from different sources, then multiple labels should be generated for each source. If there are multiple links from the same source, then only one label should be generated.Many platforms employ a technique called perceptual hash matching in order to detect harmful or illegal images. An image is passed through a perceptual hash function, which outputs a bitstring (a sequence of 0’s and 1’s), such that two similar images should hash to similar bitstrings. In this part of the assignment, you will use this technique to identify pictures of dogs that match a known list (the dog-list).
For posts in input-posts-dogs.csv containing an image matching the image dog-list (the images in the dog-list-images directory – sourced from WeRateDogs), apply the “dog” label. A match is defined as the image being within a hamming distance of THRESH[2] of the target image’s perceptual hash. You can use the PHash implementation provided here to perform perceptual hashing. We leave it to you to figure out how to extract the image(s) contained within a post. You’ll find it helpful to consult the atproto documentation along with the PIL and requests python modules. See if you can notice a pattern in the URLs associated with post images.In Part I of this assignment, you gained familiarity with the AT protocol and implemented automated moderation routines. Using those skills, you’ll extend your labeler to handle a harm type of your choice. You are encouraged to implement the policy proposal you outlined in Assignment 2 because you’ll have spent significant time grappling with it, but you are also free to choose a different problem to tackle if you don’t think you can implement your solutions from Assignment 2. Your choice to build on Assignment 2 or start anew will not affect your grade. Recall that your implementation for Part II should be in a file named policy_proposal_labeler.py.We expect your problem selection and solution to demonstrate a reasonable level of creativity, sophistication, and involvement. For instance, tackling toxicity by making a call to the Perspective API for each post and attaching a “toxic” label if it exceeds some threshold would not suffice. You will likely have to iterate on your policy proposal and implementation to achieve something reasonable. Document this process and discuss it in your presentation.You should begin by gathering data on the harm you plan to tackle. This will inform your testing approach and solution design. You can use the ATProto SDK to crawl Bluesky and filter for content that may be relevant to the harm you address. You can also leverage research done for Assignment 2. Part of your grade will be based on the description, execution, and efficacy of your testing setup. Depending on your approach, you may have to manually label some of the data you collect. We do not want you to deal with illegal or severely harmful content (e.g. sale/solicitation of illegal substances, CSAM, etc).Remember that precision in labeling at scale is difficult – and you only have a few weeks. For that reason, we encourage you to choose a labeling implementation that recognizes it is detecting potentially sensitive content rather than one that is categorical about finding the harmful material – unless you can be highly confident about the accuracy of your endeavors. You can help yourself by being precise about what you call the labeler; you can then explain why that labeler might help fight the harm you care about in your presentation / video. Here are some illustrative examples: “Potentially soliciting financial information” is a better labeler than “Fraud posts”. You might use a combination of text-matching and LLM reasoning to label posts that include certain brand names tied to money exchange (e.g. Venmo, CashApp) and a call to action (e.g. “Send me your” or “Give me”). Recognize that people often provide this information during emergencies, for fundraising, or as tips for their online work (e.g. on Patreon). “Addresses content that has been fact-checked before” is a better labeler than “Fake news” both because it is more precise and because you are not making a definitive judgment about the veracity of the Bluesky post, which will be very hard. To build such a labeler you may choose to lean on the Fact Check Explorer API or the open source Community Notes data. For inspiration, we provide below a non-exhaustive list of inputs, signals, and tools you may consider using in your labeler:
Perspective API for toxicity scoring
Google fact checking tools, other fact checking databases/APIs Analysis of the Bluesky network – looking at follower lists, number of posts/replies etc. other metadata. This could be helpful for analyzing particular communities. User input – users can message your labeler/react to posts. This can inform a collaborative voting/labeling approach
Non-profit, human rights, and/or legal groups that have categorized organizations in ways that may be useful for labeling purposes (e.g. RSF Freedom of the Press index) LLMs, computer vision models For full transparency, you should make it clear in the description of your labeler account that your labeler is part of an educational exercise and that it should not be trusted for complete accuracy. You should also collect and respond to any serious criticism.Take care to consider the ethical implications of deploying your labeler and ensure that it does not lead to or provide a vector for abuse/harm. For instance, you could see how labeling non-notable individuals for their perceived political position on a hot-button topic could lead to doxxing campaigns or worse. Please feel free to reach out to us if you want to gut-check your proposed labeler for possible harmful consequences. ~10 minute recorded video Introduce, motivate, and explain the harm you aim to mitigate, along with your proposed policy Discuss the various approaches you tried out, explain what hurdles where and what you needed to go back on in your policy to make a better implementation that reflects it Give a high-level technical overview of your implementation Provide a demo of your labeler in action Discuss your approach to testing and evaluation Analyze the ethical implications of deploying your labeler Talk about future areas for improvement We expect you to test on a reasonably large number of posts (e.g. somewhere in the ballpark of ~100), and evaluate the accuracy, precision, and recall of your labeler (for labelers highly dependent on user input, this analysis may look slightly different). You should also discuss the efficiency and performance of your labeler in terms of the amount of computation and memory it requires – these are things you may consider measuring e.g., How long does it take your labeler to make a decision on a particular post? How much memory does it consume? How much network communication does it require?As you build an application that interfaces with Bluesky and the AT protocol, you’ll likely have conceptual questions about how the protocol and python SDK work. Additionally you may wonder about useful APIs for implementing your Part II solution. We list relevant resources below:
AT Protocol spec
AT Protocol Python SDK documentation
Bluesky developer discord
List of Bluesky labelers
Broader developer docs for Bluesky Your grade in the assignment will be made up following components:
https://docs.bsky.app/blog/blueskys-moderation-architecture ↩︎
This is a constant that will be provided in the starter code.↩︎

Assignment 1 - Discord Bot

Mon, 27 Apr 2026 21:09:36 GMT

Source code: GitHub
Credits: Trust & Safety Teaching Consortium and Cornell Tech Alex Stamos, Stanford Internet Observatory Shelby Grossman, Stanford Internet Observatory Jeffrey Hancock, Stanford Internet Observatory For the course project, you and your group will be the Trust and Safety team at a major social media or consumer cloud platform. You will be assigned to a group at the end of April. The team will focus on a particular type of abuse, proposing policies to the executives at your company as well as researching and implementing relevant technological solutions within a content moderation bot in Discord. The project is split into three milestones, which will be completed over the course of the quarter. The first milestone will be completed individually. The second and third milestones will be completed with your group. The final milestone will culminate in a presentation that your group will give to the teaching team and guest judges from industry, the evening of Wednesday, June 5th.Please do not generate the text you submit using ChatGPT or any other LLM. You will have plenty of opportunities to use these systems to generate test data or perform abuse detection in future milestones, but anything submitted by students in this assignment should be written by humans.Percent of Final Grade: 20%Deliverables: A PDF containing two major sections: 1) Abuse Research Report (2000-4000 words) 2) Policy Comparison Table Submission*:* This first milestone will be completed independently. CS152 students should upload the document to their Canvas site and POLISCI143 students should upload the document to their Canvas site. Note: The abuse type you choose to focus on for this milestone may or may not be the abuse type your group focuses on for milestones 2 and 3.Description: The execs recently came to your team and tasked you with looking into a significant type of online abuse, researching the current best options available for dealing with such abuse, and making specific recommendations to the company of how to detect and mitigate it.
In writing this paper, please use citations (choose your preferred style) for factual information and feel free to add your own original interpretations and suggestions. Please follow the structure below for the paper overall, but you are welcome to add additional information in whatever sections you see fit. You are also welcome to use graphics, charts, or diagrams as long as you cite the original source. Example Abuse Types: You are welcome to write about any of these abuse types or one of equivalent importance. If you want to move away from topics that are covered in the syllabus, please check in with the teaching team first. Suicides driven by bullying on Instagram Murder-Suicides on Facebook Live Live streaming of terrorist attacks Government propaganda against domestic minorities Disinformation in online ads Coordinated harassment of journalists on Twitter Sextortion (on many platforms) Trading of Child Sexual Abuse Materials (on many platforms) Online cryptocurrency scams on Twitter Hate speech on a streaming platform Terrorist recruitment on Twitter Fraudulent identification on Airbnb Grooming on Snapchat or Instagram Catfishing on dating apps A note on choosing abuse type: You may end up working with this abuse type for the entire quarter, so we encourage you to pick a topic that you care about. Note that when you later work to implement technological solutions we will use stand-ins for illegal and/or very harmful material (e.g. pictures of kittens instead of CSAM). Don’t let the potential technical challenge of a topic scare you away from tackling it; your milestones will be graded on effort and thoughtfulness, not whether or not you’re able to effectively solve these problems. They’re still problems in the real world because they’re quite hard, after all!Required Sections: Description of the Abuse Type - Provide a summary of the kind of abuse, including citations to known examples. This can include linking to reporting in the media, academic research, talks by professionals, or links to legal documents like indictments. Actor and Victim - What do you know about the people behind this kind of abuse? How about the victims? Is this something anybody can experience, or is the abuse tied to a specific part of the victim's identity? Are there forums or other platforms where these kinds of abusers congregate and can we learn more about them? In-Depth Profile Piece on an Actor or Victim - Research someone who has personal experience (as an actor, victim, bystander, or content moderator) with the abuse type you are analyzing. What are their experiences with this type of abuse? How has the abuse shaped or influenced their post-abuse experience (if at all)? What do they have to say about the way content moderation on major tech platforms should handle this type of abuse? Include these findings in your final paper. You will receive extra credit on this section if you are able to do an “in-person” (remote) interview with a real person who experienced (or perpetrated) this kind of abuse. Details of the Abuse - Describe the immutable aspects of this kind of abuse and how they could be detected. What might differ between attackers and victims? Dive into at least one real-world example and pull out specific moments at which the abuse could be detected or mitigated. Relevant Technologies - What technologies currently exist that are/can be used to combat this kind of abuse? What are their strengths and weaknesses? On what platforms are they used now and to what levels of success? Specific Recommendations - What policy, product, engineering, or operational changes do you recommend to deal with this type of abuse? Length: 2000-4000 words. We will penalize papers that exceed this word limit.
Create a policy table outlining what platform policies are currently in effect that relate to this kind of abuse. Compile language, pulled from the policies of other platforms, that you think is relevant or appropriate. Please do this for three platforms. An example table for an investigation into coordinated harassment of journalists on Twitter is below, but you may adjust as appropriate for different abuse types. Think critically about what types of columns you think are theoretically important. Please make sure all table cells that reference policy hyperlink to the exact website. Note that there may be cases where you need to write “unclear.” That’s fine, just justify the reasoning. You can see other examples in Figure 1 here, Table 1 here, and here. In addition to the table, in two or three paragraphs, please explore any research on the effects of these policies, including whether they help mitigate the abuse or enable it. Coordinated harassment of journalists on Twitter (Note: this table includes made up data) Note to instructors: Including this example table will lead many students to focus on this exact abuse type.Percent of Final Grade: 20%This milestone has 3 components and 4 deliverables: Design a user reporting flow and a behind-the-scenes moderator flow Deliverable: A user reporting flow and a behind-the-scenes moderator flow, in pdf format Implement these flows into your Discord bot Deliverable: All code files from your backend implementation checked into a forked Github repository (submit the link to your repo) Deliverable: Short video (around 6 minutes) demoing your bot’s functionality + discussing examples Writeup Deliverable: Writeup about the work you’ve done so far to handle your specific abuse type (~500 words) Submission*:* Please have one group member from CS152 upload all documents to the Canvas CS152 site before the deadline. Make sure to include your team number in the name of each document.For milestone 3 you will be asked to make your bot “smart,” e.g. train a classifier on your abuse type. For this milestone, milestone 2, it is fine if the back end moderator flow is manual. Select an abuse type to focus on for the group project. This should probably be an abuse type investigated by a member of your group for milestone 1, so that you are starting with a good understanding of the abuse and potential mitigations. While choosing an abuse type to focus on, please keep in mind that milestone 3 will ask you to create some kind of automated detection mechanism for that abuse type, so choose something for which automated detection is at least tractable. You are allowed to choose an abuse type where realistic testing is illegal or would be extremely difficult for the team, such as the trading of child sexual abuse material. In these cases, you will use a type of proxy content to train and test your detection system. For example, we have used “naked” photos of kittens (versus adult cats) as stand-ins for CSAM detection in the past. If you want to scan for extremely violent videos, you can use cartoon violence instead of content such as beheading videos. For topics where testing might require the use of upsetting content, such as racist language or misogynistic threats, make sure that the entire team is ok with the topic before proceeding with milestone 2.Your group will design two reporting flows: one based on user reports and the other a behind-the-scenes flow for content moderators. You will implement both these flows (within Discord’s constraints) in your bot for this milestone. The best way to represent these flows is using a flowchart that includes both the user action and the system’s response. The reporting flow should have a variety of abuse types at the highest level, but the more detailed flow (after the first or second prompt) only needs to be built out for your specific abuse type. An example of a User Reporting Flow focused on hate and harassment that received a high score last year is included here, however as noted above you only need to show a variety of abuse types at the highest level.
Some additional examples from industry:
Facebook: https://blog.heyo.com/wp-content/uploads/2012/06/FB-Reporting-Guide.png
Twitter: https://help.twitter.com/en/safety-and-security/report-abusive-behavior Your user reporting flow should outline the process that a user is taken through when they attempt to report an instance of your abuse type on your platform. It should do the following: Offer users the ability to specify the detailed type of abuse Note steps of the process that require review (automated or manual) Clearly identify potential outcomes of a report (nothing, post is removed, shadow block, etc.) Your manual review flow should outline the process that a content reviewer goes through when they review a piece of content submitted by a user as abusive using the flow you just created. It should do the following: Handle reports coming both from users and automated flagging (though automated flagging need not be complete until milestone 3) Outline the manual review process of flagged messages - what options are given to reviewers? What information do they have access to? Make sure to clearly identify potential outcomes of a report (nothing, post is removed, user is banned, etc.) Are there multiple levels of reviewers? Are there situations where a first-tier content reviewer can engage their management or a specialized investigations team? Some questions you should think about when designing your flows: How many steps should there be? How does this balance warding off malicious reporters/spammy reporters while still encouraging real reports? How specific should the options be? What’s the tradeoff between offering many different options and only a few? How will this affect user experience? What characteristics make content able to be moderated automatically, and what content should go through human review? In a perfect world, what outcomes might exist to help keep users safe? (E.g. shadow blocking, user rehabilitation programs, etc.) How can you work those ideas into the flows? Your company’s T&S-minded CEO has approved your reporting flows and asked that you implement them for this abuse type, commending you on the thought that went into them. Success! She gives you the green light to build out a skeleton of the system for some A/B testing with real users.You will design and implement a reporting flow within the context of Discord. We are using Discord for the class project because their bot framework is very powerful; many communities build the functionality they want on top of Discord by using bots to greet users as they join servers, auto-respond to messages, moderate chat/ban swear words, and much more. Discord bots are written in asynchronous Python - don’t worry if you haven’t worked with it before, it shouldn’t be too hard to pick up (and the TAs are here to help!). Please consult the Discord Bot Setup Guide at the bottom of this doc. Your group will be given two pre-configured channels: group-# for general chatting and group-#-mod to serve as a back-end for human moderators (from now on we’ll call them the “main” channel and the “mod” channel). User generated content will go in the main channel, and that content will either be manually reported or automatically flagged by your bot and potentially be sent to the mod channel for human review.Although we have given you the setup of these two channels, you are not necessarily restricted to the context of a group chat! It is up to you to specify what context your moderation tools exist in, and style your main group-# channel accordingly. For example, you could say it is a feed of content that one specific user is scrolling through (e.g. Instagram/Twitter style), or say the channel is actually a DM between two users. There’s a lot of flexibility here, and as long as you’re clear about what you imagine and how you are simulating that within Discord, you’re welcome to adapt in whatever way makes sense for the abuse type you have chosen. Please reach out to the TAs if you have ideas that you aren’t sure how to adapt; they can also potentially give you additional channels if it would be helpful. For this part, your bot should be able to do the following: Allow users to report content (and/or other users) and follow your reporting flows to completion, including outcomes. Note: if your outcomes include banning users, you can just simulate that with messages; we won’t be banning from the 152 server in the interest of keeping it functional. Allow moderators to moderate reported content using your moderation flow and implement the outcomes to their completion (with the exception of banning users, which you can simulate with a message). A note on sensitive content here; the TAs can see all messages sent in your group channels, but we won’t be actively monitoring them. In order to properly do this milestone, you may have to engage with perverse and hateful content; that is the nature of this kind of abuse-fighting. It is important that you test your bots, but this is not an excuse to behave inappropriately to other students, so make sure that it is clear to your team when you are testing and don’t actually target individuals in a way that might be emotionally harmful. Please take care of yourselves and each other, and reach out to the teaching team if you’re finding that this work places any undue emotional burden.For this milestone we have given you the skeleton of a moderation bot with the following capabilities: Automatically forward every message to the mod channel Allow users to report messages from the main channel (reports are initiated via DMs) Please set up your bot within the first week of the assignment going out so we can catch any potential problems. The TAs will be ready to answer any questions and help debug during section!For this part of the milestone you will be submitting a short (up to 6 minute) video demoing all of the functionality of your Discord bot as well as talking through your edge cases. Be sure to begin with a clear description of the context you’ve chosen for your channels.Below are some resources we think might be useful to you for this part of the milestone.
Here is the documentation for discord.py, Discord’s python package for writing Discord bots. It’s very thorough and fairly readable; this plus google (in addition to the TAs) should be able to answer all of your functionality questions!
Discord bots frequently use emoji reactions as a quick way to offer users a few choices - this is especially convenient in a setting like moderation when mods may have to make potentially many consecutive choices. Check out on_raw_reaction_add() for documentation about how to do this with your bot. You also might want to look into on_raw_message_edit() to notice users editing old messages.
Discord offers “embeds” as a way of getting a little more control over message formatting. Read more about them in this article or in the docs.
unidecode and uni2ascii-janin are two packages which can help with translating unicode characters to their ascii equivalents.In approximately 500 words, summarize: Your abuse type An explanation of the user reporting flow and behind-the-scenes moderator flow, and the rationale for the decisions you made For this milestone your group will be making your very own Discord bot. Discord bots are implemented in Python (or Javascript) - don’t stress if you haven’t written Python before! It’s a pretty readable language, so you should be able to pick it up as you go, and the TAs are always here to help.
If you’re not familiar with Discord, that’s totally okay! Check out this short video which overviews Discord’s features and quirks.First, every member of the team (both CS and POLISCI) should join the Discord server using this invite link: [insert link here]
Discord can be used in your web browser, although most people prefer the thick client apps.For the next two milestones, you and your group will have two channels to test and develop your bot in: group-#, and group-#-mod, where # is your group’s number. We will give you and your bot a special role such that only you and the staff can see those channels; that way, everyone will have their own small workspace. To get the role for your group, click on the TA Bot user to bring up this window. Type in: .join # where # is replaced by your group number.
If all goes according to plan, you should receive a message back saying that you have been given a role corresponding to your group number and you should see a new role on your user in the server.Additionally, you should be able to see two new channels under one of the “Group Channels” categories:
If you accidentally join the wrong group, just message the TA Bot .leave # to have the role removed and leave those channels. Please let [TA] know if something goes awry in this process! Note: only ONE student per group should follow the rest of these steps.
Fork and clone the GitHub repository here. For instructions on how to fork a github repo, see this article. In order for your group to be able to collaborate effectively on this project, we recommend you create a shared GitHub repository; when you do, make sure you use the .gitignore file included in the starter code so that you don’t accidentally upload your tokens to GitHub. Our GitHub repository already has tokens.json in its .gitignore file. When you clone your project from there, you will have to create your own tokens.json file in the same folder as your bot.py file. The tokens.json file should look like this, replacing the “your key here” with your key. In the below sections, we explain how to obtain Discord keys.
The first thing you’ll want to do is make the bot. To do that, log in to https://discord.com/developers and click “New Application” in the top right corner.
Name your application Group # Bot, where # is replaced with your group number. So, for instance, Group 0 would name their bot like so:
It is very important that you name your bot exactly following this scheme; some parts of the bot’s code rely on this format. Next, you’ll want to click on the tab labeled “Bot” under “Settings.” Click “Copy” to copy the bot’s token. If you don’t see “Copy”, hit “Reset Token” and copy the token that appears (make sure you’re the first team member to go through these steps!) Open tokens.json and paste the token between the quotes on the line labeled “discord”. Scroll down to a region called “Privileged Gateway Intents” Tick the options for “Presence Intent”, “Server Members Intent”, and “Message Content Intent”, and save your changes. See the image for what it should look like.!
An aside: It’s unsafe to embed API keys in your code directly. If you put that code on GitHub, then anyone could find and use that key! (GitHub actually tries to detect code like this and forbids programmers from uploading it.) That’s why we’re storing them in a separate file which can be ignored by version control software. Next, we’ll add the bot to the 152 Discord server! You’ll need to generate a link that the teaching team can use to invite your bot. Click on the tab labeled “OAuth2” under “Settings” Click the tab labeled “URL Generator” under “OAuth2”. Check the box labeled “bot”. Once you do that, another area with a bunch of options should appear lower down on the page. Check these permissions, then copy the link that’s generated.
Send that link to any of the TAs via Discord (or by email) - they will use it to add your bot to the server. Once they do, your bot will appear in the #general channel and will be a part of the server! Note that these permissions are just a starting point for your bot. We think they’ll cover most cases, but it’s entirely possible you’ll run into cases where you want to be able to do more. If you do, you’re welcome to send updated links to the teaching team to re-invite your bot with new permissions. First things first, the starter code is written in Python. You’ll want to make sure that you have Python 3 installed on your machine. Alternatively, you can use a text editor of your choice.
Once you’ve done that, open a terminal in the same folder as your bot.py file. (If you haven’t used your terminal before, check out this guide!)You’ll need to install some libraries if you don’t have them already, namely:\# python3 \-m pip install requests \# python3 \-m pip install discord.py Next up, let’s take a look at what bot.py already does. To do this, run bot.py and leave it running in your terminal. Next, go into your team’s private group-# channel and try typing any message. You should see something like this pop up in the group-#-mod channel:
The default behavior of the bot is, any time it sees a message (from a user), it sends that message to the moderator channel with no possible actions. This is obviously not the final behavior you’ll want for your bot - you should update this to match your report flow. However, the infrastructure is there for your bot to automatically flag messages and (potentially) moderate them somehow.Next up, click on your app in the right sidebar under “Online” to begin direct messaging it (or click on its name). First of all, try sending “help”. Try following its instructions from there by reporting a message from one of the channels to get a sense for the reporting flow that’s already built out for you. (Make sure to only report messages from channels that the bot is also in.)If you look through the starter code, you’ll see the beginnings of the reporting flow that are already there. It will be up to you to build that out in whatever way your group decides is best. You’re welcome to edit any part of the starter code you’d like if you want to change what’s already there - we encourage it! This is just meant to be a starting point that you can pattern match off of.If you’re not familiar with Python and asynchronous programming, please come to a section for an introduction. The TAs are happy to walk you through the starter code and explain anything that’s unclear.If you’re seeing this error, it probably means that your terminal is not open in the right folder. Make sure that it is open inside the folder that contains bot.py and tokens.json. You can check this by typing in ls and verifying that the output looks something like this:\# ls bot.py tokens.json Discord has a slight incompatibility with Python3 on Mac. To solve this, navigate to your /Applications/Python 3.6/ folder and double click the Install Certificates.command. Try running the bot again; it should be able to connect now. If you’re still having trouble, try running a different version of Python (i.e. use the command python3.7 or python3.8) instead. If that doesn’t work, come to section and we’ll be happy to help!This is an issue with the version of Discord API that is installed. Try the following steps: running pip install --upgrade discord in the terminal in your folder in the project that contains this file IF that does not work, try changing the line in bot.py that says intents.message_content = True to intents.messages = True Percent of Final Grade: 30%Deliverables and due date: Poster, presented in person, on Wednesday, June 7 from 6-8pm. Your poster should be completely set up by 5:30pm. You can arrive as early as 5:00pm. You should have a video showing your bot’s functionality that you can play for judges. We recommend having this video on a tablet, as the space lacks tables and outlets. A PDF of your final poster and all code files are due on Tuesday, June 6 at 11:59pm PT. Submission*:* Please have one group member upload the poster PDF and video to the CS 152 Canvas site. Your code files should be in the Github repository submitted for Milestone 2.Please fill out the work distribution survey to assess how equally different members of each group contributed to the project. We reserve the right to change grades based on the results.Your bot will be responsible for handling both manual reports from users (which you implemented in Milestone 2) as well as automatically detecting and flagging abusive content (the primary goal for this Milestone). This will include finding or collecting a dataset of examples to use to evaluate the efficacy of your solutions. Here are some examples of what we envision you accomplishing for this milestone. Please note that this is not a checklist of things you must accomplish, just ideas. Training a classifier Using your collected dataset to test a few publicly available packages or APIs and noting their pros/cons We have provided instructions on how to use a few different APIs at the bottom of this document Designing and building a robust backend for logging and maintaining per-user statistics Building a framework by which communities can specify their own regex-like rules for content they don’t want to see Creating a tool that automatically detects and visits all links in a piece of text to see if they host undesirable content Building a system by which users can outsource unwanted content to their friends for review
To handle multiple languages, you can use packages like google_trans_new to automatically translate everything to English or langdetect for language detection (make sure to rate limit yourself where necessary).
You should develop and implement a strategy for evaluating the efficacy of your back-end solution. If you trained a classifier or utilized external APIs, this might look like utilizing your dataset to generate a confusion matrix and figuring out whether your model is over or under sensitive and what kinds of problems this might cause at scale. If your solution is more user or design focused, you could conduct further user studies; for instance, you could invite friends to interact with the bot to assess the additional functionality and identify cases in which your design is clunky or might not scale well. What scenarios does your bot handle effectively? What scenarios does your bot not handle as effectively? What explains your bot’s strengths and weaknesses? With more time and resources, what would be some of your next steps? A key final deliverable for this milestone will be a poster that you’ll display and discuss with your platform’s “executives” (the teaching team as well as guest industry judges who will stop by your poster during the poster session). You will likely want to explain how your platform currently handles your specific type of abuse (your back-end solution), and address its strengths and shortcomings, leading to clear and specific recommendations for how the platform should move forward. You will also want to answer questions from the “executives.” We encourage you all to think creatively about how to communicate your work! We encourage your group to have a prepared 5 minute pitch, and to make sure all group members who are present have the chance to participate in the pitch. Your poster should include: Problem Description Policy Language Technical Back-end Evaluation Looking Forward More details on these components are below. Give a short description of your group’s abuse type and victim profile. You can assume that people viewing your poster have a general awareness of your abuse type.Create a written policy in the kind of language you have seen from the community standards and terms of service you have seen that is specifically targeted at your abuse type. The policy should be less than 400 words and understandable by a normal user.Discuss the original goals and final state of your back-end technology in more detail, explaining the work you did to build it and what its current capabilities are. If there are things you tried which didn’t make it into the final product, be sure to mention them here, along with the reasoning behind not including them. Provide a clear analysis of how well your group’s back-end technology accomplishes what it originally set out to do. Make sure to address both the successes and the shortcomings of your current solution. Discuss any negative unintended consequences you foresee and which users may be more affected by them. Try to think about what critics/stakeholders would say about your technology. Discuss what impact you believe implementation would have on platform safety. Discuss other engineering approaches that your group didn’t pursue but that you’d want to propose going forward.You can print your poster however you’d like. The poster session will have corkboards where you can pin up your poster without a rigid backing. We recommend that your poster be 24”x36”. It should not be larger than that, as we have limited space.
Stanford Undergraduate Research has tips for creating good posters here. We encourage you to avoid having a lot of text on your poster; hit the points you need to make, but keep the poster readable. You are going to want to show off your excellent reporting system to our judges, friends and family! Please record a short demo video that you can show to judges and narrate. We encourage the political science student to provide strong support for the evaluation component of this milestone, for example by leading a rigorous qualitative evaluation of the bot to supplement a quantitative evaluation. The political science student could generate a typology of abuse manifestations, and evaluate how the bot performs on them. The political science student could recruit testers and do a deep dive into why the bot did or did not respond appropriately to the testers. These are just ideas, think creatively about this! We also encourage the political science student to lead the “policy language” section. The political science student could also assist with dataset creation.

Social Network Analysis Syllabus (Advanced Topic)

Mon, 27 Apr 2026 20:46:15 GMT

This course develops social network analysis (SNA) as both a set of quantitative methods and a lens for studying adversarial behavior, information spread, and influence in complex systems. Lectures progress from foundational concepts illustrated through animal behavior, through core computational methods, into applied analysis of information operations, and finally into simulation-based modeling of social dynamics.For a broader overview of the Trust and Safety space, see the Trust & Safety class.Lecture 1: Animal Sociality and SNA Fundamentals
Source: sna_animal_networks Uses animal social networks as a politically neutral entry point to introduce core SNA vocabulary: nodes, edges, weighted networks, directed vs. undirected graphs, assortative mixing, and homophily ("birds of a feather flock together"). Case studies draw from marmot fieldwork and broader ethology literature. Establishes foundational intuitions before applying methods to human social media networks. Lecture 2: Animal Network Robustness and Node Removal
Source: sna_animal_network_robustness Explores what happens when individuals are removed from a network — strategically or randomly — using both animal and information network examples. Introduces node-level metrics (degree, betweenness centrality) and network-level metrics (connectedness, fragmentation). Applies these concepts to misinformation source rankings, illustrating how targeted interventions affect information flow. Lays groundwork for intervention analysis in later lectures. Lecture 3: Community Detection
Source: sna_community_detection Core methodology lecture. Covers clustering coefficient, modularity maximization, the Louvain algorithm, and CONCOR (convergence of iterated correlations). Framed around a "Locate Groups" report assignment. Students learn to identify cohesive subgroups and evaluate the quality of detected communities using modularity scores. Lecture 4: Stance Detection via Label Propagation
Source: sna_stance_detection Applied method that builds directly on community structure from Lecture 3. Introduces stance detection using hashtag-seeded label propagation over retweet networks. Covers how stance labels spread from users to hashtags and back, general label propagation algorithms, confidence calibration, and the choice between propagation strategies. Optional extension covers text-based stance detection. Lecture 5: Information Operations
Source: sna_information_operations Conceptual overview of adversarial social behavior using the BEND framework (Boost, Engage, Neutralize, Distort). Connects community structure and stance to coordinated inauthentic behavior. Discusses dynamic multi-agent scenarios in which adversarial actors attempt to shift population-level stance. Sets up Lecture 6's detection approach. Lecture 6: Information Operations Detection
Source: sna_information_operations_detection Technical case study of detection using paid link schemes and SEO manipulation as the adversarial domain. Covers how to identify coordinated link schemes, distinguish paid from organic linking, and use LLMs to label the political bias of news sites at scale (case study: Iranian news network). Draws on SEO network construction and classification methods introduced in Lectures 2 and 4. Lecture 7: Social Influence Modeling
Source: sna_social_influence_modeling Introduces agent-based modeling (ABM) as a complement to network analysis. Develops a co-evolutionary stance-influence model where network structure and agent opinions update simultaneously. Key findings: minority stances exhibit tipping points around 25% adoption; optimal confederates target local ego-networks rather than global hubs. Discusses validation against real data and recovery from polarized states. Lecture 8: Information Diffusion and Population Modeling
Source: sna_population_modeling Broadest lens in the course. Applies epidemiological-style population models (SEIRM, Friedkin social influence model) to information spread and opinion dynamics. Multi-agent scenarios allow virtual experiments: given an observed information environment, what policies lead a population toward a desired trajectory? LLMs are introduced as tools for constructing action distributions and translating between the agent-level model and the real information environment.
Slides
Zotero library

Adversarial Retrieval and LLMs Syllabus (Advanced Topic)

Mon, 27 Apr 2026 20:41:13 GMT

This course examines how large language models handle — and fail to handle — factual knowledge, and how adversaries exploit these failure modes in information retrieval and generation systems. Lectures are organized in four modules that move from internal model mechanics outward to ecosystem-level attacks.For a broader overview of the Trust and Safety space, see the Trust & Safety class.
Before taking the class, read through the LLM background literature (considered pre-requisite)Lecture 1: Memorization, Generalization, and Specialization in LLMs
Source: Memorization, Generalization, and Specialization in LLMs Introduces the core tension at the heart of the course: LLMs memorize training data (enabling recall but risking privacy leakage and stale knowledge) while also generalizing (enabling zero-shot tasks but introducing hallucinations). Covers finetuning vs. zero-shot prompting on QA tasks, the VoLTA vision-language model as an extended case, and why retrieval-augmented generation (RAG) reduces memorization-driven errors. Establishes the vocabulary for lectures 2–4. Lecture 2: LLM Hallucinations and Knowledge Conflicts
Source: LLM Hallucinations and Knowledge Conflicts Deepens the hallucination picture by distinguishing faithfulness hallucinations (model contradicts its context) from factuality hallucinations (model contradicts the world). Introduces knowledge conflicts — situations where parametric knowledge, retrieved knowledge, and real-world facts diverge — and discusses how RLHF safety tuning interacts with faithfulness. Covers entity substitution frameworks and conflict-inducing dataset construction. Lecture 3: Adversarial Adaptation in Information Systems
Source: Adversarial Adaptation In Information Systems Broadens scope from model internals to the adversarial information ecosystem. Uses a "means, motives, and opportunities" framework to analyze how actors adapt content to manipulate search rankings, social platforms, and recommendation systems. Covers SEO manipulation, social bot adaptation, memorialization hacking, and the trustworthiness/pluralism tradeoff that constrains platform interventions. This lecture is the conceptual bridge between the model-focused material (Lectures 1–2) and the attack-focused material (Lecture 4). Lecture 4: Adversarial Attacks on IR Systems
Source: Adversarial Attacks on IR Systems Catalogues specific technical attacks against information retrieval systems: malicious text and image encoding, gradient-based multi-view topic attacks, poisoned corpus attacks, and RAG-specific poisoning. Applies the means/motives/opportunities framework to SEO attack vectors, including evidence that unreliable news sites are disproportionately linked by paid schemes. Covers the AREA (Adversarial REtrieval Attack) literature.
Tutorial: How Do I Make a Good Classifier? A Python-focused practical guide to binary classification: data collection, annotation and inter-rater reliability (Cohen's Kappa, Fleiss' Kappa, Krippendorff's Alpha), preprocessing, class imbalance handling, model selection, hyperparameter tuning, and evaluation (precision, recall, F1, ROC-AUC). Best delivered as a lab session after Lecture 2, when students have encountered hallucination/conflict classification tasks in context.
AI Worksheets Extended worksheet collection supporting the readings and lectures. Covers alignment, reasoning, RAG, vision-language models, and adversarial scenarios.
Zotero Hallucinations + misinformation (SegSub) Typologies Generality vs specialization Need for RAG to stay up to date... Adversarial information retrieval Jailbreaks of agentic AI (benchmark papers, sudo rm -rf agent security) Ethics
See Advanced Topics/Adversarial Retrieval and LLMs/LLM background literature for primers on model design and training processes. Readings:
Training language models to follow instructions with human feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Optional:
KTO: Model Alignment as Prospect Theoretic Optimization
Constitutional AI: Harmlessness from AI Feedback
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Understanding the Effects of RLHF on LLM Generalisation and Diversity Readings:
Large Language Models Cannot Self-Correct Reasoning Yet
Chain-of-Thought Reasoning Without Prompting Optional:
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Large Language Models are Zero-Shot Reasoners
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Large Language Models are Better Reasoners with Self-Verification Readings:
On the Planning Abilities of Large Language Models - A Critical Investigation
Chain of Thoughtlessness? An Analysis of CoT in Planning Recent Trends and Developments after O1, e.g.,
https://www.arxiv.org/abs/2409.13373
https://cdn.openai.com/o1-system-card.pdf Optional:
On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks
On the Brittle Foundations of ReAct Prompting for Agentic Large Language Models
REALM: Retrieval-Augmented Language Model Pre-Training (v)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (v)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (v)
G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
Internal State of an LLM knows when it’s Lying
A survey on Hallucinations in LLMs Optional:
A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly Readings:
CLIP: Learning Transferable Visual Models From Natural Language Supervision
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions Optional:
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Visual Instruction Tuning (LLaVa)
Show O : One Single Transformer to Unify Multimodal Understanding and Generation

Assessing T&S Policies

Mon, 27 Apr 2026 19:47:04 GMT

Consider the following controversial political ad examples that expose tensions, inconsistencies, and edge cases in platform policies. These cases are designed to: Show where Policy A and Policy B give different results Highlight situations where policies produce seemingly inappropriate or controversial outcomes Reveal ambiguities and enforcement challenges Policy definitions: Political Ad PoliciesA political action committee creates an ad featuring a stylized image of the opposing candidate with: Devil horns superimposed on their head An American flag burning in the background Text overlay: "Don't let 'Candidate X' destroy America" Dramatic apocalyptic imagery (crumbling buildings, dark skies) The ad is targeted to voters in swing states who have previously engaged with content about patriotism and national security.--Policy A (Meta-based) - Likely Outcome: ✅ APPROVED Reasoning: Stylized imagery and negative campaigning are explicitly allowed Not a direct call for violence Not voter suppression Custom audience targeting is permitted Dramatic imagery is allowed under free political speech Note: Controversy: Many would argue this ad demonizes the candidate and could contribute to hostile climate, but policy prioritizes political speech --Policy B (Google-based) - Likely Outcome: ✅ APPROVED Reasoning: False or misleading claims about opponents are allowed Not direct violence incitement Not voter suppression Would be subject to basic targeting only (can't microtarget) Controversy: Same concerns as Policy A; the ad is inflammatory but doesn't violate technical policy provisions Note: Both policies approve this controversial ad. The main difference is targeting precision.A campaign creates an ad stating: "Transgender activists are mentally ill and have no place teaching our children" Features images of a transgender school teacher Calls for legislation to ban transgender individuals from teaching positions Text: "Vote 'Candidate X' to protect our kids" The ad uses custom audience lists of parents with school-age children and targets users who have engaged with content about education policy.--Policy A (Meta-based) - Likely Outcome: ✅ APPROVED (as of January 2025) Reasoning: Post-January 2025 policy explicitly allows "allegations of mental illness or abnormality when based on gender or sexual orientation, given political and religious discourse about transgenderism" Policy allows "content arguing for gender-based limitations of military, law enforcement and teaching jobs" Falls under "religious beliefs" exception Custom audience targeting permitted Note: Controversy: This represents a major policy shift. Previously would have been removed as hate speech. Now explicitly permitted despite concerns about: Real-world violence against LGBTQ+ individuals Creating hostile environment Spreading misinformation about transgender identity --Policy B (Google-based) - Likely Outcome: ⚠️ UNCLEAR Reasoning: Google's policy doesn't explicitly address attacks on protected characteristics in political ads Not technically voter suppression or violence incitement Would likely be approved under "political speech" but enforcement uncertain Controversy: Lack of clear policy guidance on hate speech in political context creates enforcement ambiguity Note: Policy A explicitly permits this after January 2025 changes; Policy B is ambiguous. This case highlights how recent policy changes normalized anti-LGBTQ+ content.An ad features a highly realistic AI-generated video showing: The opposing candidate appearing to say: "I don't care about working-class families" The candidate never actually said this The ad includes a small disclaimer at the bottom: "This ad contains AI-generated content" Targets working-class voters in specific zip codes using custom audience data --Policy A (Meta-based) - Likely Outcome: ✅ APPROVED (with disclosure) Reasoning: AI-generated content depicting false statements by candidates IS allowed Advertiser checked the disclosure box for synthetic content Small disclaimer present (even if hard to notice) Custom audience targeting permitted False claims about opponents are allowed Note; Controversy: Even with disclosure, many viewers may believe the video is real, especially if disclaimer is small or easy to miss --Policy B (Google-based) - Likely Outcome: ✅ APPROVED (with disclosure) Reasoning: Synthetic content must be disclosed - advertiser complied False claims about opponents are allowed Would use basic targeting only (less precise than Policy A) Note: Controversy: Same as Policy A - disclosure may be insufficient to prevent deceptionKey Insight: Both policies allow deceptive AI content with minimal disclosure requirements. The disclosure is technically compliant but may be functionally inadequate.An ad shows: Images of migrants at the border Text: "Illegal immigrants are criminals poisoning our communities" Dramatic music and red alert-style graphics Call to action: "Vote to stop the invasion" Uses lookalike audiences based on people who donated to immigration restriction groups --Policy A (Meta-based) - Likely Outcome: ✅ APPROVED (as of January 2025) Reasoning: Post-January 2025 policy allows attacks on immigrants "Claims about immigrants' character or fitness for participation in society" now permitted Not direct violence incitement Lookalike audience targeting allowed Falls under political discourse about immigration Note: Controversy: Uses dehumanizing language ("invasion", "poisoning") Research shows this rhetoric correlates with hate crimes against immigrants Previously might have been restricted; now explicitly allowed --Policy B (Google-based) - Likely Outcome: ✅ LIKELY APPROVED Reasoning: Doesn't explicitly call for violence Political speech about immigration policy False/misleading claims allowed Can only use basic targeting (not lookalike audiences) Note: Controversy: Same concerns about dehumanizing rhetoric Key Insight: Both approve, but Policy A allows more precise targeting of receptive audiences. The language is inflammatory but doesn't cross the line to direct violence incitement.An ad from a political activist group states: "Patriots: Show up at polling places on Election Day" "Document suspicious voters and ask them for their papers" "Don't let them steal our election" Features images of confrontations at polling places Targets users who have engaged with election fraud conspiracy content --Policy A (Meta-based) - Likely Outcome: ❌ REJECTED Reasoning: Violates prohibition on voter intimidation "Coordinated calls to interfere with voting or election processes" is prohibited Could be interpreted as encouraging confrontation at polls Likely flagged as attempting to suppress voting through intimidation Notes: This should be removed, but enforcement may be inconsistent --Policy B (Google-based) - Likely Outcome: ❌ REJECTED Reasoning: Explicitly violates prohibition on "content encouraging others to interfere with democratic processes" Telling people to confront voters at polling locations is specifically called out as prohibited "Calls to incite physical conflict... at polling locations to deter voting" is banned Notes: Clear policy violation Note: Key Insight: Both policies reject this. However, subtle variations (e.g., "observe" instead of "confront") might create grey areas.On election night, before votes are fully counted, a candidate runs ads stating: "Victory! I've won the election!" "Despite fake news media, the real numbers show I won" "Don't believe the lying media when they say I lost" Uses all available targeting to reach maximum audience --Policy A (Meta-based) - Likely Outcome: ⚠️ LIKELY REJECTED (but historically inconsistent) Reasoning: "Premature victory claims" before official certification are prohibited However, claiming election fraud/stolen election is now allowed (fact-checking removed) Ambiguous whether "real numbers show I won" crosses the line Note: Controversy: Enforcement has been highly inconsistent Similar content appeared in 2020 and 2024 elections Policy was supposed to prevent this but failed in practice --Policy B (Google-based) - Likely Outcome: ⚠️ UNCLEAR Reasoning: Previously would have been removed As of June 2023, false claims about election outcomes are no longer prohibited However, "premature" claims might still violate voter suppression rules Policy now prioritizes "free expression" over accuracy Controversy: The 2023 policy rollback makes enforcement unclear Note: Key Insight: Both policies have been weakened on election misinformation. What was once clearly prohibited is now in a grey area.In a local election with large immigrant population, an ad states: "[Ethnic Group] voters are destroying our neighborhood" "They don't share our values and shouldn't have a say in our community" Shows unflattering images of people from that ethnic group Uses custom audience targeting to reach residents of specific neighborhoods --Policy A (Meta-based) - Likely Outcome: ⚠️ AMBIGUOUS - Likely Rejected but Uncertain Reasoning: Race and ethnicity are still protected characteristics under base policy However, immigration status-based attacks are now allowed Ambiguous whether this is "ethnic" discrimination (prohibited) or "immigration" discourse (allowed) If framed as "immigrants" rather than ethnic group, might be approved Enforcement would depend on exact wording Note: Controversy: The line between ethnic discrimination and immigration discourse is blurry Ads can be reworded to exploit this ambiguity Custom targeting makes this especially harmful to specific communities --Policy B (Google-based) - Likely Outcome: ⚠️ UNCLEAR Reasoning: No explicit policy on ethnic/racial attacks in political ads Not technically voter suppression Not direct violence incitement Likely would depend on human review Controversy: Lack of clear policy creates inconsistent enforcement Note: Key Insight: Both policies have ambiguities around ethnic attacks that aren't direct violence incitement. Clever wording can exploit these gaps.An ad attacking LGBTQ+ school board candidates: "Stop the groomers from getting near our children" Features photos of LGBTQ+ candidates with ominous music Text: "They want to indoctrinate your kids" Links LGBTQ+ identity with child predation Targets parents using custom audience lists from school districts --Policy A (Meta-based) - Likely Outcome: ⚠️ AMBIGUOUS (Likely Approved post-2025 changes) Reasoning: Post-January 2025, allows content about LGBTQ+ individuals and their "fitness" for roles involving children "Mental illness or abnormality" allegations allowed when based on sexual orientation Might be approved as "political discourse" However, "groomer" rhetoric has been linked to violence against LGBTQ+ people Enforcement highly uncertain Note: Controversy: "Groomer" is a slur falsely associating LGBTQ+ people with pedophilia This rhetoric has preceded real-world attacks on LGBTQ+ individuals Policy changes may have emboldened this type of content --Policy B (Google-based) - Likely Outcome: ⚠️ UNCLEAR Reasoning: No explicit policy on LGBTQ+ attacks in political context Not technically violence incitement (though may inspire violence) Would likely depend on individual review Controversy: Dangerous rhetoric in policy grey area Note: Key Insight: Policy A's 2025 changes may have opened door to this type of content. Both policies struggle with indirect incitement to violence.An advertiser creates dozens of variations of an ad, each tailored to specific audiences: To Latino voters: "Your opponent wants to deport your family" To white rural voters: "Your opponent is giving your jobs to immigrants" To Black voters: "Your opponent supports policies that target your community" Each version uses custom audiences, voter file data, and interest targeting No single message seen broadly; each group sees different (contradictory) claims --Policy A (Meta-based) - Likely Outcome: ✅ APPROVED Reasoning: Custom audience targeting explicitly allowed False/misleading claims about opponents permitted Each individual ad doesn't violate policy Ability to show different messages to different groups is a feature, not a bug Note: Controversy: Prevents public accountability (no one sees all messages) Allows contradictory claims to different audiences Makes fact-checking nearly impossible Enables targeted manipulation This is exactly what Cambridge Analytica did --Policy B (Google-based) - Likely Outcome: ⚠️ PARTIALLY PREVENTED Reasoning: Microtargeting is prohibited, limiting precision Can still target different geographic regions or basic demographics False claims allowed but less precisely targeted More likely that contradictory messages would be noticed Note: Controversy: Basic targeting still allows some audience segmentation with different messages Key Insight: This case highlights the danger of microtargeting + lack of truthfulness requirements. Policy A enables this; Policy B partially prevents it.An ad encourages viewers to: "Report suspected voter fraud to our hotline" "If you see something suspicious at polls, document it" Shows examples of "suspicious behavior" that are actually normal (people helping elderly voters, non-English speakers, etc.) Provides a phone number Doesn't explicitly call for confrontation --Policy A (Meta-based) - Likely Outcome: ⚠️ AMBIGUOUS Reasoning: Doesn't explicitly call for confrontation or intimidation Could be framed as "election integrity" efforts However, showing normal behavior as "suspicious" could suppress voting "Coordinated calls to interfere" might apply Enforcement would be inconsistent Note: Controversy: Chilling effect on legitimate voters without explicit intimidation --Policy B (Google-based) - Likely Outcome: ⚠️ AMBIGUOUS Reasoning: Not technically "instructing" interference Could be argued as voter education (though misleading) Doesn't explicitly tell people to create long lines or confront voters Grey area between observation and intimidation Note: Controversy: Same as Policy A - subtle intimidation that doesn't explicitly violate policy Key Insight: Both policies struggle with subtle forms of voter intimidation that don't explicitly call for confrontation. Case 2 (Mental Illness): Policy A explicitly permits (post-2025); Policy B unclear Case 9 (Microtargeting): Policy A enables; Policy B restricts Case 4 (Immigration): Both approve but Policy A allows more precise targeting -- Case 1 (Devil Horns): Both approve inflammatory demonization Case 3 (AI Deepfake): Both allow with minimal disclosure Case 6 (False Victory): Both weakened enforcement on election misinformation Case 8 (Groomers): Both struggle with indirect violence incitement -- Case 7 (Ethnic Attack): Depends on exact wording Case 10 (Fraud Hotline): Subtle intimidation in grey area Policy Effectiveness: Which cases show where truthfulness requirements would matter most? Targeting vs. Content: Is it worse to have precise targeting of harmful content (Policy A) or broad distribution of harmful content (Policy B)? Recent Changes: How do Policy A's January 2025 changes affect marginalized communities? What's the trade-off between "free speech" and safety? Enforcement Gaps: Which cases reveal that written policies don't match actual enforcement? Indirect Harm: How do policies handle content that doesn't directly incite violence but creates conditions for violence? Microtargeting: Why is Case 9 particularly problematic? How does it undermine democratic discourse? AI Content: Is disclosure sufficient for AI-generated content, or do we need stronger restrictions? Protected Characteristics: Should political speech exceptions exist for attacks on protected groups? Why or why not? ✅ Classify cases according to platform policies ✅ Identify borderline cases with ambiguous determinations ✅ Diagnose how and why policies break down ✅ Criticize automated content moderation on edge cases ✅ Propose measurement strategies for tracking failures ✅ Propose alterations to address borderline cases Activity 1: Set It Up (LO1 - Classification) Use Cases 1, 3, 5 (clear outcomes) Students classify: Approve/Reject under each policy Use clickers for real-time feedback Activity 2: Think-Pair-Share (LO2 - Borderline Cases) Use Cases 2, 7, 8, 10 (ambiguous) Pairs discuss and justify their determinations Compare reasoning Activity 3: Contrasting Cases (LO3 - Policy Diagnosis) Compare how Policies A and B handle Cases 2, 4, 9 Identify: ambiguity, binary classification issues, false positives/negatives Small group discussion on different policy failures These cases are inspired by actual ads that ran on Meta and Google platforms: Case 1: Based on actual anti-Harris ads (2024) Case 2: Based on anti-trans political ads (2023-2025) Case 3: Common deepfake concern across platforms Case 4: Common immigration ad rhetoric Case 8: "Groomer" rhetoric seen in 2022-2024 school board races Case 9: Cambridge Analytica-style tactics Policy ≠ Enforcement: Written policies often fail in practice Grey Areas: Most controversial content lives in ambiguous spaces Harm Beyond Violence: Indirect harm is real but harder to regulate Recent Backsliding: Both platforms weakened protections (2023-2025) Targeting Amplifies Harm: Same content is worse with precise targeting Course: CSPedagogy / Trust & Safety
Related: old/Trust & Safety Class Old/Quizzes/Political Ad Policies, Active Learning Resources

Political Ad Policies

Mon, 27 Apr 2026 19:44:54 GMT

This document defines two platform policies for political advertising, based on real-world approaches from major social media platforms. These policies will be used for active learning exercises on content moderation and trust & safety. Advertiser Verification: All political advertisers must complete identity verification, providing government-issued ID and proof of location Disclosure Requirements: All political ads must include a "Paid for by __ " disclaimer Ad Library: All political ads stored in publicly accessible archive showing: Ad content Who paid for the ad Amount spent Targeting parameters used Custom Audiences: Advertisers may upload their own customer lists to target specific individuals Lookalike Audiences: Allowed - can target users similar to existing supporters Geographic Targeting: Full geographic targeting available (city, state, region) Demographic Targeting: Age, gender, and basic demographic targeting permitted Interest-Based Targeting: Limited - advertisers may target based on general interests but NOT based on specific political, religious, or health-related content users have accessed on the platform Exclusions: NOT allowed - advertisers cannot exclude specific groups or audiences with opposing interests (as of January 2025) Negative campaign ads criticizing opponents' policies or record False or misleading claims about political opponents Dramatic or stylized imagery (e.g., apocalyptic scenes, unflattering photo manipulation) AI-generated content IF DISCLOSED - must check box indicating synthetic/digitally altered content that depicts: A person saying/doing something they didn't do Realistic-looking people or events that don't exist Altered footage of real events Voter Suppression: False information about where, when, or how to vote False Eligibility Claims: Misleading information about who can vote Premature Victory Claims: Calling election results before official certification Direct Violence Incitement: Content that encourages violence against: Election workers Candidates Voters Any individuals at polling locations Dangerous Organizations: Glorification or support of designated terrorist organizations or hate groups Voter Intimidation: Coordinated calls to interfere with voting or election processes Protected Characteristic Exceptions: While general attacks on protected characteristics are prohibited, the following ARE ALLOWED when based on religious or political beliefs: Allegations that LGBTQ+ individuals are "mentally ill" or "abnormal" Arguments for gender-based limitations in military, law enforcement, and teaching positions Arguments for sexual orientation-based limitations in the same professions when based on religious beliefs Claims about immigrants' character or fitness for participation in society Automated systems screen ads before publication Community reporting available Human review for flagged content Non-compliance results in ad disapproval and potential account suspension Advertiser Verification: All political advertisers must complete Election Ads verification process Disclosure Requirements: All political ads must include in-ad "Paid for by [Name]" disclosure Visual ads: Disclosure must be visible at all times and sufficiently large for average viewer Audio ads: Disclosure must be similar in pitch, tone, and speed to rest of ad Transparency Report: All election ads published in Political Advertising Transparency Report with: Ad content Who paid for the ad Amount spent Targeting parameters (limited) Custom Audiences: NOT allowed for granular political targeting Microtargeting: Explicitly PROHIBITED - never allowed Basic Political Targeting: Only the following permitted: Public voter records General political affiliations (left-leaning, right-leaning, independent) Geographic Targeting: Allowed (but limited in precision) Search-Based Targeting: Ads may appear in response to user search queries Interest-Based Targeting: Very limited - only broad categories, no granular interests Negative campaign ads criticizing opponents' policies or record False or misleading claims about political opponents' positions or record Search ads responding to political queries Display ads on partner websites Video ads on platform Voter Suppression: False information about voting methods (e.g., "text your vote to this number") Made-up voter eligibility requirements Misleading information about where, when, or how to vote False Candidate Eligibility Claims: False claims that candidates are deceased False claims about age or citizenship eligibility Interference with Democratic Processes: Instructions to create long voting lines to deter others Instructions to hack government websites Calls to incite physical conflict at polling locations Manipulated Content Creating Serious Risk of Harm: Technically manipulated content making government officials appear to say/do things they didn't Old footage falsely presented as current events Fabricated events creating serious risk of egregious harm Direct Violence Incitement: Content encouraging violent acts against: Election workers Candidates Voters Synthetic Content (must be disclosed): AI-generated or digitally altered content depicting people saying/doing things they didn't do Synthetic content creating realistic portrayals of events that didn't happen False claims about election outcomes (e.g., "the 2020 election was stolen") General election misinformation that doesn't directly suppress votes Automated screening before ad approval Human review for verification process Ads must comply with all policies to run Violations result in ad disapproval Repeated violations may result in loss of verification status Which policy provides more protection against targeted manipulation of voters? Which policy is more permissive regarding hate speech in political contexts? How do the targeting restrictions in Policy B affect the ability of smaller campaigns to reach specific audiences? What are the trade-offs between free political speech and preventing harm under each policy? How might enforcement challenges differ between these two policies? Which policy better addresses the risk of violence incitement, and why? These policies are simplified versions based on: Policy A: Meta's U.S. political ads policy (as of 2025, post-January policy changes) Policy B: Google/YouTube's political ads policy (as of 2025) Key sources: Meta Transparency Center: Political Advertising policies Google Ads Policy Help: Political Content policy Documented enforcement challenges and policy changes (2023-2025) Both platforms banned political ads in the EU (October 2025) in response to TTPA regulation.Created: 2025-10-22 Last Updated: 2025-10-22 Course: CSPedagogy / Trust & Safety

Pitfalls of Binary Classification

Mon, 27 Apr 2026 19:44:03 GMT

Policies: Trust & Safety Class/Quizzes/Political Ad Policies
Cases: Trust & Safety Class/Quizzes/Assessing T&S Policies Classify easy examples of online harms according to a platform policy Identify borderline cases where the determination of harm is ambiguous Diagnose how and why policies begin to breakdown on these examples (System) Criticize how automated content moderation mechanisms handle such cases (System) Propose measurement strategies to track such failures (System) Propose alterations to content moderation mechanisms to account for the identified borderline cases Learning Objective #1: Enforcing platform policies Strategy Name: "Set It Up" (problem solving process) Description: consider a list of 3 example cases against a reference policy (given) to determine is the obvious violation and which is not a violation Expected Outcomes: Platform policies are understood as a process. Assessment Method: Clickers Learning Objective #2: Identifying edge cases Strategy Name: think-pair-share Description: discuss and action a series of borderline cases using the same policy Expected Outcomes: Borderline cases determination is improved by discussion Assessment Method: Clickers Learning Objective #3: Diagnosing policy issues Strategy Name: contrasting cases Description: given 2 example policies for the same issue, determine where and why each policy is problematic for the previous examples (ambiguity in policy, binary classification where multiple thresholds are needed, false positives, false negatives,...) Expected Outcomes: each small group may discuss a different set of issues Assessment Method: Discussion Active Learning for system based questions is left to the assignmentOnline Trust & Safety tackles a range of online harms. To decide which to cover, use Dotmocracy. Trust & Safety is about people! Case's presented in the class could be simply portrayed using screenshots in a slide deck, but there is opportunity for more interaction. Fishbowl method: like charades, have a small number of students act out certain cases, and the rest of the class makes the determination based on a platform policy. Extremely risky: have to be very careful about which topics, cases, and platform policies are chosen for this. Students must have the ability to choose another case or refuse to participate. The lines drawn by platform policies are subject to intense decision making processes (e.g. decisions to take down President Trump's social media accounts during Jan 6th Capitol Riot). Structured debates on safety (public harm) vs censorship (freedom of speech) Question: how do the decisions made by different platforms reflect the tradeoff between safety and censorship? What active learning strategies to use for homeworks? How much time to set aside for this in practice?

02. sna_animal_network_robustness

Mon, 27 Apr 2026 18:41:28 GMT

03. sna_community_detection

Mon, 27 Apr 2026 18:39:35 GMT

04. sna_stance_detection

Mon, 27 Apr 2026 18:38:45 GMT

05. sna_information_operations

Mon, 27 Apr 2026 18:38:28 GMT

06. sna_information_operations_detection

Mon, 27 Apr 2026 18:37:42 GMT

07. sna_social_influence_modeling

Mon, 27 Apr 2026 18:36:43 GMT

08. sna_population_modeling

Mon, 27 Apr 2026 18:35:56 GMT

07. Media Influences

Mon, 27 Apr 2026 18:34:45 GMT

04. Detection and Discovery of Misinformation Sources

Mon, 27 Apr 2026 18:33:54 GMT

06. Credibility Pluralism Tradeoff

Mon, 27 Apr 2026 18:29:53 GMT

05. Misinformation Resilient Search Rankings

Mon, 27 Apr 2026 18:29:42 GMT

04. Adversarial Attacks on IR Systems

Mon, 27 Apr 2026 18:27:53 GMT

01. Memorization, Generalization, and Specialization in LLMs

Mon, 27 Apr 2026 18:27:42 GMT

02. LLM Hallucinations and Knowledge Conflicts

Mon, 27 Apr 2026 18:27:08 GMT

03. Adversarial Adaptation In Information Systems

Mon, 27 Apr 2026 18:26:40 GMT

06. Authentication, Identity, and Platform Manipulation

Mon, 20 Apr 2026 01:26:12 GMT

12. Types of Attack Surfaces

Mon, 20 Apr 2026 01:20:44 GMT

15. Emerging_Topics 3 - LLM Hallucinations and Knowledge Conflicts

Sun, 19 Apr 2026 15:33:00 GMT

14. Emerging_Topics 2 - Adversarial Retrieval

Sun, 19 Apr 2026 15:31:52 GMT

12. Adversarial Adaptation and the Limitations of Interventions

Sun, 19 Apr 2026 15:30:50 GMT

10. Source Credibility

Sun, 19 Apr 2026 15:28:51 GMT

11. Intervention Effectiveness Case Study - Misinformation and Search Rankings

Sun, 19 Apr 2026 15:20:05 GMT

Pasted image 20260419104128

Sun, 19 Apr 2026 14:41:28 GMT

Pasted image 20260419104103

Sun, 19 Apr 2026 14:41:03 GMT

Pasted image 20260419104034

Sun, 19 Apr 2026 14:40:34 GMT

Pasted image 20260419103952

Sun, 19 Apr 2026 14:39:52 GMT

Pasted image 20260419103918

Sun, 19 Apr 2026 14:39:18 GMT

Pasted image 20260419103845

Sun, 19 Apr 2026 14:38:45 GMT

Pasted image 20260419103820

Sun, 19 Apr 2026 14:38:20 GMT

Pasted image 20260419103604

Sun, 19 Apr 2026 14:36:04 GMT

LLM background literature

Sat, 18 Apr 2026 19:33:44 GMT

CSCE 689 LLMs:: Course Readings as shared by Maria TelekiParameter-Efficient Tuning, Compression Readings: LoRA: Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs Optional:
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
DoRA: Weight-Decomposed Low-Rank Adaptation Efficient inference Readings:
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Fast Inference from Transformers via Speculative Decoding Optional:
Flash-Decoding for long-context inference Some of Andrej Karpathy’s github repos
https://github.com/karpathy/nanoGPT
https://github.com/karpathy/llm.c Some of Georgi Gerganov’s github repos
https://github.com/ggerganov/llama.cpp
https://github.com/ggerganov/ggml Model distillation Readings:
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Distilling System 2 into System 1 Optional:
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
A Survey on Knowledge Distillation of Large Language Models
https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs Data Efficiency in the Age of LLMs References: Sorscher, Ben, et al. "Beyond neural scaling laws: beating power law scaling via data pruning." NeurIPS (2022) Abbas, Amro, et al. "Semdedup: Data-efficient learning at web-scale through semantic deduplication." arXiv preprint arXiv:2303.09540 (2023). Sachdeva, Noveen, et al. "How to Train Data-Efficient LLMs." arXiv preprint arXiv:2402.09668 (2024). Marion, Max, et al. "When less is more: Investigating data pruning for pretraining llms at scale." arXiv preprint arXiv:2309.04564 (2023). Brown, Tom B. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020). Xie, Sang Michael, et al. "Data selection for language models via importance resampling." NeurIPS (2023) Engstrom, Logan, Axel Feldmann, and Aleksander Madry. "Dsdm: Model-aware dataset selection with datamodels." arXiv preprint arXiv:2401.12926 (2024). Fadhel, et al. "Data pruning and neural scaling laws: fundamental limitations of score-based algorithms." TMLR ‘23. Guo, et al. "Deepcore: A comprehensive library for coreset selection in deep learning." arXiv preprint arXiv:2204.08499 (2022). Tools, Agents, and MoE Readings:
What Are Tools Anyway? A Survey from the Language Model Perspective
Toolformer: Language Models Can Teach Themselves to Use Tools
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
Mixture-of-Agents Enhances Large Language Model Capabilities
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity Optional:
Lilian’s Blog on LLM Powered Autonomous Agents
Visual Programming: Compositional Visual Reasoning Without Training
Great collection of papers on tools: https://github.com/zorazrw/awesome-tool-llm
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Long context, extending context Readings:
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention Optional:
Extending Context Window of Large Language Models via Positional Interpolation
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
TransformerFAM: Feedback attention is working memory Scaling Laws Readings:
Emergent Abilities of Large Language Models
Inverse scaling can become U-shaped Optional:
Scaling Laws for Neural Language Models
Scaling Laws for Transfer
Training Compute-Optimal Large Language Models
Are Emergent Abilities of Large Language Models a Mirage?
Scaling Data-Constrained Language Models Self-play
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Self-Rewarding Language Models LLM Applications: Text mining, user modeling, …
TnT-LLM: Text Mining at Scale with Large Language Models
LLMs for User Interest Exploration in Large-scale Recommendation Systems Optional:
Scaling Synthetic Data Creation with 1,000,000,000 Personas VLM Part 2
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Model developmentTransformers and New Directions (Linear Attention, Linear RNNs, State Space Models)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Longformer: The Long-Document Transformer
Generating Long Sequences with Sparse Transformers
Linformer: Self-Attention with Linear Complexity
Efficiently Modeling Long Sequences with Structured State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
RWKV: Reinventing RNNs for the Transformer Era
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models Bias
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Semantics derived automatically from language corpora contain human-like biases
StereoSet: Measuring stereotypical bias in pretrained language models
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models Diffusion Models
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models Miscellaneous
Chess as a Testbed for Language Model State Tracking | Proceedings of the AAAI Conference on Artificial Intelligence Topic: How well can an LLM learn Conway’s Game of Life and be prompted to solve for different tasks, such as: given some NxN space how well can it maximize still life or given the opportunity to modify the state of the system minimizing entropy while maximizing stable life. This could test how well they can reason with very simple rules and how far out it can predict for a highly chaotic systems. These kind of questions are typically for mathematicians and very fast computers with lots of RAM

13. Emerging_Topics 1 - AI in Trust and Safety

Sat, 18 Apr 2026 16:02:32 GMT

06. Harassment_and_Hate_Speech

Sat, 18 Apr 2026 16:02:28 GMT

05. Terrorism_Radicalization_and_Extremism

Sat, 18 Apr 2026 16:02:25 GMT

01. Define Misinfo (Consortium Information Environment)

Sat, 18 Apr 2026 16:02:21 GMT

08. Misinformation (information environment)

Sat, 18 Apr 2026 16:02:21 GMT

03. Metrics_and_Measurement

Sat, 18 Apr 2026 16:02:17 GMT

02. Content Moderation Overview (Consortium)

Sat, 18 Apr 2026 16:02:14 GMT

04. Content_Moderation

Sat, 18 Apr 2026 16:02:14 GMT

07. Government_Regulation

Sat, 18 Apr 2026 16:02:10 GMT

01. Introduction_to_Trust_and_Safety

Sat, 18 Apr 2026 16:02:06 GMT

02. Large Scale Trust & Safety Systems

Wed, 15 Apr 2026 18:37:31 GMT

Pasted image 20260415142538

Wed, 15 Apr 2026 18:25:38 GMT

Pasted image 20260415142449

Wed, 15 Apr 2026 18:24:49 GMT

How do I make a good classifier

Wed, 15 Apr 2026 18:12:52 GMT

Let’s do an example with binary classification. Collect data (raw samples). Tabular: pandas Annotate data (multiple annotators → compute IRR to check reliability). Annotators are the humans labeling your data (e.g., deciding whether an instance is positive or negative). Since humans can disagree, IRR (inter-rater reliability) measures how consistently annotators label the same items. Common metrics: Cohen’s Kappa → for two annotators, adjusts for agreement by chance. → sklearn.metrics.cohen_kappa_score Fleiss’ Kappa → for multiple annotators. → statsmodels.stats.inter_rater Krippendorff’s Alpha → general, supports missing labels and different data types. → krippendorff High IRR means your labels are reliable and can be trusted for training a classifier. Preprocess (tokenization, normalization, feature engineering, embeddings). Text processing: nltk, spaCy Basic Vectorization: sklearn.feature_extraction.text (CountVectorizer, TfidfVectorizer). Deep embeddings: transformers. Split into train/validation/test sets. Tabular: pandas Handle class imbalance (only on the training data, use validation to tune hyperparams): Downsampling → randomly reduce majority-class samples. Pros: balances quickly, smaller dataset. Cons: throws away information. Upsampling → duplicate or synthetically generate minority-class samples (e.g., SMOTE). Pros: keeps all data, improves minority class signal. Cons: may overfit (duplicates) or add artifacts (synthetic). imbalanced-learn (imblearn): RandomUnderSampler, RandomOverSampler. SMOTE, ADASYN. Train classifier on the training set (e.g., logistic regression, random forest, neural net, whatever). Classic ML: scikit-learn (LogisticRegression, RandomForestClassifier). Boosting: xgboost, lightgbm Deep learning: pytorch, tensorflow Tune the hyperparameters using the validation set: Evaluate using metrics robust to imbalance (precision, recall, F1, ROC-AUC) on test sets that are left in their original class distribution. sklearn.metrics (accuracy, precision, recall, F1, ROC-AUC) Run an ablation study on important hyperparameters Ablation Study = sweeping or adjusting the threshold over a set of values to study effect. For example, Classifiers output probabilities (e.g., P(y=1|x)). You pick a threshold to turn probabilities into binary predictions. Default = 0.5, but this may not be optimal. For example, this could happen: Lower threshold → ↑ recall, ↓ precision. Higher threshold → ↑ precision, ↓ recall. You can… Plot Precision-Recall curves. Plot ROC curves (TPR vs FPR). Compare metrics at multiple thresholds to select the right trade-off (maximize F1, enforce recall, minimize false positives, etc.). This best value totally depends on your problem. AFTER all of this, evaluate on the test set using metrics robust to imbalance (precision, recall, F1, ROC-AUC) on test sets that are left in their original class distribution (no up/downsampling the test set, leave it as is). sklearn.metrics (accuracy, precision, recall, F1, ROC-AUC).

AI Worksheets

Wed, 15 Apr 2026 18:11:59 GMT

The Fundamental Problem with Neural Networks - Vanishing Gradients
This is how to take your ML models from great to GOAT
This is why you should care about unbalanced data .. as a data scientist
What does it mean to subtract one distribution from another?
Gradient Descent : Data Science Concepts
Loss Functions : Data Science Basics
Curse of Dimensionality : Data Science Basics
The Softmax : Data Science Basics
The Sigmoid : Data Science Basics
Large Language Models explained briefly So LLMs are a mathematical function that are really good at predicting what word comes next for any piece of text. How does probability come into this? [answer goes here] What is backpropegation? Why do we need it? [answer goes here] LLMs are trained with “the goal of autocompleting a random passage of text from the internet” during pretraining. How is this different from RLHF? [answer goes here] Why are GPUs helpful? [answer goes here] Explain the first step shown above: [answer goes here] Write a list of questions you still have during/after watching this video: [answer goes here]
https://mariateleki.github.io/pdf/CAFE-Talk.pdf (these are slides from one of my talks and it was to a vet school so forgive my AI example pls, also skip slides 73 to the end) How do we represent words with numbers? [answer goes here] Why do we have multiple dimensions in neural networks? [answer goes here] Why are LLMs biased? [answer goes here, hint, see slides 66-67]
https://www.youtube.com/shorts/FJtFZwbvkI4 How do we represent words? [answer goes here] What do the directions mean? [answer goes here] Can we visualize 4D, 5D, 6D? [answer goes here]
https://www.youtube.com/shorts/9Ejh8pPZu_A How many dimensions does GPT3 have for its word embeddings? [answer goes here] How many dimensions do we usually use to “draw” embeddings when we talk about them? [answer goes here]
https://www.youtube.com/shorts/qzRyCEapjFE What data structure do we use in AI stuff? [answer goes here] What is an embedding space? Like, what is it for? [answer goes here] If text has similar meaning, is it closer together or farther together in the embedding space? [answer goes here] What data types (e.g. text) can we use embeddings for? [answer goes here]
https://www.youtube.com/shorts/h__DQ3LplK0 What are embeddings? [answer goes here] Are similar words closer together or farther together? [answer goes here] When someone says “embedding space” what are they talking about? [answer goes here] How do you train a word embedding model? [answer goes here] What is the input and what is the output? [answer goes here] What is this model supposed to learn? [answer goes here] List a few word embedding models: [answer goes here]
Word Embedding and Word2Vec, Clearly Explained!!! I like this video but it’s REALLY LONG – so totally up to you if you’re curious and want to watch it, but no questions from me on this one Ok what questions do u have after watching all of this about embeddings? [answer goes here]
Taking Control of LLM Outputs: An Introductory Journey into Logits (watch until 12:00) ^ So this is the whole model, and then we zoom in on the logits to look at how the model selects the next token once it has the logits: What are logits? [answer goes here] Why do we do this softmax thing? (Might need to Google/find other videos to answer this) [answer goes here] How does the model pick the next token using the logits? [answer goes here] So it seems like there are LOTS of ways that the model can pick the next token once it has these logit values… what are some of the ways he talked about in this video? [answer goes here]
What is Temperature in LLM How do LLMs generate text? [answer goes here] What are the 3 sampling techniques discussed in this video? [answer goes here] Why doesn’t the LLM give back the same response every time you give it the same input prompt? [answer goes here] How do we use the probability distribution? [answer goes here] Explain greedy sampling – how does it pick tokens? [answer goes here] Ok so this is our overall flow:
What is temperature? [answer goes here] What does a high temperature do to the probability distribution? [answer goes here] What does a low temperature do to the probability distribution? [answer goes here] If I want super stable outputs (like same prompt gives me back same output each time), should I use a high temp or a low temp? [answer goes here] Explain Top-P sampling – how does it pick tokens? [answer goes here] Explain Top-K sampling – how does it pick tokens? [answer goes here] Tbh top-p and top-k seem like overkill?? Why do you think we would use them/what are some situations where it might make sense? [answer goes here]
https://dylancastillo.co/posts/seed-temperature-llms.html#seed So from all the previous stuff, now we know that the different decoding strategies (greedy sampling, top-p/top-k sampling, etc.) need to pull a random number. So question: why do we need to set the seed? [answer goes here] What does it mean for an LLM to give deterministic outputs (in contrast to more creative/random outputs)? [answer goes here] What values do I need to fix/set/freeze/choose to get deterministic outputs? List here: [answer goes here] Ok what questions do u have after watching all of this about temperature/logits/decoding? [answer goes here]
Transformer Architecture Explained tbd
Transformer Explained Why do we need to look at every token compared to every other token? [answer goes here] Why does this make LLMs so expensive? [answer goes here] Why don’t we use some of the cheaper options for this comparison (e.g. linformer, reformer, sparse attention, etc.)? [answer goes here] Why does a long context window ( = long input text) make it more expensive? [answer goes here] So, why are transformers bad at reasoning and symbolic data stuff? [answer goes here]
Attention in transformers, step-by-step | Deep Learning Chapter 6 tbd
Proximal Policy Optimization (PPO) for LLMs Explained Intuitively
DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs
Training models with only 4 bits | Fully-Quantized Training
Why do we use "e" in the Sigmoid?
Recommender Systems – this one also goes over collaborative filtering a little bit
Collaborative Filtering : Data Science Concepts What is the BIG IDEA behind collaborative filtering? Like what idea are we trying to model? [answer goes here] What is the data structure setup for collaborative filtering? [answer goes here] What are the rows? [answer goes here] What are the columns? [answer goes here] What does a rating mean? [answer goes here] What does a blank cell mean? [answer goes here] How do we figure out if U1 is more similar to U2 or U3? [answer goes here] What does cosine similarity tell us about the user relationships? [answer goes here] What does high cosine similarity mean? [answer goes here] What does low cosine similarity mean? [answer goes here] What does the hat on r^ mean? [answer goes here] What do you think about the equation we use for getting the estimated rating (r^)? Do you like it/not like it + why? [answer goes here] What are the 3 big barriers to running collaborative filtering (CF) in the real world? Like, why is it hard/when does CF suck? [answer goes here]
How does Netflix recommend movies? Matrix Factorization
Every Ranking Metric : MRR, MAP, NDCG
Why is the Formula for F1-Score Unnecessarily Complicated?
Learning to Rank - The ML Problem You've Probably Never Heard Of
Ranking Methods : Data Science Concepts
Can You Solve the Ratings Problem?
A survey on large language models for recommendation TODO: feature Maria Teleki's work

Pasted image 20260415141102

Wed, 15 Apr 2026 18:11:02 GMT

Pasted image 20251022162657

Wed, 22 Oct 2025 20:26:57 GMT

Pasted image 20250919122359

Fri, 19 Sep 2025 16:23:59 GMT

Pasted image 20250919122344

Fri, 19 Sep 2025 16:23:44 GMT

Pasted image 20250919122308

Fri, 19 Sep 2025 16:23:08 GMT

Pasted image 20250919122247

Fri, 19 Sep 2025 16:22:47 GMT

Pasted image 20250912135828

Fri, 12 Sep 2025 17:58:28 GMT

Pasted image 20250912110052

Fri, 12 Sep 2025 15:00:52 GMT

03. Misinformation Detection Tutorial (IC2S2_25)

Sun, 20 Jul 2025 06:32:34 GMT

09. Misinformation Detection Tutorial (IC2S2_25)

Sun, 20 Jul 2025 06:32:34 GMT