AI Worksheets

🧠 Neural Networks

πŸ¦‹ LLMs

LLMs

    • So LLMs are a mathematical function that are really good at predicting what word comes next for any piece of text. How does probability come into this? [answer goes here]
  • What is backpropegation? Why do we need it? [answer goes here]
  • LLMs are trained with β€œthe goal of autocompleting a random passage of text from the internet” during pretraining. How is this different from RLHF? [answer goes here]
  • Why are GPUs helpful? [answer goes here]
    • Explain the first step shown above: [answer goes here]
  • Write a list of questions you still have during/after watching this video:
    • [answer goes here]

Embeddings

https://mariateleki.github.io/pdf/CAFE-Talk.pdf (these are slides from one of my talks and it was to a vet school so forgive my AI example pls, also skip slides 73 to the end)

  • How do we represent words with numbers? [answer goes here]
  • Why do we have multiple dimensions in neural networks? [answer goes here]
  • Why are LLMs biased? [answer goes here, hint, see slides 66-67]
  • How do we represent words? [answer goes here]
  • What do the directions mean? [answer goes here]
  • Can we visualize 4D, 5D, 6D? [answer goes here]
  • How many dimensions does GPT3 have for its word embeddings? [answer goes here]
  • How many dimensions do we usually use to β€œdraw” embeddings when we talk about them? [answer goes here]
  • What data structure do we use in AI stuff? [answer goes here]
  • What is an embedding space? Like, what is it for? [answer goes here]
  • If text has similar meaning, is it closer together or farther together in the embedding space? [answer goes here]
  • What data types (e.g. text) can we use embeddings for? [answer goes here]
  • What are embeddings? [answer goes here]
  • Are similar words closer together or farther together? [answer goes here]
  • When someone says β€œembedding space” what are they talking about? [answer goes here]
  • How do you train a word embedding model? [answer goes here]
    • What is the input and what is the output? [answer goes here]
    • What is this model supposed to learn? [answer goes here]
    • List a few word embedding models: [answer goes here]
  • I like this video but it’s REALLY LONG – so totally up to you if you’re curious and want to watch it, but no questions from me on this one

Ok what questions do u have after watching all of this about embeddings? [answer goes here]

The Seed, The Logits, The Probability Distribution, & The Decoding Process

  • ^ So this is the whole model, and then we zoom in on the logits to look at how the model selects the next token once it has the logits:
    • What are logits? [answer goes here]
    • Why do we do this softmax thing? (Might need to Google/find other videos to answer this) [answer goes here]
    • How does the model pick the next token using the logits? [answer goes here]
    • So it seems like there are LOTS of ways that the model can pick the next token once it has these logit values… what are some of the ways he talked about in this video? [answer goes here]
  • How do LLMs generate text? [answer goes here]
  • What are the 3 sampling techniques discussed in this video? [answer goes here]
  • Why doesn’t the LLM give back the same response every time you give it the same input prompt? [answer goes here]
  • How do we use the probability distribution? [answer goes here]
  • Explain greedy sampling – how does it pick tokens? [answer goes here]
  • Ok so this is our overall flow:
    • Pasted image 20260415141102.png
  • What is temperature? [answer goes here]
    • What does a high temperature do to the probability distribution? [answer goes here]
    • What does a low temperature do to the probability distribution? [answer goes here]
    • If I want super stable outputs (like same prompt gives me back same output each time), should I use a high temp or a low temp? [answer goes here]
  • Explain Top-P sampling – how does it pick tokens? [answer goes here]
  • Explain Top-K sampling – how does it pick tokens? [answer goes here]
  • Tbh top-p and top-k seem like overkill?? Why do you think we would use them/what are some situations where it might make sense? [answer goes here]
  • So from all the previous stuff, now we know that the different decoding strategies (greedy sampling, top-p/top-k sampling, etc.) need to pull a random number. So question: why do we need to set the seed? [answer goes here]
  • What does it mean for an LLM to give deterministic outputs (in contrast to more creative/random outputs)? [answer goes here]
  • What values do I need to fix/set/freeze/choose to get deterministic outputs? List here: [answer goes here]

Ok what questions do u have after watching all of this about temperature/logits/decoding? [answer goes here]

The Transformer Architecture

  • tbd

Attention

  • Why do we need to look at every token compared to every other token? [answer goes here]
  • Why does this make LLMs so expensive? [answer goes here]
  • Why don’t we use some of the cheaper options for this comparison (e.g. linformer, reformer, sparse attention, etc.)? [answer goes here]
  • Why does a long context window ( = long input text) make it more expensive? [answer goes here]
  • So, why are transformers bad at reasoning and symbolic data stuff? [answer goes here]
  • tbd

Training

Pre-training: next token prediction on internet data

Post-training: fine-tuning/RLHF on specific datasets

Objective Functions/Loss Functions

Tasks

Summarization

…

Etc.

Quantization

🎬 RecSys

Recommender Systems

Content-Based Filtering

Recommender Systems – this one also goes over collaborative filtering a little bit

Collaborative Filtering

  • What is the BIG IDEA behind collaborative filtering? Like what idea are we trying to model? [answer goes here]
  • What is the data structure setup for collaborative filtering? [answer goes here]
    • What are the rows? [answer goes here]
    • What are the columns? [answer goes here]
    • What does a rating mean? [answer goes here]
    • What does a blank cell mean? [answer goes here]
  • How do we figure out if U1 is more similar to U2 or U3? [answer goes here]
    • What does cosine similarity tell us about the user relationships? [answer goes here]
    • What does high cosine similarity mean? [answer goes here]
    • What does low cosine similarity mean? [answer goes here]
  • What does the hat on r^ mean? [answer goes here]
    • What do you think about the equation we use for getting the estimated rating (r^)? Do you like it/not like it + why? [answer goes here]
  • What are the 3 big barriers to running collaborative filtering (CF) in the real world? Like, why is it hard/when does CF suck? [answer goes here]

Evaluation

Learning to Rank

RecSys+LLM

🦜 Topic Models

❌ Bias

🀠 Computational Social Science

πŸ’¬ Speech

TODO: feature Maria Teleki's work