AI Worksheets
π§ Neural Networks
π¦ LLMs
LLMs
- So LLMs are a mathematical function that are really good at predicting what word comes next for any piece of text. How does probability come into this? [answer goes here]
- What is backpropegation? Why do we need it? [answer goes here]
- LLMs are trained with βthe goal of autocompleting a random passage of text from the internetβ during pretraining. How is this different from RLHF? [answer goes here]
- Why are GPUs helpful? [answer goes here]
- Explain the first step shown above: [answer goes here]
- Write a list of questions you still have during/after watching this video:
- [answer goes here]
Embeddings
https://mariateleki.github.io/pdf/CAFE-Talk.pdf (these are slides from one of my talks and it was to a vet school so forgive my AI example pls, also skip slides 73 to the end)
- How do we represent words with numbers? [answer goes here]
- Why do we have multiple dimensions in neural networks? [answer goes here]
- Why are LLMs biased? [answer goes here, hint, see slides 66-67]
- How do we represent words? [answer goes here]
- What do the directions mean? [answer goes here]
- Can we visualize 4D, 5D, 6D? [answer goes here]
- How many dimensions does GPT3 have for its word embeddings? [answer goes here]
- How many dimensions do we usually use to βdrawβ embeddings when we talk about them? [answer goes here]
- What data structure do we use in AI stuff? [answer goes here]
- What is an embedding space? Like, what is it for? [answer goes here]
- If text has similar meaning, is it closer together or farther together in the embedding space? [answer goes here]
- What data types (e.g. text) can we use embeddings for? [answer goes here]
- What are embeddings? [answer goes here]
- Are similar words closer together or farther together? [answer goes here]
- When someone says βembedding spaceβ what are they talking about? [answer goes here]
- How do you train a word embedding model? [answer goes here]
- What is the input and what is the output? [answer goes here]
- What is this model supposed to learn? [answer goes here]
- List a few word embedding models: [answer goes here]
- I like this video but itβs REALLY LONG β so totally up to you if youβre curious and want to watch it, but no questions from me on this one
Ok what questions do u have after watching all of this about embeddings? [answer goes here]
The Seed, The Logits, The Probability Distribution, & The Decoding Process
Taking Control of LLM Outputs: An Introductory Journey into Logits (watch until 12:00)
- ^ So this is the whole model, and then we zoom in on the logits to look at how the model selects the next token once it has the logits:
- What are logits? [answer goes here]
- Why do we do this softmax thing? (Might need to Google/find other videos to answer this) [answer goes here]
- How does the model pick the next token using the logits? [answer goes here]
- So it seems like there are LOTS of ways that the model can pick the next token once it has these logit values⦠what are some of the ways he talked about in this video? [answer goes here]
- How do LLMs generate text? [answer goes here]
- What are the 3 sampling techniques discussed in this video? [answer goes here]
- Why doesnβt the LLM give back the same response every time you give it the same input prompt? [answer goes here]
- How do we use the probability distribution? [answer goes here]
- Explain greedy sampling β how does it pick tokens? [answer goes here]
- Ok so this is our overall flow:
- What is temperature? [answer goes here]
- What does a high temperature do to the probability distribution? [answer goes here]
- What does a low temperature do to the probability distribution? [answer goes here]
- If I want super stable outputs (like same prompt gives me back same output each time), should I use a high temp or a low temp? [answer goes here]
- Explain Top-P sampling β how does it pick tokens? [answer goes here]
- Explain Top-K sampling β how does it pick tokens? [answer goes here]
- Tbh top-p and top-k seem like overkill?? Why do you think we would use them/what are some situations where it might make sense? [answer goes here]
- So from all the previous stuff, now we know that the different decoding strategies (greedy sampling, top-p/top-k sampling, etc.) need to pull a random number. So question: why do we need to set the seed? [answer goes here]
- What does it mean for an LLM to give deterministic outputs (in contrast to more creative/random outputs)? [answer goes here]
- What values do I need to fix/set/freeze/choose to get deterministic outputs? List here: [answer goes here]
Ok what questions do u have after watching all of this about temperature/logits/decoding? [answer goes here]
The Transformer Architecture
- tbd
Attention
- Why do we need to look at every token compared to every other token? [answer goes here]
- Why does this make LLMs so expensive? [answer goes here]
- Why donβt we use some of the cheaper options for this comparison (e.g. linformer, reformer, sparse attention, etc.)? [answer goes here]
- Why does a long context window ( = long input text) make it more expensive? [answer goes here]
- So, why are transformers bad at reasoning and symbolic data stuff? [answer goes here]
- tbd
Training
Pre-training: next token prediction on internet data
Post-training: fine-tuning/RLHF on specific datasets
Objective Functions/Loss Functions
Tasks
Summarization
β¦
Etc.
Quantization
π¬ RecSys
Recommender Systems
Content-Based Filtering
Recommender Systems β this one also goes over collaborative filtering a little bit
Collaborative Filtering
- What is the BIG IDEA behind collaborative filtering? Like what idea are we trying to model? [answer goes here]
- What is the data structure setup for collaborative filtering? [answer goes here]
- What are the rows? [answer goes here]
- What are the columns? [answer goes here]
- What does a rating mean? [answer goes here]
- What does a blank cell mean? [answer goes here]
- How do we figure out if U1 is more similar to U2 or U3? [answer goes here]
- What does cosine similarity tell us about the user relationships? [answer goes here]
- What does high cosine similarity mean? [answer goes here]
- What does low cosine similarity mean? [answer goes here]
- What does the hat on r^ mean? [answer goes here]
- What do you think about the equation we use for getting the estimated rating (r^)? Do you like it/not like it + why? [answer goes here]
- What are the 3 big barriers to running collaborative filtering (CF) in the real world? Like, why is it hard/when does CF suck? [answer goes here]
Evaluation
Learning to Rank
RecSys+LLM
π¦ Topic Models
β Bias
π€ Computational Social Science
π¬ Speech
TODO: feature Maria Teleki's work
