🧠 Neural Networks

🦋 LLMs

LLMs

- So LLMs are a mathematical function that are really good at predicting what word comes next for any piece of text. How does probability come into this? [answer goes here]
What is backpropegation? Why do we need it? [answer goes here]
LLMs are trained with “the goal of autocompleting a random passage of text from the internet” during pretraining. How is this different from RLHF? [answer goes here]
Why are GPUs helpful? [answer goes here]
- Explain the first step shown above: [answer goes here]
Write a list of questions you still have during/after watching this video:
- [answer goes here]

Embeddings

https://mariateleki.github.io/pdf/CAFE-Talk.pdf (these are slides from one of my talks and it was to a vet school so forgive my AI example pls, also skip slides 73 to the end)

How do we represent words with numbers? [answer goes here]
Why do we have multiple dimensions in neural networks? [answer goes here]
Why are LLMs biased? [answer goes here, hint, see slides 66-67]

https://www.youtube.com/shorts/FJtFZwbvkI4

How do we represent words? [answer goes here]
What do the directions mean? [answer goes here]
Can we visualize 4D, 5D, 6D? [answer goes here]

https://www.youtube.com/shorts/9Ejh8pPZu_A

How many dimensions does GPT3 have for its word embeddings? [answer goes here]
How many dimensions do we usually use to “draw” embeddings when we talk about them? [answer goes here]

https://www.youtube.com/shorts/qzRyCEapjFE

What data structure do we use in AI stuff? [answer goes here]
What is an embedding space? Like, what is it for? [answer goes here]
If text has similar meaning, is it closer together or farther together in the embedding space? [answer goes here]
What data types (e.g. text) can we use embeddings for? [answer goes here]

https://www.youtube.com/shorts/h__DQ3LplK0

What are embeddings? [answer goes here]
Are similar words closer together or farther together? [answer goes here]
When someone says “embedding space” what are they talking about? [answer goes here]
How do you train a word embedding model? [answer goes here]
- What is the input and what is the output? [answer goes here]
- What is this model supposed to learn? [answer goes here]
- List a few word embedding models: [answer goes here]

Word Embedding and Word2Vec, Clearly Explained!!!

I like this video but it’s REALLY LONG – so totally up to you if you’re curious and want to watch it, but no questions from me on this one

Ok what questions do u have after watching all of this about embeddings? [answer goes here]

The Seed, The Logits, The Probability Distribution, & The Decoding Process

Taking Control of LLM Outputs: An Introductory Journey into Logits (watch until 12:00)

^ So this is the whole model, and then we zoom in on the logits to look at how the model selects the next token once it has the logits:
- What are logits? [answer goes here]
- Why do we do this softmax thing? (Might need to Google/find other videos to answer this) [answer goes here]
- How does the model pick the next token using the logits? [answer goes here]
- So it seems like there are LOTS of ways that the model can pick the next token once it has these logit values… what are some of the ways he talked about in this video? [answer goes here]

What is Temperature in LLM

How do LLMs generate text? [answer goes here]
What are the 3 sampling techniques discussed in this video? [answer goes here]
Why doesn’t the LLM give back the same response every time you give it the same input prompt? [answer goes here]
How do we use the probability distribution? [answer goes here]
Explain greedy sampling – how does it pick tokens? [answer goes here]
Ok so this is our overall flow:
What is temperature? [answer goes here]
- What does a high temperature do to the probability distribution? [answer goes here]
- What does a low temperature do to the probability distribution? [answer goes here]
- If I want super stable outputs (like same prompt gives me back same output each time), should I use a high temp or a low temp? [answer goes here]
Explain Top-P sampling – how does it pick tokens? [answer goes here]
Explain Top-K sampling – how does it pick tokens? [answer goes here]
Tbh top-p and top-k seem like overkill?? Why do you think we would use them/what are some situations where it might make sense? [answer goes here]

https://dylancastillo.co/posts/seed-temperature-llms.html#seed

So from all the previous stuff, now we know that the different decoding strategies (greedy sampling, top-p/top-k sampling, etc.) need to pull a random number. So question: why do we need to set the seed? [answer goes here]
What does it mean for an LLM to give deterministic outputs (in contrast to more creative/random outputs)? [answer goes here]
What values do I need to fix/set/freeze/choose to get deterministic outputs? List here: [answer goes here]

Ok what questions do u have after watching all of this about temperature/logits/decoding? [answer goes here]

The Transformer Architecture

Transformer Architecture Explained

Attention

Transformer Explained

Why do we need to look at every token compared to every other token? [answer goes here]
Why does this make LLMs so expensive? [answer goes here]
Why don’t we use some of the cheaper options for this comparison (e.g. linformer, reformer, sparse attention, etc.)? [answer goes here]
Why does a long context window ( = long input text) make it more expensive? [answer goes here]
So, why are transformers bad at reasoning and symbolic data stuff? [answer goes here]

Attention in transformers, step-by-step | Deep Learning Chapter 6

Training

Pre-training: next token prediction on internet data

Post-training: fine-tuning/RLHF on specific datasets

Proximal Policy Optimization (PPO) for LLMs Explained Intuitively

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

Objective Functions/Loss Functions

Tasks

Summarization

…

Etc.

Quantization

Training models with only 4 bits | Fully-Quantized Training

Why do we use "e" in the Sigmoid?

🎬 RecSys

Recommender Systems

Content-Based Filtering

Recommender Systems – this one also goes over collaborative filtering a little bit

Collaborative Filtering

Collaborative Filtering : Data Science Concepts

What is the BIG IDEA behind collaborative filtering? Like what idea are we trying to model? [answer goes here]
What is the data structure setup for collaborative filtering? [answer goes here]
- What are the rows? [answer goes here]
- What are the columns? [answer goes here]
- What does a rating mean? [answer goes here]
- What does a blank cell mean? [answer goes here]
How do we figure out if U1 is more similar to U2 or U3? [answer goes here]
- What does cosine similarity tell us about the user relationships? [answer goes here]
- What does high cosine similarity mean? [answer goes here]
- What does low cosine similarity mean? [answer goes here]
What does the hat on r^ mean? [answer goes here]
- What do you think about the equation we use for getting the estimated rating (r^)? Do you like it/not like it + why? [answer goes here]
What are the 3 big barriers to running collaborative filtering (CF) in the real world? Like, why is it hard/when does CF suck? [answer goes here]