LLM background literature
CSCE 689 LLMs:: Course Readings as shared by Maria Teleki
Parameter-Efficient Tuning, Compression
Efficient inference
- Readings:
- Optional:
- Flash-Decoding for long-context inference
- Some of Andrej Karpathy’s github repos
- https://github.com/karpathy/nanoGPT
- https://github.com/karpathy/llm.c
- Some of Georgi Gerganov’s github repos
Model distillation
Data Efficiency in the Age of LLMs
- References:
- Sorscher, Ben, et al. "Beyond neural scaling laws: beating power law scaling via data pruning." NeurIPS (2022)
- Abbas, Amro, et al. "Semdedup: Data-efficient learning at web-scale through semantic deduplication." arXiv preprint arXiv:2303.09540 (2023).
- Sachdeva, Noveen, et al. "How to Train Data-Efficient LLMs." arXiv preprint arXiv:2402.09668 (2024).
- Marion, Max, et al. "When less is more: Investigating data pruning for pretraining llms at scale." arXiv preprint arXiv:2309.04564 (2023).
- Brown, Tom B. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
- Xie, Sang Michael, et al. "Data selection for language models via importance resampling." NeurIPS (2023)
- Engstrom, Logan, Axel Feldmann, and Aleksander Madry. "Dsdm: Model-aware dataset selection with datamodels." arXiv preprint arXiv:2401.12926 (2024).
- Fadhel, et al. "Data pruning and neural scaling laws: fundamental limitations of score-based algorithms." TMLR ‘23.
- Guo, et al. "Deepcore: A comprehensive library for coreset selection in deep learning." arXiv preprint arXiv:2204.08499 (2022).
Tools, Agents, and MoE
- Readings:
- What Are Tools Anyway? A Survey from the Language Model Perspective
- Toolformer: Language Models Can Teach Themselves to Use Tools
- ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
- Mixture-of-Agents Enhances Large Language Model Capabilities
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
- Optional:
- Lilian’s Blog on LLM Powered Autonomous Agents
- Visual Programming: Compositional Visual Reasoning Without Training
- Great collection of papers on tools: https://github.com/zorazrw/awesome-tool-llm
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Long context, extending context
- Readings:
- Optional:
Scaling Laws
- Readings:
- Optional:
Self-play
LLM Applications: Text mining, user modeling, …
VLM Part 2
Model development
Transformers and New Directions (Linear Attention, Linear RNNs, State Space Models)
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- Longformer: The Long-Document Transformer
- Generating Long Sequences with Sparse Transformers
- Linformer: Self-Attention with Linear Complexity
- Efficiently Modeling Long Sequences with Structured State Spaces
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- RWKV: Reinventing RNNs for the Transformer Era
- Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Bias
- Language (Technology) is Power: A Critical Survey of “Bias” in NLP
- Semantics derived automatically from language corpora contain human-like biases
- StereoSet: Measuring stereotypical bias in pretrained language models
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
Diffusion Models
Miscellaneous
- Chess as a Testbed for Language Model State Tracking | Proceedings of the AAAI Conference on Artificial Intelligence
- Topic: How well can an LLM learn Conway’s Game of Life and be prompted to solve for different tasks, such as: given some NxN space how well can it maximize still life or given the opportunity to modify the state of the system minimizing entropy while maximizing stable life. This could test how well they can reason with very simple rules and how far out it can predict for a highly chaotic systems. These kind of questions are typically for mathematicians and very fast computers with lots of RAM