CSE 5610: Large Language Models (2025 Fall)

Course Overview

This is an advanced research-oriented course that teaches and discusses frontier papers of Large Language Models (language model architecture and training framework) as well as Large Language Model capabilities, applications and issues.
Please be aware that this is a fast-paced, research-driven course, not an introductory course to LLMs. The curriculum is tailored for advanced students (PhD candidates) doing state-of-the-art LLM-related research. Students without a strong machine learning research background will find the pace and technical depth of the material exceptionally challenging.

Course Grading

20% Preview question submissions

25% Paper Presentation

55% Final Project

10% Project/Survey Proposal
10% Mid-term Report
10% Final Project Presentation (Group-based)
5% Feedbacks for other groups’ final project presentations
20% Final Project Report

Paper Presentation

Grading Criteria:

Well Preparation: Whether the slides are sent over by the given deadline for the instructors to give feedback
- For Tuesday classes, send over your slides before the previous Friday 12:00PM to both the instructor (jiaxinh@wustl.edu) and Zheyuan (w.zheyuan@wustl.edu)
- For Thursday classes, send over your slides before the previous Monday 12:00PM to both the instructor (jiaxinh@wustl.edu) and Isle (s.xiaodao@wustl.edu)
Completeness: Whether the presentation covers the background and major contribution of the listed papers, and is delivered within the required timeframe
Clarity: Whether the presenter clearly convey the information from their slides
Q&A: If there are any raised questions from the audiences, whether the presenters can handle the questions properly

Preview Questions Submission

Each student is required to submit a preview question for a paper to be presented one day before every class (except for the class that you will present). You are also encouraged to raise that question in class. Preview questions cannot be simple ones like "what is the aim of the paper?" or "what is the difference between this method and previous methods?"

Final Project (2-3 students per group)

Project Requirement: There are typically two types of projects.

Designing a novel algorithm to train a medium-sized language model: BERT, GPT-2 for problems that you are interested in.
Designing a novel algorithm to do inference on large language models (white box models such as Qwen, Llama, and DeepSeek series, or black box models such as GPT, Gemini, CLAUDE, etc.) to solve some type of complex problems, and analyze their limitations.

Find open-weight models here: https://huggingface.co/models
https://platform.openai.com/docs/api-reference/introduction
https://docs.anthropic.com/claude/reference/getting-started-with-the-api

Project Presentation: Date: 12/2 and 12/4. You will need to signup for a time slot near the end of the semester. Students will need to submit feedback scores for other groups’ presentation (through Google Form).

Office Hour

Our office hour will be on-demand ones: If you find yourself needing to discuss course materials or have questions at any point, feel free to send an email requesting an office hour. Based on these requests, we will organize time slots for students to schedule appointments.

Teaching Assistants

Zheyuan Wu(w.zheyuan@wustl.edu)

Isle Song(s.xiaodao@wustl.edu)

Syllabus (The dates of the courses are tentative due to guest lectures.)

Date	Topic	Readings	Slides
Large Language Model Basics
8/26	Course Overview	Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Enriching Word Vectors with Subword Information Attention Is All You Need (Transformer)	Slides
8/28	Language Model Pre-training	Language Models are Unsupervised Multitask Learners (GPT-2) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension	Slides
9/2	Scaling Laws and Emergent Behaviors	Language Models are Few-Shot Learners (GPT-3) Emergent Abilities of Large Language Models Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers	Slides
9/4	Post-training (I): Instruction Tuning	Multitask Prompted Training Enables Zero-Shot Task Generalization Cross-Task Generalization via Natural Language Crowdsourcing Instructions Self-Instruct: Aligning Language Models with Self-Generated Instructions LIMA: Less Is More for Alignment How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources	Slides
	-----Student Presentation Starts-----
State-of-the-Art Reasoning and Post-training
9/9	Language Model Reasoning (I): Chain of Thoughts + Inference-Time Scaling	Chain of Thought Prompting Elicits Reasoning in Large Language Models Self-Consistency Improves Chain of Thought Reasoning in Language Models Self-Refine: Iterative Refinement with Self-Feedback Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters	Slides
9/11	Language Model Reasoning (II): Thinking in Latent Space	Training Large Language Models to Reason in a Continuous Latent Space Compressed Chain of Thought: Efficient Reasoning Through Dense Representations Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking	Slides
	-----Proposal Deadline: 9/15/2025-----
9/16	Guest Lecture by Weijia Shi (University of Washington)
9/18	Post-training (II): Reinforcement Learning from Human Feedback	Training language models to follow instructions with human feedback Direct Preference Optimization: Your Language Model is Secretly a Reward Model SimPO: Simple Preference Optimization with a Reference-Free Reward Fine-Grained Human Feedback Gives Better Rewards for Language Model Training	Slides
9/23	Post-training (III): Reinforcement Learning from Verified Rewards	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DAPO: An Open-Source LLM Reinforcement Learning System at Scale Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training	Slides
Efficient Methods for Large Language Models
9/25	Efficient Fine-Tuning	The Power of Scale for Parameter-Efficient Prompt Tuning Parameter-Efficient Transfer Learning for NLP LoRA: Low-Rank Adaptation of Large Language Models Text-to-LoRA: Instant Transformer Adaption	Slides
9/30	Efficient RLVR (Data & Computation)	Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Spurious Rewards: Rethinking Training Signals in RLVR R-Zero: Self-Evolving Reasoning LLM from Zero Data	Slides
10/2	Efficient Inference	Fast Inference from Transformers via Speculative Decoding Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty Learning Harmonized Representations for Speculative Sampling	Slides
10/7	----Fall Break-----
10/9	No class
10/14	Long-Context Language Models	Lost in the Middle: How Language Models Use Long Contexts RoFormer: Enhanced Transformer with Rotary Position Embedding LongNet: Scaling Transformers to 1B Tokens RULER: What's the Real Context Size of Your Long-Context Language Models?	Slides
Large Language Model Factuality
10/16	LLM Hallucination and Solutions	How Language Model Hallucinations Can Snowball Improving Factuality and Reasoning in Language Models through Multiagent Debate Trusting Your Evidence: Hallucinate Less with Context-aware Decoding Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation	Slides
10/21	Language Model Calibration	Just Ask for Calibration Teaching Models to Express Their Uncertainty in Words Taming Overconfidence in LLMs: Reward Calibration in RLHF Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models	Slides
Large Language Model Applications
10/23	Retrieval-Augmented Generation	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation REPLUG: Retrieval-Augmented Black-Box Language Models Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection	Slides
	-----Mid-Term Report Deadline: 10/24-----
10/28	Language Models as Agents	Toolformer: Language Models Can Teach Themselves to Use Tools ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ART: Automatic multi-step reasoning and tool-use for large language models A-MEM: Agentic Memory for LLM Agents	Slides
10/30	Agentic RAG	Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models Search-o1: Agentic Search-Enhanced Large Reasoning Models Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning	Slides
11/4	Multi-modal LLMs	Learning Transferable Visual Models From Natural Language Supervision Visual Instruction Tuning NExT-GPT: Any-to-Any Multimodal LLM Evaluating Object Hallucination in Large Vision-Language Models	Slides
Large Language Model Evaluation
11/6	Evaluation of Language Models	Proving Test Set Contamination in Black Box Language Models Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Large Language Models are not Fair Evaluators Holistic Evaluation of Language Models	Slides
11/11	Detection of LLM Generation	DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature GPT-who: An Information Density-based Machine-Generated Text Detector A Watermark for Large Language Models GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content	Slides
Other Topics
11/13	Revisiting Other Language Model Architectures	Mixtral of Experts Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality RWKV: Reinventing RNNs for the Transformer Era Hierarchical Reasoning Model	Slides
11/18	Language Model Bias	Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints Whose Opinions Do Language Models Reflect? “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters Red Teaming Language Models with Language Models	Slides
11/20	Guest Lecture by Bowen Jin (University of Illinois at Urbana-Champaign)
11/25	Language Model Safety	Multi-step Jailbreaking Privacy Attacks on ChatGPT Jailbreaking Black Box Large Language Models in Twenty Queries Quantifying Memorization Across Neural Language Models Poisoning Language Models During Instruction Tuning	Slides
11/27	----No Class-----
	-----Project Presentation Deadline: 12/1-----
12/2	Final Project Presentation I
12/4	Final Project Presentation II
	-----Project Final Report Deadline: 12/12-----

Jiaxin Huang