CSE 4061: Text Mining (2026 Spring)

Course Overview

This is an advanced research-oriented course that teaches fundamental techniques of text mining and natural language processing. It is a rapidly evolving field at the intersection of natural language processing and machine learning. Students will gain both in-depth knowledge of fundamental concepts and hands-on experience in practical applications.
Pre-requisites: Students are expected to understand concepts in machine learning (CSE 4107/5107)

Teaching Assistants

Jinyuan Li (ljinyuan@wustl.edu)
Ryan Zhang (yongyan@wustl.edu)

Course Grading

Final Project (2-3 students per group)

Project Requirement: Demonstrate that you are able to apply the knowledge and techniques learned from this course. The project requires more complex implementation than the programming assignments. Topics include but not limited to:

  1. Investigate word embeddings and sentence embeddings for text classification problems.
  2. Train a medium-sized language model (e.g., BERT, GPT-2) for tasks that you are interested in.
  3. Do inference on large language models (white box models such as Qwen, LLaMA models, or black box APIs such as Gemini, GPT, CLAUDE, etc.) to solve some type of complex problems, and analyze their limitations.
  4. Create benchmark for new and challenging tasks and test it with SoTA models.

    Project Presentation Date: 4/21 and 4/23. You will need to signup for a time slot near the end of the semester. Presentation length will be 10-15 minutes depending on the number of groups.

    Office Hour

    Instructor Office Hour: Thursday 11am - 12pm at McKelvey Hall 2010E

    TA Office Hour: Thursday 1-2pm at McKelvey Hall 2037 (Jinyuan Li)

    TA Office Hour: Tuesday 1-2pm at McKelvey Hall 2037 (Ryan Zhang)

    Course Policies

    Syllabus (The content of each class is tentative.)

    WeekDateTopicAssignment
    Week 101/13Course Overview
    01/15N-gram Models
    Week 201/20Bag of Words, TF-IDF
    01/22Word Representations and Neural Word Embeddings
    Week 301/27Neural Word Embeddings (Cont'd)
    01/29Document RepresentationsHW1 Out
    Week 402/03Neural Sequence Modeling (RNN, LSTM)Proposal Due
    02/05Neural Sequence Modeling and Self Attention
    Week 502/10Transformer architecturesHW1 Due
    02/12LLM Pre-training
    Week 602/17Text Mining Applications: Sentiment AnalysisHW2 Out
    02/19Text Mining Applications: Information Extraction
    Week 702/24Large Language Models: Pre-training and Scaling
    02/26Advanced LLM reasoning (I): Prompting
    Week 803/03Instruction TuningHW2 Due
    03/05Reinforcement Learning with Human FeedbackHW3 Out, Mid-Term Report Due
    Spring Break
    Week 1003/17Advanced LLM reasoning (II): Reinforcement Learning with Verifiable Rewards
    03/19Advanced LLM reasoning (III): Self-Training and RoutingHW3 Due
    Week 1103/24Language Model FactualityHW4 Out
    03/26LLM Training Efficiency
    Week 1203/31LLM Inference Efficiency
    04/02LLM Multi-modalityHW4 Due
    Week 1304/07LLM Applications: Retrieval-Augmented Generation
    04/09LLM Applications: Agents
    Week 1404/14Future Directions
    04/16TBD
    Week 1504/21Final Project Presentations
    04/23Final Project Presentations