CSE 4061: Text Mining (2025 Spring)
Course Overview
This is an advanced research-oriented course that teaches fundamental techniques of text mining and natural language processing. It is a rapidly evolving field at the intersection of natural language processing and machine learning. Students will gain both in-depth knowledge of fundamental concepts and hands-on experience in practical applications.
Pre-requisites: Students are expected to understand concepts in machine learning (CSE 417T/517A)
Teaching Assistants
Langlin Huang (h.langlin@wustl.edu)
Xinhang Yuan (xinhang.y@wustl.edu)
Course Grading
- 10% Class Participation
- Regular class participation and discussion (10%)
- 60% Programming Assignments
- 4 programming assignments, each account for 15%
- 5 days of grace period in total for all assignments
- After grace period is used up, late assignments receive 0 score
- 30% Final Project (Group-Based, 2-3 people)
- 5% Project Proposal (Due: 2/4, 11:59PM)
- 5% Mid-term Report (Due: 3/3, 11:59PM)
- 10% Final Course Presentation (Due: 4/14, 11:59PM)
- We will use two lectures for project presentation: 4/15, 4/17
- 10% Final Project Report (Due: 5/2, 11:59PM)
Final Project (2-3 students per group)
Project Requirement: Demonstrate that you are able to apply the knowledge and techniques learned from this course. The project requires more complex implementation than the programming assignments. Topics include but not limited to:
- Investigate word embeddings and sentence embeddings for text classification problems.
- Train a medium-sized language model (e.g., BERT, GPT-2) for tasks that you are interested in.
- Do inference on large language models (white box models such as LLaMA models, or black box models such as GPT-4, CLAUDE, etc.) to solve some type of complex problems, and analyze their limitations.
- https://platform.openai.com/docs/introduction
- https://docs.anthropic.com/claude/reference/getting-started-with-the-api
- Create benchmark for new and challenging tasks and test it with SoTA models.
Project Presentation Date: 4/15 and 4/17, 2025. You will need to signup for a time slot near the end of the semester. Presentation length will be 10-15 minutes depending on the number of groups.
Office Hour
Instructor Office Hour: Thursday After Class - 5pm at McKelvey Hall 2010E
TA Office Hour: Tuesday 10-11am at McKelvey Hall 2040 (Langlin Huang)
TA Office Hour: Friday 10-11am at McKelvey Hall 2040 (Xinhang Yuan)
Course Policies
- LLM Usage Policy:
- It is fine to collaborate with LLMs for coding assignments and refining your reports. However, directly using LLM generated outputs without manual check results in 0 score of the assignment.
- Extra Credit:
- Students who first correctly answers technical questions (excluding assignment questions) raised by other students on Piazza will get 1 bonus point each time, up to 3 points in total.
Syllabus (The content of each class is tentative.)
Week | Date | Topic | Assignment |
Week 1 | 01/14 | Course Overview | |
01/16 | N-gram Models | ||
Week 2 | 01/21 | Bag of Words, TF-IDF | |
01/23 | Word Representations and Neural Word Embeddings | ||
Week 3 | 01/28 | Neural Word Embeddings (Cont'd) | |
01/30 | Document Representations | HW1 Out | |
Week 4 | 02/04 | Neural Sequence Modeling (RNN, LSTM) | |
02/06 | Neural Sequence Modeling and Self Attention | ||
Week 5 | 02/11 | Transformer architectures | HW1 Due |
02/13 | LLM Pre-training | ||
Week 6 | 02/18 | Text Mining Applications: Sentiment Analysis | HW2 Out |
02/20 | Text Mining Applications: Information Extraction | ||
Week 7 | 02/25 | Large Language Models: Pre-training and Scaling | |
02/27 | Instruction Tuning | HW2 Due | |
Week 8 | 03/04 | Advanced LLM reasoning (I) | HW3 Out |
03/06 | Advanced LLM reasoning (II) | ||
Spring Break | |||
Week 10 | 03/18 | Reinforcement Learning with Human Feedback | HW3 Due |
03/20 | Language Model Factuality | HW4 Out | |
Week 11 | 03/25 | LLM Training Efficiency | |
03/27 | LLM Inference Efficiency | ||
Week 12 | 04/01 | LLM Applications: Retrieval-Augmented Generation | HW4 Due |
04/03 | LLM Applications: Agents | ||
Week 13 | 04/08 | LLM Multi-modality | |
04/10 | Future Directions | ||
Week 14 | 04/15 | Final Project Presentations | |
04/17 | Final Project Presentations |