CSE 4061: Text Mining (2025 Spring)

Course Overview

This is an advanced research-oriented course that teaches fundamental techniques of text mining and natural language processing. It is a rapidly evolving field at the intersection of natural language processing and machine learning. Students will gain both in-depth knowledge of fundamental concepts and hands-on experience in practical applications.
Pre-requisites: Students are expected to understand concepts in machine learning (CSE 417T/517A)

Teaching Assistants

Langlin Huang (h.langlin@wustl.edu)
Xinhang Yuan (xinhang.y@wustl.edu)

Course Grading

10% Class Participation
- Regular class participation and discussion (10%)
60% Programming Assignments
- 4 programming assignments, each account for 15%
- 5 days of grace period in total for all assignments
- After grace period is used up, late assignments receive 0 score
30% Final Project (Group-Based, 2-3 people)
- 5% Project Proposal (Due: 2/4, 11:59PM)
- 5% Mid-term Report (Due: 3/3, 11:59PM)
- 10% Final Course Presentation (Due: 4/14, 11:59PM)
- We will use two lectures for project presentation: 4/15, 4/17
- 10% Final Project Report (Due: 5/2, 11:59PM)

Final Project (2-3 students per group)

Project Requirement: Demonstrate that you are able to apply the knowledge and techniques learned from this course. The project requires more complex implementation than the programming assignments. Topics include but not limited to:

Investigate word embeddings and sentence embeddings for text classification problems.
Train a medium-sized language model (e.g., BERT, GPT-2) for tasks that you are interested in.

https://huggingface.co/models

Do inference on large language models (white box models such as LLaMA models, or black box models such as GPT-4, CLAUDE, etc.) to solve some type of complex problems, and analyze their limitations.

Create benchmark for new and challenging tasks and test it with SoTA models.

Project Presentation Date: 4/15 and 4/17, 2025. You will need to signup for a time slot near the end of the semester. Presentation length will be 10-15 minutes depending on the number of groups.

Office Hour

Instructor Office Hour: Thursday After Class - 5pm at McKelvey Hall 2010E

TA Office Hour: Tuesday 10-11am at McKelvey Hall 2040 (Langlin Huang)

TA Office Hour: Friday 10-11am at McKelvey Hall 2040 (Xinhang Yuan)

Course Policies

LLM Usage Policy:
- It is fine to collaborate with LLMs for coding assignments and refining your reports. However, directly using LLM generated outputs without manual check results in 0 score of the assignment.
Extra Credit:
- Students who first correctly answers technical questions (excluding assignment questions) raised by other students on Piazza will get 1 bonus point each time, up to 3 points in total.

Syllabus (The content of each class is tentative.)

Week	Date	Topic	Assignment
Week 1	01/14	Course Overview
Week 1	01/16	N-gram Models
Week 2	01/21	Bag of Words, TF-IDF
Week 2	01/23	Word Representations and Neural Word Embeddings
Week 3	01/28	Neural Word Embeddings (Cont'd)
Week 3	01/30	Document Representations	HW1 Out
Week 4	02/04	Neural Sequence Modeling (RNN, LSTM)
Week 4	02/06	Neural Sequence Modeling and Self Attention
Week 5	02/11	Transformer architectures	HW1 Due
Week 5	02/13	LLM Pre-training
Week 6	02/18	Text Mining Applications: Sentiment Analysis	HW2 Out
Week 6	02/20	Text Mining Applications: Information Extraction
Week 7	02/25	Large Language Models: Pre-training and Scaling
Week 7	02/27	Advanced LLM reasoning (I)
Week 8	03/04	Advanced LLM reasoning (II)	HW2 Due
Week 8	03/06	Instruction Tuning	HW3 Out
Spring Break
Week 10	03/18	Reinforcement Learning with Human Feedback
Week 10	03/20	Language Model Factuality	HW3 Due
Week 11	03/25	LLM Training Efficiency	HW4 Out
Week 11	03/27	LLM Inference Efficiency
Week 12	04/01	LLM Applications: Retrieval-Augmented Generation
Week 12	04/03	LLM Applications: Agents	HW4 Due
Week 13	04/08	LLM Multi-modality
Week 13	04/10	Future Directions
Week 14	04/15	Final Project Presentations
Week 14	04/17	Final Project Presentations

Jiaxin Huang