CSE 4061: Text Mining (2025 Spring)

Course Overview

This is an advanced research-oriented course that teaches fundamental techniques of text mining and natural language processing. It is a rapidly evolving field at the intersection of natural language processing and machine learning. Students will gain both in-depth knowledge of fundamental concepts and hands-on experience in practical applications.
Pre-requisites: Students are expected to understand concepts in machine learning (CSE 417T/517A)

Teaching Assistants

Langlin Huang (h.langlin@wustl.edu)
Xinhang Yuan (xinhang.y@wustl.edu)

Course Grading

Final Project (2-3 students per group)

Project Requirement: Demonstrate that you are able to apply the knowledge and techniques learned from this course. The project requires more complex implementation than the programming assignments. Topics include but not limited to:

  1. Investigate word embeddings and sentence embeddings for text classification problems.
  2. Train a medium-sized language model (e.g., BERT, GPT-2) for tasks that you are interested in.
  3. Do inference on large language models (white box models such as LLaMA models, or black box models such as GPT-4, CLAUDE, etc.) to solve some type of complex problems, and analyze their limitations.
  4. Create benchmark for new and challenging tasks and test it with SoTA models.

    Project Presentation Date: 4/15 and 4/17, 2025. You will need to signup for a time slot near the end of the semester. Presentation length will be 10-15 minutes depending on the number of groups.

    Office Hour

    Instructor Office Hour: Thursday After Class - 5pm at McKelvey Hall 2010E

    TA Office Hour: Tuesday 10-11am at McKelvey Hall 2040 (Langlin Huang)

    TA Office Hour: Friday 10-11am at McKelvey Hall 2040 (Xinhang Yuan)

    Course Policies

    Syllabus (The content of each class is tentative.)

    WeekDateTopicAssignment
    Week 101/14Course Overview
    01/16N-gram Models
    Week 201/21Bag of Words, TF-IDF
    01/23Word Representations and Neural Word Embeddings
    Week 301/28Neural Word Embeddings (Cont'd)
    01/30Document RepresentationsHW1 Out
    Week 402/04Neural Sequence Modeling (RNN, LSTM)
    02/06Neural Sequence Modeling and Self Attention
    Week 502/11Transformer architecturesHW1 Due
    02/13LLM Pre-training
    Week 602/18Text Mining Applications: Sentiment AnalysisHW2 Out
    02/20Text Mining Applications: Information Extraction
    Week 702/25Large Language Models: Pre-training and Scaling
    02/27Instruction TuningHW2 Due
    Week 803/04Advanced LLM reasoning (I)HW3 Out
    03/06Advanced LLM reasoning (II)
    Spring Break
    Week 1003/18Reinforcement Learning with Human FeedbackHW3 Due
    03/20Language Model FactualityHW4 Out
    Week 1103/25LLM Training Efficiency
    03/27LLM Inference Efficiency
    Week 1204/01LLM Applications: Retrieval-Augmented GenerationHW4 Due
    04/03LLM Applications: Agents
    Week 1304/08LLM Multi-modality
    04/10Future Directions
    Week 1404/15Final Project Presentations
    04/17Final Project Presentations