CIS 8538: Text Mining and Language Processing [Spring 2011]

Additional information about this course may be found on the Web at http://knight.cis.temple.edu/~yates/cis8538/sp11/.

Lecture Time: Thursdays: 5:30pm to 8:00pm in Tuttleman 1B

Instructor: Alexander Yates

DESCRIPTION

This course will give a broad overview of problems and techniques in natural language processing and text mining, and then move on to cover the latest research in selected topics. The overview part of the course will cover topics in:

The in-depth part of the course will focus on the latest research in topics like domain adaptation, unsupervised and self-supervised information extraction, and knowledge acquisition.

PREREQUISITES

A general familiarity and basic level of comfort with probability and statistics is essential, and will be assumed. Successful completion of one of the following courses, or specific permission of the instructor, is required: CIS 8525, 8526, 9603, 9664; Math 8031 or Math 9031; Statistics 8001.

TEXT

For part of the course we will use the free online textbook Introduction to Information Retrieval, by Manning, Raghavan, and Schuetze.

We will also be reading extensively from the research literature.

GRADING

EXAMS AND QUIZZES

All exams and quizzes are closed book. Their content is cumulative, i.e. they address the material from the entire semester up to the day of the exam. If a student misses the midterm for an emergency [as agreed with instructor], there will be no makeup exam: the quizzes and final exam will become proportionally more important. If a student misses the midterm without previous agreement and without definitive proof as to the medical or legal reasons, he or she will get a zero for that exam. Quizzes that are missed will not be made up.

FINAL_PROJECT

Several project ideas will be suggested during the course of the semester, but students are free to suggest their own, especially if they relate to their current research. Students will be expected to come up with innovative, novel solutions to problems in text mining and language processing.

Course projects will be undertaken individually or in small teams (2-3 students). Each student on a team will receive the same grade for the project; it is up to the team members to divide the work fairly.

OUTLINE

Week 1: Introduction to Natural Language Processing (NLP)

Overview of the field; defining problems --- machine translation, question answering, information retrieval; history of approaches to NLP --- early knowledge-based approaches, heuristic techniques, statistical approaches, and approaches combining statistics and logic (or knowledge); examples of why language is a tricky thing to process.

Weeks 2-5: NLP ignoring structure

The modern search engine, index construction, the vector space model, TF-IDF, methods for scoring and performing term weighting, pseudo-relevance feedback, collaborative filtering, dimensionality reduction with PCA and LSA, LDA, Naïve Bayes and SVM classifiers, pattern-matching for information extraction, bootstrapping and redundancy.

Weeks 6-8: NLP with sequence structure

Ngram models, Hidden Markov Models, log-linear models, Conditional Random Fields, EM, supervised information extraction.

Weeks 9-11: Trees and linguistic structure

Dependency trees, PCFGs, parse algorithms, brief introduction to high-precision linguistic grammars, grammar induction, semantic role labeling.

Weeks 12-14: Advanced Topics

Polysemy and synonymy, coreference resolution; handling sparsity, domain adaptation and representations; combining logic and probability, Markov logic and NLP, ontology extension; Open IE, machine reading, Wikipedia processing, and learning by reading; combining language, vision, and action.

MISCELLANEOUS