Mon-Wed 4:00-5:30 pm
Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)
The goal of this course is to provide the theoretical insights behind commonly-used data-mining algorithms. Such algorithms include those designed for clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve theoretical exercises as well as programming assignments that will help them better understand the covered material.
Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Mon 5:30pm - 7 pm, Wed 9:30-11am
or by
appointment.
www.cs.bu.edu/~evimaria
Harshal Chaudhari, harshal@bu.edu
Office Hours: Tues 2:30-4pm and Thur 11-12:30pm
http://cs-people.bu.edu/harshal/
Two programming projects (25%)
Three problem sets (25%)
Two exams; one midterm (20%) and one final (30%)
Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.
Week 1 | What is datamining Introductory lecture | |
Week 2 | Distance functions, finding similar objects | pdf1, pdf2 |
Week 3 | Finding similar entities: locality-sensitive hashing (min-wisepermutations)dimensionality reduction | |
Week 4,5 | Clustering | |
Week 6 | Hierachical Clustering, Clustering aggregation | pdf,pdf |
Week 7 | Covering and Influence maximization | pdf,pdf |
Week 8 | Clustering: graph cuts and spectral graph partitioning | |
Week 7 | Review and midterm | |
Week 8 | Network models | |
Week 10 and 11 | Classification methods (decision trees, naive bayes,boosting) | pdf,pdf,pdf |
Week 12 | Link Analysis Ranking, Voting Systems | pdf,pdf |
Week 13 | Time-series segmentation | |
Week 14 | Recommender systems, Matrix completion | pdf,pdf |
Although there is no required textbook for the course we will use material from the following books throughout the class:
A. Rajaraman and J. Ullman: Mining of Massive Datasets. Cambridge University Press, 2012.
P.-N. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining. Addison-Wesley, 2006.
D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001
Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006.
Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly.
Programming Projects (2) 25%
Problem Sets (3) 25%
Midterm 20%
Final Exam 30%
Incompletes will not be given
Late Assignment Policy: For the homeworks, there will be a penalty of 10% per day, up to three days late. After that no credit will be given. No late project submissions will be accepted.
I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer
to a question will be given 20% of the grade that corresponds to this
question.
A wrong answer will be graded as a 0.
Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.