Mon-Wed 4:00-5:30 pm
Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)
The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.
Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Mon 2pm-3:30pm and Tues 5pm - 6:30 pm or by
appointment.
www.cs.bu.edu/~evimaria
Harry Mavroforakis, cmav@bu.edu
Office Hours: Wed 1pm-2:30pm and Thur 4pm-5:30pm
http://cs-people.bu.edu/cmav/
Two programming projects (25%)
Three problem sets (25%)
Two exams; one midterm (20%) and one final (30%)
Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.
Sept 4 | What is datamining Introductory lecture | |
Sept 9, 11 | Distance functions | |
Sept 13 | Homework 1; Due Sept 30 | download |
Sept 16 | Finding similar entities: locality-sensitive hashing (min-wise permutations) | |
Sept 18 | Finding similar entities: locality-sensitive hashing anddimensionality reduction | |
Sept 23 | Project 1; Due Oct 16 | download |
Sept 23, 25 | Clustering : partition-based methods | |
Sept 30 | TA teaches a course on project 1 | |
Oct 2 | Hierarchical Clustering | |
Sept 13 | Homework 2; Due Oct 21 | download |
Oct 7 | Clustering aggregation | |
Oct 9,15 | Clustering: graph cuts and spectral graph partitioning | |
Oct 16 | Classification: Decision Trees Classifiers | |
Oct 21 | Midterm! | |
Oct 23 | Midterm review | |
Oct 28 | Naive Bayes Classifier | |
Oct 30 | Evaluation of classification; Intro to Link Analysis Ranking | |
Oct 30 | Project 2: Due Dec 13 | download |
Nov 4, 6 | Link Analysis Ranking: Rank Aggregation and Voting Theory | |
Nov 11, 13 | VotingCovering problems | |
Nov 15 | Homework 3 : Due Dec 2 | download |
Nov 18 | Optimization of submodular functions | |
Nov 20 | Information propagation in social networks | |
Nov 25 | Review of information propagation models again | |
Nov 27 | No class; Thanksgiving break | |
Dec 2 | Algorithms for Recommendation systems; collaborative filtering | |
Dec 4 | Algorithms for co-clustering | |
Dec 9 | Time-series data analysis | |
Dec 11 | Review Class | |
Dec 17 | Final Exam | |
Although there is no required textbook for the course we will use material from the following books throughout the class:
A. Rajaraman and J. Ullman: Mining of Massive Datasets. Cambridge University Press, 2012.
P.-N. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining. Addison-Wesley, 2006.
D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001
Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006.
Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly.
Programming Projects (2) 25%
Problem Sets (3) 25%
Midterm 20%
Final Exam 30%
Incompletes will not be given
Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.
I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer
to a question will be given 20% of the grade that corresponds to this
question.
A wrong answer will be graded as a 0.
Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.