Mon-Wed 1:00-2:30 pm

Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)

The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.

Evimaria Terzi, evimaria@cs.bu.edu

Office Hours: Wed 2:30pm - 5:30 pm or by
appointment.

www.cs.bu.edu/~evimaria

Natali Ruchansky, natalir@bu.edu

Office Hours: Mon 11:30am-1pm and Thur 3:30pm-5pm

http://cs-people.bu.edu/natalir/

Two programming projects (25%)

Three problem sets (25%)

Two exams; one midterm (20%) and one final (30%)

Working knowledge of programming and data structures (CS 112, or
equivalent). Familiarity with basic algorithmic concepts, probability,
statistics and linear algebra (CS 237, CS 330 or equivalent).
Programming projects will require knowledge
of C (or C`), java, python or Matlab.`

Sept 3 | What is datamining/ Introductory lecture | |

Sept. 8 and 10 | Distance functions | |

Sept. 15 | Homework 1; due Oct 1 2014 | download |

Sept. 15 | Finding similar entities: locality-sensitive hashing (min-wise permutations) | |

Sept 17 | Dimensionality reduction | |

Sept 22 | Matrix Sketches | |

Sept 24 | Clustering : partition-based methods | |

Sept 25 | Project 1; due Oct 14 2014 | download |

Sept 29, Oct 1 | Clustering: kmeans and 1-dimensional k-means | lecture on the board |

Oct 6 | Hierarchical clustering | |

Oct 8 | Clustering aggregation | |

Oct. 9 | Homework 2; due Oct 24 2014 | download |

Oct 13 ,15, 22, 24 | Clustering: graph cuts and spectral graph partitioning | |

Oct 27 | Review | |

Oct 29 | Midterm! | |

Nov 3 | Decision Trees | |

Nov 5 | Naive Bayes Classifier | |

Nov 10 | Evaluation of classification; | |

Nov 12 | Covering | |

Nov 17 | Information propagation | |

Nov 19 | Link Analysis Ranking | |

Nov 24 | Rank aggregation | |

Nov 24 | Homework 3; due Dec 10 2014 | download |

Nov 26 | No class – thanksgiving break | |

Dec 1 | Rank Aggregation | |

Dec 3 | Algorithms for Recommendation systems; collaborative filtering | |

Time-series data analysis | ||

Although there is no required textbook for the course we will use material from the following books throughout the class:

A. Rajaraman and J. Ullman: Mining of Massive Datasets. Cambridge University Press, 2012.

P.-N. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining. Addison-Wesley, 2006.

D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001

Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006.

Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly.

Programming Projects (2) 25%

Problem Sets (3) 25%

Midterm 20%

Final Exam 30%

Incompletes will not be given

**Late Assignment Policy**: For the homeworks, there will be a penalty of 10% per day, up to
three days late. After that no credit will be given. No late project
submissions will be accepted.

**I don’t know policy**: In all homeworks and exams, an ‘‘I don’t know" answer
to a question will be given 20% of the grade that corresponds to this
question.

A wrong answer will be graded as a 0.

Although you can collaborate or use the Web in order to solve many of your

assignment problems you need to specify explicitely the sources that helped

you obtain your final answers.