CAS CS 565, Data Mining Fall 2016

Schedule

Mon-Wed 4:00-5:30 pm

Course Outline

Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)

The goal of this course is to provide the theoretical insights behind commonly-used data-mining algorithms. Such algorithms include those designed for clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve theoretical exercises as well as programming assignments that will help them better understand the covered material.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Mon 5:30pm - 7 pm, Wed 9:30-11am or by appointment.
www.cs.bu.edu/~evimaria

Teaching Fellow

Harshal Chaudhari, harshal@bu.edu
Office Hours: Tues 2:30-4pm and Thur 11-12:30pm
http://cs-people.bu.edu/harshal/

Workload

  1. Two programming projects (25%)

  2. Three problem sets (25%)

  3. Two exams; one midterm (20%) and one final (30%)

Prerequisites

Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.

Syllabus

Week 1 What is datamining Introductory lecture pdf
Week 2 Distance functions, finding similar objects pdf1, pdf2
Week 3 Finding similar entities: locality-sensitive hashing (min-wisepermutations)dimensionality reduction pdf
Week 4,5 Clustering pdf
Week 6 Hierachical Clustering, Clustering aggregation pdf,pdf
Week 7 Covering and Influence maximization pdf,pdf
Week 8 Clustering: graph cuts and spectral graph partitioning pdf
Week 7 Review and midterm
Week 8 Network models pdf
Week 10 and 11 Classification methods (decision trees, naive bayes,boosting) pdf,pdf,pdf
Week 12 Link Analysis Ranking, Voting Systems pdf,pdf
Week 13 Time-series segmentation pdf
Week 14 Recommender systems, Matrix completion pdf,pdf

Textbooks

Although there is no required textbook for the course we will use material from the following books throughout the class:

Grading Policy

Incompletes will not be given

Late Assignment Policy: For the homeworks, there will be a penalty of 10% per day, up to three days late. After that no credit will be given. No late project submissions will be accepted.

I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer to a question will be given 20% of the grade that corresponds to this question.
A wrong answer will be graded as a 0.

Collaborations/Academic Honesty

Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.