CAS CS 565, Data Mining Fall 2014

Schedule

Mon-Wed 1:00-2:30 pm

Course Outline

Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)

The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Wed 2:30pm - 5:30 pm or by appointment.
www.cs.bu.edu/~evimaria

Teaching Fellow

Natali Ruchansky, natalir@bu.edu
Office Hours: Mon 11:30am-1pm and Thur 3:30pm-5pm
http://cs-people.bu.edu/natalir/

Workload

  1. Two programming projects (25%)

  2. Three problem sets (25%)

  3. Two exams; one midterm (20%) and one final (30%)

Prerequisites

Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.

Syllabus

Sept 3 What is datamining/ Introductory lecture .pdf
Sept. 8 and 10 Distance functions .pdf
Sept. 15 Homework 1; due Oct 1 2014 download
Sept. 15 Finding similar entities: locality-sensitive hashing (min-wise permutations) pdf
Sept 17 Dimensionality reduction pdf
Sept 22 Matrix Sketches pdf
Sept 24 Clustering : partition-based methods pdf
Sept 25 Project 1; due Oct 14 2014 download
Sept 29, Oct 1 Clustering: kmeans and 1-dimensional k-means lecture on the board
Oct 6 Hierarchical clustering pdf
Oct 8 Clustering aggregation pdf
Oct. 9 Homework 2; due Oct 24 2014 download
Oct 13 ,15, 22, 24 Clustering: graph cuts and spectral graph partitioning pdf
Oct 27 Review
Oct 29 Midterm!
Nov 3 Decision Treespdf
Nov 5 Naive Bayes Classifier pdf
Nov 10 Evaluation of classification; pdf
Nov 12 Covering pdf
Nov 17 Information propagation pdf
Nov 19 Link Analysis Ranking pdf
Nov 24 Rank aggregation pdf
Nov 24 Homework 3; due Dec 10 2014 download
Nov 26 No class – thanksgiving break
Dec 1 Rank Aggregation
Dec 3 Algorithms for Recommendation systems; collaborative filtering pdf
Time-series data analysispdf

Textbooks

Although there is no required textbook for the course we will use material from the following books throughout the class:

Grading Policy

Incompletes will not be given

Late Assignment Policy: For the homeworks, there will be a penalty of 10% per day, up to three days late. After that no credit will be given. No late project submissions will be accepted.

I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer to a question will be given 20% of the grade that corresponds to this question.
A wrong answer will be graded as a 0.

Collaborations/Academic Honesty

Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.