CAS CS 565, Data Mining Fall 2013

Schedule

Mon-Wed 4:00-5:30 pm

Course Outline

Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)

The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Mon 2pm-3:30pm and Tues 5pm - 6:30 pm or by appointment.
www.cs.bu.edu/~evimaria

Teaching Fellow

Harry Mavroforakis, cmav@bu.edu
Office Hours: Wed 1pm-2:30pm and Thur 4pm-5:30pm
http://cs-people.bu.edu/cmav/

Workload

  1. Two programming projects (25%)

  2. Three problem sets (25%)

  3. Two exams; one midterm (20%) and one final (30%)

Prerequisites

Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.

Syllabus

Sept 4 What is datamining Introductory lecture .pdf
Sept 9, 11 Distance functions .pdf
Sept 13 Homework 1; Due Sept 30 download
Sept 16 Finding similar entities: locality-sensitive hashing (min-wise permutations) .pdf
Sept 18 Finding similar entities: locality-sensitive hashing anddimensionality reduction .pdf
Sept 23 Project 1; Due Oct 16download
Sept 23, 25 Clustering : partition-based methods .pdf
Sept 30 TA teaches a course on project 1
Oct 2 Hierarchical Clustering .pdf
Sept 13 Homework 2; Due Oct 21 download
Oct 7 Clustering aggregation .pdf
Oct 9,15 Clustering: graph cuts and spectral graph partitioning .pdf
Oct 16 Classification: Decision Trees Classifiers .pdf
Oct 21 Midterm!
Oct 23 Midterm review
Oct 28 Naive Bayes Classifier .pdf
Oct 30 Evaluation of classification; Intro to Link Analysis Ranking .pdf
Oct 30 Project 2: Due Dec 13 download
Nov 4, 6 Link Analysis Ranking: Rank Aggregation and Voting Theorypdf
Nov 11, 13 VotingCovering problems pdf
Nov 15 Homework 3 : Due Dec 2 download
Nov 18 Optimization of submodular functions.pdf
Nov 20 Information propagation in social networks .pdf
Nov 25 Review of information propagation models again
Nov 27 No class; Thanksgiving break
Dec 2 Algorithms for Recommendation systems; collaborative filtering .pdf
Dec 4 Algorithms for co-clustering .pdf
Dec 9 Time-series data analysis .pdf
Dec 11 Review Class
Dec 17 Final Exam

Textbooks

Although there is no required textbook for the course we will use material from the following books throughout the class:

Grading Policy

Incompletes will not be given

Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.

I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer to a question will be given 20% of the grade that corresponds to this question.
A wrong answer will be graded as a 0.

Collaborations/Academic Honesty

Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.