CAS CS 565, Data Mining Fall 2014

Schedule

Mon-Wed 1:00-2:30 pm

Course Outline

Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)

The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Wed 2:30pm - 5:30 pm or by appointment.
www.cs.bu.edu/~evimaria

Teaching Fellow

Natali Ruchansky, natalir@bu.edu
Office Hours: Mon 11:30am-1pm and Thur 3:30pm-5pm
http://cs-people.bu.edu/natalir/

Workload

Two programming projects (25%)
Three problem sets (25%)
Two exams; one midterm (20%) and one final (30%)

Prerequisites

Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra (CS 237, CS 330 or equivalent). Programming projects will require knowledge of C (or C), java, python or Matlab.

Syllabus

Sept 3	What is datamining/ Introductory lecture	.pdf
Sept. 8 and 10	Distance functions	.pdf
Sept. 15	Homework 1; due Oct 1 2014	download
Sept. 15	Finding similar entities: locality-sensitive hashing (min-wise permutations)	pdf
Sept 17	Dimensionality reduction	pdf
Sept 22	Matrix Sketches	pdf
Sept 24	Clustering : partition-based methods	pdf
Sept 25	Project 1; due Oct 14 2014	download
Sept 29, Oct 1	Clustering: kmeans and 1-dimensional k-means	lecture on the board
Oct 6	Hierarchical clustering	pdf
Oct 8	Clustering aggregation	pdf
Oct. 9	Homework 2; due Oct 24 2014	download
Oct 13 ,15, 22, 24	Clustering: graph cuts and spectral graph partitioning	pdf
Oct 27	Review
Oct 29	Midterm!
Nov 3	Decision Trees	pdf
Nov 5	Naive Bayes Classifier	pdf
Nov 10	Evaluation of classification;	pdf
Nov 12	Covering	pdf
Nov 17	Information propagation	pdf
Nov 19	Link Analysis Ranking	pdf
Nov 24	Rank aggregation	pdf
Nov 24	Homework 3; due Dec 10 2014	download
Nov 26	No class – thanksgiving break
Dec 1	Rank Aggregation
Dec 3	Algorithms for Recommendation systems; collaborative filtering	pdf
	Time-series data analysis	pdf

Textbooks

Although there is no required textbook for the course we will use material from the following books throughout the class:

A. Rajaraman and J. Ullman: Mining of Massive Datasets. Cambridge University Press, 2012.

P.-N. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining. Addison-Wesley, 2006.

D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001

Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006.

Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly.

Grading Policy

Programming Projects (2) 25%
Problem Sets (3) 25%
Midterm 20%
Final Exam 30%

Incompletes will not be given

Late Assignment Policy: For the homeworks, there will be a penalty of 10% per day, up to three days late. After that no credit will be given. No late project submissions will be accepted.

I don’t know policy: In all homeworks and exams, an ‘‘I don’t know" answer to a question will be given 20% of the grade that corresponds to this question.
A wrong answer will be graded as a 0.

Collaborations/Academic Honesty

Although you can collaborate or use the Web in order to solve many of your
assignment problems you need to specify explicitely the sources that helped
you obtain your final answers.