Mon-Wed 4-5:30 pm
Data mining is the analysis of (often large) observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst (Hand, Mannila and Smyth: Principles of Data Mining)
The goal of this course is to provide an introduction to the main topics in data mining including: frequent-itemset mining, clustering, classification, link-analysis ranking, dimensionality reduction etc. The focus of the course will be on the algorithmic issues as well as applications of data mining to real-world problems. Students will be required to solve small written and programming assignments that will help them better understand the covered material.
Evimaria Terzi, evimaria@cs.bu.edu
Office Hours: Monday 2:30 pm - 4:00 pm and Tuesday 10:30 am -12:00 or by
appointment.
www.cs.bu.edu/~evimaria
Three programming projects
Three problem sets
Two exams; one midterm and one final
Working knowledge of programming and data structures (CS 112, or equivalent). Familiarity with basic algorithmic concepts, probability, statistics and linear algebra. Programming projects will require knowledge of C (or C), java or python.
Many of these slides have been borrowed by lectures, talks and tutorials prepared by: Aris Anagnostopoulos, Petros Drineas, Piotr Indyk, George Kollios, Heikki Mannila, Panayiotis Tsaparas, Jeffrey Ullman.
The lectures on classification are using the slides from the book ‘‘Introduction to Data Mining" by Tan, Steinbach, Kumar, available here.
Sept 2 | What is datamining/ Introductory lecture | .pptx,.ppt,.pdf |
Sept 9 | Mining frequent itemsets and association rules | .pptx,.ppt,.pdf |
Sept 14 | Condensed representations of itemset collections | .pptx,.ppt,.pdf |
Sept 14 | Problem Set 1; Due Sept. 28 | download |
Sept 16 | Clustering I | .pptx,.ppt,.pdf |
Sept 21 | Programming Project 1; Due Oct 7 | download |
Sept 21 | Hierarchical clustering | .pptx,.ppt,.pdf |
Sept 23 | Model-based clustering and clustering aggregation | .pptx,.ppt,.pdf |
Sept 28 | Problem Set 2; Due October 14 | download |
Sept 28 | Impossibility theorem, subspace clustering, co-clustering | .pptx,.ppt,.pdf |
Sept 30 | Clustering validation, randomization testss | .pptx,.ppt,.pdf |
Oct 5 | MDS, FastMap, Random Projections | .pptx,.ppt,.pdf |
Oct 5 | SVD, CUR, kd-trees | .pptx,.ppt,.pdf |
Oct 13 | kd-trees and LSH | .pptx,.ppt,.pdf |
Oct 14 | Midterm | download |
Oct 19 | Review of LSH and Introduction to Decision Trees | See LSH slides |
Oct 20 | Project 2; Due November 4 | download |
Oct 21 | Review of midterm solutions | |
Oct 26 | Decision Trees | .pptx,.ppt,.pdf |
Oct 28 | Naive Bayes Classification and k-NN | .pptx,.ppt,.pdf |
Nov 2 | Support Vector Machines | .pptx,.ppt,.pdf |
Nov 4 | Evaluation and Ensembles | .pptx,.ppt,.pdf |
Nov 7 | Project 3, deadline Dec 2, 2009 | download |
Nov 9 | Introduction to Link Analysis Ranking | .pptx,.ppt,.pdf |
Nov 11 | Veteran’s day – No class | |
Nov 16 | PageRank | .pptx,.ppt,.pdf |
Nov 18 | Rank aggregation and voting systems | .pptx,.ppt,.pdf |
Nov 22 | Homework 3; Due Dec 9 | download |
Nov 23,30 | Min cuts and cuts on graphs | .pptx,.ppt,.pdf |
Dec 3 | Timeseries analysis | .pptx,.ppt,.pdf |
Dec 8, 10 | Privacy-preserving data mining | |
Week starting Dec 14 | Final Exam |
Rakesh Agrawal and Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules. In International Conference on Very Large Databases, VLDB 1994.
Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo: Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, 307 - 328. AAAI Press, 1996.
Roberto J. Bayardo Jr.: Efficiently Mining Long Patterns from Databases. In SIGMOD Conference 1998: 85-93
F. Afrati, A. Gionis, H. Mannila: Approximating a Collection of Frequent Sets. In International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2004.
A. Gionis, H. Mannila, P. Tsaparas: Clustering Aggregation. In ACM Transactions on Knowledge Discovery from Data (TKDD), 2006.
Jon Kleinberg: An impossibility theorem for clustering. In NIPS, 2002.
A. Anagnostopoulos, A. Dasgupta, R. Kumar: Approximation algorithms for co-clustering. In PODS, 2008.
Kai Puolamaki, Sami Hanhijarvi, Gemma C. Garriga: An approximation ratio for biclustering. In Information Processing Letters, 2008.
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan: Automatic subspace clustering for high-dimensional data. In SIGMOD, 1998.
Christos Faloutsos, King-Ip Lin: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In SIGMOD 1995.
Dimitris Achlioptas: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. In Journal of Comp. & Sys. Sci.,special issue of invited papers from PODS’01.
Ella Bingham, Heikki Mannila: Random projection in dimensionality reduction: Applications to image and text data. In KDD, 2001.
For MDS read: Programming Collective Intelligence: Building Smart Web 2.0 Applications, Chapter 3, pages 49-52.
For Petros Drineas’ tutorials on linear-algebra methods for data mining see here.
For Piotr Indyk’s tutorial on Nearest-Neighbor search in low and high dimensions see here and here.
Taher H. Haveliwala, Aristides Gionis, Dan Klein, Piotr Indyk: Evaluating strategies for similarity search on the web. In WWW Conference 2002.
Aristides Gionis, Piotr Indyk, Rajeev Motwani:Similarity Search in High Dimensions via Hashing. In VLDB 1999.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
S. Brin, R. Motwani, L. Page and T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries, 1998.
A. Borodin, J. S. Rosenthal, G. O. Roberts, P. Tsaparas: Link Analysis Ranking: Algorithms, Theory and Experimets. ACM Transactions on Internet Technologies (TOIT), Vol 5, No 1, February 2005.
David Cheng, Ravi Kannan, Santosh Vempala and Grant Wang:A Divide-and- Merge methodology for Clustering In ACM SIGMOD/PODS, 2005.
Ravi Kannan and Santosh Vempala and Andrian Vetta: On Clusterings: Good, Bad and Spectral. In the Journal of the ACM, May 2004.
E. Terzi and P. Tsaparas: Efficient algorithms for sequence segmentation. In Siam Data Mining Conference (SDM), 2006.
There is no required textbook for the course. Suggested textbooks are:
D. Hand, H. Mannila and P. Smyth: Principles of Data Mining. MIT Press, 2001
Jiawer Han and Micheline Kamber: Data Mining: Concepts and Techiques. Second Edition. Morgan Kaufmann Publishers, March 2006.
Toby Segaran: Programming Collective Intelligence: Building Smart Web 2.0 Applications. O’Reilly.
Programming Projects (3) 30%
Problem Sets (3) 20%
Midterm 20%
Final Exam 30%
Incompletes will not be given
Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.
All course participants must adhere to the CAS Academic Conduct Code. All instances of adacemic dishonesty will be reported to the academic conduct committee