The goal of this course is to provide the theoretical insights behind
commonly-used data-mining algorithms. Such algorithms include those
designed for clustering, classification, link-analysis ranking,
dimensionality reduction etc. The focus of the course will be on the
algorithmic issues as well as applications of data mining to real-world
problems. Students will be required to solve theoretical exercises as well
as programming assignments that will help them better understand the
covered material.
Prerequisites: Working knowledge of programming and data structures
(CS 112, or equivalent). Familiarity with basic algorithmic concepts,
probability, statistics and linear algebra (CS 237, CS 330 or
equivalent). Programming projects will require knowledge of C (or C++), java,
python or Matlab.
There is no required textbook for this course. We are going to use a mix of
book chapters and papers, I will make reading material
for each lecture available via link or piazza.
Prof. Dora Erdos Email: edori @ bu . edu
Office Hours Tues. 12:30-2:30 (drop-in) and Wed 9:30-10:30 (by
prior appointment) in MCS 288.
Hao Chen Email: chenh13 @ bu . edu
Office Hours TBA (EMA 302 undergrad lab).
The class will be taught by Professor Erdos. The TF will lead the discussion sessions. The
objective is to reinforce the concepts covered in the lectures through
problem-solving, and to provide clarifications and guidance on the homework
assignments. The purpose of the office hours of the Instructor and
Teaching Fellow is to answer specific questions or clarify specific
issues. Your fastest route to get an answer to most questions is via
Piazza. Office hours are not to be used to fill you in on a class you
skipped or to re-explain entire topics. Office hours are scheduled at times
to provide the most help to students who start the homework before the last minute.
Lectures
Lecture: Tues, Thurs. 3:30 - 4:45 room CAS B12.
Lab: A2 Wed. 3:35 - 4:25 CAS 204A.
A3 Wed. 4:40 - 5:30 CAS 204.
We expect students to come to class, and to come on time, class participation and questions will be
encouraged.
If you miss a class, please get the notes and work through the material
with a
fellow student.
Labs will be an invaluable part of the course involving interactive
problem-solving sessions, tips on homework questions, details on projects and supplemental
material not covered in lecture. Attendance is mandatory.
Communications
We will be using Piazza for all discussions outside of class.
The system is highly catered to getting you answers to your questions fast
and
efficiently from classmates and instructors.
Please do not email questions to the teaching staff -- post your questions --
-- on Piazza instead.
We also encourage you to post answers to other students' questions there
(but obviously, not
answers to problems on the problem sets!).
Our class page is located at:
https://piazza.com/bu/fall2018/cs565.
Please go there to sign up today.
We will also use Piazza to post announcements, homework assignments, labs
and lab solutions, etc.
Grading and attendance
The course grade will break down as follows:
Programming Projects (2) 25%
Problem Sets (3) 25%
Midterm 20%
Final Exam 30%
Exams
There will be one in-class midterm held during the
middle of the semester tentatively on Thursday 10/18. The final
will be during exam week during our assigned exam slot.
Late Policy:
For the homeworks, there will be a penalty of 10% per day, up to three days
late. After that no credit will be given. No late project submissions will
be accepted.
Topics
This list is tentative. The exact topics covered as well as the
corresponding reading material will be updated. Slides and other handouts
can be found on piazza.
What is data mining
Distance functions, finding similar objects
Finding similar entities: locality-sensitive hashing
(min-wise permutations)dimensionality reduction
Clustering (2 weeks)
Hierachical Clustering, Clustering aggregation
Covering and Influence maximization
Clustering: graph cuts and spectral graph partitioning
Academic standards and the code of academic conduct are taken very
seriously
by our university, by the College of Arts and Sciences, and by the
Department of
Computer Science. Course participants must adhere to the
CAS
Academic
Conduct Code -- please take the time to review this document if you are --
-- unfamiliar
with its contents.
Collaboration Policy
The collaboration policy for this class is as follows.
You are encouraged to
collaborate with one another in studying the textbook and lecture material.
As long as it satisfies the following conditions, collaboration on
the homework assignments is permitted and will not reduce your grade:
Before discussing each homework problem with anyone
else, you must give it an honest half-hour of serious thought.
You may discuss ideas and approaches with other students in the class,
but not share any
written solutions. In other words, the writeups you submit must be
entirely your own work.
You must also acknowledge clearly in the appropriate portion of your
solutions
(e.g., at the top of your writeups) people with whom you discussed ideas
for that portion.
You may get help from the TFs and Instructors for the class for
specific problems.
Don't expect them to do it for you, however.
You may not work with people outside this class (but come and talk to
us if you
have a tutor), seek on-line solutions, get someone else to do it for you,
etc.
You are not permitted to collaborate on exams.
The last point is particularly important: if you don't make an honest
effort
on the homework but always get ideas from others, your exam scores
(accounting
for the majority of your grade) will reflect it.