CAS CS 591, Tools and Techniques for Data Mining and Applications

Schedule

Mon-Wed 4-5:30 pm

Office hours: Mon 4:30-7pm (Evimaria, MCS 280), Wed 9:30-11am (Evimaria, MCS 280)

Tues 1:30-3pm (Harry, Undergrad Lab), Thur 1:30-3pm (Harry, Undergrad Lab)

Instructor and TF

Instructor: Evimaria Terzi cs-people.bu.edu/evimaria

Teaching Fellow: Charalampos (Harry) Mavroforakis cs-people.bu.edu/cmav

Course Outline

The course emphasizes practical skills in working with data, while introducing students to a wide range of techniques that are commonly used in the analysis of data, such as clustering, classification, regression, and network analysis. The goal of the class is to provide to students a hands-on understanding of classical data analysis techniques and to develop proficiency in applying these techniques in a modern programming language (Python).

Lectures will present the fundamentals of each technique; focus is not on the theoretical underpinnings of the methods, but rather on helping students understand the practical settings in which these methods are useful. Class discussion will study use cases and will go over relevant Python packages that will enable the students to perform hands-on experiments with their data.

Note this class is different from CS 565 (Data Mining): while CS 565 focuses on the fundamental algorithmic problems around a set of data-mining problems and emphasizes on the analysis of the algorithms for certain data analysis tasks, this class will focus on how these algorithms woork in practice.

Target audience

This course is targeted towards graduate or advanced undergraduate students who need to be proficient on working and analyzing large datasets for their research or aim to find a job that will require data-analysis skills.

Prerequisites

Students taking this class must have some prior familiarity with programming, at the level of CS 105, 108, or 111, or equivalent. CS 112 is also helpful.

Suggested Textbooks

Workload

There will be one programming assignment each week. In these assignments students will be given datasets which they will analyze using the tools and techniques presented during that particular week. The weekly programming assignments will be very targeted and their goal will be to practice the material taught during the week.

In addition, there will be a final project. For the project the students will use a dataset of their choice and will have to extract some knowledge or conclusions from the analysis of the dataset. The analysis will be done using a subset of the methods we described in class.

The project will have three essential components: 1) a data collection piece (which may involve crawling or calls to an API, combining data from different sources etc), 2) a data analysis piece (which will involve applying different techniques we described in class for the analysis) and 3) a conclusion component (where the results of the data analysis will be drawn). The students will submit a 5-page report explaining clearly all the three components of their project. Finally a poster presentation will be required where the students will be prepare to present their effort and results in front of their poster.

As an example, a student may choose to collect data from Twitter related to a specific topic (e.g., Ebola virus) and then measure the intensity of posts about a topic in different areas of the world etc. Other examples of projects may include (but are not limited to): analysis of MBTA data, analysis of NYC data, crawling of YouTube (or other social media data) and analysis of social behavior like trolling, bullying etc.

The project is due by the end of the exam week. The project presentations will be given in the form of a final poster explaining components 1, 2 and 3 of the project.

Students are expected to work individually on homeworks and on the final project. There will be no final exam.

Grading scheme:
Homeworks: 40%
Project: 60%

Tentative schedule

The project is due by the end of the exam week.

Collaborations/Academic Honesty

All course participants must adhere to the CAS Academic Conduct Code. All instances of adacemic dishonesty will be reported to the academic conduct committee.