CAS CS 591, Advanced Topics in Data Mining, Spring 2014

Data Analysis for Big and Small Data

Schedule

Tue-Thur 2:00-3:30 pm

Room: MCS B019

Course Outline

The goal of the coarse is to make students familiar with basic data-analysis techniques and real-life data analysis problems.

The format of the course will be the following: the instructor herself as well as professors from different departments will present the students with a challenging data-analysis problem and explain the challenges the problem raises as well as the impact a solution can possibly have.

The students of the class will pick one of the problems and will work towards solving it – using techniques we will present in the class.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
www.cs.bu.edu/~evimaria
Office Hours: Monday 5:00 pm - 6:30 pm and Tuesday 3:30 pm -5:00pm.
Contact me by e-mail to arrange an appointment if necessary.

Prerequisites

Working knowledge of programming and data structures. Familiarity with basic algorithmic concepts, probability, statistics and linear algebra. Programming projects will require knowledge of C (or C++), java, python.

Workload

Students that take the course for credit will be required to complete a course project. Completion of the project constitutes 50% of the overall grade. Although the main task of each student will be this one project, in the course of preparing the project the students will have to do one presentation of some papers related to the project they picked. The presentation will be 20% of the credit. Finally, a reaction paper and a final presentation that summarizes the thoughts of the students with respect to their topic will be another 20% of the total credit. Finally, active class participation will be the additional 10%.

  1. One initial class presentation (20%)

  2. One reaction paper and final class presentation (20%)

  3. One project – poster presentation (50%)

Incompletes will not be given.

Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.

Initial class presentation : introduction of your project

In the initial class presentation the students will focus on the project each student will be assigned to. The student will present his/her understanding of the problem as well as some initial thoughts of approaching it.

Second presentation and reaction paper : progress on the project

As a way to get everyone thinking about the research issues underlying the course, there will be a short reaction paper of roughly 5 pages in length.

The reaction paper will be related to your project and should be structured as follows:

The reaction paper should be the summary of your second presentation about your project.

Projects

The projects you will be assigned will involve a data-mining/data-analysis real-life problem. The instructor as well as faculty members from accross the university will present a collection of projects from which you will have to pick one as your semester’s project. Your project will involve both some algorithm design as well as the application of this algorithm in some real-life datasets.

List of potential projects

Intrusion detection in network trace data

Given a collection of network trace data collected over a period of time, develop an algorithm that observes these traces over time and identifies (a) irregular behaviors and (b) identifies the tuplesusersapplications that are responsible for the observed irregularities.

References

Uncovering Interest Group Networks (with Dino P. Christenson)

Interest groups (i.e., lobbyists) use a variety of techniques to exert influence, among which coalition strategies are dominant. However, interest groups are hesitant to openly discuss their networks for a variety of reasons. Utilizing a network measure of interest group coalitions based on cosigner status to United States Supreme Court amicus curiae, or friend of the court briefs, the central players and overall characteristics of this dynamic and secretive network from 1930 to present-day can be illuminated. This project brings a new theoretical perspective to the study of interest group coalitions by applying network theory and statistics. Additional information on the project is available here: http://amicinetworks.com/

Some related problems include:

References

Noisy graph isomorphism

Given two graphs (maybe noisy versions of an archetypical graph), can you map the nodes of the one graph to the nodes of the other? Applications of this problem include mapping users of one social network (e.g., Facebook) to nodes of the other (e.g., Twitter).

References

Generating high-variance clusters and applications to team formation

(By B. Golshan)

In the traditional clustering problems the goal is to cluster a collection of multi-dimensional points into clusters such that the sum of the variances of the clusters is minimized. In this project we will consider the “maximum variance clustering” problem where the goal is to partition a set of points into k clusters such that the sum of the variances of the clusters is maximized. Motivation for this problem comes from computational education and grouping of students in groups.

References

Text mining for marketing analysis (By G. Zervas)

An emergent marketing literature has begun exploring the application of text-mining methods to problems such as uncovering market structure, identifying competitors, and segmenting consumers. For example, Netzer et al. 2012 recently demonstrated that basic text-mining techniques such as word co-occurrence counts, TF-IDF, and cosine similarities can be used to effectively uncover brand relationships in the automobile and pharmaceutical industries. In the proposed project, we seek to further explore the application of state-of-the-art text-mining methods from the computer science literature – such as topic modeling – to traditional marketing problems. As a broad starting point, we propose applying these tools to the Yelp Academic Dataset (https:www.yelp.com/academic_dataset) to explore solutions to problems such as:

Network engineering for effective shortest paths (by Dora Erdos)

Shortest paths play an important role in many applications from computer networks to route planning on roads. A way to measure the importance of a node in a network (i.e., a router or a road intersection) is to count the number of shortest paths it covers. While there has been a lot of research on developing algorithms for individual nodes, the question of performing computations for sets or groups of nodes remains a challenge. Two interesting problems here are the following

Everett, Borgatti “The centrality of groups and classes”, Journal of Mathematical Sociology 23: 181-201, 1999.

Papagelis, Bonchi, Gionis ‘‘Suggesting ghost edges for a smaller world", CIKM, 2011.

Discovering the chemistry of collaborations (with E. Terzi and S. Paparizos)

In general people perform better when they work in teams. Also some teams perform better than others. The high-level question that this project will try to address is: ‘‘what makes a team perform well?".

In order to approach this high-level question, we are going to discover successful and unsuccessful teams of scientists. For this we are going to study scientific publications and characterize them as successful (or not) based on the number of citations they have received. Then, based on that we are going to discover people (or groups of people), i.e., co-authors, that act as catalyst for successful projects.

For this we are going to use data from Microsoft Academic Search. A starting point for the data analysis will be the data that is available here.

From that point on further data collection may be required.

Reading material:

Syllabus

Jan 16 First day of classes
Jan 21 clustering and k-means.pdf
Jan 21 SVD and matrix sketches.pdf
Jan 28 Matrix Sketches .pdf
Jan 30 Project presentations by D. Erdos and G. Zervas
Feb 4 Project presentation by D. Christenson
Feb 4 Project presentation by E. Terzi (graph isomporphism) .pdf
Feb Feb 6 Project presentation by E. Terzi (Team chemistry) .pdf
Feb 11 Project presentation by E. Terzi (reviews) .pdf
Feb 13 Project presentations
Feb 18 Project presentations
Feb 20 Tensor Completion (by N. Ruchansky) presentation
Feb 20 Review Analysis (by N. Shah) presentation
Feb 25, Feb 27 No class
Mar 4 Team Formation (by R. Churchill)presentation
Mar 4 Graph De-anonymization (by B. Chen)presentation
Mar 6 Network outliers (by Sanaz Bahargam) presentation
Mar 6 Team dynamics (by Raymond Mead) presentation
Mar 11, Mar 13 Spring break
Mar 18 Mapping accross social networks: a coverage-based prespective (byHarshal Chaudhari)
Mar 18 Noisy graph isomorphism: a random-projection approach (by HarryMavroforakis) link
Mar 20 Uncovering interst groups (byRushi Ganmuthi) presentation
Mar 20 Modeling boredom in social networks (by Neda Derakhshani)
Mar 25 Mar 27 final presentations
Apr 1, Apr 3 final presentations
Apr 8, Apr 10 final presentations
Apr 15, Apr 17 final presentations
Apr 22 final presentations
Apr 24 Wed schedule
Apr 29, May 1 Poster sessions

Collaborations/Academic Honesty

All course participants must adhere to the CAS Academic Conduct Code. All instances of adacemic dishonesty will be reported to the academic conduct committee.