CAS CS 591, Advanced Topics in Data Mining, Spring 2014

Data Analysis for Big and Small Data

Schedule

Tue-Thur 2:00-3:30 pm

Room: MCS B019

Course Outline

The goal of the coarse is to make students familiar with basic data-analysis techniques and real-life data analysis problems.

The format of the course will be the following: the instructor herself as well as professors from different departments will present the students with a challenging data-analysis problem and explain the challenges the problem raises as well as the impact a solution can possibly have.

The students of the class will pick one of the problems and will work towards solving it – using techniques we will present in the class.

Instructor

Evimaria Terzi, evimaria@cs.bu.edu
www.cs.bu.edu/~evimaria
Office Hours: Monday 5:00 pm - 6:30 pm and Tuesday 3:30 pm -5:00pm.
Contact me by e-mail to arrange an appointment if necessary.

Prerequisites

Working knowledge of programming and data structures. Familiarity with basic algorithmic concepts, probability, statistics and linear algebra. Programming projects will require knowledge of C (or C++), java, python.

Workload

Students that take the course for credit will be required to complete a course project. Completion of the project constitutes 50% of the overall grade. Although the main task of each student will be this one project, in the course of preparing the project the students will have to do one presentation of some papers related to the project they picked. The presentation will be 20% of the credit. Finally, a reaction paper and a final presentation that summarizes the thoughts of the students with respect to their topic will be another 20% of the total credit. Finally, active class participation will be the additional 10%.

One initial class presentation (20%)
One reaction paper and final class presentation (20%)
One project – poster presentation (50%)

Incompletes will not be given.

Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.

Initial class presentation : introduction of your project

In the initial class presentation the students will focus on the project each student will be assigned to. The student will present his/her understanding of the problem as well as some initial thoughts of approaching it.

Second presentation and reaction paper : progress on the project

As a way to get everyone thinking about the research issues underlying the course, there will be a short reaction paper of roughly 5 pages in length.

The reaction paper will be related to your project and should be structured as follows:

What is the problem you are trying to solve?
What are the key approaches you have/will try?
What were the main challenges that you encountered when you tried to implement/apply the above approaches?
What are your results? What lessons have you learned?

The reaction paper should be the summary of your second presentation about your project.

Projects

The projects you will be assigned will involve a data-mining/data-analysis real-life problem. The instructor as well as faculty members from accross the university will present a collection of projects from which you will have to pick one as your semester’s project. Your project will involve both some algorithm design as well as the application of this algorithm in some real-life datasets.

List of potential projects

Intrusion detection in network trace data

Given a collection of network trace data collected over a period of time, develop an algorithm that observes these traces over time and identifies (a) irregular behaviors and (b) identifies the tuplesusersapplications that are responsible for the observed irregularities.

References

Sanjay Chawla, Aristides Gionis: k-means-: A Unified Approach to Clustering and Outlier Detection. SDM 2013.

Theodoros Lappas, Mark Crovella, Evimaria Terzi: Selecting a characteristic set of reviews. KDD 2012.

Uncovering Interest Group Networks (with Dino P. Christenson)

Interest groups (i.e., lobbyists) use a variety of techniques to exert influence, among which coalition strategies are dominant. However, interest groups are hesitant to openly discuss their networks for a variety of reasons. Utilizing a network measure of interest group coalitions based on cosigner status to United States Supreme Court amicus curiae, or friend of the court briefs, the central players and overall characteristics of this dynamic and secretive network from 1930 to present-day can be illuminated. This project brings a new theoretical perspective to the study of interest group coalitions by applying network theory and statistics. Additional information on the project is available here: http://amicinetworks.com/

Some related problems include:

Parsing documents, emails and related databases, as well as scraping websites and publicly available for interest groups in the amici networks database.
Creating fuzzy matching algorithms to match non-unique names in the amici networks database with other databases.
Creating programs to check big data for coding errors, including commonly misspelled names and outlier categories from attribute data.
Amending MySQL database with interest group data from other databases.

References

Box-Steffensmeier, Janet M, Dino P Christenson, and Matthew P Hitt. ‘‘Quality Over Quantity: Amici Influence And Judicial Decision Making”. American Political Science Review 107, no. 3. American Political Science Review (2013): 446-460.

Box-Steffensmeier, Janet M, and Dino P Christenson. The Evolution And Formation Of Amicus Curiae Networks. Social Networks. Social Networks (2012).

Noisy graph isomorphism

Given two graphs (maybe noisy versions of an archetypical graph), can you map the nodes of the one graph to the nodes of the other? Applications of this problem include mapping users of one social network (e.g., Facebook) to nodes of the other (e.g., Twitter).

References

Pedram Pedarsani, Matthias Grossglauser On the privacy of anonymized networks. KDD 2011.

Arvind Narayanan, Vitaly Shmatikov: De-anonymizing Social Networks. IEEE Symposium on Security and Privacy 2009.

Generating high-variance clusters and applications to team formation

(By B. Golshan)

In the traditional clustering problems the goal is to cluster a collection of multi-dimensional points into clusters such that the sum of the variances of the clusters is minimized. In this project we will consider the “maximum variance clustering” problem where the goal is to partition a set of points into k clusters such that the sum of the variances of the clusters is maximized. Motivation for this problem comes from computational education and grouping of students in groups.

References

E. Fernandez, J. Kalcsics, S. Nickel: The maximum dispersion problem, Omega 2013.

S. S. Ravi, D. J. Rosenkrantz, G.K. Tayi: Heuristic and Special Case Algorithms for Dispersion Problems Operations Research, 1994.

Text mining for marketing analysis (By G. Zervas)

An emergent marketing literature has begun exploring the application of text-mining methods to problems such as uncovering market structure, identifying competitors, and segmenting consumers. For example, Netzer et al. 2012 recently demonstrated that basic text-mining techniques such as word co-occurrence counts, TF-IDF, and cosine similarities can be used to effectively uncover brand relationships in the automobile and pharmaceutical industries. In the proposed project, we seek to further explore the application of state-of-the-art text-mining methods from the computer science literature – such as topic modeling – to traditional marketing problems. As a broad starting point, we propose applying these tools to the Yelp Academic Dataset (https:www.yelp.com/academic_dataset) to explore solutions to problems such as:

Uncovering competitive relationships between businesses.

Identifying business attributes valued by consumers, similar to Ghose et al. 2011

Predicting the survival rate of businesses.

Network engineering for effective shortest paths (by Dora Erdos)

Shortest paths play an important role in many applications from computer networks to route planning on roads. A way to measure the importance of a node in a network (i.e., a router or a road intersection) is to count the number of shortest paths it covers. While there has been a lot of research on developing algorithms for individual nodes, the question of performing computations for sets or groups of nodes remains a challenge. Two interesting problems here are the following

Load balancing application: given a set of K nodes, partition it into two groups so that the difference of total number of shortest paths covered by each group is minimal.

Everett, Borgatti “The centrality of groups and classes”, Journal of Mathematical Sociology 23: 181-201, 1999.

Influence maximization application: given a set of nodes and a budget L, between what nodes should we add K new edges so that the total number of shortest paths covered by the group is maximized?

Papagelis, Bonchi, Gionis ‘‘Suggesting ghost edges for a smaller world", CIKM, 2011.

Discovering the chemistry of collaborations (with E. Terzi and S. Paparizos)

In general people perform better when they work in teams. Also some teams perform better than others. The high-level question that this project will try to address is: ‘‘what makes a team perform well?".

In order to approach this high-level question, we are going to discover successful and unsuccessful teams of scientists. For this we are going to study scientific publications and characterize them as successful (or not) based on the number of citations they have received. Then, based on that we are going to discover people (or groups of people), i.e., co-authors, that act as catalyst for successful projects.

For this we are going to use data from Microsoft Academic Search. A starting point for the data analysis will be the data that is available here.

From that point on further data collection may be required.

Reading material:

T. Lappas, K. Liu, E. Terzi: Finding a team of experts in social networks, ACM SIGKDD 2009.

E. L. Fitzpatrick, R. G. Askin: Forming effective worker teams with multi-functional skill requirements, Computer and Industrial Engineering, 2005.

A. Zakarian, A. Kusiak: Forming teams: an analytical approach, IIE Transactions, 1999.

Syllabus

Jan 16	First day of classes
Jan 21	clustering and k-means	.pdf
Jan 21	SVD and matrix sketches	.pdf
Jan 28	Matrix Sketches	.pdf
Jan 30	Project presentations by D. Erdos and G. Zervas
Feb 4	Project presentation by D. Christenson
Feb 4	Project presentation by E. Terzi (graph isomporphism)	.pdf
Feb Feb 6	Project presentation by E. Terzi (Team chemistry)	.pdf
Feb 11	Project presentation by E. Terzi (reviews)	.pdf
Feb 13	Project presentations
Feb 18	Project presentations
Feb 20	Tensor Completion (by N. Ruchansky)	presentation
Feb 20	Review Analysis (by N. Shah)	presentation
Feb 25, Feb 27	No class
Mar 4	Team Formation (by R. Churchill)	presentation
Mar 4	Graph De-anonymization (by B. Chen)	presentation
Mar 6	Network outliers (by Sanaz Bahargam)	presentation
Mar 6	Team dynamics (by Raymond Mead)	presentation
Mar 11, Mar 13	Spring break
Mar 18	Mapping accross social networks: a coverage-based prespective (byHarshal Chaudhari)
Mar 18	Noisy graph isomorphism: a random-projection approach (by HarryMavroforakis)	link
Mar 20	Uncovering interst groups (byRushi Ganmuthi)	presentation
Mar 20	Modeling boredom in social networks (by Neda Derakhshani)
Mar 25 Mar 27	final presentations
Apr 1, Apr 3	final presentations
Apr 8, Apr 10	final presentations
Apr 15, Apr 17	final presentations
Apr 22	final presentations
Apr 24	Wed schedule
Apr 29, May 1	Poster sessions

Collaborations/Academic Honesty

All course participants must adhere to the CAS Academic Conduct Code. All instances of adacemic dishonesty will be reported to the academic conduct committee.