Tue-Thur 2:00-3:30 pm
Room: MCS B019
The goal of the coarse is to make students familiar with basic data-analysis techniques and real-life data analysis problems.
The format of the course will be the following: the instructor herself as well as professors from different departments will present the students with a challenging data-analysis problem and explain the challenges the problem raises as well as the impact a solution can possibly have.
The students of the class will pick one of the problems and will work towards solving it – using techniques we will present in the class.
Evimaria Terzi, evimaria@cs.bu.edu
www.cs.bu.edu/~evimaria
Office Hours: Monday 5:00 pm - 6:30 pm and Tuesday 3:30 pm -5:00pm.
Contact me by e-mail to arrange an appointment if necessary.
Working knowledge of programming and data structures. Familiarity with basic algorithmic concepts, probability, statistics and linear algebra. Programming projects will require knowledge of C (or C++), java, python.
Students that take the course for credit will be required to complete a course project. Completion of the project constitutes 50% of the overall grade. Although the main task of each student will be this one project, in the course of preparing the project the students will have to do one presentation of some papers related to the project they picked. The presentation will be 20% of the credit. Finally, a reaction paper and a final presentation that summarizes the thoughts of the students with respect to their topic will be another 20% of the total credit. Finally, active class participation will be the additional 10%.
One initial class presentation (20%)
One reaction paper and final class presentation (20%)
One project – poster presentation (50%)
Incompletes will not be given.
Late Assignment Policy: There will be a penalty of 10% per day, up to three days late. After that no credit will be given.
In the initial class presentation the students will focus on the project each student will be assigned to. The student will present his/her understanding of the problem as well as some initial thoughts of approaching it.
As a way to get everyone thinking about the research issues underlying the course, there will be a short reaction paper of roughly 5 pages in length.
The reaction paper will be related to your project and should be structured as follows:
What is the problem you are trying to solve?
What are the key approaches you have/will try?
What were the main challenges that you encountered when you tried to implement/apply the above approaches?
What are your results? What lessons have you learned?
The reaction paper should be the summary of your second presentation about your project.
The projects you will be assigned will involve a data-mining/data-analysis real-life problem. The instructor as well as faculty members from accross the university will present a collection of projects from which you will have to pick one as your semester’s project. Your project will involve both some algorithm design as well as the application of this algorithm in some real-life datasets.
Given a collection of network trace data collected over a period of time, develop an algorithm that observes these traces over time and identifies (a) irregular behaviors and (b) identifies the tuplesusersapplications that are responsible for the observed irregularities.
Sanjay Chawla, Aristides Gionis: k-means-: A Unified Approach to Clustering and Outlier Detection. SDM 2013.
Theodoros Lappas, Mark Crovella, Evimaria Terzi: Selecting a characteristic set of reviews. KDD 2012.
Interest groups (i.e., lobbyists) use a variety of techniques to exert influence, among which coalition strategies are dominant. However, interest groups are hesitant to openly discuss their networks for a variety of reasons. Utilizing a network measure of interest group coalitions based on cosigner status to United States Supreme Court amicus curiae, or friend of the court briefs, the central players and overall characteristics of this dynamic and secretive network from 1930 to present-day can be illuminated. This project brings a new theoretical perspective to the study of interest group coalitions by applying network theory and statistics. Additional information on the project is available here: http://amicinetworks.com/
Some related problems include:
Parsing documents, emails and related databases, as well as scraping websites and publicly available for interest groups in the amici networks database.
Creating fuzzy matching algorithms to match non-unique names in the amici networks database with other databases.
Creating programs to check big data for coding errors, including commonly misspelled names and outlier categories from attribute data.
Amending MySQL database with interest group data from other databases.
Box-Steffensmeier, Janet M, Dino P Christenson, and Matthew P Hitt. ‘‘Quality Over Quantity: Amici Influence And Judicial Decision Making”. American Political Science Review 107, no. 3. American Political Science Review (2013): 446-460.
Box-Steffensmeier, Janet M, and Dino P Christenson. The Evolution And Formation Of Amicus Curiae Networks. Social Networks. Social Networks (2012).
Given two graphs (maybe noisy versions of an archetypical graph), can you map the nodes of the one graph to the nodes of the other? Applications of this problem include mapping users of one social network (e.g., Facebook) to nodes of the other (e.g., Twitter).
Pedram Pedarsani, Matthias Grossglauser On the privacy of anonymized networks. KDD 2011.
Arvind Narayanan, Vitaly Shmatikov: De-anonymizing Social Networks. IEEE Symposium on Security and Privacy 2009.
(By B. Golshan)
In the traditional clustering problems the goal is to cluster a collection of multi-dimensional points into clusters such that the sum of the variances of the clusters is minimized. In this project we will consider the “maximum variance clustering” problem where the goal is to partition a set of points into k clusters such that the sum of the variances of the clusters is maximized. Motivation for this problem comes from computational education and grouping of students in groups.
E. Fernandez, J. Kalcsics, S. Nickel: The maximum dispersion problem, Omega 2013.
S. S. Ravi, D. J. Rosenkrantz, G.K. Tayi: Heuristic and Special Case Algorithms for Dispersion Problems Operations Research, 1994.
An emergent marketing literature has begun exploring the application of text-mining methods to problems such as uncovering market structure, identifying competitors, and segmenting consumers. For example, Netzer et al. 2012 recently demonstrated that basic text-mining techniques such as word co-occurrence counts, TF-IDF, and cosine similarities can be used to effectively uncover brand relationships in the automobile and pharmaceutical industries. In the proposed project, we seek to further explore the application of state-of-the-art text-mining methods from the computer science literature – such as topic modeling – to traditional marketing problems. As a broad starting point, we propose applying these tools to the Yelp Academic Dataset (https:www.yelp.com/academic_dataset) to explore solutions to problems such as:
Uncovering competitive relationships between businesses.
Identifying business attributes valued by consumers, similar to Ghose et al. 2011
Predicting the survival rate of businesses.
Shortest paths play an important role in many applications from computer networks to route planning on roads. A way to measure the importance of a node in a network (i.e., a router or a road intersection) is to count the number of shortest paths it covers. While there has been a lot of research on developing algorithms for individual nodes, the question of performing computations for sets or groups of nodes remains a challenge. Two interesting problems here are the following
Load balancing application: given a set of K nodes, partition it into two groups so that the difference of total number of shortest paths covered by each group is minimal.
Everett, Borgatti “The centrality of groups and classes”, Journal of Mathematical Sociology 23: 181-201, 1999.
Influence maximization application: given a set of nodes and a budget L, between what nodes should we add K new edges so that the total number of shortest paths covered by the group is maximized?
Papagelis, Bonchi, Gionis ‘‘Suggesting ghost edges for a smaller world", CIKM, 2011.
In general people perform better when they work in teams. Also some teams perform better than others. The high-level question that this project will try to address is: ‘‘what makes a team perform well?".
In order to approach this high-level question, we are going to discover successful and unsuccessful teams of scientists. For this we are going to study scientific publications and characterize them as successful (or not) based on the number of citations they have received. Then, based on that we are going to discover people (or groups of people), i.e., co-authors, that act as catalyst for successful projects.
For this we are going to use data from Microsoft Academic Search. A starting point for the data analysis will be the data that is available here.
From that point on further data collection may be required.
Reading material:
T. Lappas, K. Liu, E. Terzi: Finding a team of experts in social networks, ACM SIGKDD 2009.
E. L. Fitzpatrick, R. G. Askin: Forming effective worker teams with multi-functional skill requirements, Computer and Industrial Engineering, 2005.
A. Zakarian, A. Kusiak: Forming teams: an analytical approach, IIE Transactions, 1999.
Jan 16 | First day of classes | |
Jan 21 | clustering and k-means | |
Jan 21 | SVD and matrix sketches | |
Jan 28 | Matrix Sketches | |
Jan 30 | Project presentations by D. Erdos and G. Zervas | |
Feb 4 | Project presentation by D. Christenson | |
Feb 4 | Project presentation by E. Terzi (graph isomporphism) | |
Feb Feb 6 | Project presentation by E. Terzi (Team chemistry) | |
Feb 11 | Project presentation by E. Terzi (reviews) | |
Feb 13 | Project presentations | |
Feb 18 | Project presentations | |
Feb 20 | Tensor Completion (by N. Ruchansky) | presentation |
Feb 20 | Review Analysis (by N. Shah) | presentation |
Feb 25, Feb 27 | No class | |
Mar 4 | Team Formation (by R. Churchill) | presentation |
Mar 4 | Graph De-anonymization (by B. Chen) | presentation |
Mar 6 | Network outliers (by Sanaz Bahargam) | presentation |
Mar 6 | Team dynamics (by Raymond Mead) | presentation |
Mar 11, Mar 13 | Spring break | |
Mar 18 | Mapping accross social networks: a coverage-based prespective (byHarshal Chaudhari) | |
Mar 18 | Noisy graph isomorphism: a random-projection approach (by HarryMavroforakis) | link |
Mar 20 | Uncovering interst groups (byRushi Ganmuthi) | presentation |
Mar 20 | Modeling boredom in social networks (by Neda Derakhshani) | |
Mar 25 Mar 27 | final presentations | |
Apr 1, Apr 3 | final presentations | |
Apr 8, Apr 10 | final presentations | |
Apr 15, Apr 17 | final presentations | |
Apr 22 | final presentations | |
Apr 24 | Wed schedule | |
Apr 29, May 1 | Poster sessions | |
All course participants must adhere to the CAS Academic Conduct Code. All instances of adacemic dishonesty will be reported to the academic conduct committee.