BIA 660, Web and Data Analytics

Schedule

Mon/Wed 6:15-9:00 pm in Babbio Center 319

Office hours: Mon/Wed 5:15-6:15 pm in Babbio Center 639

Instructor: Sanaz Bahargamhttp://cs-people.bu.edu/bahargam/

Course Outline

In this course, students will learn through hands-on experience how to extract data from the web and analyze web-scale data using distributed computing. Students will learn different analysis methods that are widely used across the range of internet companies, from start-ups to online giants like Amazon or Google. At the end of the course, students will apply these methods to answer a real scientific question.

Prerequisites

Students must have programming experience. It is also highly recommended for the students to have taken Multivariate Data Analytics (BIA 652), Data & Knowledge Management (MIS 630), Knowledge Discovery in Databases (MIS 637) and Statistical Learning & Analytics (BIA 656).

Grading Policy

  • Class work: 30%
  • Peer Evaluation: 10%
  • Mid-term Project: 20%
  • Final Project: 40%

Suggested Textbooks

Mid-term project

Collect, clean and organize online data from a website of your choice. The deliverable includes a dataset, the collection & cleaning scripts, and a presentation to be given in class.

Final project

Choose an important research question that emerges in the context of the dataset collected for the midterm project. Develop, apply and record an analytics methodology to address your question. This work will be presented in class.

Syllabus and Tentative schedule

  • Week 1:
    Introduction to Python (basic concepts): Data types, Lists, Sets, Tuples, Dictionaries, I/O, Pandas, Parsing and Analysing data

  • Week 2:
    Scraping the web, parsing and data cleaning

  • Week 3:
    Using Python to scrape the web I (regex, selenium, Beautiful Soap & other libraries) and data cleaning

  • Week 4:
    Text Mining with Python (nltk), Sentiment Analysis, Visualization (matplotlib & other tools) and more on Pandas

  • Week 5:
    Machine Learning & Analytics (Linear Regression, Logistic Regression, KNN and GridSearch)

  • Week 6:
    Text Classification and Naive Bayes, LDA, tf–idf

  • Week 7:
    Feature selection

Collaborations/Academic Honesty