6.824 2016 Lecture 1: Introduction

6.824: Distributed Systems Engineering

What is a distributed system?
  multiple cooperating computers
  DNS, P2P file sharing, big databases, MapReduce, &c
  lots of critical infrastructure is distributed!

Why distribute?
  to connect physically separate entities
  to achieve security via isolation
  to tolerate faults via replication
  to scale up throughput via parallel CPUs/mem/disk/net

But:
  complex: many concurrent parts
  must cope with partial failure
  tricky to realize performance potential

Why take this course?
  interesting -- hard problems, non-obvious solutions
  used by real systems -- driven by the rise of big Web sites
  active research area -- lots of progress + big unsolved problems
  hands-on -- you'll build serious systems in the labs

COURSE STRUCTURE

http://pdos.csail.mit.edu/6.824

Course staff:
  Robert Morris, lecturer
  Frans Kaashoek, lecturer
  Steven Allen, TA
  Stephanie Wang, TA
  Jon Gjengset, TA
  Daniel Ziegler, TA

Course components:
  lectures
  readings
  two exams
  labs
  final project

Lectures about big ideas, papers, and labs

Readings: research papers as case studies
  please read papers before class
    otherwise boring, and you can't pick it up by listening
  each paper has a short question for you to answer
  and you must send us a question you have about the paper
  submit question&answer by 10pm the night before

Mid-term exam in class, and final exam

Lab goals:
  deeper understanding of some important techniques
  experience with distributed programming
  first lab is due a week from Friday

Lab 1: MapReduce
Lab 2: replication for fault-tolerance
Lab 3: fault-tolerant key/value store
Lab 4: sharded key/value store

Final project at the end, in groups of 2 or 3.
  You can think of a project and clear it with us.
  Or you can do a "default" project that we'll specify.

Lab grades depend on how many test cases you pass
  we give you the tests, so you know whether you'll do well
  careful: if it usually passes, but sometimes fails,
    chances are it will fail when we run it

Lab code review
  look at someone else's lab solution, send feedback
  perhaps learn about another approach

Debugging the labs can be time-consuming
  start early
  come to TA office hours
  ask questions on Piazza

MAIN TOPICS

This is a course about infrastructure, to be used by applications.
  About abstractions that hide complexity of distribution from applications.
  Three big kinds of abstraction:
    Storage.
    Communication.
    Computation.
  A couple of topics come up repeatedly.

Topic: implementation
  RPC, threads, concurrency control.

Topic: performance
  The dream: scalable throughput.
    Nx servers -> Nx total throughput via parallel CPU, disk, net.
    So handling more load only requires buying more computers.
  Scaling gets progressively harder:
    Load balance, stragglers.
    "Small" non-parallelizable parts.
    Hidden shared resources, e.g. network.

Topic: fault tolerance
  1000s of server, complex net -> always something broken
  We'd like to hide these failures from the application.
  We often want:
    Availability -- I can keep using my files despite failures
    Durability -- my data will come back to life when failures are repaired
  Big idea: replicated servers.
    If one server crashes, client can proceed using the other.

Topic: consistency
  General-purpose infrastructure needs well-defined behavior.
    E.g. "Get(k) yields the value from the most recent Put(k,v)."
  Achieving good behavior is hard!
    Clients submit concurrent operations.
    Servers crash at awkward moments.
    Network may make live servers look dead; risk of "split brain".
  Consistency and performance are enemies.
    Consistency requires communication, e.g. to get latest Put().
    Systems with pleasing ("strict") semantics are often slow.
    Fast systems often make applications cope with complex ("relaxed") behavior.
  People have pursued many design points in this spectrum.