6.824 2016 Lecture 1: Introduction 6.824: Distributed Systems Engineering What is a distributed system? multiple cooperating computers DNS, P2P file sharing, big databases, MapReduce, &c lots of critical infrastructure is distributed! Why distribute? to connect physically separate entities to achieve security via isolation to tolerate faults via replication to scale up throughput via parallel CPUs/mem/disk/net But: complex: many concurrent parts must cope with partial failure tricky to realize performance potential Why take this course? interesting -- hard problems, non-obvious solutions used by real systems -- driven by the rise of big Web sites active research area -- lots of progress + big unsolved problems hands-on -- you'll build serious systems in the labs COURSE STRUCTURE http://pdos.csail.mit.edu/6.824 Course staff: Robert Morris, lecturer Frans Kaashoek, lecturer Steven Allen, TA Stephanie Wang, TA Jon Gjengset, TA Daniel Ziegler, TA Course components: lectures readings two exams labs final project Lectures about big ideas, papers, and labs Readings: research papers as case studies please read papers before class otherwise boring, and you can't pick it up by listening each paper has a short question for you to answer and you must send us a question you have about the paper submit question&answer by 10pm the night before Mid-term exam in class, and final exam Lab goals: deeper understanding of some important techniques experience with distributed programming first lab is due a week from Friday Lab 1: MapReduce Lab 2: replication for fault-tolerance Lab 3: fault-tolerant key/value store Lab 4: sharded key/value store Final project at the end, in groups of 2 or 3. You can think of a project and clear it with us. Or you can do a "default" project that we'll specify. Lab grades depend on how many test cases you pass we give you the tests, so you know whether you'll do well careful: if it usually passes, but sometimes fails, chances are it will fail when we run it Lab code review look at someone else's lab solution, send feedback perhaps learn about another approach Debugging the labs can be time-consuming start early come to TA office hours ask questions on Piazza MAIN TOPICS This is a course about infrastructure, to be used by applications. About abstractions that hide complexity of distribution from applications. Three big kinds of abstraction: Storage. Communication. Computation. A couple of topics come up repeatedly. Topic: implementation RPC, threads, concurrency control. Topic: performance The dream: scalable throughput. Nx servers -> Nx total throughput via parallel CPU, disk, net. So handling more load only requires buying more computers. Scaling gets progressively harder: Load balance, stragglers. "Small" non-parallelizable parts. Hidden shared resources, e.g. network. Topic: fault tolerance 1000s of server, complex net -> always something broken We'd like to hide these failures from the application. We often want: Availability -- I can keep using my files despite failures Durability -- my data will come back to life when failures are repaired Big idea: replicated servers. If one server crashes, client can proceed using the other. Topic: consistency General-purpose infrastructure needs well-defined behavior. E.g. "Get(k) yields the value from the most recent Put(k,v)." Achieving good behavior is hard! Clients submit concurrent operations. Servers crash at awkward moments. Network may make live servers look dead; risk of "split brain". Consistency and performance are enemies. Consistency requires communication, e.g. to get latest Put(). Systems with pleasing ("strict") semantics are often slow. Fast systems often make applications cope with complex ("relaxed") behavior. People have pursued many design points in this spectrum.