Secure Multiparty Compuation FAQ for Non-Experts
Welcome to our Secure Multiparty Computation (MPC) FAQ aimed at non-experts! This is a living document that we have compiled to support ongoing deployments of MPC analyzing the sensitive data of non-experts (eg. collaborations between Boston University and BWWC and Boston Univeristy and MMF). As such, we include only a breif overview of the mathematical techniques used in MPC systems. If you are a student or researcher who wants to learn about MPC, Pragmatic MPC might be a good place to start!
Big thank you to Mayank Varia who helped compile this FAQ. If you want to suggest any modifications or additional questions, please send a message to kaptchuk@bu.edu!
-- Gabe Kaptchuk
What is MPC?
Secure Multiparty Computation is a set of computational techniques that allows computing on data without allowing a data analyst to actually see that data. Concretely, this means that organizations can learn insights (eg. aggregate statistics) about data that is considered too private or sensitive to be shared. To achieve this seeming paradox, MPC distributes the trust typically placed in a single data analyst among multiple organizations/people, such that the privacy of the data is maintained as long as any one of the organizations/people behaves honestly.
What problems does MPC solve?
In cases where the data is considered to be private—either due to social consensus or legal confidentiality agreements—typical data analysis techniques (see below for more details) are unworkable. Data analysts with access to raw data can unacceptably invade the privacy of individuals with information contained in the data set, and a malicious data analyst could share data inappropriately. Moreover, it may be difficult to find an organization or data analyst willing to take on the responsibility (and potential liability) of safeguarding the sensitive data.
Using MPC techniques, the same reports can be created without designating a trusted party to perform the data analysis. All parties that participate in the MPC process see nothing about the data besides the output of the computation (ie. the report). This alleviates the responsibility (legal or otherwise) that comes with being a data analyst and increases the confidence of the individuals with information contained in the data that their data will remain private.
How does the MPC workflow differ from typical data analysis?
Typical data analysis requires trusting a single organization or analyst with the ability to access and interrogate the raw data. Typical data analysis generally follows the following template:
- Compile all the relevant data onto one computer (eg. surveying respondents or sharing access to previously gathered data)
- Give a data analyst access to the compiled data
- The data analyst iteratively explores and analyzes the data
- The data analyst produces a report containing insights about the data that is shared for public consumption
The most significant difference with MPC is that the data analyst is replaced with a pre-selected set of computational parties and that the raw data is never compiled in the clear. Concretely, the typical MPC workflow is as follows:
- An organizer recruits a set of computational parties agree to participate in the MPC protocol
- The organizer and computational parties agree on a set of analyses that will be carried out on the data
- Individuals (or external organizations) that hold data points submit their data to the computational parties encoded in a special format. This special format ensures that individual computational parties learn nothing about the data points
- After the encoded data has been collected, the computational parties run the MPC protocol to compute the set of analyses agreed upon in (2). At the end of this protocol, the organizer learns exactly the output of these analyses (and nothing else about the data)
- The organizer compiles these results into a report which is then circulated for public consumption (as appropriate)
What are the benefits of MPC?
MPC shifts the trust paradigm of data analysis from “Trust Me,” in which a single party assumes responsibility for the confidentiality of data, to “Trust Anyone,” in which the confidentiality of the data is maintained as long as any one of the computational parties acts honestly. This shift has a dramatic impact on the concrete privacy of the data:
- Rouge data analysts cannot unilaterally release raw data publicly. While this worst-case scenario is rare in practice, the implications of such a release could be catastrophic.
- If an external attacker wishes to see or steal the raw data, they must breach the information systems of multiple computational parties. As these computational parties will likely have differing security practices, this significantly raises the difficulty for attackers.
- A data analyst will not be able to snoop through the raw data for data points that reveal personal information (eg. about someone that they know).
- The data can only be used to perform computations that are mutually agreed upon by the computational parties. This means that anyone who contributes data has increased confidence that their data will only be used for the pre-specified purposes, effectively preventing mission creep and p-hacking.
Is MPC better than anonymization?
Anonymization, a technique in which names, addresses, and other directly identifying features are removed from a dataset, is the common approach for mitigating the privacy risks associated with data analysis. Unfortunately, there is significant documentation that anonymization (and it's more powerful cousin, k-anonymity) simply don’t work. Latanya Sweeney famously was able to identify Massachusetts Governor William Weld’s medical records in an anonymized dataset in 1997. As commercial databases have grown more powerful, reidentificant attacks have only become easier.
MPC prevents the reidentification attacks that plague anonymization. Specifically, data analysts are not able to look at any individual records in the underlying data, and therefore make inferences about any individual who contributes data.
What risks still remain when using MPC?
There are two main concerns stakeholders should have when using MPC: (1) the privacy implications of reports (eg. aggregate statistics or computed insights) produced from the data, and (2) whether the distributed trust assumptions made in the system seem reasonable.
- Privacy Implications of Reports: Although MPC prevents data analysts from seeing the raw data, disclosure of private details is still possible if the output report is not chosen carefully. As a contrived example, consider an output resort containing the average income of an organization with only one employee; although the function can be described as an aggregate statistic, the report directly reveals private information about an individual. Controlling the privacy implications of aggregate reports is complex, and formally reasoning about these privacy implications requires orthogonal privacy enhancing techniques like differential privacy.
- Distributed Trust Assumptions: MPC is able to provide strong privacy guarantees by distributing trust among many computing parties (trusting anyone rather than being forced to trust someone). The privacy of the data can only be violated if all of these computing parties collude or are compromised by an attacker. Computational parties should, therefore, be selected with care to make this assumption realistic.
Finally, we note that running MPC systems requires rending intricate mathematics into software, which can result in imperfect implementation and security bugs. As such, it is important to ensure that the software is written carefully by professionals, following best practices for secure programming.
What can be computed using MPC?
In principle, MPC can be used to compute arbitrary functions of the input data, including all statistical tests. MPC is, however, slower than typical data analysis by orders of magnitude; computations that would usually take seconds will likely take minutes or hours. In its current state, most common statistical tests or simple aggregations are sufficiently computationally simple that MPC can be used efficiently. More computationally expensive tasks, like training complex machine learning models, remain active areas of MPC research and may be feasible in the near future.
Note that although the runtime of MPC may be slow compared to “normal” computation, most data analysis tasks that would use MPC can easily accommodate a few extra minutes (or hours) of computation time; the overhead of such computation is marginal on the length of the full project.
How difficult is it to use MPC?
MPC is mature enough as a technology to use today. However, the systems to deploy MPC are currently written in such a way that it requires support from experts in cryptography to use. It is an ongoing and active area of research to design MPC platforms that anyone can use.
Does MPC require special systems?
No. MPC can be run using typical computers and typical software systems. Current developments of MPC endeavor to meet users where they are: data can be submitted through a normal web browser and results can be viewed using familiar software programs like excel. As such, no special systems are required to run MPC.
Who is using MPC now?
MPC has recently found adoption in a number of public-facing academic projects and industry efforts. Starting in 2016, Boston University and the Boston Women’s Workforce Council have used MPC to study the wage gap in Boston. DARPA has experimented with using MPC to prevent collisions between satellites run by nations wary of sharing satellite location data. Finally, there are a large number of early-stage startups using MPC to address a wide variety of issues, including improving company’s cybersecurity and securing cryptocurrency assets; these companies have formed an industry group called The MPC Alliance.
What differences are there in the planning process when using MPC?
Preparing to analyze data with MPC requires preparation and a willingness to challenge the intuition that you might have built using typical data analysis techniques. We highlight several key differences below, but note that you might discover other differences, depending on your prior experiences.
- Plan Analysis Ahead of Time: Although it is possible to compute any function of the data using MPC, not all functions of the data are equally fast or easy to compute. As such, understanding the planned scope of analysis is critical to choosing the specific MPC system. For developers creating an MPC system, it is much more important to know the types of analysis (eg. descriptive statistics, statistical tests, regressions, machine learning) than the exact data format.
- Recruiting Computational Parties: MPC leverages computational parties to federate trust. Because computational parties are not a typical feature of a data analysis process, recruiting organizations/parties to serve as computational parties may be a foreign experience. Moreover, as MPC is not a well-known technique, recruiting computational parties may require significant communication efforts and education.
- No Going Back: A common feature of MPC systems is unrecoverable decisions. For example, data may be encrypted under a cryptographic key; if this key is lost, the data can not be recovered. In other situations, passwords may be used to secure data, and if a password is forgotten there is no recovery process. As such, clear communication about the importance of key files and passwords is critical.
- Data Exploration Weakens Privacy Guarantees: Typical data analysis approaches involve iterative, open-ended exploration of a data set before selecting the contents of the final report. In principle, similar exploration of the data is possible under MPC, as each step of the open-ended exploration can be computed in a privacy preserving way. Unfortunately, such a process substantially diminishes the meaningfulness of MPC’s privacy guarantees. As more aggregations of the data are released, the easier it becomes for a data analyst to make observations about the raw data. As such, we encourage designing the analysis plan to be designed and finalized before any computation has been executed.
- Manual Data Cleaning is Difficult: Data cleaning and sanitization (ie. removing data that is improperly formatted) is a critical part of most data analysis efforts. Because MPC prevents data analysts from seeing the raw data, data cleaning is very difficult under MPC. As such, data should be cleaned before it is submitted to the MPC system. In the worst case scenario. Automated techniques for after-the-fact data cleaning (e.g., outlier detection and removal) can be performed using MPC, but it is likely to be an error-prone process with a significant computational cost.
Does MPC guarantee participant anonymity?
No. By default, deployments of MPC do not aim to hide the identities of the parties that contribute data. If the identities of the parties contributing data should be kept secret, there do exist methods that hide this data that can be added. Please bring this up proactively with the designers of the MPC system if this property is required.
How does the math of MPC work?
There is no single way that MPC protocols work: the term “MPC” is a shorthand for a collection of techniques that accomplish a similar goal. Detailing the full set of techniques used to design MPC systems is beyond the scope of this FAQ, but we illustrate a simple MPC for privacy-preserving voting to show MPC’s basic tenets. While there might be lots of other security properties you could want from a voting system, but we focus on preserving privacy:
Imagine the workers within a warehouse want to vote on unionization. In the case where the unionization effort fails, the owner of the warehouse may retaliate against the workers who voted to unionize. As such, no one (not even the union organizers) wants to learn how each worker voted, as the existence of such a record is inherently dangerous. The only output that should be learned in the final vote tally.
To compute the finally vote tally in a privacy-preserving way, the workers use the following voting procedure:
- The workers start by standing in a large circle (facing the middle of the circle), each holding a blank slip of paper and a pencil
- The organizer begins by selecting a random number, which we will denote with \(r\). This number could be any whole number greater than zero (eg. 3, 1694, 9327234, 43, 531, etc…). The organizer writes \(r\) down on their slip of paper and privately passes it to their left.
- Whenever a worker receives a slip of paper (from their right), they write down a new number on their blank slip of paper and pass that new slip to their left. Let’s call the number on the slip of paper they received \(k\). The number the write on the new slip of paper is either:
- In the case that the worker wants to vote for unionization, they add one to the received number and write the result on the new slip of paper. That is, they write the value \(k+1\) on the new slip of paper.
- In the case that the worker wants to vote against unionization, they simply write the same number that they received on the new slip of paper. That is, they write the value \(k\) on the new slip of paper.
- This process proceeds clockwise around the circle, until the final slip of paper is passed to the organizer. Let’s call the number received by the organizer as \(n\). The organizer subtracts \(r\) from \(n\) and announces the result (ie. \(n-r\)). This is the number of workers who voted for unionization. To determine if the vote succeeded, everyone can check to see if \(n-r\) is greater than half the number of workers who participated in the vote.
Why is this procedure privacy preserving? Notice that the number that each worker sees is totally random, because \(r\) is selected randomly. Specifically, a worker knows nothing about how the workers before them voted. For example, let’s assume that the third worker receives the value 451. There are three possible scenarios:
- The organizer randomly selected \(r=451\) and the first two workers voted against unionization,
- The organizer randomly selected \(r=450\) and exactly one of the first two workers voted for unionization,
- The organizer randomly selected \(r=449\) and both prior workers voted for unionization.
Importantly, there is no way for the third worker to distinguish between these three scenarios! As such, we can conclude that the third worker learns nothing about the votes cast by the first two workers. The same arguments can be made about all of the workers.
The organizer sees the value \(n\) at the end of the voting procedure. While the organizer can learn how many workers voted for unionization, they cannot determine how any individual worker voted. This is because they only see an aggregate number at the very end of the protocol, and the pro-unionization votes could have been added into the tally at any point around the circle.
As system designers compute more complex functions of the data (ie. more complicated than vote tallying), more complex techniques are required. If you are interested in learning more about the mathematics of MPC, there are many freely available courses and textbooks online, but their target audience is security researchers and MPC system designers.
Glossary
THIS IS A WORK IS PROGRESS
Throughout this FAQ, we use a number of terms to help explain MPC. These terms are not all standard in the technical literature. Below we provide a brief explainer of these terms. Additionally, we provide a glossary for technical terms commonly used in the technical MPC literature to help non-experts who have read multiple sources online.
We note that this is a living document, so this list may be incomplete. If you have any suggestions, just send an email to kaptchuk@bu.edu
- Computational Parties:
- Data Analyst:
FAQ template from here.