In ACM Symposium on the Theory of Computing (STOC), 2007.
Abstract:
We introduce a new, generic framework for private data analysis. The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains. Our framework allows one to release functions $f$ of the data with instance-specific additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also by the database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the {\em smooth sensitivity} of $f$ on the database $x$ --- a measure of variability of $f$ in the neighborhood of the instance $x$. The new framework greatly expands the applicability of output perturbation, a technique for protecting individuals' privacy by adding a small amount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instance-specific noise in the context of data privacy.
Our framework raises many interesting algorithmic questions. Namely, to apply the framework one must compute or approximate the smooth sensitivity of $f$ on $x$. We show how to do this efficiently for several different functions, including the median and the cost of the minimum spanning tree. We also give a generic procedure based on sampling that allows one to release $f(x)$ accurately on many databases $x$. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of $f$ is known or when $f$ is given as a black box. We illustrate the procedure by applying it to $k$-SED ($k$-means) clustering and learning mixtures of Gaussians.
Available files: