Wednesday 13/6/2018 at 2:30 p.m.
Prof. Stefano Favaro
Protection against disclosure is a legal and ethical obligation for statistical agencies releasing microdata files for public use. Given a cross classification of sample records by categorical key variables, any decision about release is supported by measures of disclosure risk, the most common being the number τ of sample uniques cells that are also population uniques. In this paper we depart from the dominant literature that infers τ by modeling association among key variables, and we consider modeling directly sample records. We develop a novel nonparametric Bayesian approach under the minimal assumption of a generalized Dirichlet prior for the random partition induced by the cross-classified sample records. This allows to derive an explicit, and simple, expression for the posterior distribution of τ, as well as a large sample Binomial approximation of it. Such a closed-form results, combined with an estimator for prior parameters designed in such a way to recognizes a primary role of small cells, make inference on τ exact, of easy implementation, computationally efficient and scalable to massive datasets. The proposed approach is tested on benchmark data from the U.S. 2000 census for the state of California, showing the same good performance of recent semiparametric Bayesian models for key variables.
Presso: Sala Pentagonale II Piano Via Bassini 12 20133 Milano