We propose and apply a novel paradigm for characterization of genome data quality, which quantifi... more We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
International Conference on Digital Government Research, May 18, 2003
We describe a risk-utility framework for selecting among swapped data releases, NISS WebSwap-a We... more We describe a risk-utility framework for selecting among swapped data releases, NISS WebSwap-a Web service that performs data swapping, and the NISS WebSwap graphical user interface.
We articulate and investigate issues associated with performing statistical disclosure limitation... more We articulate and investigate issues associated with performing statistical disclosure limitation (SDL) for data subject to edit rules. The central problem is that many SDL methods generate data records that violate the constraints. We propose and study two approaches. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced by means of a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates several SDL methods from the existing literature. The results suggest that (i) in practice, some SDL methods cannot be implemented with the second approach, and (ii) differences in risk-utility profiles across SDL approaches dwarf differences across the two approaches. Among the SDL strategies, microaggreggation followed by adding noise and partially synthetic ...
When releasing microdata to the public, data disseminators typically alter the original data to p... more When releasing microdata to the public, data disseminators typically alter the original data to protect the confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts the utility (quality) of the released data. In this paper, we present quantitative measures of data utility for masked microdata, with the aim of improving disseminators' evaluations of competing masking strategies. The measures, which are global in that they reflect similarities between the entire distributions of the original and released data, utilize empirical distribution estimation, cluster analysis, and propensity scores. We evaluate the measures using both simulated and genuine data. The results suggest that measures based on propensity score methods are the most promising for general use.
Data swapping is a statistical disclosure limitation method used to protect the confidentiality o... more Data swapping is a statistical disclosure limitation method used to protect the confidentiality of data by interchanging variable values between records. We propose a risk-utility framework for selecting an optimal swapped data release when considering several swap variables and multiple swap rates. Risk and utility values associated with each such swapped data file are traded off along a frontier of undominated potential releases, which contains the optimal release(s). Current Population Survey data are used to illustrate the framework for categorical data swapping.
To protect conden tiality, statistical agencies typically alter data before re- leasing them to t... more To protect conden tiality, statistical agencies typically alter data before re- leasing them to the public. Ideally, although rarely done, the agency releasing data also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to disregard inaccurate conclusions resulting from the disclosure limitation procedures, as well as have condence in accurate conclusions. We propose an interactive computer system that an- alysts can query for measures of data quality. We focus on potential disclosure risks of providing these quality measures.
When disseminating a data set to the public, agencies generally take three steps. First, after re... more When disseminating a data set to the public, agencies generally take three steps. First, after removing direct identifiers like names and addresses, the agency evaluates the disclosure risks inherent in releasing the data "as is." Almost always the agency determines that these risks are too large, so that some form of restricted access or SDL is needed. We focus on SDL techniques here, because of the importance to researchers and others of direct access to the data. Second, the agency applies an SDL technique to the data. Third, the agency evaluates the disclosure risks and assesses the analytical quality of the candidate data release(s). In these evaluations, the agency seeks to determine whether the risks are sufficiently low, and the usefulness is adequately high, to justify releasing a particular set of altered data (Reiter, 2012). Often, these steps are iterated multiple times, for example, a series of SDL techniques is applied to the data and subsequently evaluated for risk and utility. The agency stops when it determines that the risks are acceptable and the utility is adequate (Cox et al., 2011). To set the stage for our discussion of SDL frameworks and big data releases, we begin with a short overview of common SDL techniques, risk assessment, and utility assessment. We are not comprehensive here; additional information can be found in, for example, Federal Committee on Statistical Methodology (1994), Willenborg and
Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorres... more Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorresponsibilityofstatisticalagencies.Inthispaperwepresentsolutionstoseveralcomputational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computationofmarginalsandalgorithmssuchasiterativeproportionalfitting,aswellasageneralized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
In industrial and government settings, there is often a need to perform statistical analyses that... more In industrial and government settings, there is often a need to perform statistical analyses that require data stored in multiple distributed databases. However, the barriers to literally integrating these data can be substantial, even insurmountable. In this article we show how tools from information technologyspecifically, secure multiparty computation and networking-can be used to perform statistically valid analyses of distributed databases. The common characteristic of these methods is that the owners share sufficient statistics computed on the local databases in a way that protects each owner's data from the other owners. Our focus is on horizontally partitioned data, in which data records rather than attributes are spread among the databases. We present protocols for securely performing regression, maximum likelihood estimation, and Bayesian analysis, as well as secure construction of contingency tables. We outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives for owners to be dishonest.
We present the old-but-new problem of data quality from a statistical perspective, in part with t... more We present the old-but-new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. The organizing principle underlying our views about DQ is that DQ problems and actions are driven by decisions based on the data. In some contexts, for example, whether to undertake military action on the basis of intelligence reports or whether to cease marketing a drug because of reported side effects, both the nature and the import of the decisions are clear. In others, such as much "basic" scientific research, the decisions and consequences are more nebulous. This organizing principle does, however, have tangible implications. In particular, "good" or "improved" DQ is often not an end in itself, because it is the quality of the decisions, not the data, that ultimately matters. For instance, as discussed further in §6, data clean-up efforts-especially if they are costly-must be justified on the basis of better decisions. Like any principle, this one breaks when taken to extremes. Cleaning up a customer database to remove duplicates, for example, has economic benefits even in the absence of links to explicit decisions. And good scientific data are inherently valuable. 2.1 Overview We begin with a definition that supports our position that DQ should always be embedded in a decisiontheoretic context: Data quality is the capability of data to be used effectively, economically and rapidly to inform and evaluate decisions. Necessarily, DQ is multi-dimensional, going beyond record-level accuracy to include such factors as accessibility, relevance, timeliness, metadata, documentation, user capabilities and expectations, cost and context-specific domain knowledge. DQ concerns are problems of large-scale machine and human generation of data, the assembly of large data sets, data anomalies, and organizational influences on data characteristics such as accuracy, timeliness and cost. The impact of poor DQ and the potential benefit of good DQ have implications beyond the ambit of standard statistical analyses. DQ has dramatic implications. Some people blame the U.S. government's failure to avert the terrorist attacks of September 11, 2001 on DQ problems that prevented the easy availability of prompt, accurate, and relevant information from key federal databases. In a different context, the Fatality Analysis and Reporting System discussed in §5 may not be of sufficient quality-vehicle make-model data were rife with errors-to support rapid identification of such problems as those associated with Firestone tires on Ford Explorers. At a more mundane level, nearly every major company loses significant income because of data errors-they send multiple mailings to a single person, mishandle claims, disaffect customers, suffer inventory shortfall, or simply spend too much on corrective data processing. The impetus for improved DQ is especially strong in federal agencies. Managers there are struggling with declining survey response rates and consequent diminished quality. Simultaneously, Congress and the executive branch are using the Government Performance Results Act (GPRA) to require clear linkage between regulation and measurable outcomes. This confluence of a decreasing ability to obtain accurate measurements and increasing management accountability for achieving data-determined goals has compelled federal managers to address DQ more directly than ever before. Some federal organizations are trying to formalize aspects of the DQ process. For example, the Bureau of Transportation Statistics (BTS) has tried to evaluate databases using a data quality report card (DQRC),
We present a method for performing statistical valid linear regressions on the union of distribut... more We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multi-party computation to share local sufficient statistics necessary to compute least squares estimators of regression coefficients, error variances and other quantities of interest. We illustrate with an example containing four companies' rather different databases.
Journal of Computational and Graphical Statistics, 2005
We present several methods for performing linear regression on the union of distributed databases... more We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data. Secure multi-party computation based on shared local statistics effects computations necessary to compute least squares estimators of regression coefficients and error variances by means of analogous local computations that are combined additively using the secure summation protocol. We also provide two approaches to model diagnostics in this setting, one using shared residual statistics and the other using secure integration of synthetic residuals.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002
We describe two classes of software systems that release tabular summaries of an underlying datab... more We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions — primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.
When several data owners possess data on different records but the same variables, known as horiz... more When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records' data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining goodfitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model misspecification inherent in secure computation settings. We illustrate the protocol with air pollution data.
We construct a decision-theoretic formulation of data swapping in which quantitative measures of ... more We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap ...
In this paper we study the impact of statistical disclosure limitation in the setting of paramete... more In this paper we study the impact of statistical disclosure limitation in the setting of parameter estimation for a finite population. Using a simulation experiment with microdata from the 2010 American Community Survey, we demonstrate a framework for applying risk-utility paradigms to microdata for a finite population, which incorporates a utility measure based on estimators with survey weights and risk measures based on record linkage techniques with composite variables. The simulation study shows a special caution on variance estimation for finite populations with the released data that are masked by statistical disclosure limitation. We also compare various disclosure limitation methods including a modified version of microaggregation that accommodates survey weights. The results confirm previous findings that a two-stage procedure, microaggregation with adding noise, is effective in terms of data utility and disclosure risk.
We propose and apply a novel paradigm for characterization of genome data quality, which quantifi... more We propose and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.
International Conference on Digital Government Research, May 18, 2003
We describe a risk-utility framework for selecting among swapped data releases, NISS WebSwap-a We... more We describe a risk-utility framework for selecting among swapped data releases, NISS WebSwap-a Web service that performs data swapping, and the NISS WebSwap graphical user interface.
We articulate and investigate issues associated with performing statistical disclosure limitation... more We articulate and investigate issues associated with performing statistical disclosure limitation (SDL) for data subject to edit rules. The central problem is that many SDL methods generate data records that violate the constraints. We propose and study two approaches. In the first, existing SDL methods are applied, and any constraint-violating values they produce are replaced by means of a constraint-preserving imputation procedure. In the second, the SDL methods are modified to prevent them from generating violations. We present a simulation study, based on data from the Colombian Annual Manufacturing Survey, that evaluates several SDL methods from the existing literature. The results suggest that (i) in practice, some SDL methods cannot be implemented with the second approach, and (ii) differences in risk-utility profiles across SDL approaches dwarf differences across the two approaches. Among the SDL strategies, microaggreggation followed by adding noise and partially synthetic ...
When releasing microdata to the public, data disseminators typically alter the original data to p... more When releasing microdata to the public, data disseminators typically alter the original data to protect the confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts the utility (quality) of the released data. In this paper, we present quantitative measures of data utility for masked microdata, with the aim of improving disseminators' evaluations of competing masking strategies. The measures, which are global in that they reflect similarities between the entire distributions of the original and released data, utilize empirical distribution estimation, cluster analysis, and propensity scores. We evaluate the measures using both simulated and genuine data. The results suggest that measures based on propensity score methods are the most promising for general use.
Data swapping is a statistical disclosure limitation method used to protect the confidentiality o... more Data swapping is a statistical disclosure limitation method used to protect the confidentiality of data by interchanging variable values between records. We propose a risk-utility framework for selecting an optimal swapped data release when considering several swap variables and multiple swap rates. Risk and utility values associated with each such swapped data file are traded off along a frontier of undominated potential releases, which contains the optimal release(s). Current Population Survey data are used to illustrate the framework for categorical data swapping.
To protect conden tiality, statistical agencies typically alter data before re- leasing them to t... more To protect conden tiality, statistical agencies typically alter data before re- leasing them to the public. Ideally, although rarely done, the agency releasing data also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to disregard inaccurate conclusions resulting from the disclosure limitation procedures, as well as have condence in accurate conclusions. We propose an interactive computer system that an- alysts can query for measures of data quality. We focus on potential disclosure risks of providing these quality measures.
When disseminating a data set to the public, agencies generally take three steps. First, after re... more When disseminating a data set to the public, agencies generally take three steps. First, after removing direct identifiers like names and addresses, the agency evaluates the disclosure risks inherent in releasing the data "as is." Almost always the agency determines that these risks are too large, so that some form of restricted access or SDL is needed. We focus on SDL techniques here, because of the importance to researchers and others of direct access to the data. Second, the agency applies an SDL technique to the data. Third, the agency evaluates the disclosure risks and assesses the analytical quality of the candidate data release(s). In these evaluations, the agency seeks to determine whether the risks are sufficiently low, and the usefulness is adequately high, to justify releasing a particular set of altered data (Reiter, 2012). Often, these steps are iterated multiple times, for example, a series of SDL techniques is applied to the data and subsequently evaluated for risk and utility. The agency stops when it determines that the risks are acceptable and the utility is adequate (Cox et al., 2011). To set the stage for our discussion of SDL frameworks and big data releases, we begin with a short overview of common SDL techniques, risk assessment, and utility assessment. We are not comprehensive here; additional information can be found in, for example, Federal Committee on Statistical Methodology (1994), Willenborg and
Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorres... more Disseminationofinformationderivedfromlargecontingencytablesformedfromconfidentialdataisa majorresponsibilityofstatisticalagencies.Inthispaperwepresentsolutionstoseveralcomputational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computationofmarginalsandalgorithmssuchasiterativeproportionalfitting,aswellasageneralized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
In industrial and government settings, there is often a need to perform statistical analyses that... more In industrial and government settings, there is often a need to perform statistical analyses that require data stored in multiple distributed databases. However, the barriers to literally integrating these data can be substantial, even insurmountable. In this article we show how tools from information technologyspecifically, secure multiparty computation and networking-can be used to perform statistically valid analyses of distributed databases. The common characteristic of these methods is that the owners share sufficient statistics computed on the local databases in a way that protects each owner's data from the other owners. Our focus is on horizontally partitioned data, in which data records rather than attributes are spread among the databases. We present protocols for securely performing regression, maximum likelihood estimation, and Bayesian analysis, as well as secure construction of contingency tables. We outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives for owners to be dishonest.
We present the old-but-new problem of data quality from a statistical perspective, in part with t... more We present the old-but-new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. The organizing principle underlying our views about DQ is that DQ problems and actions are driven by decisions based on the data. In some contexts, for example, whether to undertake military action on the basis of intelligence reports or whether to cease marketing a drug because of reported side effects, both the nature and the import of the decisions are clear. In others, such as much "basic" scientific research, the decisions and consequences are more nebulous. This organizing principle does, however, have tangible implications. In particular, "good" or "improved" DQ is often not an end in itself, because it is the quality of the decisions, not the data, that ultimately matters. For instance, as discussed further in §6, data clean-up efforts-especially if they are costly-must be justified on the basis of better decisions. Like any principle, this one breaks when taken to extremes. Cleaning up a customer database to remove duplicates, for example, has economic benefits even in the absence of links to explicit decisions. And good scientific data are inherently valuable. 2.1 Overview We begin with a definition that supports our position that DQ should always be embedded in a decisiontheoretic context: Data quality is the capability of data to be used effectively, economically and rapidly to inform and evaluate decisions. Necessarily, DQ is multi-dimensional, going beyond record-level accuracy to include such factors as accessibility, relevance, timeliness, metadata, documentation, user capabilities and expectations, cost and context-specific domain knowledge. DQ concerns are problems of large-scale machine and human generation of data, the assembly of large data sets, data anomalies, and organizational influences on data characteristics such as accuracy, timeliness and cost. The impact of poor DQ and the potential benefit of good DQ have implications beyond the ambit of standard statistical analyses. DQ has dramatic implications. Some people blame the U.S. government's failure to avert the terrorist attacks of September 11, 2001 on DQ problems that prevented the easy availability of prompt, accurate, and relevant information from key federal databases. In a different context, the Fatality Analysis and Reporting System discussed in §5 may not be of sufficient quality-vehicle make-model data were rife with errors-to support rapid identification of such problems as those associated with Firestone tires on Ford Explorers. At a more mundane level, nearly every major company loses significant income because of data errors-they send multiple mailings to a single person, mishandle claims, disaffect customers, suffer inventory shortfall, or simply spend too much on corrective data processing. The impetus for improved DQ is especially strong in federal agencies. Managers there are struggling with declining survey response rates and consequent diminished quality. Simultaneously, Congress and the executive branch are using the Government Performance Results Act (GPRA) to require clear linkage between regulation and measurable outcomes. This confluence of a decreasing ability to obtain accurate measurements and increasing management accountability for achieving data-determined goals has compelled federal managers to address DQ more directly than ever before. Some federal organizations are trying to formalize aspects of the DQ process. For example, the Bureau of Transportation Statistics (BTS) has tried to evaluate databases using a data quality report card (DQRC),
We present a method for performing statistical valid linear regressions on the union of distribut... more We present a method for performing statistical valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multi-party computation to share local sufficient statistics necessary to compute least squares estimators of regression coefficients, error variances and other quantities of interest. We illustrate with an example containing four companies' rather different databases.
Journal of Computational and Graphical Statistics, 2005
We present several methods for performing linear regression on the union of distributed databases... more We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data. Secure multi-party computation based on shared local statistics effects computations necessary to compute least squares estimators of regression coefficients and error variances by means of analogous local computations that are combined additively using the secure summation protocol. We also provide two approaches to model diagnostics in this setting, one using shared residual statistics and the other using secure integration of synthetic residuals.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002
We describe two classes of software systems that release tabular summaries of an underlying datab... more We describe two classes of software systems that release tabular summaries of an underlying database. Table servers respond to user queries for (marginal) sub-tables of the "full" table summarizing the entire database, and are characterized by dynamic assessment of disclosure risk, in light of previously answered queries. Optimal tabular releases are static releases of sets of sub-tables that are characterized by maximizing the amount of information released, as given by a measure of data utility, subject to a constraint on disclosure risk. Underlying abstractions — primarily associated with the query space, as well as released and unreleasable sub-tables and frontiers, computational algorithms and issues, especially scalability, and prototype software implementations are discussed.
When several data owners possess data on different records but the same variables, known as horiz... more When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records' data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining goodfitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model misspecification inherent in secure computation settings. We illustrate the protocol with air pollution data.
We construct a decision-theoretic formulation of data swapping in which quantitative measures of ... more We construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap ...
In this paper we study the impact of statistical disclosure limitation in the setting of paramete... more In this paper we study the impact of statistical disclosure limitation in the setting of parameter estimation for a finite population. Using a simulation experiment with microdata from the 2010 American Community Survey, we demonstrate a framework for applying risk-utility paradigms to microdata for a finite population, which incorporates a utility measure based on estimators with survey weights and risk measures based on record linkage techniques with composite variables. The simulation study shows a special caution on variance estimation for finite populations with the released data that are masked by statistical disclosure limitation. We also compare various disclosure limitation methods including a modified version of microaggregation that accommodates survey weights. The results confirm previous findings that a two-stage procedure, microaggregation with adding noise, is effective in terms of data utility and disclosure risk.
Uploads
Papers by Alan Karr