General remarks

What are the Characteristics of Big Data and How do they Affect the Risk of Identification?

Below is a list of Big Data specificities in terms of privacy, identified by the Big Data Privacy Group. It is written from the perspective of an official agency releasing Big Data or statistics based on it.

o Velocity. Data get “younger and younger”. The time of day (perhaps within seconds) associated with data could make it identifiable. Related to this, if a constant stream of new data are released in real time, data that were previously considered non-identifiable, may quickly become identifiable. This may mean assessments about identification risks may need to be ongoing.

o Variety. The need for the creation of multiple solutions for the protection of privacy in the context of multiple data sources which contain different types of data available at different levels of aggregation. Given the variety of information, all people or businesses may be identifiable in terms of at least some variables.

o Size. Virtually unlimited. Use of current risk assessment techniques on such large data may be computationally difficult and involved.

o Veracity (or accuracy). Reporting errors, poorly defined metadata, data that are masked for the purposes of statistical disclosure control or data that are out-of-date, will reduce the likelihood of identification. For example, the data obtained directly from people, especially through social media, may be less accurate and so less likely to lead to identification compared with information obtained from an institution, such as a bank.

o Value. The value could be high if there is the potential for substantial monetary reward from identification (e.g., commercially sensitive data) or if the information is highly private or personal (e.g. health conditions, political opinions or personal behavior). The greater the value, the more likely attempts at disclosure will be made and the greater the disclosure risk.

o Availability. Very different entities can be data owners. The type of cooperation with them depends on whether we get the raw data or already aggregated, and therefore affects the type of confidentiality protection required.

o Aggregate. A significant part of the data will be aggregated at the stage of obtaining them from the owners.

o Awareness of society. The need to educate and reassure society about the use of Big Data will affect the public view on the level of confidentiality protection required

o Geographical differences. Limitations in access to new technologies in some societies cause the need to ensure adequate privacy tools, regardless of the available technology.

Data access strategies for big data

Over the years statistical organizations established three different settings for providing external researchers access to their data: dissemination of microdata to the public, onsite access in research data centers (RDC), and remote access (see also Deliverable 1 for a detailed discussion of the three access strategies). In the context of Big Data, microdata dissemination is no longer an option in most cases. One feature of Big Data is that the size of the data implies that transferring the data is cumbersome. It would not be possible to send the data to the researcher on CDs or provide a link for a download as is the current practice with microdata dissemination. Thus, onsite and remote access will be the only viable solutions. Onsite access has the advantage that the agency has a better control over who accesses the data and what is done with the data. Researchers usually are not allowed to bring any own devices such as laptops or cameras to RDCs, all their activities are monitored, and no analysis result leaves the RDC before it has been carefully checked by RDC staff for potential confidentiality violations. However, this manual output checking is a labor intensive task and most RDCs are already working at their limits resulting in increasing waiting times for the researchers before they can obtain the cleared outputs. If the standard set of surveys and administrative data that is already offered at the RDC will be enriched by new data sources it is likely that offering manual output checking for all these databases will no longer be feasible. It is thus more relevant than ever that general strategies for ensuring the confidentiality of the generated outputs without manual intervention are developed. These strategies will likely consist of a combination of output checking to suppress risky outputs and output perturbation to allow the release of outputs that are considered safe in most situations but could be misused by an ill-intentioned user, to learn some private information about specific individuals in the database. If general strategies to automatically ensure confidentiality of the outputs can be developed and a secure connection to the server hoisting the data can be guaranteed it will only be a question of convenience whether the data will be accessed on site or remotely. However, whether this goal can ever be achieved is currently an open question and an area of active research.

Approaches to privacy in the context of Big Data

In this section, current and future approaches to privacy in the Big Data context are presented.

Possible Use of Mobile Data

Potential information from Big Data about mobile phone users include the GPS location, time of day, and duration of phone call, and the GPS location and time of day of text messages. The GPS location could be the location of the mast that routes the communication or the location of the phone user at the time of communication. There are many important questions that mobile device data could help answer, such as "what are the daily and weekly changes in regional population counts", "are there any seasonal patterns in population counts", "are there periods of un-seasonal movements and can these be attributed to specific events, such as severe drought or storms", to name a few. The benefit to society of answering such questions is clear. For example, it could assist governments target transportation infrastructure spending where it is most needed. There is also the potential to link people who are selected in surveys that are conducted by statistical organisations to their mobile phone data (e.g. when selecting a person in a survey, ask the person for their mobile phone number). This would allow analysis of how personal characteristics (e.g. employment, education) are related to their mobility.

From a privacy perspective, it is interesting to ask whether a person's GPS locations are sensitive and, if so, at what level of geographic detail (e.g. suburb, state) can GPS locations be aggregated so that they are no longer sensitive? Furthermore, how identifiable is GPS location? Clearly a mobile phone's GPS location centred on a particular house would likely mean that the mobile phone belongs to a person who lives at the house. This would clearly be a disclosure of the person's GPS locations. Again, at what level of geographic detail would GPS locations need to be aggregated so that they are no longer identifiable? The answer to this question is related to the frequency with which GPS locations are taken- the more frequent the GPS locations are taken, the more likely that they will lead to idenfication. If a mobile phone user travels to a particular sequence of suburbs on a routine basis, this could be potentially identifiable.

The PARAT software

Protected dissemination of patient-level data is critical for healthcare organizations to increase the quality of care, improve patient safety and reduce costs. The PARAT software, which automates de-identification and masking of data for secondary use, has been developed and utilized in Canada. PARAT works based on peer-reviewed algorithms and technology, and it is in compliance with the United States Health Insurance Portability and Accountability Act, and other legal requirements for sharing data. The Ontario Cancer Data Linkage program, which aims at linking Ontario’s rich cancer data resources and providing the de-identified data directly to health services researchers, used PARAT to assess the risks of re-identification and to de-identify the data. Key government officials expressed their approval of using PARAT for this program. To know more about PARAT: http://www.privacyanalytics.ca/software/

Hadoop (HDFS and MapReduce)

Hadoop is an open source software project, developed by the Apache Software Foundation, targeted at supporting the execution of data-oriented applications on clusters of generic hardware. The core component of the Hadoop architecture is a distributed file system implementation, namely HDFS. HDFS realizes a file system abstraction on top of a cluster of machines. Its main feature is the ability to scale to a virtually unlimited storage capacity by simply adding new machines to the cluster at any time. The framework transparently handles data distribution among machines and fault-tolerance (through replication of files on multiple machines).

The basic idea behind the processing of Big Data consists in distributing the processing load among different machines in a cluster, so that the complexity of the computation can be shared. The number of the machines can be adapted in order to fit to the complexity of the task, the input size and the expected processing time. Besides managing distributed storage, a technological platform such as Hadoop also provide the abstraction of distribution of processing in a cluster.

The MapReduce paradigm is a programming model specifically designed for the purpose of writing programs whose execution can be parallelized. A MapReduce program is composed by two functions, “map”, specifying a criteria to partition the input into categories and “reduce” where the actual processing is performed on all the input records that belong to a same category. The distributed computation framework (e.g. Hadoop) takes care of splitting the input and assigning it to a different node in the cluster that takes in turn the role of “mapper” and “reducer”. Hence, the general processing task is split into separate subtask and the result of the computation of each node is independent from all the others. The physical distribution of data through the cluster is transparently handled by the platform and the programmer must only write the processing algorithm according to the MapReduce rules.

MapReduce is a general paradigm not tied to a specific programming language although the core implementation in Hadoop requires MapReduce programs to be written in Java.

Pig and Have, from the SandBox

Pig is a high level interface to MapReduce. Writing MapReduce programs in Java can be complex and also common operations on data, like joins and aggregations, may require a significant amount of code. This requires trained IT developers, slowing down the analysis of data. Pig is based on a high-level language, namely PigLatin, oriented to data transformation and aggregations. Complex operations on data can be performed with short scripts, that can be simply read and modified also by business analysts.

Hive is a SQL interface to MapReduce. It allows to structure data in HDFS in tables, like in a relational database, and to query data using familiar SQL constructs such as selections, joins and filters. Hive and Pig have different purposes and complement themselves in the Hadoop ecosystem: Hive is oriented to interactive querying of data, while Pig allows to build complex transformation flows.

RHadoop

RHadoop is a R library that allows to write MapReduce programs in R. Once installed and configured, it integrates with the Hadoop cluster, allowing to read and write files from/to HDFS and to stream MapReduce jobs over the cluster. The advantage it provides consists in the possibility of using a language which is already familiar to statistical users, allowing to work on Big Data with a graceful learning curve and exploiting established know-how.

Information Integration and Governance
Statistical Disclosure Limitation/Control (SDL)
Manage Potential Risk to Reputation (Public Image)
Legislation
Risk-based approach

Information Integration and Governance

Database Activity Monitoring

It is important to keep track of who has access to your databases and what they are doing at any given time. System event logging, monitoring and auditing is crucial in determining whether anything inappropriate is occurring within an environment. For example, events to monitor include:

Database logons and logoffs;
Database administrator actions;
Attempted accesses that are denied;
Attempts to elevate privileges;
Changes to user roles or rights;
Addition of new users, especially administrative users;
Changes to the database structure;
Access to particularly sensitive information;
Any query or database alerts or failures;
Any query containing comments; and
Any query containing multiple embedded queries.

In addition to the above, all of these logs should be stored on a central logging server to collate the data and also ensure that it does not remain on the same server that the event occurred on, in case that server is compromised or modified by an attacker. With the amount of events occurring in a Big Data environment the amount of events covering the above will be quite high, which means it will need to be determined whether the risk associated with not monitoring these events, or keeping them stored for a sufficient period of time, is acceptable.

Identifying any potential threats before they happen, or escalate, helps to protect the data and the reputation of the organisation through reducing the chances of it being compromised.

Security Considerations

Privacy should be embedded into the design and architecture of IT systems and business practices. It is not something that should be included at the end as an add-on, but instead should become an essential component of the core functionality being delivered. This helps to ensure that end-to-end security has been embedded into the system before the first piece of information has been collected and extends throughout the entire lifecycle. One of the main references to help guide an organisation regarding best practices is the International Organisation for Standardisation (ISO) [1]. ISO is an independent, non-governmental organisation made up of members from 163 countries who are the national standards bodies around the world and responsible for publishing standards to cover almost every industry. The ISO has developed an "Information Security Management Systems" collection [2], which is made up of a number of standards to help an organisation manage the security of assets entrusted to them by third parties. Implementing good IT security practices will give confidence to users of Big Data looking to communicate and trade in secure environments.

Below are four key standards of security best practices that should be considered as a baseline to help mitigate potential risks to IT systems and business practices:

Separation of Duties

This concept involves the development of good internal controls to help restrict the amount of power on influence held by any given user within an organisation. Separation of duties focuses on two primary objectives:

Detection of control failures, such as information theft, security breaches and circumvention of security controls.
Prevention of conflict of interest, wrongful acts, the appearance of conflict of interest, fraud, errors, and abuse.

For example, someone who is responsible for designing and implementing security cannot be the same person as the person who is responsible for testing security, conducting security audits, or monitoring and reporing on security. Responsibilities should be assigned within an organisation to ensure that there is a way that checks and balances can be established within the system to minimise the opportunity for unauthorised access and fraud.

Separation of Concerns

Similar to Separation of Duties, this concept focuses on the systems themselves within an environment and separating them into parts that overlap in functionality as little as possible. By using a more modular approach, especially in the development of a system, it can help with maintainability, extensibility and reusability.

Principle of Least Privilege

The basis of this principle is to provide only the minimum necessary level of access rights a user, or role, requires in order to complete their job. For example, if I require an external evaluator to carry out a documentation review then I would provide them with a limited user access account, allowing them to only access the documentation they require as part of their review. To implement this principle it is important to have an appropriate authorisation process in place to help identify exactly what any given user, or role, will require access to in order to carry out any given task.

Defence in Depth

The purpose of Defence in Depth is to ensure that there are multiple security mechanisms in place so that in the even that one is compromised there are others in place that will mitigate the threat. For example, a bank uses security guards, security cameras, time-release locks, and vaults so that in the event that one of these security mechanisms fail there are a number of others still in place. The level of implementation regarding this principle will depend on the risk associated with the system/solution that is being implemented. The higher the risk and consequence if it were to be compromised, the higher the implementation of security mechanisms to ensure the integrity and confidentiality of the system/solution.

Security of Transportation

The use of internationally recognised transport mechanisms, such as Transport Layer Security (TLS), would be recommended to ensure that all information is transferred securely between locations. This will also help to mitigate any potential data leaks that could potentially impact on the reputation of an organisation.

Data Encryption

Storage privacy refers to the ability to store data without anyone being able to read (let alone manipulate) it, except the party having stored the data (called the data owner) and whoever the data owner authorises. A major challenge to implement private storage is to prevent non-authorised parties from accessing the stored data. If the data owner stores data locally, then physical access control might help, but it is not sufficient if the computer equipment is connected to a network; a hacker might succeed in remotely accessing the stored data. If the data owner stores data in the cloud, then physical access control is not even feasible. Hence, technology-oriented countermeasures are needed. These include:

Full Disk Encryption (FDE)
File System-Level Encryption (FSE)
Steganographic Storage
Secure Remote Storage
Searchable Encryption
Homomorphic Encryption
Secure Multiparty Computation

Statistical Disclosure Limitation/Control (SDL)

The four main types of Statistical Disclosure include:

Identity Disclosure
Attribute Disclosure
Inferential Disclosure
Population/Model Disclosure

There are a number of controls that can be put in place to mitigate the likelihood of these disclosures occurring.

Preserving Confidentiality

There are ethical, legal and pragmatic reasons to ensure, and care, about the privacy and confidentiality of a person's data. It is not just a matter of keeping promises, as failure to do so may result in people not wanting to provide accurate data, but also it is required under law.

Restricted Access

This relates closely to section 1. b) (i) regarding Separation of Duties. It is important to ensure that there are a number of conditions put in place that relate to:

- Which location(s) can the data be accessed from?
- What are the variables that can be accessed?
- What is the purpose of the access?
- What data can be published?
- Who can access the data?

Restricted Data

The data that is provided will come in a number of formats so it is important to ensure that it is reviewed before being released. This will help to ensure that no confidential information is disclosed.

Combination of Both

The use, for example, of online access using an approach such as secure computation and distributed databases, will help to protect the confidentiality of data.

Providing access to useful statistical data, not just a few numbers

It is important to ensure that the released data, while not identifiable, fulfils a purpose and is still meaningful. As outlined by Slavkovic, this includes sufficient variables and statistics to allow for proper multivariate analyses, as well as the ability to assess goodness of fit of models.

Strike a balance between “Data Utility” and “Disclosure Risk”

Data Masking: Transform original data to disseminated data

Statistical disclosure limitation consists in the transformation of original (sensitive) data through the application of data manipulation algorithms, whose output is a new "masked" data set with a lower risk of disclosure.

Traditional Approaches:

Aggregation: Rounding, Top/Bottom Coding & Thresholding

This approach consists of the generalization of individual observations by means of rounding continuous variables, or making them discrete. It also includes re-coding categorical variables into more general categories.

Suppression/Obfuscation, e.g. Cell Suppression

Data suppression consists of removing from the original dataset those observations that are deemed to represent a greater risk of either direct or indirect disclosure.

Data Perturbations

Perturbative methods attempt to hide sensitive information by confounding it with an additional layer of random noise.

Data Swapping/Shuffling

This approach consists of interchanging the values of one or more variables between pairs or among groups of records in a dataset.

Modern Approaches: Sampling and Simulation Techniques

Synthetic Data

This approach refers to the use of statistical models to simulate data records that emulate the main distributional characteristics of the original microdata.

Remote Access Servers

This is an increasingly popular way to provide users with access to detailed data for legitimate research purposes. An example of the process has been provided below:

A statistical organization provides a description of the structure of the dataset to the user/researcher
The user develops a software program specifying the analytic computations she is interested in
The statistical organization then executes the program supplied by the researcher on its internal computer system
Finally, the statistical organization communicates the output to the user after verifying compliance with existing data confidentiality policies

Secure Computation

The idea of secure computation is the use of a "global" database partitioned among multiple organizations. Due to proprietary and/or confidential data, organizations want to be able to perform statistical analysis on this database, but without combining the databases. There is no actual data integration across the "global" database with each organization involved being responsible for securing its own data, as well as ensuring identities or sensitive attributes are not revealed.

Partial Information Releases

Using a partial data release approach helps to ensure that in the event of compromise it negates the risk of full disclosure. Using tools provided specifically for linear/integer programming it is possible to create these "splits" while ensuring the data remains useful and relevant. The approach should be considered only once the risk of disclosure of the data has been identified.

Other methods:

Location Data;
Argus; and
Special Unique Detection Algorithm (SUDA2).

Manage Potential Risk to Reputation (Public Image)

Enforce ethical practices in the supply chain

Statistical organizations not only need to effectively manage actual disclosure risks, but also perceived disclosure risks. This is essential to secure the continued cooperation of data providers and public trust in general. Therefore, statistical disclosure limitation methods need to be supplemented with:

Legal and administrative instruments of accountability
Informed consent

Establish strong compliance controls

To help establish trust with providers it is important for an organisation to be able to demonstrate compliance, which is then confirmed through audits. While a holistic approach to data security and privacy will go a long way, it is a necessity to demonstrate and prove compliance to both internal and external auditors. There are numerous compliance controls available to help an organization, such as Data/Database Activity Monitoring, but in order to know what controls to implement it is important for an organization to identify and understand the major privacy/security compliance requirements they must adhere to, which include, but are not limited to:

Basel II;
Health Insurance Portability and Accountability Act (HIPAA);
Payment Card Industry Data Security Standard (PCI DSS); and
Sarbanes-Oxley Act (SOX).

Once identified, an organisation then needs to establish the relevant compliance controls to help meet those requirements. For example, by automating the capture of new compliance-related data from all sources, including regulatory bulletins, web page updates, and internal memos, as well as targeted distribution of data to staff members, a leading healthcare provider was able to greatly enhance its compliance position.

Develop a monitoring system to track reputational threats

It is important to always consider implementing a logging, auditing and alerting system as part of your environment to ensure that any potential risks are identified as soon as possible. By configuring the system accordingly you can be alerted when a potential incident might be occurring, allowing you to stop it before it becomes a threat to your organizations reputation.

Develop clear communication channels with internal and external stakeholders (Transparency)

Improve the reporting experience;
Explain what you’re doing and why;
Keep everyone in the loop;
“Market” the organizations reporting and compliance programs;
Make sure leadership is listening; and
Inform the citizen better on the potential use of their data.

Create a crisis communication plan

A crisis communication plan is required because, in the event of an emergency, a business needs to be able to respond quickly to a number of different audiences with information specific to their interests and needs. The image of an organization can be greatly impacted, either positively or negatively, by the public perception of how they handle an incident.

Organisation of dialogue with public (Both sending information and receiving feedback comments)

Developing a communication channel with the public provides an organisation with the opportunity to improve the services that it offers, including security/privacy controls.

Inform the citizen about the use of his/her data and give feedback

It is important to make providers feel comfortable that the data that they are providing an organization with will be securely stored and used responsibly. To help reassure providers that their data is safe, an organization should endeavour to provide as much information about how their data will be used and stored as possible. By allowing a provider to get a better understanding of what it is being used for, and its benefit, it encourages them to engage others to get involved. Creating a feedback mechanism benefits both parties, as it allows an organization to liaise with a provider for further information, while also allowing providers to identify any strengths or weaknesses of a system.

Legislation

One of the things to consider is how Big Data privacy will be managed from a legislative perspective. In Europe, Directive 95/46/EC on the protection of individuals, regulates the collection and processing of personal data. Directive 2002/58/EC, amended by Directive 2009/136/EC deals with privacy and electronic communications (ePrivacy Directive). Other legislative factors to consider include where your system will store the data, as there are different confidentiality rights based on the location where it is hosted. For example, any data hosted in the United States of America is subject to the Patriot Act and permits the U.S. Government to be able to access personal information if they request it.

Risk-based approach

Notice and consent regarding the use of Big Data is becoming increasingly impractical, impossible, or illusionary, meaning that the use of a risk-based approach to privacy can deliver appropriate protection. It is important, as the business, to assess the potential harms of the proposed data uses and identify appropriate controls to mitigate any risks. For example, below is an example risk matrix identifying the criteria to consider if any given event were to occur within an organisation, such as privileged user access credentials being compromised. Once the consequence of a given risk has been identified then it is important to identify the likelihood of that risk occurring, which is based on any controls in place to mitigate that risk, or ones that can be implemented, with a Medium risk or lower generally considered as acceptable.

References

[1] Internation Standards Organisation, 2014. https://www.iso.org/

[2] ISO Information Management Systems Collection, 2014. https://www.iso.org/obp/ui/#iso:pub:PUB200004:en

De Filippi, D. (2014). Big Data, big responsibilities, INTERNET POLICY REVIEW. DOI: 10.14763/2014.1.227

Rubinstein, I. S. (2013). Big Data: The End of Privacy or a New Beginning?, International Data Privacy Law Advance Access.

Rubinstein, I. S. (2013). Big Data: A Pretty Good Privacy Solution. Big Data & Privacy: Making Ends Meet Digest, 106-108.

Big Data Working Group (2013). Expanded Top Ten Big Data Security and Privacy Challenges, Cloud Security Alliance.

Cavoukian, A. (2011). Privacy By Design: The 7 Foundational Principes, 1-3.

Slavkovic, A. B. (2007). Overview of Statistical Disclosure Limitation: Statistical Models for Data Privacy, Confidentiality and Disclosure Limitation.

Heyder, M. (2014). Tackling the Risks of Big Data, Privacy Perspectives.

IBM (2014). Data Profit vs. Data Waste.

IBM Redbooks (2014). Information Governance Principles and Practices for a Big Data Landscape, 47, 134, 142, 156.

Stock take: Overview of existing tools for risk management in view of privacy issues

Statistical organizations collect information from people and businesses, here called data providers, in order to support decision- making, research and discussion within governments and the community. This information is typically released to government and the community in the form of estimates or micro-data.

Data providers reasonably expect that disclosure of their sensitive information (e.g. income or health condition) is unlikely. Without this expectation of privacy, data providers may not provide information to the agency. This in turn would undermine the ability of the agency to fulfil its mission. Different people, cultures, and nations have a variety of expectations about what information is sensitive and what constitutes unlikely. In some cases there will be an explicit legal answer to these questions but in many cases, judgement is required.

This chapter reviews current practices by statistical organizations that aim to manage two conflicting aims: to provide access to data for the benefit of society and to meet society's expectation that sensitive information about data providers will be kept private. This chapter will focus on these practices, and their advantages and disadvantages, in so far as they relate to traditional data sources (e.g. sample surveys and administrative data collected by governments). Section a) broadly discusses how risks to privacy depend on characteristics of the data, how the data are released and the potential motivations to breach a data provider's privacy. Section b) discusses the different ways in which statistical organizations allow analysts or researchers to access micro-data. Section c) discusses managing the risk of disclosure associated with databases. Section d) is a brief summary of the advantages and disadvantages of the different approaches to managing privacy. Section e) mentions some relevant software, some of which is freely available.

Risks to privacy

For simplicity, in this chapter it is assumed that all data are sensitive. The risk of disclosing sensitive information about a data provider is called disclosure risk. For micro-data releases, disclosure could occur by identification or attribute disclosure, even after removing name and address. Identification disclosure can occur when a record is matched to its corresponding record on an external administrative file that contains the data provider’s name and address. This can occur via a quasi-identifier (e.g. in the case of a person this matching key could be age, sex, suburb and occupation) available on both the micro-data and an external administrative file. Attribute disclosure occurs when new information is disclosed about a person or business. Continuing with the above example, following identification of the person, all sensitive variables or attributes contained in the micro-data are disclosed.

For estimate releases, disclosure can occur as for micro-data releases, except that information about a data provider must first be derived from one or more estimates. For example, consider a frequency table with dimensions age, sex, suburb, occupation and income range and that a margin defined in terms of age, sex, suburb and occupation has a count of one. The value of variables that define the margin could be used as a quasi-identier for the purpose of identification disclosure and the internal cell of the table would reveal the person’s income range, which is attribute disclosure.

The risk of disclosure is also influenced by the type of access. Analysts can directly access micro-data, which means they can observe data values. Direct access could be supervised by official agency staff via onsite access or it could be unsupervised via a micro-data release. In the case of estimate releases, the disclosure risk is affected by whether it is the official agency who decides what estimates are released (e.g. publications) or whether the analyst has some flexibility to decide what estimates are released (e.g. via some form of remote access).

Two broad areas of managing disclosure risk are:

Reducing the risk that an attempt at disclosure will be made
Releasing data in a manner that is not likely to enable disclosure

Risk 1. refers to the risk that a user is motivated to attempt disclosure, or alternately, that a user is not sufficiently deterred from making an attempt at disclosure. This risk can be managed by masking the data, by selecting which users to grant access to the data, by training users, and by the way in which the data are accessed (e.g. access to micro-data via a remote server). Factors affecting whether a user would attempt disclosure are:

• Motivation to attempt disclosure. The motivation could be high if there is the potential for substantial monetary reward from disclosure (e,g, commercially sensitive data) or if the information is highly private or personal (e.g. health conditions, political opinions or personal behavior).

• Deterrents to attempt disclosure. These include: requiring users to sign legally enforceable undertakings not to attempt any disclosure, threatening penalties for breaching these undertakings, auditing users to ensure they are using the data for the purpose for which access was granted

• Effort, skills and technology required to make a disclosure. With technological advancements, attempts at identification may require little effort or skill. Applying methods of statistical disclosure control will increase the effort and skills required to make an identification. Developing standard ‘profiles’ of users (e.g. private individual, telecommunications company, and government organization) may help identify users for whom the required skills and technology are currently available.

Risk 2 refers to the risk of disclosure assuming that an attempt at disclosure was made. The factors affecting this type of risk include:

• Amount of data accessible (or that can be viewed) by the user. Is the data available in the form of a unit record file, a set of predetermined aggregates, or in a set of user-defined aggregates (e.g. via a remote server or from user requests);

• Level of detail in the data- is the detail in the data sufficient for identification

• Presence of remarkable units that are readily identifiable.

• Accuracy of the collected data: Reporting errors, poorly defined metadata, data that are masked for the purposes of statistical disclosure control or data that is out-of-date, all reduce the likelihood of identification.

• Coverage of the data: A person can only be identified if their information is available in the data. Therefore the greater the chance that the person is in the data, the greater the risk of disclosure for the person

• Presence of other information that can assist in identification, including:

o publicly available information;

o restricted access data holdings that a data user may have access to; and

o personal knowledge that a user may have.

Data access strategies

Historically, national statistical organizations were the only entities collecting, storing, and analyzing large amounts of data for many years. Until the early eighties of the last century direct access for external researchers to the data collected by the organizations was unthinkable. No-one outside the organizations had the capabilities to analyze these data and strict regulations prohibited that anybody not employed at the organizations could have access to the data on the premises of the organization. So the main concern regarding confidentiality was that published tables based on the underlying microdata might lead to the identification and attribute disclosure for individual included in the database. When modern desktop computers became available in the early eighties, researchers at universities and other research institutions suddenly had the capacity to analyze microdata on their own computers and calls for access to data collected by government agencies were raised for the first time. As a response to these calls statistical organizations developed different strategies over the last decades to enable external researchers to analyze their data without violating confidentiality regulations. These strategies can be summarized under three broad topics: microdata dissemination, onsite analysis in research data centers, and remote access.

Onsite access

An alternative approach for providing data access for external researchers emerged at the end of the last century. Many statistical organizations established research data centers (RDC) on their premises. This form of data access is usually not available to the general public. Only academic researchers may use this approach. Generally, external researchers that wish to analyze the data have to submit a research proposal in which they explain their research ideas, discuss which datasets they want to use and why their research questions can only be answered using the specific dataset. Some organizations also require that the researchers determine in advance which variables they will be using. If they realize later that they would need additional variables to answer their research questions a new proposal would need to be submitted. Once the proposal has been submitted, the organizations check whether data access can be granted. In many cases formal approval from other legal entities also needs to be given. For example, at the German Institute for Employment Research (IAB), each requested data access needs to be formally approved by the Ministry of Labour. Once the request has been approved, the researcher signs a contract in which she guarantees that she will not use the data for re-identification purposes. The researcher can then schedule a time slot during which she will visit the organization. She will be provided with computing facilities and access to the requested data. The computing facilities are not connected to the internet and the data and all analysis code can only be stored in selected folders on the desktop machine. All printing and external data storage devices such as USB drives are disabled. The researcher is not allowed to bring any electrical devices such as laptops, cameras, or cell phones inside the RDC and all her activities on the computer are monitored. She will be able to manipulate the data in any way but any output that should be used outside the RDC for example for inclusion in a research article will be carefully scrutinized by staff at the RDC to ensure that no confidentiality guarantees are violated if the output is released.

Compared to microdata dissemination the organization has much better control over what happens with the data. Only accredited researchers will have access, the data never leaves the RDC, the activities of the researchers are monitored and any output taken out of the RDC is carefully checked before release. For this reason, more datasets are usually available on site and often the data are only altered slightly or even not altered at all. An obvious disadvantage of the approach is that it is costly for the researcher and the data providing organization. Researchers have to invest time and money to travel to the RDC and find accommodation for several nights. Often, unanticipated problems arise when working with the microdata and thus the time slot that was requested might not suffice. However, a simple extension of the time slot is often not possible since the machines might be booked by other researchers. Thus, the researcher will have to wait until another slot becomes available to finish her research. On the other hand, the approach is cost intensive for the statistical organization since staff are needed to run the facility and to check each output that is to be used outside the RDC.

A recent approach to mitigate some of the costs at least for the researchers are RDC-in-RDC solutions. The idea is that statistical organizations allow researchers to access their data in the safe environment of RDCs run by other agencies. For example, the IAB has set up several RDC-in-RDC solutions throughout Germany and also at several universities in the USA. With this approach the traveling distances for the researchers are greatly reduced and researchers that might not be able to cover the traveling expenses otherwise can now access the data at an RDC that is closer to their own research institution.

Remote access

The latest development in data access research is remote access. With remote access external researchers connect with a secure server and run their analyses on these machines. The microdata never leaves the secure environment. A predecessor of remote access is remote execution. With remote execution, researchers are provided with some dummy or test datasets. These datasets mimic the structure and content of the original data as closely as possible but no meaningful inferences can be obtained based on the data. Based on these test data researchers develop their analysis code and send the code to the RDC. Staff at the RDC will run the analysis code on the original data. The output is checked manually for confidentiality violations and sent back to the researcher if no violations are detected. Since the analytical validity requirements are much lower than the requirements for microdata dissemination it is much easier for statistical organizations to generate such datasets and disseminate them to the public. Remote execution has already been established at several organizations. However, the output still needs to be checked manually, which means an investment for the RDC. Furthermore, researchers have to wait until the output has been released before they can decide whether further adjustments are necessary. Since the test data will not reflect all subtleties of the original data analysis code might have to be sent back and forth between the RDC and the researcher several times before the final output is obtained.

To avoid these hassles it seems a consequent step to circumvent any manual interference is necessary. However, besides the technical problems such as ensuring a safe connection between the computer of the researcher and the secure server, several other issues complicate the establishment of remote access solutions. The main such problem is that in many countries any output that is shown on the screen of the researcher would be considered a data dissemination. Furthermore, a full remote access solution requires that all outputs that are transmitted to the researcher are guaranteed to confirm with data confidentiality regulations. In many cases it seems trivial to decide whether an output can be classified as safe or not. For example, maximum queries should never be answered since they provide information on individual records. However, it is not always easy to detect problematic queries and misguided data users could use several strategies to hide their attacks on the data (see O’Keefe and Chipperfield (2013) for an excellent review). Furthermore, several studies (see for example Gomatam et al. (2005), Bleninger et al. (2011), or Drechsler et al. (2014)) have shown that seemingly harmless multivariate analyses such as linear regressions could be used to obtain exact information for individuals in the database. Section C (ii) below expands upon this discussion about remote access.

Microdata dissemination

Ever since the early eighties statistical organizations have been developing strategies to disseminate microdata without violating confidentiality requirements. Since the organizations have no control over who will access the data and what the data are used for once they are disseminated, they need to ensure that the released data are sufficiently protected. In many countries legal regulations require that it is strictly impossible to identify any individual in the released data even if the information that is learned would not be considered sensitive. Disseminating data that still contain valuable information is a challenging task under this requirement. Early protection methods such as swapping (see discussion below) mostly focused on ensuring confidentiality paying little attention on the potentially negative impacts of the approaches on analytical validity but over the years more and more sophisticated methods were developed (see the review below) and nowadays thousands of datasets are available around the world for public use using a variety of methods to ensure confidentiality. A good example is the IPUMS International project (www.ipums.org) from the University of Minnesota that offers access to public use micro data samples (PUMS) of Census data from 74 countries. Still, sensitive microdata such as business data or medical data are often not released. Balancing disclosure risk and analytical validity is especially difficult for these data since re-identification is often easy and at the same time the contained information is especially sensitive. Developing data dissemination strategies for these data types is an area of current research.

Database privacy technologies

Database privacy technologies are required when analysts are restricted from viewing all information contained on the database. The meaning of database privacy is largely dependent on the context where this concept is being used. In co-operative market analysis, it is understood as keeping private the databases owned by the various collaborating corporations. In healthcare, both of the above requirements may be implicit: patients must keep their privacy and the medical records should not be transferred from a hospital to, say, an insurance company. In official statistics, it normally refers to the privacy of the respondents to which the database records correspond. In the context of interactively queryable databases and, in particular, Internet search engines, the most rapidly growing concern is the privacy of the queries submitted by users (especially after scandals like the August 2006 disclosure by the AOL search engine of 36 million queries made by 657000 users). Thus, what makes the difference is whose privacy is being sought.

The last remark motivates splitting database privacy in the following three dimensions:

Owner privacy is about two or more autonomous entities being able to compute queries across their databases in such a way that only the results of the query are revealed.
Respondent privacy is about preventing re-identification of the respondents (e.g. individuals like patients or organizations like enterprises) to which the records of a database correspond. Usually respondent privacy becomes an issue only when the database is to be made available by the data collector (hospital or national statistical office) to third parties, like researchers or the public at large. This type of privacy is relevant to the discussion on remote data access, mentioned in b) above.
User privacy is about guaranteeing the privacy of queries to interactive databases, in order to prevent user profiling and re-identification.
The technologies to deal with the above three privacy dimensions have evolved in a fairly independent way within research communities with surprisingly little interaction.

Privacy-preserving data mining (data owner’s privacy)

Owner privacy is the primary though not the only goal of privacy-preserving data mining (PPDM). PPDM has become increasingly popular because it allows sharing sensitive data for analysis purposes. It consists of techniques for modifying the original data in such a way that the private data and/or knowledge (i.e. the rules) remain private even after the mining process. PPDM may provide also respondent privacy as a by-product.

PPDM based on random perturbation was introduced by Agrawal and Srikant (2000) in the database community. This type of PPDM is largely based on statistical disclosure control (discussed below). Independently and simultaneously, PPDM based on secure multi-party computation was introduced by Lindell and Pinkas (2000) in the cryptographic community. We next discuss some specifics of PPDM for data hiding and for knowledge hiding.

PPDM for data hiding

The approach can be oriented to countering attribute disclosure (protecting values of confidential attributes) or identity disclosure (protecting against re- identification).

Random perturbation PPDM for data hiding uses the statistical disclosure control methods (see below). Cryptographic PPDM for data hiding uses secure multi-party computation for a wide range of clustering and other data mining algorithms in distributed environments, where the data are partitioned across multiple parties. Partitioning can be vertical (each party holds the projection of all records on a different subset of attributes), horizontal (each party holds a subset of the records, but each record contains all attributes) or mixed (each party holds a subset of the records projected on a different subset of attributes). For example, a secure scalar product protocol based on cryptographic primitives is applied in privacy preserving k-means clustering over a vertically distributed dataset in Vaidya and Clifton (2000).

Using secure multi-party computation protocols which are based on cryptography or sharing intermediate results often requires changing or adapting the data mining algorithms. Hence, each cryptographic PPDM protocol is designed for a specific data mining computation and in general it is not valid for other computations. In contrast, random perturbation PPDM is more flexible: a broad range of data mining computations can be performed on the same perturbed data, at the cost of some accuracy loss.

PPDM for knowledge hiding

Knowledge hiding or rule hiding (Verykios and Gkoulalas-Divanis, 2008) refers to the process of modifying the original database in such a way that certain sensitive (confidential) rules are no longer inferable, while preserving the data and the non-sensitive rules as much as possible. A classification rule is an expression r : X → C, where C is a class item (e.g. Credit=yes in a data set classifying customers between those who are granted credit or not) and X is an itemset (set of values of attributes) containing no class item (e.g. Gender=female, City=Barcelona). The support of a rule r : X → C is the number of records that contain X , whereas the confidence of the rule is the proportion of records that contain C among those containing X . Rule hiding techniques change the data to decrease the confidence or support of sensitive rules to less than the minimum confidence or support threshold required for a rule to be inferred. Data perturbation, data reconstruction, data reduction and data blocking are some principles that have been proposed to implement rule hiding

Private information retrieval (user’s privacy)

User privacy has found solutions mainly in the cryptographic community, where the notion of private information retrieval (PIR) was invented (Chor et. al 1995). In PIR, a user wants to retrieve an item from a database or search engine without the latter learning which item the user is retrieving. In the PIR literature the database is usually modeled as a vector. The user wishes to retrieve the value of the i-th component of the vector while keeping the index i hidden from the database. Thus, it is assumed that the user knows the physical address of the sought item, which might be too strong an assumption in many practical situations. Keyword PIR (Chor et. al 1995) is a more flexible form of PIR: the user can submit a query consisting of a keyword and no modification in the structure of the database is needed.

The PIR protocols in the above cryptographic sense have two fundamental shortcomings which hinder their practical deployment:

The database is assumed to contain n items and PIR protocols attempt to guarantee maximum privacy, that is, maximum server uncertainty on the index i of the record retrieved by the user. Thus, the computational complexity of such PIR protocols is O(n). Intuitively, all records in the database must be “touched”; otherwise, the server would be able to rule out some of the records when trying to discover i. For large databases, O(n) computational cost is unaffordable.
It is assumed that the database server co-operates in the PIR protocol. However, it is the user who is interested in her own privacy, whereas the motivation for the database server is dubious. Actually, PIR is likely to be unattractive to most companies running queryable databases, as it limits  their profiling ability. This probably explains why no real instances of PIR-enabled databases can be mentioned.

If one wishes to run PIR against a search engine, there is another shortcoming beyond the lack of server co-operation: the database cannot be modeled as a vector in which the user can be assumed to know the physical location of the keyword sought. Even keyword PIR does not really fit, as it still assumes a mapping between individual keywords and physical addresses (in fact, each keyword is used as an alias of a physical address). A search engine allowing only searches of individual keywords stored in this way would be much more limited than real engines like Google and Yahoo. In view of the above, several relaxations of PIR have been proposed. These fall into two categories:

• Standalone relaxations. In Domingo-Ferrer (2009), a system named Goopir is proposed in which a user masks her target query by OR-ing its keywords with k−1 fake keyword sets of similar frequency and then submits the resulting masked query to a search engine or database (which does not need to be aware of Goopir, let alone co-operate with it). Then Goopir locally extracts the sub- set of query results relevant to the target query. Goopir does not provide strick PIR, because the database knows that the target query is one of the k OR-ed keyword sets in the masked query. TrackMeNot (Howe and Nissenbaum, 2009) is another practical system based on a different principle: rather than submitting a single masked query for each actual query like Goopir, a browser extension installed on the user’s computer hides the user’s actual queries in a cloud of automatic “ghost” queries submitted to popular search engines at different time intervals. Again, this is not strict PIR, because the target query is one of the submitted ones.

• Multi-party relaxations. The previous standalone approaches rely on fake queries and this can be viewed as a weakness: it is not easy to generate fake queries that look real and, besides, overloading the search engines/databases with fake queries clearly degrades performance. Multi-party relaxations avoid fake queries by allowing one user to use the help of other entities either for anonymous routing or for cross-submission:

– An onion-routing system like Tor (Tor Project , 2014) is not intended to offer query profile privacy because it only provides anonymity at the transport level; however, if complemented with the Torbutton component, Tor can offer application-level security (e.g. by blocking cookies) and it can be used for anonymous query submission.

– In proxy-based approaches, the user sends her query to a proxy, who centralizes and submits queries from several users to the database/search engine. While using the proxy prevents the database/search engine from profiling the user, the proxy itself can profile the user  and this is a weakness. 

– Another principle is peer-to-peer: a user can submit queries originated by peers and conversely. In this way, peers use each other’s queries as noise. The advantage here is that this noise consists of real queries by other peers, rather than fake queries as in standalone systems. The game-theoretic analysis in Domingo-Ferrer (2012) shows that cross-submission among peers is a rational behavior. Proposals exploiting this approach include Crowds (Reiter and Rubin, 1998) and more recently by Domingo-Ferrer (2009).

Remote access statistical disclosure control

In the case of remote access to a database as discussed in b (ii), a simple model for a remote server is:

An analyst submits a query (in this case, requests a frequency table), via the internet, to the remote server.
The remote server processes the analyst's query on the linked micro-data.
The outputs (e.g. counts) are modified or restricted in order to ensure the risk of disclosure is acceptably low.
The remote server sends the modified output, via the internet, to the analyst.

The aim of statistical disclosure control (i.e. Step 3) is to ensure that the risk of disclosure from releasing output via the remote server is acceptably small level. Seemingly harmless multivariate analyses such as linear regressions can be manipulated to reveal information on the individual level (see for example, Bleninger et al. 2011 or Drechsler et al. 2014). Identity disclosure would occur if an attacker was able to infer that a particular person belongs to the database; membership of a database could be sensitive (e.g. if the database contained records about patients suffering a particular medical condition). If an attacker knows a person is on the database, attribute disclosure occurs if an attacker is able to infer the characteristics of the person from the database.

As mentioned in step 3, there are two main SDC principles for queryable database protection:

Query perturbation: Perturbation (noise addition) can be applied to the microdata records on which queries are computed (input perturbation) or to the query result after computing it on the original data (output perturbation).
Query restriction: The database refuses to answer certain queries.

Query restriction is the right approach if the user does require deterministically correct answers and these answers have to be exact (i.e. a number). Exact answers may be very disclosive, so it may be necessary to refuse answering certain queries at some stage. A common criterion to decide whether a query can be answered is query set size control: the answer to a query is refused if this query together with the previously answered ones isolates too small a set of records. The main problem of query restriction are: i) the computational burden to keep track of previous queries; ii) collusion attacks can circumvent the query limit.

Privacy of estimates and microdata

Measuring the risks to privacy when a statistical agency releases estimates or micro-data has been given considerable attention in the literature (for a review see Hunderpool et. al 2012).

Microdata

Microdata releases allow analysts to view the data values for individual records. Even though names and address information is removed from the microdata before its release, the risk of disclosure remains. Disclosure risks for microdata releases include:

The risk of identifying a person by linking a record on the microdata to an administrative file that contains name and address, via a set of quasi identifiers (e.g. age, sex, suburb, occupation)
Spontaneous recognition, i.e., the direct identification of individuals based on a set of quasi-identifiers provided in the dataset without using any external information, such as administrative file.

Let N_q be the number of people in the population with quasi-identifier q. The above disclosure risks are often related to 1/N_q.

A non-perturbative method to manage the disclosure risks is to reduce the detail in of data items that could be used as quasi-identifiers. The less detail provided on the microdata, the higher the value of N_q, and the lower the disclosure risk. For continuous variables common strategies include discretizing (e.g. reporting only 5 year age groups instead of the exact ages) Another strategy is top coding or replacing the values above a certain threshold with the value of the threshold (e.g. income values above 100,000 EUR might be reported as “100,000+”). For categorical variables a common strategy is to collapse sparsely populated categories.

Common perturbation methods include: swapping, adding random noise (e.g. PRAM for categorical data) and micro-aggregation. Perturbative methods can be applied to potential quasi-identifiers, in order to reduce the risk of identification. Perturbative methods can also be applied to sensitive variables, and thereby reduce the risk of attribute disclosure.

An alternative to traditional perturbation is to replace original data with synthetic data. Under the partially synthetic data approach only the sensitive records and/or records that could be used for re-identification are replaced with synthetic values. The synthesis can range from synthesizing only some records for a single variable, for example all incomes for individuals with an income above a given threshold, to synthesizing all variables. When making inference with synthetic data, special formulas are required (Reiter, 2003; Reiter, 2005a). Since true values remain in the released data careful disclosure risk evaluations are still necessary. Methods to evaluate these risks haven been suggested in Drechsler and Reiter (2008) and Reiter and Mitra (2009). Examples of this approach include: n the U.S., the Survey of Consumer Finances by replaced monetary values at high risk of disclosure with multiple imputations (Kennickell, 1997); On the Map (Machanavajjhala et al., 2008), illustrating commuting patterns (i.e., where people live and work) for the entire United States via maps available to the public on the Web (http://lehdmap.did.census.gov/); the Survey of Income and Program Participation (SIPP) (Abowd et al., 2006), the Census Bureau’s synthetic version of the Longitudinal Business Database (Kinney et al., 2011) and in Germany the Institute for Employment Research released a synthetic version its establishment survey in 2011 (Drechsler, 2012).

Tabular estimates

Below the risks associated of releasing tabular estimates are considered. Tables may contain counts (e.g. counts of people) or sums of continuous data (e.g. person income).

Tabular frequency counts

Counts are often presented in a table which have multiple dimensions (e.g. number of employed people by postcode, age and sex). The dimensions of the table (e.g. postcode, age and sex) could identify a single person as the only contributor to a cell and thereby reveal attributes of the person (e.g. employment status), in the same way as for a micro-data release. Count estimates are at risk from differencing attacks. A differencing attack involves taking the difference between two cell counts, where the two counts are the same except that a small number of people are excluded from one of the counts. The difference reveals the count for those people who were excluded.

A non-perturbative protection measure for count estimates is to restrict tables which have dimensions that could identify a single person. Another approach is to perturb the cell counts- even if the dimensions of a table identify a single person, perturbation can be designed to protect the person’s attributes from being disclosed.

Tabular magnitude estimates

An example of a magnitude estimate is total business income by postcode, age and sex. Identify disclosure, via a table’s dimensions, is possible in the same way as (ii) above. Attribute disclosure is often defined in terms of how close (e.g. percentage difference) an attacker is able to infer a person’s attribute.

Evaluation of different privacy approaches

When an organization applies statistical disclosure control to microdata, or estimates derived from them, there is an implicit trade-off between the disclosure risk and the utility. Utility is a measure of how well statistical output from the micro-data meets the needs of the analyst. Utility is often measured by the difference between the statistical output before and after disclosure control methods have been applied, where only the latter is released to the analyst. It is clearly undesirable if statistical disclosure control seriously affects an analysts' conclusions. Utility can also be measured in terms of whether an analyst can obtain the desired output at all. For example, there is no constraint on what output an analyst can produce if the microdata are available on CD Rom; however, there is a restrictive set of analytical outputs available from a remote server.

Disclosure risk depends upon the context, or “scenario” in the words of Hunderpool et. al (2012), in which the attack is assumed to occur. A scenario describes the:

information available to identify the target unit (i.e. matching key variables). This could be in the form of prior knowledge or in the form of administrative data containing name and addresses;
the method by which an attacker attempts to disclose sensitive information about the target.

Hunderpool et. al (2012) give several examples of scenarios. While a wide range of attacks are possible, a pragmatic approach for an agency to take is to focus on the scenarios which are likely to have the greatest disclosure risk. Below a summary of the advantages and disadvantages of the different approaches to privacy are discussed. This will be helpful in the next chapter, where the suitability of the different approaches to Big Data will be discussed.

Utility and disclosure risk for queryable databases

From a utility/risk perspective, some advantages of remote analysis are:

Although the statistical output is modified, it is based on the real microdata. This means complex relationships in the microdata can be essentially retained.
The degree to which a particular output is modified can depend upon the output itself. For example, estimates at a broad level may require less modification than estimates at a fine, or small area, level.
Since an analyst is restricted from viewing the attributes of any record, less protective measures may be needed than would otherwise be the case. This means that a small number, perhaps only one, record can be attacked at a time.
The impact of the modifications on the output can be broadly indicated to the analyst or incorporated into the output seamlessly.
It can process multiple queries in real time.
Submitted queries can be logged and audited. If a disclosure attempt is detected and confirmed, the agency can revoke the analyst's access and/or take legal action.

There are some disadvantages of remote analysis.

Some statistical outputs may be aggregated (e.g. record-level residual plots may be replaced with parallel box plots), perturbed (e.g. random noise added to regression coefficients), or restricted altogether.
The analyst may be restricted to submit programming code from specific statistical packages (e.g. SAS), or may be restricted to using a menu-driven interface.
Also, remote analysis may take longer than if the micro-data were available on the analyst's personal computer.

Utility and disclosure risk in microdata disclosure control

The advantages and disadvantages of microdata disclosure control are more or less the inverse of the advantages and disadvantages of the queryable databases. Advantages include:

All information is available at the microlevel and there are no restrictions on the types of analyses that can be applied.
The analyst can use any software for his or her research.
The analyst doesn’t have to wait before the research output is checked for any disclosure violations.
Organizations don’t need to worry that a clever combination of different research queries might reveal more information than anticipated.

Disadvantages include:

If perturbative methods have been used to protect the data, there is no guarantee that the results based on the protected data are similar to the results that would have been obtained based on the original data.
All protection mechanisms are applied to ensure that no disclosure occurs even if the full microdata is available. This usually requires a higher amount of perturbation than would be necessary if only a specific output needs to be protected.
It is often difficult for the user to get an idea how much his or her analysis has been affected by the protection mechanisms that have been applied prior to the release.
If a risk of disclosure is detected after the data have been disseminated it is nearly impossible to ensure that all released files are deleted.

Utility and disclosure risk for tabular data

Typically, the disclosure risks associated with tabular data are typically much lower than for microdata because the available quasi-identifiers in the case of the former have less detail. Statistical organizations typically have basic rules for the publication/release of a particular tabular estimates (see Anna Oganian and Josep Domingo-Ferrera 2003). These may include the “Rule of Two” which means that no non-zero cell count can be based on fewer than two people. Often users are not satisfied with the limited set of tables that a statistical organization is able to publish- this naturally means the statistical organization will also have to allow another type of data access.

Privacy software

Remote server

Table Builder is a remote server developed by the Australian Bureau of Statistics that allows users access to highly detailed micro-data in order to:

Construct user-defined tables (e.g. state, age, sex)
Calculate means, medians and quantiles for continuous variables (e.g. income) and their associated relative standard errors
Create custom ranges from continuous variables
Create graphs using your customised tables
Save or export your tables as CSV, Excel or SDMX files

DataAnalyser is a remote server developed by the Australian Bureau of Statistics (Chipperfield, 2014 and Chipperfield and O'Keefe, 2014) that allows users access to highly detailed microdata in order to undertake analyses using a menu driven user interface. DataAnalyser allows users to transform and manipulate variables and to produce basic exploratory data analysis, summary tables and regressions analysis (e.g. linear (robust), logistic, probit and multinomial). Condentialised outputs can be viewed on-screen or downloaded to the user's own computer. For more information see http://www.abs.gov.au/websitedbs/D3310114.nsf/home/About+DataAnalyser. The confidentiality method underlying Data Analyser is described by Chipperfield and O'Keefe (2014)

Microdata dissemination

The Special Uniques Detection Algorithm (SUDA) (Elliot, Manning and Ford, 2002) is an algorithm to locate 'risky' records with categorical variables. First, SUDA finds the different sets of variables on which a record is unique. Second, SUDA ranks the 'risk' of each record in terms of the number and distribution of these sets. For example, SUDA will find "Special unique records" which are unique on an “unusual” and small number of attributes (e.g. 16 year-old widow).

Statistical disclosure control software

τ-ARGUS is a software program designed to protect statistical tables http://neon.vb.cbs.nl/casc/tau.htm

µ-ARGUS is a software program designed to create safe micro-data files http://neon.vb.cbs.nl/casc/mu.htm

sdcMicro based on R : http://cran.r-project.org/web/packages/sdcMicro/vignettes/sdc_guidelines.pdf

References

Abowd, J. M., Stinson, M., and Benedetto, G. (2006). Final report to the Social Security Administration on the SIPP/SSA/IRS public use file project. Technical report, Longitudinal Employer–Household Dynamics Program, U.S. Bureau of the Census, Washington, DC.

Agrawal, R. and Srikant, . R. (2000) Privacy preserving data mining. In Proc. of the 2000 ACM SIGMOD Conference on Management of Data, pages 439–450. ACM.

Bleninger, P., Drechsler, J., and Ronning, G. (2011). Remote data access and the risk of disclosure from linear regression. SORT, Special Issue: Privacy in Statistical Databases, 7–24.

Castellà-Roca, A. Viejo, and J. Herrera-Joancomartí (2009). Preserving user’s privacy in web search engines. Computer Communications, 32:1541–1551,.

Chipperfield, J. O.(2014). Disclosure-Protected Inference with Linked Microdata Using a Remote Analysis Server, Journal of Official Statistics 30, 123-146

Chipperfield, J. O. and O'Keefe, C. M. (2014). Disclosure‐protected Inference Using Generalised Linear Models, International Statistical Review, DOI: 10.1111/insr.12054

Chor B., Goldreich O., Kushilevitz E., and Sudan M. (1995). Private information retrieval. In IEEE Symposium on Foundations of Computer Science - FOCS 1995, pages 41–50. IEEE.

Chor, B. Goldreich O., Kushilevitz E., and Sudan M.(1998). Private information retrieval. Journal of the ACM, 45:965–981.

Domingo-Ferrer J., M. Bras-Amorós, Qianhong Wu, and Jesús Manjón (2009). User-private information retrieval based on a peer-to-peer community. Data & Knowledge Engineering, 68(11):1237–1252.

Domingo-Ferrer J. and González-Nicolás Ú. (2012). Rational behavior in peer- to-peer profile obfuscation for anonymous keyword search: the multi-hop scenario. Information Sciences, 200:123–134,.

Domingo-Ferrer J., Solanas A., and Castellà-Roca J. (2009). h(k)-private infor- mation retrieval from privacy-uncooperative queryable databases. Online Information Review, 33(4):720–744.

Domingo-Ferrer, J. (2008). A critique of k-anonymity and some of its enhancements, in Proc. of ARES/PSAI 2008, IEEE CS, pp. 990-993.

Drechsler, J., Ronning, G., Bleninger P. (2014). Disclosure Risk for Factor Scores, Journal of Official Statistics, 30, S. 107–122

Drechsler, J. (2011): Synthetic Datasets for Statistical Disclosure Control – Theory and Implementation. New York: Springer

Drechsler, J. (2012). New data dissemination approaches in old Europe – synthetic datasets for a German establishment survey. Journal of Applied Statistics 39, 243–265.

Drechsler, J., Dundler, A., Bender, S., Rässler, S., and Zwick, T. (2008a). A new approach for disclosure control in the IAB Establishment Panel – multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458.

Drechsler, J., Bender, S., and Rässler, S. (2008b). Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130.

Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In J. Domingo-Ferrer and Y. Saygin, eds., Privacy in Statistical Databases, 227–238. New York: Springer.

Drechsler, J. and Reiter, J. P. (2010). Sampling with synthesis: A new approach for releasing public use census microdata. Journal of the American Statistical Association 105, 1347–1357.

Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control, Lecture Notes in Statistics, vol. 201, Springer.

Dwork, C. (2006). Differential privacy, in ICALP 2006, LNCS 4052, pp. 1-12.

Elliot, M. J. Manning, A. M and Ford, R. W. (2002) A computational algorithm for handling the special uniques problem, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 493 - 509

Elliott, M. and Purdam, K. (2007). A case study of the impact of statistical disclosure control on data quality in the individual UK Samples of Anonymized Records. Environment and Planning A 39, 1101–1118.

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage, Journal of the American Statistical Assoc., 64(328):1183-1210.

Gomatam, S., Karr, A., Reiter, J., and Sanil, A. (2005). Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers. Statistical Science, 20, 163–177.

Howe D.C. and Nissenbaum H. (2009). TrackMeNot: Resisting surveillance in web search. In Lessons from the Identity Trail: Anonymity, Privacy, and Identity in a Networked Society, pages 417–436. Oxford University Press.

Hundepool, A., Domingo-Ferrer, J. , Franconi, L. Giessing, S. Schulte Nordholt, E. Spicer, K., and de Wolf, P. P. (2012) Statistical Disclosure Control, Wiley.

Kennickell, A. B. (1997). Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. In W. Alvey and B. Jamerson, eds., Record Linkage Techniques, 1997, 248–267.Washington, DC: National Academy Press.

Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. International Statistical Review 79, 363–384.

Lindell Y. and Pinkas B.(2000). Privacy-preserving data mining. In Advances in Cryptology-CRYPTO 2000, volume 1880 of Lecture Notes in Computer Science, pages 36–54. Springer Berlin / Heidelberg.

Little, R. J. A. (1993). Statistical analysis of masked data. Journal of Official Statistics 9, 407–426.

Machanavajjhala, A., Kifer, D., Abowd, J. M., Gehrke, J., and Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering, 277–286.

O'Keefe, C.M. and Chipperfield, J.O. (2013). A Summary of Attack Methods and Confidentiality Protection Measures for Fully Automated Remote Analysis Systems, International Statistical Review, 81, 426–455

Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19, 1–16.

Reiter M.K. and Rubin A.D. (1998)Crowds: anonymity for web transactions. ACM Transactions on Information and Systems Security, 1(1):66–92.

Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189.

Reiter, J. P. (2005a). Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. Journal of Statistical Planning and Inference 131, 365–377.

Reiter, J. P. (2005b). Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205.

Reiter, J. P. and Drechsler, J. (2010). Releasing multiply-imputed, synthetic data generated in two stages to protect confidentiality. Statistica Sinica 20, 405–421.

Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in partially synthetic data. Journal of Privacy and Confidentiality 1, 99–110.

Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468.

Samarati, P. and Sweeney, L. (1998). Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , Technical Report, SRI International.

Tor Project, 2014. http://www.torproject.org.

Vaidya J. and Clifton C. (2003) Privacy-preserving k-means clustering over ver- tically partitioned data. In Proc. of the 9th International Conference on Knowledge Discovery and Data Mining (KDD’03), pages 206–215. ACM.

Verykios V., Bertino E., Fovino I., Provenza L., Saygin Y., and Theodor- idis Y. (2004). State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33(1):50–57.

Verykios, V. and Gkoulalas-Divanis, A. (2008). A survey of association rule hid- ing methods for privacy. In Privacy-Preserving Data Mining: Models and Algorithms, pages 267–289. Springer.

Winkler, W. E. (2007). Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC.

Woo, M-J. Reiter, J.P. Oganian, A. and Karr, A.F. (2009) “Global measures of data utility for microdata masked for disclosure limitation”, J. of Priv. and Conf., 1(1):111-124.

Page tree

Big Data and Privacy

General remarks

Data access strategies for big data

Approaches to privacy in the context of Big Data

Possible Use of Mobile Data

The PARAT software

Hadoop (HDFS and MapReduce)

Pig and Have, from the SandBox

RHadoop

Information Integration and Governance

Database Activity Monitoring

Security Considerations

Separation of Duties

Separation of Concerns

Principle of Least Privilege

Defence in Depth

Security of Transportation

Data Encryption

Statistical Disclosure Limitation/Control (SDL)

Preserving Confidentiality

Restricted Access

Restricted Data

Combination of Both

Providing access to useful statistical data, not just a few numbers

Strike a balance between “Data Utility” and “Disclosure Risk”

Data Masking: Transform original data to disseminated data

Traditional Approaches:

Aggregation: Rounding, Top/Bottom Coding & Thresholding

Suppression/Obfuscation, e.g. Cell Suppression

Data Perturbations

Data Swapping/Shuffling

Modern Approaches: Sampling and Simulation Techniques

Synthetic Data

Remote Access Servers

Secure Computation

Partial Information Releases

Other methods:

Manage Potential Risk to Reputation (Public Image)

Enforce ethical practices in the supply chain

Establish strong compliance controls

Develop a monitoring system to track reputational threats

Develop clear communication channels with internal and external stakeholders (Transparency)

Create a crisis communication plan

Organisation of dialogue with public (Both sending information and receiving feedback comments)

Inform the citizen about the use of his/her data and give feedback

Legislation

Risk-based approach

Stock take: Overview of existing tools for risk management in view of privacy issues

Risks to privacy

Data access strategies

Onsite access

Remote access

Microdata dissemination

Database privacy technologies

Privacy-preserving data mining (data owner’s privacy)

Private information retrieval (user’s privacy)

Remote access statistical disclosure control

Privacy of estimates and microdata

Microdata

Tabular estimates

Evaluation of different privacy approaches

Utility and disclosure risk for queryable databases

Utility and disclosure risk in microdata disclosure control

Utility and disclosure risk for tabular data

Privacy software

Remote server

Microdata dissemination

Statistical disclosure control software

References