Big data and de-identification: Taking a risk management approach
Big data offers huge benefits: it improves human life with new medical and health solutions and supports more efficient design and delivery of a wide range of services. There is no doubt that the release of large data can improve all our lives. However, these benefits often depend on large datasets accessible by innovators, researchers and developer – datasets drawn from information collected and made available by government agencies (which have access to vast troves of useful data), researchers and health service providers.
With the public availability of these large data sets comes risk – risk that the personal data of individuals might be revealed, either directly (by the failure to remove identifiers) or that the data might be able to be re-identified (for example, by matching with another data set).
Perhaps one of the most well-known examples is where a year’s worth of New York City taxi rides were re-identified – to the chagrin of high-profile passengers. Several actors were called out for tipping poorly, while some politicians were caught out after leaving certain locations at certain times. Another example comes from Washington State Hospital – after it released discharge data that was poorly de-identified. A user looked at newspaper articles and matched the data to the information in the hospital discharge data.
This risk to individuals makes it every important that, before being released, data sets have been properly de-identified. But a more pro-active, on-going assessment of the risks is need.
To date, a lot of focus has been on the legal meaning of de-identification, and whether at a point of time (usually the release of the data set), a data set can be regarded as de-identified. However, de-identification may be better viewed as a risk mitigation strategy, and as such, subject to the need for an on-going assessment of the risk of re-identification.
In this post we look as de-identification through the lens of risk management, and what that might mean for the way we consider de-identified publicly available datasets..
De-identification and the Australian Privacy Act
In Australia, the standard thinking is that information that has undergone an appropriate and robust de-identification process is not personal information, and is therefore not subject to the Privacy Act 1988 (Cth) (Privacy Act). This is confirmed in the guidance provided by the OAIC on de-identification.
There is some inconsistency between different privacy regimes as to what is meant by ‘de-identification’ versus ‘anonymisation.’ In fact, the use of the term ‘de-identification’ is one of the issues considered as part of the recent Privacy Act Review Discussion Paper.
In the Discussion Paper, it is proposed that the Privacy Act be amended to provide that information must be anonymous (rather than de-identified) before it is no longer protected by the Act. This would make the Act consistent with the GDPR and other jurisdictions that use the term ‘anonymous’ rather than ‘de-identified’. (For more information on the Privacy 108 response to that Discussion Paper, see our previous blog post.) This would at least clarify the confusion between the terminology in different regimes. But still leaves open the question of what is de-identification.
What is de-identification?
In Australia, guidance from the OAIC provides that information will be de-identified where there is no reasonable likelihood of re-identification occurring. The no reasonable likelihood test is less onerous than the test of no possibility at all of re-identification occurring.
According to the OAIC Guidance, to meet this level, de-identification involves two steps:
- Removal of direct identifiers: The first is the removal of direct identifiers.
- Steps taken to prevent re-identification: The second is taking one or both of the following additional steps:
- the removal or alteration of other information that could potentially be used to re-identify an individual (such as quais-identifiers), and/or
- the use of controls and safeguards in the data access environment to prevent re-identification (including access controls and data controls like perturbation and aggregation).
De-identification and re-identification: A risk-based approach
Given that Australia is using the ‘no reasonable likelihood of identification’ test, assessing the likelihood of re-identification becomes key. Data will only considered to be de-identified if this risk of re-identification is low.
This means that de-identification is inextricably linked to the risk of re-identification.
Re-identification risks to be considered
De-identification – Data points
The first consideration of re-identification risk, it the data itself: the type of data, the number of data points, the removal of strong identifiers etc.
Individuals can be identified from very small data sets. For example, one study found that 87% of the American population can be uniquely identified by their gender, ZIP code and date of birth.[1]
In 2019, another research study laid out a specific method to correctly re-identify 99.98% of individuals out of supposedly anonymized datasets using just 15 demographics. One of the authors of that study, Yves-Alexandre de Montjoye from Imperial College in London, elsewhere showed that he could identify 95% of people in easily acquired smartphone data using just four location timestamps.
Data reduction or modification techniques
Examples of some of the techniques that can be used to reduce or modify the identifiability of the data within data sets include:
- Sampling — providing access to only a fraction of the total existing records or data, thereby creating uncertainty that any particular person is even included in the dataset.
- Choice of variables — removing quasi-identifiers (for example, significant dates, profession, income) that are unique to an individual
- Rounding — combining information or data that is likely to enable identification of an individual into categories.
- Perturbation — altering information that is likely to enable the identification of an individual in a small way, such that the aggregate information or data is not significantly affected — a ‘tolerable error’ — but the original values cannot be known with certainty.
- Swapping — swapping information that is likely to enable the identification of an individual for one person with the information for another person with similar characteristics to hide the uniqueness of some information.
- Manufacturing synthetic data — creating new values generated from original data so that the overall totals, values and patterns are preserved, but do not relate to any particular individual.
- Encryption or ‘hashing’ of identifiers.
More information is available here.
De-identification – Release context
When assessing the risk of re-identification risk, consideration must include not just to the data itself (and steps that can be taken to reduce identifiabilit) but also how the data might be made available – the environment the data will be released into (the release context).
Consideration of the release context includes thinking about:
- the audience who will have access to the data,
- the purpose for which the data will be used, and
- the release environment.
Obviously, the context for a publicly available data base is different to that for a data base constructed to share with a service provider or a small group of researchers for a specific project (and not to be made publicly available).
The level of data treatment appropriate for authorised access in a controlled environment (e.g. where access is provided to a single service provider for a particular purpose) is unlikely to be the same as required to manage open and unrestricted public access.
Other considerations relevant to managing re-identification risks include:
- how the dataset could be used to re-identify an individual or organisation, and
- whether information available elsewhere could be combined with the dataset to re-identify a person or organisation.
On-going consideration of de-identification and re-identification risk
The calculation of the risk of re-identification is not a ‘set and forget’ exercise.
Privacy risks are not static. They evolve in an environment where more information is continuously released, and new technologies emerge. Organisations that regularly review privacy risks and assess the effectiveness of risk treatments can better respond to changes in context and manage risks appropriately over time.
Because risk is context based, the effective management of re-identification risk must include review of the risk on a regular basis and specifically if there are any context changes.
If one or more aspects of the context changes, a reassessment of the disclosure risks should be performed to ensure data subjects remain unlikely to be re-identified.
Documenting risk assessments and the reasons for selecting risk treatments (whether through privacy impact assessments or as part of privacy risk management) helps regular monitoring and review.
However, this is rarely done. And in fact, consideration of re-identification risk is not usually included as part of the standard Privacy Impact Assessment.
De-identification risks and PIAs
It is not common practice to consider re-identification risk as part of a Privacy Impact Assessment. And this means that major risks are not being considered.
A useful example is provided by a data set containing 1.8 billion historical records of public transport users’ activity, released in July 2018 by Public Transport Victoria (PTV) for use in a Data Science Event. Following release, it was found that membership attacks could expose the identities of certain travellers within the data set, including police officers and members of parliament.
The Office of the Victorian Information Commissioner (OVIC) investigated and released a report in August 2019 which stated that, although both PTV and the Victorian Police both conducted risk assessments, both organisations found that there was either no or low risk in the data sets release. It seemed that PTV relied on technical arguments about the definition of personal information (had the data been anonymised) instead of doing an evaluative assessment in the specific context.
Conclusion
De-identification is a privacy-enhancing tool. When done well, it can help your entity meet its obligations under the Privacy Act and build trust in your data governance practices.
A key part of de-identification is consideration of the risk of re-identification. The different ways of reducing the risk of de-identification include:
- Removing or not including all direct identifiers, including name, address and Tax File Number, and quasi-identifiers, such as DOB, phone number, and details about a person’s profession;
- Implementing controls appropriate to the release context;
- Using other data reduction or masking techniques such as releasing a sample of the data, instead of the entire dataset;
- Controlling access to the de-identified data, and requiring those who access the data to not re-identify individuals or organisations using publicly or privately held information.
However, because the re-identification risk is context dependent, it is constantly changing. This means that de-identification must be considered a risk management exercise (assessing and reviewing the risk of de-identification on an on-going basis, and taking additional steps required to ensure the risk level stays low). It is not an exact science and nor is it ‘set and forget.’ It is an on-going process, requiring regular review as the context and particulars change.
How Privacy 108 help
The OAIC recommends that entities seek specialist expertise for complex de-identification matters – for example when de-identifying rich or detailed datasets, where data may be shared publicly or with a wide audience, or where de-identification is carried out in the context of a multi-entity data sharing arrangement.
Privacy 108 can support you in developing de-identification policies and procedures and in auditing and resting the likelihood of re-identification. Contact us now to talk more about how we can assist.
See our previous blog post: De-identification of data: How, when & why | Privacy108
More resources:
- De-identification and the Privacy Act – Home (oaic.gov.au)
- The OAIC recommends that entities also refer to the De-Identification Decision-Making Framework,produced jointly by the OAIC and CSIRO-Data61, which provides a comprehensive framework for approaching de-identification in accordance with the Privacy Act.
- Australian Bureau of Statistics: Understanding re-identification | Australian Bureau of Statistics (abs.gov.au)
- UK Anonymisation Network – Anonymisation Decision Making Framework The ADF (ukanon.net)
- Queensland Information Commissioner’s Office: https://www.oic.qld.gov.au/__data/assets/pdf_file/0016/43045/Privacy-and-public-data-managing-re-identification-risk.pdf
- Victoria:An Introduction to Data De-Identification by the Office of the Victorian Information Commissioner (OVIC).
- ICO Data Sharing: a code of practice: Data sharing: a code of practice | ICO
[1] Sweeney, L.; “Simple Demographics Often Identify People Uniquely,” Data Privacy Working Paper 3, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, 2000, https://dataprivacylab.org/projects/identifiability/paper1.pdf