As public interesting in open data increases and the goverment comes under greater pressure to publish more datasets, threats to the privacy of citizens will increase, a report commissioned by the Cabinet Office, has found.
Entitled "Transparent Government, Not Transparent Citizens" and written by Kieran O’Hara from the University of Southampton, the report makes a number of recommendations for maintaining privacy as the government’s public data programme, data.gov.uk, becomes increasingly "demand-driven".
O’Hara’s main thrust is that privacy concerns must be considered at every step in the process of publication of public data. "Privacy protection should … be embedded in any transparency programme, rather than bolted on as an afterthought," he writes.
The report suggests that the technological definition of ‘privacy’ must be included in government thinking on public data. Legal definitions of privacy have proved inadequate, it says, and O’Hara recommends that "technologically-trained experts should be brought into procedures for deciding whether or not to release particular datasets."
He also recommends that the Information Commissioner’s Office (ICO) obtain a greater technical awareness – although O’Hara stressed that the ICO is currently making progress on this through the appointment of a technology Policy Advisor and the creation of a Technology Reference Panel.
Other recommendations include creating a data asset register to allow the government to keep track of its datasets; setting up transparency panels to determine the privacy threat posed by data; and investigating the vulnerability of anonymised databases to "deanonymisation", whereby an individual’s identify can be figured out by matching information across multiple sources.
The report points to work on deanonymisation by two computer scientists, Narayanan and Shmatikov from the University of Texas, which found that individuals could be identified based on anonymous film reviews on rental site Netflix.
"Our conclusion is that very little auxiliary information is needed [to] de- anonymize an average subscriber record from the Netflix Prize dataset," they wrote in a paper quoted in O’Hara’s report. "With 8 movie ratings (of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of records can be uniquely identified in the dataset."