A Framework For Privacy Diagnosis And Preservation In Data Publishing
Loading...
Date
2010-04
Authors
Mirakabad, Mohammad Reza Zare
Journal Title
Journal ISSN
Volume Title
Publisher
Universiti Sains Malaysia
Abstract
Privacy preservation in data publishing aims at the publication of data with
protecting private information. Although removing direct identifier of individuals
seems to protect their anonymity at first glance, private information may be revealed
by joining the data to other external data. Privacy preservation addresses this privacy
issue by introducing k-anonymity and l-diversity principles. Accordingly, privacy
preservation techniques, namely k-anonymization and l-diversification algorithms,
transform data (for example by generalization, suppression or fragmentation) to protect
identity and sensitive information of individuals respectively.
Most of the recent efforts addressing this issue have focused on privacy preservation
techniques. However, not much effort has been made to address devising techniques,
tools and methodologies to assist data publishers, managers and analysts in their
investigation and evaluation of privacy risks. Hence, the idea of a privacy diagnosis
centre is proposed that offers the necessary framework for diagnosing privacy risk
and specifically k-anonymity and l-diversity. It is shown that this problem is a
knowledge discovery problem that can be mapped to the framework proposed by
Mannila and Toivonen. By introducing and proving the necessary monotonicity
properties, necessary levelwise algorithms based on the apriori algorithm are presented
and evaluated. Moreover, proposed models and techniques for privacy preservation still have
some deficiencies and drawbacks. Specifically, clustering-based algorithms for kanonymization
may result in high information loss. By showing the deficiencies
of both small and big clusters, two-phase clustering k-anonymization is proposed.
It allows clusters to become sufficiently big, and big clusters are split to smallest
possible clusters in the next phase, both result in lower information loss. In addition,
it is shown that the extension of k-anonymization algorithms for some l-diversity
principles is not straightforward. It may result in high information loss or can not
terminate. Accordingly, bucket clustering l-diversification is proposed to guarantee
both termination and low information loss.
The proposed algorithms are implemented and ran on two sample datasets, namely
Adults and OCC, which have become de facto benchmarks for privacy preservation
algorithms. Effectiveness and efficiency of the proposed framework and algorithms
are proved experimentally by analyzing the results.
Description
Keywords
A framework for privacy diagnosis , and Preservation in data publishing