A Framework For Privacy Diagnosis And Preservation In Data Publishing

Date

2010-04

Authors

Mirakabad, Mohammad Reza Zare

Publisher

Universiti Sains Malaysia

Abstract

Privacy preservation in data publishing aims at the publication of data with protecting private information. Although removing direct identifier of individuals seems to protect their anonymity at first glance, private information may be revealed by joining the data to other external data. Privacy preservation addresses this privacy issue by introducing k-anonymity and l-diversity principles. Accordingly, privacy preservation techniques, namely k-anonymization and l-diversification algorithms, transform data (for example by generalization, suppression or fragmentation) to protect identity and sensitive information of individuals respectively. Most of the recent efforts addressing this issue have focused on privacy preservation techniques. However, not much effort has been made to address devising techniques, tools and methodologies to assist data publishers, managers and analysts in their investigation and evaluation of privacy risks. Hence, the idea of a privacy diagnosis centre is proposed that offers the necessary framework for diagnosing privacy risk and specifically k-anonymity and l-diversity. It is shown that this problem is a knowledge discovery problem that can be mapped to the framework proposed by Mannila and Toivonen. By introducing and proving the necessary monotonicity properties, necessary levelwise algorithms based on the apriori algorithm are presented and evaluated. Moreover, proposed models and techniques for privacy preservation still have some deficiencies and drawbacks. Specifically, clustering-based algorithms for kanonymization may result in high information loss. By showing the deficiencies of both small and big clusters, two-phase clustering k-anonymization is proposed. It allows clusters to become sufficiently big, and big clusters are split to smallest possible clusters in the next phase, both result in lower information loss. In addition, it is shown that the extension of k-anonymization algorithms for some l-diversity principles is not straightforward. It may result in high information loss or can not terminate. Accordingly, bucket clustering l-diversification is proposed to guarantee both termination and low information loss. The proposed algorithms are implemented and ran on two sample datasets, namely Adults and OCC, which have become de facto benchmarks for privacy preservation algorithms. Effectiveness and efficiency of the proposed framework and algorithms are proved experimentally by analyzing the results.

Keywords

A framework for privacy diagnosis , and Preservation in data publishing

URI

http://hdl.handle.net/123456789/5656

Collections

Pusat Pengajian Sains Komputer - Tesis

Full item page