Example: confidence

Chapter 8: Privacy Preserving Data Mining - LMU …

DATABASESYSTEMSGROUP1 Knowledge Discovery in DatabasesSS 2016 Lecture: Prof. Dr. Thomas SeidlTutorials: Julian Busch, Evgeniy Faerman,Florian Richter, Klaus SchmidLudwig-Maximilians-Universit t M nchenInstitut f r InformatikLehr-und Forschungseinheit f r DatenbanksystemeChapter 8: Privacy Preserving data MiningKnowledge Discovery in Databases I: Privacy Preserving data MiningDATABASESYSTEMSGROUP Introduction data Privacy Privacy Preserving data Mining k-Anonymity Privacy Paradigm k-Anonymity l-Diversity t-Closeness Differential Privacy Sensitivity, Noise Perturbation, CompositionPrivacy Preserving data Mining2 DATABASESYSTEMSGROUPHuge volume of data is collected from a variety of devices and platformsSuch as Smart Phones, Wearables,Social Networks, Medical systemsSuch data captures human behaviors,routines, activities and affiliationsWhile this overwhelming data collection provides an opportunity to perform data Privacy3 data AbuseData Abuse is inevitable.

DATABASE SYSTEMS GROUP Huge volume of data is collected from a variety of devices and platforms Such as Smart Phones, …

Tags:

  Data, Privacy, Mining, Preserving, Privacy preserving data mining

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Chapter 8: Privacy Preserving Data Mining - LMU …

1 DATABASESYSTEMSGROUP1 Knowledge Discovery in DatabasesSS 2016 Lecture: Prof. Dr. Thomas SeidlTutorials: Julian Busch, Evgeniy Faerman,Florian Richter, Klaus SchmidLudwig-Maximilians-Universit t M nchenInstitut f r InformatikLehr-und Forschungseinheit f r DatenbanksystemeChapter 8: Privacy Preserving data MiningKnowledge Discovery in Databases I: Privacy Preserving data MiningDATABASESYSTEMSGROUP Introduction data Privacy Privacy Preserving data Mining k-Anonymity Privacy Paradigm k-Anonymity l-Diversity t-Closeness Differential Privacy Sensitivity, Noise Perturbation, CompositionPrivacy Preserving data Mining2 DATABASESYSTEMSGROUPHuge volume of data is collected from a variety of devices and platformsSuch as Smart Phones, Wearables,Social Networks, Medical systemsSuch data captures human behaviors,routines, activities and affiliationsWhile this overwhelming data collection provides an opportunity to perform data Privacy3 data AbuseData Abuse is inevitable.

2 -It compromises individual s Privacy -Or bridges the security of an institutionDATABASESYSTEMSGROUPAn attacker queries a database for sensitive recordsTargeting of vulnerable or strategic nodes of large networks to Bridge an individual s Privacy Spread virusAdversary can track Sensitive locations and affiliations Private customer habitsThese attacks pose a threat to privacyData Privacy : Attacks 4 HowmanypeoplehaveHypertension?Database QueryOutputsDATABASESYSTEMSGROUPT hese Privacy concerns need to be mitigatedThey have prompted huge research interest to Protect DataBut, Strong Privacy Protection Poor data Utility Good data Utility Weak Privacy ProtectionThe challenge is to find a good trade-offbetween DataUtility and PrivacyObjectives of Privacy Preserving data Mining in Database/ data Mining : Provide new plausible approaches to ensure data Privacy when executing database and data Mining operations Maintain a good trade-off between data utility and privacyData Privacy 5 DataUtilityPrivacyDATABASESYSTEMSGROUPL inkage Attack.

3 Different public records can be linked to it to breach privacyPrivacy Breach6 NameGenderAgeZip CodeDiseaseAliceF2952066 BreastCancerJaneF2752064 BreastCancerJonesM2152076 Lung CodeSportsAliceF2952066 TennisTheoM4152074 GolfJohnM2452062 SoccerBettyF3752080 TennisJamesM3452066 SoccerBettyhadPlasticSurgeryAlicehasBrea stCancerHospital RecordsPublic Records fromSport ClubDATABASESYSTEMSGROUPA Privacy paradigm for protecting database records before data PublicationThree kinds of attributes: i) Key Attribute ii) Quasi-identifier ii) Sensitive AttributeKey Attribute: Uniquely identifiable attributes ( , Name, Social Security Number, Telephone Number)Quasi-identifier: Groups of attributes that can be combined with external data to uniquely re-identify an individual For Example: Date of Birth, Zip Code, GenderSensitive Attribute.

4 Disease, Salary, Habit, Location 7 DATABASESYSTEMSGROUPE xample of partitioning a table into Key, Quasi-Identifierand SensitiveAttributesHiding of Key Attributes does not guarantee privacyQuasi-Identifiers have to be altered to enforce privacyk-Anonymity8 KeyAttributeQuasi-IdentifierSensitiveAtt ributeNameGenderAgeZip CodeDiseaseAliceF2952066 BreastCancerJaneF2752064 BreastCancerJonesM2152076 LungCancerFrankM3552072 HeartDiseaseBenM3352078 FeverBettyF3752080 NosePainsNameGenderAgeZip CodeAliceF2952066 TheoM4152074 JohnM2452062 BettyF3752080 JamesM3452066 BettyhadPlasticSurgeryAlicehasBreastCanc erReleasedHospital RecordsPublic Records fromSport ClubDATABASESYSTEMSGROUPk-Anonymity ensures Privacy by Suppression or Generalization of quasi-identifiers. (k-ANONYMITY): Given a set of quasi-identifiers in a database table, the database table is said to be k-Anonymous, if the sequence of records in each quasi-identifier exists at least (k-1) : Accomplished by replacing a part or the entire attribute value by * SuppressPostal Code :52057 52** Suppress Gender: i) Male * ii) Female *Generalization: Not Available Exam: Passed Failed{Excellent} {Very Good} {Good, Average} {Sick} {Poor } {Very Poor} k-Anonymity 9 DATABASESYSTEMSGROUPG eneralization of Postal Code.

5 Generalization can be achieved by (Spatial) ClusteringGeneralization 1052062 -5208052062 -52068 52070 -5208052062 52064 52066 52068 DATABASESYSTEMSGROUPR emove Key Attributes Suppress or Generalize Quasi-IdentifiersThis database table is 3-AnonymousOversuppressionleads to stronger Privacy but poorer data UtilityExample of k-Anonymity 11 KeyAttributeQuasi-IdentifierSensitiveAtt ributeNameGenderAgeZip CodeDisease*2*520*BreastCancer*2*520*Bre astCancer*2*520*LungCancer*3*520*HeartDi sease*3*520*Fever*3*520*NosePainsNameGen derAgeZip CodeAliceF2952066 TheoM4152074 JohnM2452062 BettyF3752080 JamesM3452066 ReleasedHospital RecordsPublic Records??DATABASESYSTEMSGROUPG eneralize postal code to [5206*,5207*] and [5207*,5208*]K-Anonymity is still satisfied with better data UtilityAdversary cannot identify Alice or her disease from the released recordHowever, k-Anonymity still has several shortcomingsExample of k-Anonymity 12 Quasi-IdentifierSensitiveAttributeGender AgeZip CodeDisease*2*[5206*,5207*]BreastCancer* 2*[5206*,5207*]BreastCancer*2*[5206*,520 7*]LungCancer*3*[5207*,5208*]HeartDiseas e*3*[5207*,5208*]Fever*3*[5207*,5208*]No sePainsNameGenderAgeZip CodeAliceF2952066 TheoM4152074 JohnM2452062 BettyF3752080 JamesM3452066 ReleasedHospital RecordsPublic Records?

6 ?DATABASESYSTEMSGROUPU nsorted Attack: Different subsets of the record are released unsortedLinkage Attack: Different versions of the released table can be linked to compromise k-Anonymity results. Jones is at Row three. Jones has Lung Cancer!Unsorted attack can be solved byRandomizing the order of the of k-Anonymity 13 Quasi-IdentifierSensitiveAttributeGender AgeZip CodeDisease*2*[5206*,5207*]BreastCancer* 2*[5206*,5207*]BreastCancer*2*[5206*,520 7*]LungCancer*3*[5207*,5208*]HeartDiseas e*3*[5207*,5208*]Fever*3*[5207*,5208*]No sePainsReleasedRecords 1 ReleasedRecords 2 Quasi-IdentifierSensitiveAttributeGender AgeZip CodeDiseaseF2*520*BreastCancerF2*520*Bre astCancerM2*520*LungCancerM3*520*HeartDi seaseM3*520*FeverF3*520*NosePainsDATABAS ESYSTEMSGROUPB ackground Knowledge attackLack of diversity of the sensitive attribute values (homogeneity) All Females within 20 years have Breast Cancer.

7 No diversity!!! Alice has Breast Cancer! All 2*-aged males have lung cancer Jones has Lung Cancer!This led to the creation of a new Privacy model called l-diversityAttack on k-Anonymity 14 ReleasedRecordsQuasi-IdentifierSensitive AttributeGenderAgeZip CodeDiseaseF2*520*BreastCancerF2*520*Bre astCancerM2*520*LungCancerM2*520*LungCan cerM3*520*HeartDiseaseM3*520*FeverF3*520 *NosePainsDATABASESYSTEMSGROUPA ddresses the homogeneity and background knowledge attacks Accomplishes this by providing well represented sensitive attributes for each sequence of quasi-identifiers(Distinct l-Diversity)Diversity of Equivalent classl-Diversity15 Micro data Anonymized2 Quasi-IdentifierSensitive AttributeQI 1 HeadacheQI1 HeadacheQI1 HeadacheQI2 CancerQI2 CancerQuasi-IdentifierSensitive AttributeQI 1 HeadacheQI3 CancerQI2 HeadacheQI2 HeadacheQI4 CancerAnonymized1 DATABASESYSTEMSGROUPO ther variants of l-Diversity Entropy l-Diversity: For each equivalent class, the entropy of the distribution of its sensitive values must be at least log( ) Probabilistic l-Diversity: The most frequent sensitive value of an equivalent class must be at most 1/ Limitations of l-Diversity Is not necessary at times Is difficult to achieve.

8 For large record size, many equivalent classes will be needed to satisfy l-Diversity Does not consider the distribution of sensitive attributesl-Diversity 16 DATABASESYSTEMSGROUPThe l-diversity approach is insufficient to prevent sensitive attribute disclosureThis led to the proposal of another Privacy definition called t-Closenesst-Closeness achieves Privacy by keeping the distribution of each quasi-identifier s sensitive attribute close to their distribution in the databaseFor Example: Let be the distribution of a sensitive attribute and denotes the distribution of all attributes in the database tableGiven a threshold t:an equivalent class satisfies t-closeness if the distance between and is less than or equal to tA table satisfies t-closeness if all its equivalent classes have t-closeness -Closeness 17 DATABASESYSTEMSGROUPk-Anonymity, l-Diversity, t-Closeness make assumptions about the adversaryThey at times fall short of their goal to prevent data disclosureThere is another Privacy paradigm which does not rely on background knowledgeIt is called Differential PrivacyBackground Attack Assumptions22 DATABASESYSTEMSGROUP Privacy through data perturbation Addition of a small amount of noise to the true data True value of a data can be masked from adversaries Used for the perturbation of query results of count, sum, mean functions.

9 As well as other statistical query Privacy24 DATABASESYSTEMSGROUPD ifferential Privacy 25 .. Database .. Database MissingRow isremoved. Meaningdatabases and differbyonly1 entryA(x)RandomizationMechanismA(x)Queri esQuery Outputs AnswersA(x)QueriesQuery Outputs AnswersRatio ofprobabilitiesof 1and 2isat most DATABASESYSTEMSGROUPCore Idea: The addition or removal of one record from a database does not reveal any information to an adversary This means your presenceor absencein the database does not reveal or leak any information from the database This achieves a strong sense of Privacy -DIFFERENTIAL Privacy :A randomized mechanism ( )provides -differential Privacy if for any two databases 1and 2that differ on at most one element, and all output Range( ),Pr 1 Pr 2 exp is the Privacy parameter called Privacy budget or Privacy levelDifferential Privacy 26 DATABASESYSTEMSGROUPS ensitivity is important for noise derivationThe sensitivity of a function is defined as the maximum change that occurs if one record is added or removed from a database 1to form another database 2.

10 2) 1 ( Types of Sensitivities i) Global Sensitivity ii) Local SensitivitySensitivity of a Function27 DATABASESYSTEMSGROUPData Perturbation in Differential Privacy is achieved by noise additionDifferent kinds of noise Laplace noise Gaussian noise Exponential MechanismData Perturbation28 DATABASESYSTEMSGROUPS tems from the Laplace Distribution ( )=12 exp consists of a density exp 1 Output query is -indistinguishablewhen sensitivity and noise of stronger is used for perturbationLaplace Noise29 DATABASESYSTEMSGROUP Extension the notion of differential Privacy to incorporate non-real value functions Example: Color of a car, category of a car Guarantees Privacy by approximating the true value of a data using quality function or utility function.


Related search queries