ABSTRACTNormally the cryptology is concerned aboutthe encryption of the password, but forgot user information encryption. Thesepapers discuss about data desensitization in store user’s information. In orderto protect user’s privacy data identity sensitive data and using dataobfuscation or K-Anonymity algorithm to obfuscate private data.Keywords: 1?INTRODUCTION With theadvent of big data, it’s have tremendous value in many areas.
At the same time,big data also brings about the challenge of privacy information protection. Howto protect sensitive information from being leaked while realizing theefficient sharing of big data. Recently years numerous website proclaimed thattheir website users’ information leakage. Even through, some victimizedwebsites applied algorithms or methods to protect account password. Most ofwebsite developer ignore users’ information which are store into serverdatabase and with plain text.
Hacker only need to access into database users’information will be leak without protection.Website data not only will be threaded byillegal attack but also will leak data during product developing. Manyorganizations inadvertently leak information when they routinely copy sensitivedata or produce data to a non-production environment. 1.
Mostcompanies copy production data to test and development environments to allowsystem administrators to test upgrades or fixe bugs. Developers need theapplication an environment simulation to test the new functionality to ensurethat the existing functionality is not corrupted. So, they will copy use users’ real datato test application function.
2.The retailer shares each point of sales data with the market researcher toanalyze the customer’s shopping patterns. And some of analysts have authorityto know users’ personal information.
3.Drugs or medical organizations share patient data with investigators to assessdiagnostic efficacy and drug efficacy. Some of the patients’ personalinformation may leak out during these processing. If some of developers copied data intonon-production environments which explored user’s information, user’s privacydata will become the target of hackers and easily stolen or leaked, resultingin irreparable damage.
Data security also as an important part of informationsecurity. At present, there are many data securityprotection includes symmetric and asymmetric encryption, data desensitization,homomorphic encryption, access control. Their protection of the data each havetheir own characteristics and role, and this paper mainly discuss that usingdata desensitization to protect users’ private information. Also, will discussabout the encryption data distended Desensitization of data is the deformation ofsome sensitive information through the desensitization rules, so as to realizethe reliable protection of sensitive and private data. In the case of customersafety data or some commercial sensitive data, the real data can be modifiedand provided for testing without violating the rules of the system. Personalinformation such as ID card number, cell phone number, bank card number andaddress are needed Data desensitization.
With data desensitization technology, evendatabase is attacked users’ data still can block sensitive information. Keepingthe shielded information in its original data format and properties to ensurethat application functions properly during the development 1. 2.
PERSONALPRIVACY DATA THREAD InfoWatch Analytical Center collect data from public information thatcommercial organizations and government agencies disclose data for malicious ornegligent release out. According to the report shows that many largeenterprises and government can’t avoid their data security. And it also mentionsthat 93% of leaks data jeopardized personal privacy and payment details 2. So 3.
IDENTITYSENSITIVE DATA3.1 Regular expression analyzesensitive data In data desensitization system, thechoice of algorithm is generally specified, database presupposes credit cardson the choice of what data processing algorithms, on the phone data to dealwith, users can also customize the configuration.In the most database, some of datahave constant data type. Database can setting filter some data for example oneof columns is users’ phone number.
Because phone number have same digits ofnumber, we can use regular expression to filter out data field in the database. 3.2 Other detection techniques toanalyze sensitive dataIn the big data environment, so the sensitive data identification,discovery and processing of unstructured data are the urgent problems to besolved, In order to automate identity sensitive data, it can build feature learningand natural language processing technologies base on known dataset. Usestatistical methods to find out the probability of similar data in one of datafields. And builds sensitive data recognition engine with nature languagesprocessing method. The system identifies the acquired sample data andidentifies its sensitive data by its data type and data content.
And sensitivedata recognition engine is implemented by sensitive data recognition engine.Sensitive data recognition engine uses rules and naming entity recognition innatural language processing, feature word extraction to identify intelligently. 4. DATADESENSITIZATION METHOD4.1 Data obfuscationAfter recognize sensitive data from Data obfuscation is using the different ways to modify users’ data, whichwill change users’ reReplacement real data with random data using this step will replace realdata with fictitious data, such as building a larger dictionary data table,generating random factors for every real value record, and replacing thecontent of dictionary table with original data content.
The data obtained bythis method is very similar to the real data. Random order the value ofsensitive data is redistributed randomly, which confuses the connection betweenthe original value and other fields. This method does not affect thestatistical characteristics of the original data, for example, the total amountof the column is different from the original data. Using average value for numerical data, first calculate their mean value,then make the desensitized value randomly distributed near the mean value, soas to keep the sum of data unchanged.
It is usually used in a cost table, apayroll, and other occasions. 4.2 K-Anonymity algorithmThe first step in the desensitization of private data is to remove ordesensitize all identifiable columns so that attackers cannot identify usersdirectly. However, it is still possible for an attacker to identify individualsthrough the attribute values of multiple semi-identity columns.
An attacker mayobtain the semi-identities attribute value of a particular individual throughsocial engineering, or other open source databases that match personalinformation, and match with big data platform data to get specific personalsensitive information. In order to avoid this situation, usually need to desensitize thesemi-identification column processing, such as data generalization. Datageneralization is the semi-identity column data is replaced by semantic consensusbut more general data.Table1.
Example of patients data ID Age Disease 13423 27 Heart Disease 13455 43 Cancer 12342 63 Heart Disease 13424 32 Flu Table2. Example of patients data after using K-Anonymity algorithm ID Age Disease 13* 30<* Heart Disease 13* 30<*<40 Cancer 12* 60<* Heart Disease 13* 30<* Flu The data in Table 2 isa k-Anonymity dataset. As an indicator of the risk of privacy data disclosure,K-Anonymity can be used to measure the risk of personal identity disclosure.Theoretically, for any K-Anonymity dataset, the probability of an attackerhaving only 1/n (number of user) which will protect user's information directlyleak out.