Normally the cryptology is concerned about
the encryption of the password, but forgot user information encryption. These
papers discuss about data desensitization in store user’s information. In order
to protect user’s privacy data identity sensitive data and using data
obfuscation or K-Anonymity algorithm to obfuscate private data.
advent of big data, it’s have tremendous value in many areas. At the same time,
big data also brings about the challenge of privacy information protection. How
to protect sensitive information from being leaked while realizing the
efficient sharing of big data. Recently years numerous website proclaimed that
their website users’ information leakage. Even through, some victimized
websites applied algorithms or methods to protect account password. Most of
website developer ignore users’ information which are store into server
database and with plain text. Hacker only need to access into database users’
information will be leak without protection.
Website data not only will be threaded by
illegal attack but also will leak data during product developing. Many
organizations inadvertently leak information when they routinely copy sensitive
data or produce data to a non-production environment.
companies copy production data to test and development environments to allow
system administrators to test upgrades or fixe bugs. Developers need the
application an environment simulation to test the new functionality to ensure
that the existing functionality is not corrupted. So, they will copy use users’ real data
to test application function.
The retailer shares each point of sales data with the market researcher to
analyze the customer’s shopping patterns. And some of analysts have authority
to know users’ personal information.
Drugs or medical organizations share patient data with investigators to assess
diagnostic efficacy and drug efficacy. Some of the patients’ personal
information may leak out during these processing.
If some of developers copied data into
non-production environments which explored user’s information, user’s privacy
data will become the target of hackers and easily stolen or leaked, resulting
in irreparable damage. Data security also as an important part of information
At present, there are many data security
protection includes symmetric and asymmetric encryption, data desensitization,
homomorphic encryption, access control. Their protection of the data each have
their own characteristics and role, and this paper mainly discuss that using
data desensitization to protect users’ private information. Also, will discuss
about the encryption data distended
Desensitization of data is the deformation of
some sensitive information through the desensitization rules, so as to realize
the reliable protection of sensitive and private data. In the case of customer
safety data or some commercial sensitive data, the real data can be modified
and provided for testing without violating the rules of the system. Personal
information such as ID card number, cell phone number, bank card number and
address are needed Data desensitization.
With data desensitization technology, even
database is attacked users’ data still can block sensitive information. Keeping
the shielded information in its original data format and properties to ensure
that application functions properly during the development 1.
PRIVACY DATA THREAD
InfoWatch Analytical Center collect data from public information that
commercial organizations and government agencies disclose data for malicious or
negligent release out. According to the report shows that many large
enterprises and government can’t avoid their data security. And it also mentions
that 93% of leaks data jeopardized personal privacy and payment details 2. So
3.1 Regular expression analyze
In data desensitization system, the
choice of algorithm is generally specified, database presupposes credit cards
on the choice of what data processing algorithms, on the phone data to deal
with, users can also customize the configuration.
In the most database, some of data
have constant data type. Database can setting filter some data for example one
of columns is users’ phone number. Because phone number have same digits of
number, we can use regular expression to filter out data field in the database.
3.2 Other detection techniques to
analyze sensitive data
In the big data environment, so the sensitive data identification,
discovery and processing of unstructured data are the urgent problems to be
In order to automate identity sensitive data, it can build feature learning
and natural language processing technologies base on known dataset. Use
statistical methods to find out the probability of similar data in one of data
fields. And builds sensitive data recognition engine with nature languages
processing method. The system identifies the acquired sample data and
identifies its sensitive data by its data type and data content. And sensitive
data recognition engine is implemented by sensitive data recognition engine.
Sensitive data recognition engine uses rules and naming entity recognition in
natural language processing, feature word extraction to identify intelligently.
4.1 Data obfuscation
After recognize sensitive data from
Data obfuscation is using the different ways to modify users’ data, which
will change users’ re
Replacement real data with random data using this step will replace real
data with fictitious data, such as building a larger dictionary data table,
generating random factors for every real value record, and replacing the
content of dictionary table with original data content. The data obtained by
this method is very similar to the real data. Random order the value of
sensitive data is redistributed randomly, which confuses the connection between
the original value and other fields. This method does not affect the
statistical characteristics of the original data, for example, the total amount
of the column is different from the original data.
Using average value for numerical data, first calculate their mean value,
then make the desensitized value randomly distributed near the mean value, so
as to keep the sum of data unchanged. It is usually used in a cost table, a
payroll, and other occasions.
4.2 K-Anonymity algorithm
The first step in the desensitization of private data is to remove or
desensitize all identifiable columns so that attackers cannot identify users
directly. However, it is still possible for an attacker to identify individuals
through the attribute values of multiple semi-identity columns. An attacker may
obtain the semi-identities attribute value of a particular individual through
social engineering, or other open source databases that match personal
information, and match with big data platform data to get specific personal
In order to avoid this situation, usually need to desensitize the
semi-identification column processing, such as data generalization. Data
generalization is the semi-identity column data is replaced by semantic consensus
but more general data.
Table1. Example of patients data
Table2. Example of patients data after using K-Anonymity algorithm
30<* Heart Disease 13* 30<*<40 Cancer 12* 60<* Heart Disease 13* 30<* Flu The data in Table 2 is a k-Anonymity dataset. As an indicator of the risk of privacy data disclosure, K-Anonymity can be used to measure the risk of personal identity disclosure.
Theoretically, for any K-Anonymity dataset, the probability of an attacker
having only 1/n (number of user) which will protect user’s information directly