Table of Contents
What is Anonymization?
It is an operation that consists of removing, or replacing, information (personal or other data) of different natures and origins, before transmitting them to third parties (as with the Open Data movement, outsourcing, etc).
Clearly, information is replaced by values that no longer carry any information, these values not making it possible to find the original information.
Anonymization, the term most used in the EU, is sometimes referred to (depending on motivation or context) by:
- De-identification, depersonalization (especially for personal data);
- Data masking, masking.
- In English, we essentially find the terms: data masking, Document anonymization, data cloaking, and data masquerading.
But why should we anonymize?
Reason for anonymization
At the origin of this anonymization operation, there are various motivations:
- (the most common) Protection of personal data at the origin of the creation of the CNIL);
- Protection of professional and ethical secrecy.
- Protection of financial and economic information.
- Protection of commercial and/or contractual information.
- Protection of strategic information (with highly competitive value).
We will not expand on the applicable texts, dealing with the subject of the protection of privacy, which range from the Charter of Fundamental Rights (of the European Union) to other Laws.
With personal data, two main families of data must be distinguished:
- Data with high identification value. For example unique numbers (INSEE, identity card, permit, registration, telephone), biometric data (DNA, fingerprints), identity photographs, etc., make it possible to identify a person uniquely.
- Data with low identification value. In general, they do not identify a person (uniquely) on their own. However, by cross-checking with other information of low identifying value, this may be possible.
- For example, a first name (or a surname) alone has only a low identification value (but it does carry information), however, this first name associated with a municipality, if it is small, may be sufficient to uniquely identify a person.
- And the Data Protection Act clearly indicates that any data that can make it possible to identify, directly or indirectly, a natural person, is considered personal data.
That said, there are two important questions:
- What should be anonymized?
- And what can we do?
Here are some answers.
What should be anonymized?
To this first question, the answer is (apparently) simple: all information that makes it possible to identify a natural person directly or indirectly, that is to say including by cross-checking or cross-referencing with other information of the same source, or from other sources (even external).
Classic examples: first name, last name, street, town, age, date of birth, telephone numbers, email address, etc.
On a specific IT project, it is, therefore, necessary to carefully analyze and designate in a precise and unambiguous manner all the information that will have to be anonymized.
This task, which is the responsibility of the project data manager is crucial because the slightest oversight can have unfortunate consequences: once the data has been outsourced, it will be too late.
How do we anonymize?
To this second question, you will undoubtedly be somewhat disappointed, but unfortunately, there is no miracle solution.
It is an IT project like any other, which can of course be based on software solutions, which must however be chosen with circumspection, and of which it will be necessary to be aware of the limits. The essential steps in an anonymization project are as follows:
Essential steps in an anonymization project :
- Start by identifying which data sources must be anonymized (database(s), incoming and/or outgoing flow(s), document model(s), document(s), etc.).
- Choose whether the anonymization should be performed on the fly or in bulk, for each source. The first (on the fly) is complex, the second is the most common case at the database level.
- Then identify and precisely locate (for each source) all the data to be anonymized.
- Define, according to each type of data the Document anonymization technique to be used (see below), and the business and technical constraints . There are many, we will mention only the most common and relevant in the next chapters.
- Once this is done, if the technical solution has not yet been selected, the specifications of the previous points will make it possible to choose a technical solution wisely (which therefore makes it possible to implement all the rules defined).
- Implement the anonymization process (development of a specific tool, or configuration and development using a software solution from an editor).
- Carry out an in-depth recipe with data close to production. This is not a usual recipe for a classic IT project: the slightest omission of data (not anonymized), or poor quality anonymization (reversible process for example), will be irreversible once in production.
- Indeed, in production, the data can be consulted, copied, and used for an attempt at de-anonymization, even if the process is corrected afterward. The recipe must therefore be carried out in a specific environment independent of production.
This development process can obviously be done in an iterative and agile way.
Anonymization techniques
The basic anonymization techniques, which can be combined with each other under certain conditions, must be well understood, the guarantees they can provide (or not):
- Let’s start with the techniques that provide few (if any!) guarantees on anonymized data:
- Mixing of data, also called diffusion, permutation, mixing, misalignment, or even shuffling is a question of exchanging the positions of the data but the original data is still present in the source, it’s just in other places.
- Dilution, also called concealment, scrambling, or even obfuscation:
- involves drowning the data in the middle of new data without real meaning, but again, the original data is still present.
- Aging (date aging) of the data: this involves replacing the data at a time T with the same older data.
- This is only of interest for data with a short lifespan which is not the case for personal data anyway.
Techniques that are generally safe
-
-
- Deletion, deletion, overwriting, or nullification consists of deleting the information or replacing it with a fixed value.
- Random replacement or randomization: consists of replacing each value with a random value unrelated to the original data.
- This point is important: the data must not be used to initialize the random generator because this would risk allowing, under certain conditions.
- To find the original value
- Combination, concatenation, or composition: consists of combining several values to form others.
- If these combinations are operated on anonymized values.
- It is a safe operation otherwise, it should be avoided.
- Masking, felting, blackening, or even truncating: this is typically the case where certain portions of a document are blackened.
-
Other Techniques:
-
- Moreover, in the case of a digital document you have to be extremely careful: the blackened areas can often be read thanks to layers , revision marks etc.
- Hashing or hashing: consists of using a standard cryptographic hash function (SHA-2 or SHA-3), which performs a complex mathematical transformation, irreversible by construction.
- Encryption, encryption, or scrambling: a variant of the hashing technique, which uses a standard cryptographic encryption function (AES-256 minimum).
- And which operates a complex, symmetrical, and irreversible mathematical transformation without the knowledge of a key.
- Tables of substitutions, translations, or correspondences: consist of replacing each value with another unrelated one, but of the same nature.
- These tables can also make correspondences between indices (0, 1, 2, …) and business values.
- Care should be taken to define or have specialized tables:
- one for first names, one for surnames, one for dates, one for towns, etc.
-
It is understood that several elementary techniques can be
combined to anonymize a type of data and that certain combinations are of no interest.
Example: Performing a hash followed by masking would be useless.