The collection of personal data is both ubiquitous and unavoidable in today’s digitized global economy. Companies of all types collect information about consumers that includes everything from the more typical names, addresses, phone numbers and email addresses to more deeply personal and valuable data like bank account and credit card numbers. The names, addresses and phone numbers represent Personally Identifiable Information (PII).

The scope of the problem

Data loss is also ubiquitous and unavoidable. Cases of data breaches involving the exposure of PII appear in the news almost regularly. The table below shows a list of the seven largest breaches occurring in the past four years, compiled by UpGuard, Inc.

CompanyDate of BreachNumber of Records Exposed
CAM4March 202010.88 billion
YahooOctober 20173 billion
AadhaarMarch 20181.1 billion
First American Financial Corp.May 2019885 million
Verifications.ioFebruary 2019763 million
LinkedInJune 2021700 million
FacebookApril 2019533 million

Guidance toward data safety

According to the latest report by IBM and the Ponemon Institute, the cost of a data breach in 2021 is US$ 4.24 million, this is a 10% rise from the average cost in 2019 which was $3.86 million.

Customer PII was included in 44% of those breaches, with an average cost per customer per PII record of $180. That report analyzed 537 breaches across 17 countries and 17 industries.

But this doesn’t need to be the case. Technologies now exist that can mitigate or even eliminate risk. And governing bodies are stepping in to provide guidance and regulations that will enable companies to sufficiently mitigate risk while storing, processing, and transferring PII. For example, the General Data Protection Regulation (GDPR) legislation states that companies should make an evaluation of the risks posed by their various data processing activities and implement measures such as encryption to mitigate those risks and maintain security and prevent processing that isn’t compliant with the GDPR.

The EU has identified “pseudonymization” as an effective way to comply with the General Data Protection Regulation (GDPR) requirement for secure data storage of personal information. The process of pseudonymization is to replace one or more fields of data consistently with different characters that are not recognizable or related to the individual, adequately masking the information as identifiable to a specific person.

The risk of not complying

Privacy rules are far reaching. For GDPR, a single customer in the EU is enough to require compliance with GDPR. The cost of GDPR violations can be as high as €20 Million or 4% of a company’s global annual turnover, whichever is higher. Since the start of 2021, Amazon & Facebook combined have been penalized close to $1 billion for GDPR violations, as compiled by CNBC.

The challenge of data volume

In today’s business world, it is common for datasets to contain billions or even trillions of records of transactions that include personally identifiable information (PII). As per projections by Statista, the volume of data created worldwide by 2025, is estimated to reach 180 trillion GB. Relying on humans to decide which fields to pseudonymize and to what degree, is nearly impossible.

The concept of pseudonymization

Today’s technologies such as Artificial Intelligence / Machine Learning (AI/ML) can be used across unstructured datasets and provide an objective, quantifiable risk assessment of the dataset and replace PII. Pseudonymization is a technique in which PII is modified so it cannot be attributed to a specific person without the use of additional information. To accomplish this, personal identifiers are replaced with consistent and unique strings of characters: pseudonyms. The resulting string is always the same for any particular input data, which allows for analytical correlations in a process called “data tokenization.” For data to be truly pseudonymized, however, PII must be kept separate from users’ other data.

Pseudonymization is reversible and hence much more useful from a data analytics perspective. During the process of reidentification, pseudonymized data is processed through a set of complex algorithms to ensure that only authorized personnel can reidentify the original user.

Current tools in the market deploy list look-up techniques, which essentially compare the data to be pseudonymized against past records. The limitation of this technique is that any new PII which is not present in the reference list may skip the pseudonymization process. There are tools available in the market which are limited with respect to the file formats that they can accurately process and are unable to identify PII from unstructured log files.

How AI/ML can address pseudonymization

AI/ML can be leveraged to overcome these limitations with continuously evolving algorithms, to address the challenge of pseudonymizing large unstructured datasets. Orion’s Pseudonymization Tool is an AI/ML-based solution that can be trained to identify and process PII across unstructured log files before those are downloaded or shared. To train the ML model, data teams configure a reference lab, where they collect sample data with a precise list of possible PII data values. This approach allows to get 100% accurate labelling. The quality of sample and labelled PII data is crucial to train the ML-model to perform optimally and identify PII data accurately in real-world conditions. Once trained, the ML-based solutions can reduce the time required to identify all the places where PII data can be met, significantly, improve overall accuracy of Pseudonymization and safeguard organizations in the event of a data breach.

What this means for your business

The risk of today’s data breaches and the growing severity of regulatory scrutiny makes the protection of personal data a big imperative for all global organizations that have billions of records that include personal information of individuals. Pseudonymization has been identified as an adequate means to protect the data. However, many current tools have limitations in what they can do. AI/ML-based Pseudonymization tools such as the Orion Pseudonymization Tool are an effective step to mitigate the risks of a breach and stay in compliance with the law.

Learn more about the Orion Pseudonymization Tool or contact us for a consultation.

Keep Connected