Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

Anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that connect an ‘entity’ to stored data. If the anonymization is fool-proof, the process of Deanonymization should not reveal personal identifiable information. Typically, anonymization could be just pattern anonymization, or just the value anonymization, or both. In this work, we want to do just the value anonymization, so as to preserve the predictive/detective power. Just like any other data, even in Telco data, we will have to deal with both the categorical Variables and the numerical variables. There are various approaches under anonymization:

...

  • Names (Systems, Domain, Individuals, Organizations, Places, etc.)
  • Address (IP and MAC)
  • Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC
  • Location Data (Cell-ID, Count, etc.). GPS Data on its own is not a sensitive information. The context around that, such as 'names', are sensitive.


PII Type

Dataset (links)

Names (Systems, Domain, Individuals, Organizations, Places, etc.)

ServerLOG1, ServerLog2, LogHub, Campus

Address (IP and MAC)

Internet Traffic Dataset: EX1, EX2

Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC

Adult Dataset enhanced with Telco-Fields

Adult Dataset:  Generate random IMEI/IMSI* fields and add it to this dataset

Location Data (GPS, Cell-ID, Count, etc.)

OpenCellID, GPS, IEEE-dataport (crawdad)

Phase-2

In this phase we would want to:

...

  • Classic (and its variations): K-Anonymity, L-Diversity, T-Closeness, Differential Privacy
  • Data Anonymization with Autoencoders
  • NLP approaches for data anonymization
  • Generative AI (GANs)


Anonymizing Names and Telco-Fields

We have found that the classic-techniques do well when it comes to anonymizing both Names and telco-fields (Nouns and Numbers) - when it is in a structured (columns) format.  

In this repo, you can find the techniques that we have tried for these fields: https://github.com/sknrao/anonymization 

Anonymizing Packet Fields

Anonymizing the packet fields is a very well researched area. Works are available from early 2000. The most recent ones are using condensation-based differential privacy.

References:

Currently the team is working on

(a) implementing the condensation-based differential privacy.

(b) Developing containers to test and evaluate the above techniques.

Anonymizing location information (cell-ID, count, etc.).

We are currently working on this and exploring different techniques.

Anonymizing Log-Data.

The team is currently exploring use of NLP for this. Once there is a progress, we will update this section.

Phase-3


The team is currently working on building a tool that auto-detects of the PII data to picks the best technique to use on the data.

Phase-4

The team is currently building a container-based architecture for a unified tool.