databene

 
  • Increase font size
  • Default font size
  • Decrease font size

Production Data Anonymization Tutorial

Production data extraction and anonymization is easy with benerator: You simply copy all data and overwrite attributes that are secret.

 

Preparing a 'Production' system

We do not want to touch a production database in the first try, do we? So let's prepare a dummy 'production' system as a save sandbox for out tutorial. Of course, the easiest way to do this is using benerator. There is just one thing you need to change: We need a real database instance here which will be populated in the first step and then data will be extracted, anonymized and stored in an embedded HSQL database. At least you do not need to configure the latter one. So adapt the database configuration below to fit your database and save it as file prepare.ben.xml:

<?xml version="1.0" encoding="iso-8859-1"?>
<setup>

    <comment>Populating a source database which will serve as 'production' database later</comment>
    <database id="tmp" url="jdbc:firebirdsql:localhost:shop" driver="org.firebirdsql.jdbc.FBDriver" user="sysdba" password="masterkey" />
   
    <execute target="tmp" onError="warn">
        DROP TABLE db_customer;
    </execute>

    <execute target="tmp">
        CREATE TABLE db_customer (
          id        int NOT NULL,
          firstName varchar(30) NOT NULL,
          surname   varchar(30) NOT NULL,
          rank      int,
          PRIMARY KEY  (id)
        );
    </execute>

    <execute target="tmp">
        INSERT INTO db_customer (id, firstName, surname, rank) values (1, 'Alice',   'Andersson', 23);
        INSERT INTO db_customer (id, firstName, surname, rank) values (2, 'Bob',     'Bergmann', 34);
        INSERT INTO db_customer (id, firstName, surname, rank) values (3, 'Charly',  'Chan', 45);
        INSERT INTO db_customer (id, firstName, surname, rank) values (4, 'Daniela', 'Develey', 56);
    </execute>

</setup>

You can run this file by typing benerator.sh prepare.ben.xml on *ix or Mac Systems (or benerator.bat prepare.xml under Windows).

Check the results with your favorite database client: You ought to have 4 Customers now, with names that should be easy to re-identify and need to be anonymized.

Extract and anonymize production data

Now we create a descriptor file for data extraction and anonymization. Please update the content below with the same production database settings (in <database id="prod" .../>) and save it as file anonymize.ben.xml. Don't worry about its length - the extraction and anonymizationpart makes up only five lines!

<?xml version="1.0" encoding="iso-8859-1"?>
<setup>

    <import domains="person"/>

    <!-- the production database, secure the access by setting 'readOnly' to 'true' -->
    <database id="prod" url="jdbc:firebirdsql:localhost:shop" driver="org.firebirdsql.jdbc.FBDriver" user="sysdba" password="masterkey" />

    <echo message="This is the content of the 'production' database:" />   
    <iterate source="prod" type="db_customer" consumer="ConsoleExporter" />

    <!-- the target database -->
    <database id="target" url="jdbc:hsqldb:hsql" driver="org.hsqldb.jdbcDriver" user="sa" password="" schema="public"/>

    <execute target="target" onError="warn">
        DROP TABLE db_customer;
    </execute>

    <execute target="target">
        CREATE TABLE db_customer (
          id        int NOT NULL,
          firstName varchar(30) NOT NULL,
          surname   varchar(30) NOT NULL,
          rank      int,
          PRIMARY KEY  (id)
        );
    </execute>
   
    <iterate type="db_customer" source="prod" consumer="target" >
        <variable  name="person" generator="PersonGenerator"/>
        <attribute name="firstName" script="person.givenName" />
        <attribute name="surname" script="person.familyName" />
    </iterate>

    <echo message="This is the anonymized content of the 'target' database:" />   
    <iterate source="target" type="db_customer" consumer="ConsoleExporter" />

</setup>

 

Have a look at the content: It first declares a database 'prod' which is then iterated in the first <iterate/> descriptor, printing out all entities as they are in the 'production' database! It lists:

DB_CUSTOMER[ID=1, FIRSTNAME=Alice, SURNAME=Andersson, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=Bob, SURNAME=Bergmann, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=Charly, SURNAME=Chan, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=Daniela, SURNAME=Develey, RANK=56]

Now the real task begins: A 'target' database is declared and the tables are created. The second <iterate/> descriptor does the actual work of anonymization: It looks similar to the descriptor that listed the table content, but this time, the consumer is ...the 'target' database! If you restrict the descriptor to the main element, it creates identical copies of the production data. But we have some sub elements: A <variable/> introduces a helper generator that creates person objects. The two following <attribute/> descriptors make benerator overwrite the values of the columns 'firstName' and 'surname' with data that is extracted from a person object generated by the variable.

A third <generate/> descriptor is used to finally list all entries of the 'target' database. It lists something like:

DB_CUSTOMER[ID=1, FIRSTNAME=Michael, SURNAME=Fuchs, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=Birgit, SURNAME=Schmitz, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=Andrea, SURNAME=Fuchs, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=Alexander, SURNAME=Schmidt, RANK=56]

Note that all information stayed 'as is' except the person names. You surely wonder about the strange names: They are German - the PersonGenerator automatically generates names that fit the country of the system in which it runs. If you run this in Japan, you would get something like this (I hope your browser can render this):

DB_CUSTOMER[ID=1, FIRSTNAME=武, SURNAME=青木, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=彩花, SURNAME=伊藤, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=康平, SURNAME=橋本, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=雄大, SURNAME=山口, RANK=56]

Of course, US names are supported too, as well as names of 25 other countries! You can set the country explicitly writing, e.g. US, by writing

<variable name="person" generator="PersonGenerator" dataset="US"/>