Production Data Anonymization Tutorial
Production data extraction and anonymization is easy with benerator: You simply copy all data and overwrite attributes that are secret.
Preparing a 'Production' system
We do not want to touch a production database in the first try, do we? So let's prepare a dummy 'production' system as a save sandbox for out tutorial. Of course, the easiest way to do this is using benerator. There is just one thing you need to change: We need a real database instance here which will be populated in the first step and then data will be extracted, anonymized and stored in an embedded HSQL database. At least you do not need to configure the latter one. So adapt the database configuration below to fit your database and save it as file prepare.ben.xml:
<?xml version="1.0" encoding="iso-8859-1"?> |
You can run this file by typing benerator.sh prepare.ben.xml on *ix or Mac Systems (or benerator.bat prepare.xml under Windows).
Check the results with your favorite database client: You ought to have 4 Customers now, with names that should be easy to re-identify and need to be anonymized.
Extract and anonymize production data
Now we create a descriptor file for data extraction and anonymization. Please update the content below with the same production database settings (in <database id="prod" .../>) and save it as file anonymize.ben.xml. Don't worry about its length - the extraction and anonymizationpart makes up only five lines!
<?xml version="1.0" encoding="iso-8859-1"?> |
Have a look at the content: It first declares a database 'prod' which is then iterated in the first <iterate/> descriptor, printing out all entities as they are in the 'production' database! It lists:
DB_CUSTOMER[ID=1, FIRSTNAME=Alice, SURNAME=Andersson, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=Bob, SURNAME=Bergmann, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=Charly, SURNAME=Chan, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=Daniela, SURNAME=Develey, RANK=56]
Now the real task begins: A 'target' database is declared and the tables are created. The second <iterate/> descriptor does the actual work of anonymization: It looks similar to the descriptor that listed the table content, but this time, the consumer is ...the 'target' database! If you restrict the descriptor to the main element, it creates identical copies of the production data. But we have some sub elements: A <variable/> introduces a helper generator that creates person objects. The two following <attribute/> descriptors make benerator overwrite the values of the columns 'firstName' and 'surname' with data that is extracted from a person object generated by the variable.
A third <generate/> descriptor is used to finally list all entries of the 'target' database. It lists something like:
DB_CUSTOMER[ID=1, FIRSTNAME=Michael, SURNAME=Fuchs, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=Birgit, SURNAME=Schmitz, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=Andrea, SURNAME=Fuchs, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=Alexander, SURNAME=Schmidt, RANK=56]
Note that all information stayed 'as is' except the person names. You surely wonder about the strange names: They are German - the PersonGenerator automatically generates names that fit the country of the system in which it runs. If you run this in Japan, you would get something like this (I hope your browser can render this):
DB_CUSTOMER[ID=1, FIRSTNAME=武, SURNAME=青木, RANK=23]
DB_CUSTOMER[ID=2, FIRSTNAME=彩花, SURNAME=伊藤, RANK=34]
DB_CUSTOMER[ID=3, FIRSTNAME=康平, SURNAME=橋本, RANK=45]
DB_CUSTOMER[ID=4, FIRSTNAME=雄大, SURNAME=山口, RANK=56]
Of course, US names are supported too, as well as names of 25 other countries! You can set the country explicitly writing, e.g. US, by writing
<variable name="person" generator="PersonGenerator" dataset="US"/> |


