How to Anonymize Data in a PostgreSQL Database

For my current project, I needed to anonymize data in a PostgreSQL database, i.e. I wanted to remove, replace or scramble sensitive information, depending on the kind of data involved. There are several ways to achieve this. I’m going to show you how it’s done by running some easy SQL statements in PostgreSQL.

Companies often have the same problem: developers need to work with application data that is as close as possible to real production data, so that they can spot problems during development that they might miss when using a tiny test data set. At the same time, they shouldn’t have access to sensitive data like customer’s email addresses or private messages sent between users. A common approach for this is to take a production database dump, possibly thin it out to make it more manageable and then anonymize all sensitive information, and let developers only work with this dump. For example, I wanted to replace email addresses, delete oAuth tokens and scramble private messages, and then wrap it all up in a little script. I have done this before for MySQL, but not for PostgreSQL, but it turned out to be fairly easy. Let’s see an example for each kind of task.

Delete the contents of a field in all rows

This is the first, and easiest, case: we simply need to delete some data in all rows of a column:

update users set oauth_token = NULL;
Done.

Replacing Email Addresses with User-Specific Test Addresses

The second case: we need to replace data. Let’s take email addresses for example. We have a two-fold problem here: developers should not see actual email addresses, and they should not be able to accidentally send out mails to real customers (would never happen to you, right?). So let’s replace their email addresses with a test address and add their user id to enable some analysis or filtering after receiving test mails.
update users set email = 'testmail+user' || id || '@yourcompany.com';

We used the PostgreSQL concatenation operator ‘||’ and simply replaced the email field for each row with something like testmail+user853@yourcompany.com. The current project uses Gmail accounts, where you can simply add a plus sign to any mail address. The plus sign is ignored (the mail still gets delivered to the specified mail account), but you can use it for filtering incoming messages if you want.

Scramble data

This is the third, and probably most interesting case: we want to replace or “scramble” data while keeping its structure intact. Take messages between users for example: you could simply replace each message with an MD5 hash. But then you wouldn’t be able to see layout problems while working on the inbox layout for example. So you want to keep the original structure, but make the text unreadable. Enter the PostgreSQL “translate” function: “Any character in string that matches a character in the from set is replaced by the corresponding character in the to set”. This is usually used to replace single words like “is” with “was”, but we can also use it for our case: we simply provide a replacement character for each character in the alphabet. You should make sure to re-use some characters to make it harder to re-translate the message.

Just a little caveat: “If from is longer than to, occurrences of the extra characters in from are removed.” My first idea was to replace all characters with “a”, but if you just specify that as the target character, the source words will all be cut short. That’s why we’ll include one replacement character per source character. The translate function is also case-sensitive. Let’s see how it’s done:

update messages set
message = translate(message, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ');

I just chose some random replacement characters, you could use different ones if you like. Now, the message “How is it going, John?” becomes “”Jak wk wk wawhw, Uaih?”. Good enough for me.

Put It All Together

Let’s put it all into a little SQL script and use that in a shell script, so that your DBA can create an anonymized dump with one command on his local machine or on a test server:

#anonymize_db.sql
update users set oauth_token = NULL;
update users set email = 'testmail+user' || id || '@yourcompany.com';
update messages set
message = translate(message, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ');

#anonymize_db.sh
dropdb -U youruser yourdb
createdb -U youruser yourdb
psql -U youruser yourdb < ./production_dump.sql
psql -U youruser yourdb < ./anonymize_db.sql
pg_dump -U youruser yourdb > anon_dump.sql

What we do is drop and recreate the db (don’t do this on your production server, kids), import a production dump that must have been produced before, apply the anonymization script and export this state into another SQL file for the developers. You could run this as a nightly cron job and check in the anonymized dump into a repository in your version control system.

There you have it: an anonymized database dump that is safe to use and still resembles your production data as closely as possible.

Do you know of other best practices for anonymizing data? Please share them here.

One thought on “How to Anonymize Data in a PostgreSQL Database

  1. Paul Hernandez

    Thanks for the post. It gave me a initial point to start to anonymized the customer data in our dev databases. It would be nice also to anonymized fields like Name, Address, City, ZIP-CODE, etc. with real names. My colleagues have found an interesting data generator: http://migano.de/testdaten.php I’ll let you know my final approach after the implementation.
    Viele Grüße,
    Paul Hernandez

Comments are closed.