Personally Identifiable Information (PII) is the sensitive information that can be used to uniquely identify the flesh and blood people that are our staff, partners, vendors — and especially our customers.
Some examples include: name, birth date, address, government issued ID numbers, email, credit card, bank account, user ID and password. As technology evolves, biometric data and even DNA sequences will make the list as well!
Occurrences of leaked PII make headlines quickly and can be very damaging to organizations.
To get a sense of this, just google sony playstation security breach and you’ll be met with a flurry of customer dissatisfaction, lawsuits and real losses in revenue and market share. Fortunately for Sony, the majority of their online gamers are loyal (read addicted) and have flocked back after the network came back up a month later.
Will your customers be as forgiving should their PII be leaked? Let’s not find out …
It may be tempting to dive in and just start scrambling sensitive data, but as with any project, we need to do some planning first.
Make your Plan
Here are 5 steps to create your own PII privacy plan:
Step 1 – Identify PII
Take your customer’s perspective: what would they not want published on the Internet? Note that groups of seemingly harmless information can combine as PII (e.g., postal code, birth date, gender). If in doubt, include it.
Step 2 – Check with Authorities
List all related legislation, government guidelines, requirements for standards compliance and your own organization’s privacy policies. A few examples that may apply to you are: PCI-DSS, Identity Theft Protection Act, Social Security Protection Act of 2010, PIPEDA, HIPAA. Your customers and the government will not be very forgiving if you plead ignorance, so consult with industry groups, security specialists and read up!
Step 3 – What’s Your PII Lifecycle?
To know how to protect PII, first you need to consider its lifecycles in your organization. If you’re lucky enough to have documented your organization’s workflows, take a second pass through each with an eye on PII privacy. Here’s a high-level sample lifecycle:
No doubt, your organization’s PII lifecycles will differ. This is not an issue if you know what they are.
Step 4 – Target Technologies
You now need to match technologies and techniques to each element of PII in your PII lifecycle.
If you’re the type that spends evenings and weekends monitoring hackerwatch.org (my condolences), this may be a simple task. Others may be advised to seek some expertise from your trusted technology provider.
Let’s consider an example. You’re a federal electric utility with a desire to survey your customers on their consumption habits and willingness to support green initiatives. Funny … this was supposed to be a fictitious example, but then I easily found a Scottish Hydro survey. Cool! The publicly available online form is capturing: name, email, age range, telephone, address, house ownership status, government benefits, house features and the appetite for green initiatives.
Now let’s go through the high-level PII lifecycle mentioned above and see what technologies can help us.
Standard 128 bit SSL Encryption should be used to protect the data stream between the user’s browser and the web server. That way, if the data stream is intercepted, it’s worthless to a hacker.
SSL protection ends when the data hits your web application, so all the PII is in memory in clear text format. At this point, additional data could be combined before storage. Maybe the house features are combined to predict an overall energy efficiency rating. Maybe the address is cross-referenced to an existing utility account number. Maybe the data on home ownership, government benefits and appetite for going green are combined to create a marketing priority level (i.e., the homeowner has money and wants green energy). You can start to see how a simple survey becomes powerful information – in good hands or bad.
Storing the data is really the key (pun intended). You have a lot of options:
Clear text is perhaps the default of most developers. After all, you have a physically secure building, network security and security on the database objects themselves right? Umm … wrong. You don’t want to rely on these layers alone. All it takes is one disgruntled insider and all your customer survey PII is exposed — and now your organization is backpedaling for years!
Data scramble This is my own non-technical categorization of little custom algorithms to modify data before storage. For example, change the order of the characters, mix with some predefined static characters, use the ASCII 3-digit equivalent of each byte of data and perform some mathematical operation like multiplying by 47. This approach will stop a snoop but not a motivated hacker.
Hash functions like MD5 and SHA-2 scramble the data in more sophisticated ways and produce harder to decipher output. They’re relatively simple to implement and pretty common. Unfortunately, they’re so common that hackers have some pretty common attacks to break them. Going back to the most recent headlines, Sony indicates that their data was hashed. If a couple of customers’ records were leaked, the hash would have deterred most hackers, but in this case 77 million records were leaked — enough to make the computing effort of hash-cracking worthwhile for the nefarious. One other practical limitation of hash algorithms is that they are one-way. For example, for an application to see if a user supplied a valid password, it must MD5-hash their password and compare to see if it matches the MD5-hashed value in the database. One cannot just run the hashed value through some algorithm to get back the clear text.
Encryption uses a more sophisticated algorithm involving large prime numbers, mod and exponentiation. Public key cryptography (RSA) is the basis of SSL. With large enough numbers, it’s generally considered the gold standard. Also, unlike hash functions which are one-way, encryption has two keys. The public key is used for encryption while the private key is used for decryption. The cool part is that knowing the public key doesn’t help you decipher anything. Of course, it’s critical that the private key remain private! However, even encryption has limitations. If the same public key is used, two encrypted messages will have the same cipher. So, you may not know what postal code a person has by looking at the encrypted data, but you’ll know that they share a postal code with 200 other people on file. If somehow you determine one, you have them all.
Tokenization is a step beyond encryption. Tokenization aims to create a data vault in which all PII is kept. Outside the data vault, the operational database contains only tokens to the data in the vault. The data in the vault is encrypted such that it is virtually useless to a hacker. The tokens themselves are typically unique keys, which when appropriately passed to the token server get you back the clear text data. You could just as easily store the encrypted data in the operational database, but then your database would need to accommodate very long strings for each encrypted PII field — this can pose problems especially when retrofitting security into an existing system. If the data tokens in the operational database need to be readable for pragmatic purposes (e.g., last 4 digits of a customer’s phone number for CSRs to validate a caller), that can be left in clear format — either as part of the token or in a separate database field. As with any encryption, the decryption key(s) need to be stored in the most secured location possible, entrusted to a bare minimum of individuals.
The above options for PII storage go from simple to sophisticated. There’s no one right answer for all situations, so consider your needs and consult an expert.
Beyond “copying” of data for communications as outlined above, organizations often create copies of data for dashboards, reports, marketing, telemarketing, etc. In our example, marketing will no doubt want access to the survey database so that they can run queries, extract the data to an Excel pivot table, feed email campaigns, etc. Without proper encryption in the database, control over PII would quickly be lost, or left up to the honor system. Under the tokenization approach, users can be given more freedom to query the operational database, because it is only through authorized applications that PII is available in clear text format.
System backups and mirroring create copies of the database as well. Without proper encryption in the database, the backup media/disk needs to be secured and controlled as much as the original. With proper encryption techniques, there is no readable PII in the database to worry about.
This one is pretty simple, and no special technology is required. When you no longer need the PII for any important reason, destroy it by permanently deleting the associated records in the data vault. Note that the data tokens can remain in your operational database if desired – for historical reporting etc. Purging PII from data backups happens naturally. As your organization may only keep a 30-day cycle of nightly backed-up data, the PII will be removed over time.
Step 5 – Implement & Monitor
Your research is done, requirements defined, PII policies and contracts written and technologies chosen for each aspect of your PII lifecycles. Now it’s time to plan the project and implement the chosen solutions.
Like any project, one needs to determine the ROI and ensure all interested parties are onboard. Your executives need to understand the risks of a PII breach, and your staff needs to believe in the security measures to be implemented.
Each operating system (Windows, iSeries, Linux, etc.) and DBMS (SQL, DB2, Oracle, etc.) has its own features for hashing and encryption available to the development platform. The architecture for tokenization is a little more complex – with a token sever in the mix. Depending on your comfort with the subject matter, you may want to engage a technology partner to help with planning and implementation. I welcome your questions.
Once your PII protection solution is developed, tested and implemented, you can rest – sort of. You now need to monitor activities to ensure that operations are not being unduly restricted, and that PII access patterns are as expected. Hopefully your implementation includes logging and alerts which red-flag unauthorized or unexpected accesses (or attempted accesses) to PII information. For example, if you detect an unusual pattern of requests to your token server from a particular user, perhaps they are using their access via authorized applications to build their own little clear-text database.
As with monitoring any security measure, no news is good news – your deterrent is working.
As mentioned, many governments and member groups have already created policies, certifications and laws around the handling of PII. With collections of PII becoming larger and more valuable, initiatives like PCI-DSS and HIPAA will no doubt become more prevalent in the future.
With so much hype around cloud-based solutions and data being held on third-party servers, standards and technologies for PII privacy will have to evolve quickly. Tokenization appears to be the emerging approach for large volume high value PII data. However, right now it is basically implemented at the application layer.
I predict that encryption and tokenization will descend to the layers of the data repository or DBMS. In the not too distant future, I can envision the architect or database designer simply clicking checkboxes to nominate fields of information as PII and selecting one of several available standard protections schemes. This would provide a very valuable level of abstraction and relieve the burden from the application developer.
Taking this one step further, PII protection services implemented at the repository or DBMS layer will no doubt become a value-add feature of more and more cloud offerings. It’s a pretty easy sell – if you want to avoid a crippling PII leak and you can’t make sense of the latest laws and standards, entrust it all to someone who can, and will maintain the necessary compliance on your behalf.
Of course, with all this data centralized in cloud offerings, the stakes will go up for the hackers as well! Hmm …