I developed a presentation on big data for a series of education sessions I am delivering for a financial institution trade association. As I was putting the presentation together, I realized that this was probably a good topic for the blog as a lot of you are running headlong toward big data either for log data analysis or just as the next “I need to be doing this” technology fad.
Most people do not realize that big data has been around for quite a while, relatively speaking. Google, Yahoo and similar Web service providers have been dealing with big data for years. But it was only recently that big data management frameworks such as Apache Hadoop, Google BigQuery and MongoDB became publically available through shareware, commercial solutions and software as a service (SaaS).
So we are all on the same page, let us define “big data.” The best definition I have seen for big data is from Gartner.
“Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Examples of big data include information such as Web search results, electronic messages (e.g., SMS, email and instant messages), social media postings, pictures, videos and even system log data. However, it can also include credit/debit card transactions, check images, receipts and other transactional information depending on the source of the information. As a result, big data can easily end up in-scope for PCI compliance.
The first problem with big data is that organizations are expected to know the data going into their big data repositories. The reason “know your data” is so important with big data is that it comes from potentially a wide variety of sources such as:
- Social media
- News feeds
- Images
- Streaming media (audio and video)
- Documents
- Messaging systems
- Audit logs
- Transaction logs
- Web sites
- System logs
With this diversity of information sources, it is anyone’s guess as to how much sensitive information could end up in an organization’s big data repositories. But worse yet, anyone in the big data field will tell you that you need to anticipate all potential sensitive data so that you can secure and protect it appropriately. Why is this? Because big data tools allow everything to be searchable: text, images, audio, video, anything. So if you do not want the information to be searchable, then you need to identify it and either encrypt it, truncate it or remove it so that it cannot be found. As those of you that are using data loss prevention (DLP) or other tools to find cardholder data (CHD) stored on your systems are well aware, finding CHD is not as easy it would appear. As a result, finding it in big data could be the ultimate finding the needle in the haystack game.
The next problem with big data is that the security tools for big data are very early in their development and, in some cases, are really bolt on after thoughts that use constantly running queries to find and protect the sensitive data. While a lot of vendors claim they can secure data at the “field” level, I have spoken to a number of clients going through big data implementations that tell me this is a pipe dream at the moment. As such, in all cases I am aware; big data protection is currently accomplished through totally encrypting all of the data and very severely restricting access.
Which begs the question, how can big data be PCI compliant? Well it can be PCI compliant as long as: (1) access to it is extremely limited (only a very few people have access such as two to three), or (2) you are able to truncate, remove or encrypt the CHD contained in your big data hive(s). Given that accurately locating CHD can be nearly impossible with current techniques, I just do not see option 2 as currently viable, so extremely limited access is your only workable option. Even then I seriously doubt that big data is a good place for CHD to be stored as severely limiting access is also not viable given why big data is being implemented. As a result, big data is probably not a good place for sensitive data, cardholder or otherwise, until the security tools catch up.
I think my readers will recognize why their log data would fit into the big data category. It definitely has high volume as it typically comes from a large pool of devices that are generating potentially hundreds to thousands of entries per second which also satisfies the high velocity requirement as well. And the high variety aspect is also satisfied as log data from a Cisco or Juniper device is nothing like log data generated from a Windows or Linux server or any other litany of network devices. However, while the likelihood of CHD should be nonexistent in log data, CHD can still end up in log data due to debugging being performed. As such, I am not certain I would be comfortable with log data in a big data hive until the maturity of the security tools is better.
The bottom line at the moment is I just do not see big data being ready for PCI compliance at this point. I am sure that someday big data will be capable of being PCI compliant, but not at this time. So all of you that need to be PCI compliant running towards the light at the end of the big data tunnel. That light you see is not then end of the tunnel but an oncoming train about to run you over.