2015KS_Krishnappa-Big_Data_Analytics_for_Cyber_Security
2015KS_Krishnappa-Big_Data_Analytics_for_Cyber_Security
2015KS_Krishnappa-Big_Data_Analytics_for_Cyber_Security
Bharath Krishnappa
Principal Software Engineer
EMC, India Center of Excellence
Bharath.Krishnappa@rsa.com
Table of Contents
Introduction ................................................................................................................................ 2
Security ...................................................................................................................................... 4
Challenges ................................................................................................................................. 9
Privacy ..................................................................................................................................10
Conclusion ................................................................................................................................13
References ...............................................................................................................................14
Disclaimer: The views, processes or methodologies published in this article are those of the
authors. They do not necessarily reflect EMC Corporation’s views, processes or methodologies.
Today, only a fraction of this data is used to aid business decisions. Big data analytics is set to
change this. Its goal is to bring more and more data into play. Many organizations and big
businesses have woken up to the potential of big data and machine learning. Almost all of them
are either exploring ways to put them to use or are already using them. Big data has been one
of the top trending topics in Information Technology for some time now and I believe it will
continue to be on top for a long time.
Cyber security is an ever evolving field. Every new technology will introduce a new set of threats
and vulnerabilities, making security a moving target. What makes it worse is the fact that
security is almost always an afterthought when it comes to new technologies. According to a
news release from Ernst & Young - “Global companies in a hurry to adopt new technologies and
media, leaving threats to security as an afterthought”.
Considering the dynamic nature of the security domain, big data analytics can play a major role
in areas such as malware detection, intrusion detection, multi-factor authentication, etc. Most
organizations today tend to over-compensate with techniques such as multi-factor
authentication to protect themselves and their customers. Security almost always trades-off
usability. If not moreso, usability is almost as important as security for some verticals like
ecommerce. For an ecommerce site each extra step or extra second required to complete a
transaction will negatively impact revenue. Big data and machine learning techniques can be
employed to assess risk by collecting and analyzing various contributing factors such as IP
address, device type, device location, browser, MAC address, ISP, user history, etc. Only if the
risk is high will additional security measures be enforced. This way, usability will be impacted
only for a few transactions that are deemed risky. This article documents and discusses such
examples where big data analytics techniques can be used to tackle some of the difficult
security challenges like Advanced Persistent Threat (APT), big ticket breaches plaguing both
private and public sectors today.
Most techniques like machine learning, statistics, predictive analysis, behavioral analysis, etc.
used today for big data analytics have been there for a long time. Traditionally, these techniques
were used on structured data sets ranging from a few MBs to a few GBs. Today they can
handle bigger volumes, up to petabytes of both structured and unstructured data. Drivers for this
rapid change are:
Hadoop and tools built around Hadoop that can handle all three Vs of big data; volume,
velocity and variety.
NoSQL databases that can store massive amounts of data and retrieve them at
breakneck speed.
Decreasing cost of storage and compute.
Cloud computing technologies that enable easy and elastic access to massive amounts
of compute, storage, and network.
Security
Analytics is not new to the world of security. If you think about it, intrusion and fraud detection
systems have been using analytics for a long time. But these traditional systems employ
analytics in a limited way.
They collect very limited data due to storage or relational database management system
(RDBMS) restrictions.
Data is deleted after a fixed retention period because of storage or RDBMS capacity or
performance restrictions.
The amount of processing involved for decision making is also limited since they are
expected to be non-intrusive and add limited overhead.
If you look at the limitations of these traditional systems and the reasons for the restrictions, the
restrictions seem artificial when you factor in big data technologies. Hadoop and NoSQL
databases can:
Add real time stream processing to the mix and we can build systems with limitless capabilities.
In this section, we discuss some of the limitations of the traditional security solutions and how
big data tools can be utilized to alleviate them.
Intrusion detection
Intrusion detection systems (IDSs) monitor network, network node, or host traffic and flag any
intrusions. IDS use either statistical anomaly-based technique or signature-based technique for
intrusion detection.
Signature-based techniques monitors and compares the network packets and traffic patterns
against a set of signatures created from known threats and exploits. Big data tools may not add
a lot of value for this technique except that they can improve the pattern matching speeds and
increase the capacity of signature databases.
On the other hand, statistical anomaly-based detection technique compares the network traffic
with the baseline and flags major deviations from the baseline as intrusion. In my opinion, this is
the better technique compared to signature-based technique for the simple reason that it is
adaptive. Here big data tools can play a huge role. It can facilitate collection and extended
retention of more data per network packet and also the traffic pattern and monitoring
information. This anomaly-based technique is as good as the baseline with which it compares.
If the baseline has to be effective, it has to be dynamic and contextual. The baseline can differ
In organizations operational mostly during the day, network traffic is less at night.
Traffic to the ACH handling module increases during the time of day when ACH
transactions are processed.
There should be no network traffic in the daytime from the laptop belonging to the user
who works the night shift.
Traffic to a bank on the last day of the month can surge when salaries get credited.
Most of the information required to build the contextual baseline like employee shift timings,
ACH processing window, etc. are already available in different systems but they are not
leveraged yet. Big data techniques like real time stream processing can leverage data collected
from such disparate sources and build contextual baselines at great speed. Add some guided or
semi-guided learning techniques to the mix and utility of the IDS will improve significantly. It can
reduce the false alarm rates that plague them today and even when there are false alarms, it
can learn some specifics from the analyst and use that to fine tune the baseline. It can easily
combat IDS evasion techniques like low bandwidth attacks, fragmentation, etc.
User-to-device mapping – Maintain a list of devices user employ to access their account.
Device profile – If the device is a public access computer, no anti-virus protection, etc.
Source IP address – Is the IP blacklisted or is it a proxy IP?
Location of access – i.e. if the access is from countries like Nigeria
User behavior – time of day, last access location
Payee profile – is the payee of a transfer known to have previously benefited from fraud?
User’s risk appetite – Credit rating, casino statistics, traffic violations, law violation
history. The thought behind listing them as factors is that a risky transaction from a
person who is accustomed to taking risks may not really be a fraudulent transaction.
I was letting my imagination run amok; some of these facts cannot be collected in a secure and
trustworthy fashion. Other facts are out of bounds for the banks due to laws and regulations.
However, suppose there is a framework for organizations to share data about the users and
devices. Imagine how much the following examples can contribute to the accuracy of fraud
detection:
a user’s travel itinerary from travel sites like Agoda and Expedia
type of sites that user frequents that Facebook and Google tracks,
devices on which anti-virus products are installed from the anti-virus vendors
Solutions available in this space today already use some of these factors. But when it comes to
collecting more data, they are limited by capacity constraints of the traditional technologies.
Even the data retention intervals are short due to capacity constraints. In these solutions,
building of profiles is handled by offline scheduled tasks to limit the overhead of fraud detection
but this increases the time to value of the data. An analyst’s turnaround time is typically on the
higher side because mining for the required data is a slow and tedious process. NoSQL
databases can alleviate capacity constraints to a large extent and It ease the pain of data
mining for fraud analysis. Tools like Storm that offer distributed real time computation
capabilities can be used to build the profiles quickly and on the fly to reduce the time to value of
data. Big data visual analytics aids for fraud analysts can help them derive new actionable
insights and push them into the system.
The traditional techniques are also weak at identifying sophisticated attacks like APT and
steganography.
APT - CSA’s “Big Data Analytics for Security Intelligence” paper defines APT as “An Advanced
Persistent Threat (APT) is a targeted attack against a high-value asset or a physical system. In
contrast to mass-spreading malware, such as worms, viruses, and Trojans, APT attackers
operate in “low-and-slow” mode. “Low mode” maintains a low profile in the networks and “slow
mode” allows for long execution time. APT attackers often leverage stolen user credentials or
zero-day exploits to avoid triggering alerts. As such, this type of attack can take place over an
extended period of time while the victim organization remains oblivious to the intrusion.”
An intelligence-based approach to monitoring with the aid of Big Data technologies can address
all these limitations of traditional systems. To start with, not having to be concerned with
capacity constraints, the monitoring tools can start gathering all the network packets, logs, etc.
instead of focusing only on the critical and problem areas. It can start engaging deeper and
more complex packet inspection and log analysis techniques by leveraging the scalable parallel
processing big data techniques. Visual analytics can be used to provide comprehensive network
visibility to the network security administrator. It can even focus or highlight areas that are
deviating from the usual pattern and facilitate quick drill down and rollup functionality that will aid
in faster identification of threats. Additionally, it could spot stealthy techniques like APT by
identifying many minor deviations or intrusions from the same user or device, weaving them
together and flagging them as a whole.
In a blog post, Bruce Schneier states, “Security is a combination of protection, detection, and
response.” In the same blog post he stresses the importance of incident response and how
speed is off essencial when it comes to incident response. The ability to see all the alerts in a
centralized security management console and drill down capabilities to quickly wade through the
specifics would help in this regard. The ability to quickly re-construct and view the activities of
The multiple times that we are expected to enter passwords throughout the day.
At times applications expects us to log in with more than one credential (multi-factor,
step-up authentication) to augment security.
The long queues in airports, malls, and other public places for security clearance.
The procedures employed by call centers to ensure your authenticity before they actually
resolve or address your concerns/complaints. Not only are these procedures frustrating
for us because we need to spend a lot of time on the phone even to get minor
clarifications, this is a significant cost for the call centers too.
These controls are in place because of the inefficiencies and limitations of current technologies
to determine risk accurately. The multiple login problems can be addressed by calculating the
risk based on some of the factors I have listed in the “Remote Banking Fraud detection” section
and stepping up the security controls based on risk. Couple this technique with SSO
technologies to ensure that we don’t see often boring and at times frustrating login screens.
At airports, big data tools and technologies can be leveraged to build risk profile of a person in
real time and make it available to security officers. Risk profiles can be built by pulling already
available data from various sources and could also factor in vitals like blood pressure, heart
rate, and any variation in them while approaching the turnstile or the security officer, etc.
Challenges
Although big data technologies hold the key for most of our problems and for future innovations,
there are plenty of challenges that have to be addressed quickly. The biggest challenge and
concern is that big data technologies can erode privacy. The greatest strength of big data
technologies is its ability to locate a needle in a haystack. This capability can be used to easily
Privacy
In the interest of keeping the arguments balanced, I will first touch upon the virtues of big data
and the role it will play in the evolution of human race. I will then discuss the privacy erosion that
can be caused by big data tools and data harvested for big data analytics.
Data is raw and insights are derived by analyzing patterns and structures in the data. The
insights of today become data points of tomorrow from which new insights can be gained. For
example, Newton’s laws of motion were insights when he first came up with it, but today it is a
data point. These iterations of data to insights and insights becoming data points for the next
iterations are drivers for technological progress. Newton expressed a similar thought very well
when he said – “If I have seen further it is by standing on the shoulders of giants.” Big data
technologies are set to accelerate these iterations that drive technology advancements. Even
though most traditional analytical approaches were good at mining data and finding answers to
specific questions, big data technologies can lead to unexpected insights that were not even
being sought in the first place. It is this groundbreaking capability of big data technology that
helped Walmart figure out that there is usually a lot of demand for Strawberry Pop Tarts before
a hurricane. If not for these techniques, what were the chances of discovering this insight? It is
this remarkable capability of big data technologies that sets it apart and can drive
radical/disruptive innovations even in fields like automobile engineering which hasn’t seen any
radical innovation in decades. Taken further, in the field of medicine, it could pull a lot of non-
obvious but high sensitivity and specificity biomarkers.
Clearly, an aspect of big data technologies that many of us fear is that it can erode personal
privacy. Is this fear unfounded?
Almost all new age Internet companies consume a lot of user data. Websites track our activities
on their sites but they do not stop there; they track us even when we are on other sites. Such
companies will discover and already know a lot about us. Sometimes they know things about us
of which even we are not aware. Such data gives them a lot of power over us. Everyone has
skeletons in their closet and such data can be used to know which closets hold what skeletons.
If the people at the helm in such organizations decide to misuse the data can even bring heads
of state to their knees. This is a lot of power. And, as the saying goes, “Absolute power corrupts
absolutely.”
If you have heard about the Sony hack, there is one big take away for all the cyber criminals.
Governments and heads of state cannot be blackmailed because most countries have strict
non-negotiation policy with terrorists, criminals, blackmailers, etc. But private companies can
give in to blackmail very easily because in the private sector, decisions are typically made
based on cost analysis.
This is why I feel we should not let the private sector self-regulate in this aspect. Governments
should play an active role in framing new policies and regulations on how data should be
safeguarded. To weather the rapid changes in technology, policies and regulations should not
be stated in terms of tailored solutions for different technology tracks, and sub-tracks. Instead,
they should be stated in terms of intended outcomes [PCAST].
Governments should also set right their own houses. It is well-known that governments track a
lot of our activity and they themselves do not have comprehensive checks and controls to
prevent misuse. There are instances where government employees have misused this data to
spy on their friends. Governments and government agencies are going against the basic privacy
principle of using data only for the purposes that it was collected by forcing firms to share data
Widespread tracking and surveillance by governments and private organizations without respect
for geographical and jurisdictional boundaries is giving rise to self-censorship all over the world.
Governments would be wise to start rebuilding their credibility by bringing in regulations to
safeguard the surveillance data that they collect and strict laws that criminalize the misuse of
such data.
Even data protection solutions like backup and recovery are not very comprehensive and lack
options when it comes to big data tools. These gaps should be quickly plugged if big data
technologies have to be used for security purposes. If not, it will become a very serious and
obvious handicap.
One of the six laws of Melvin Kranzberg states, “Technology is neither good nor bad; nor is it
neutral”. This is true even for big data technologies. Big data technology is already here and
most of us are aware of its big value proposition. I believe most of the debate today is whether
we should allow or stop collecting certain types of data that could invade privacy. However, I
think at this juncture it makes more sense to debate about regulations, techniques, and
solutions to solve this challenge without reducing the efficacy of big data technologies. If big
data technologies have to be leveraged effectively for security solutions, there should be more
focus on building effective and granular security controls for the tools and also on solutions to
ensure trustworthiness of the input data.
[Cisco Blog] Thomas Barnett, Jr. “The Dawn of the Zettabyte Era
[INFOGRAPHIC]”, June 23, 2011, retrieved from
http://blogs.cisco.com/news/the-dawn-of-the-zettabyte-era-
infographic on 14-Jan-2015
[Ngrain article] Ngrain “3 reasons why “visualization” is the biggest “V” for big data”,
retrieved from http://blogs.cisco.com/news/the-dawn-of-the-
zettabyte-era-infographic on 14-Jan-2015
[Bryant, Katz, & Bryant, R., R. Katz & E. Lazowska. “Big-Data Computing: Creating
Lazowska, 2008] revolutionary breakthroughs in commerce, science, and society”,
December 2008 , retrieved from
http://www.cra.org/ccc/files/docs/init/Big_Data.pdf on 3-Jan-2015
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.