Big Data - Concepts, Applications, Challenges and Future Scope

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ISSN (Online) 2278-1021

IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

Big Data – Concepts, Applications, Challenges


and Future Scope
Samiddha Mukherjee1, Ravi Shaw2
Information Technology, Institute of Engineering & Management, Kolkata, India 1, 2

Abstract: The term, ‗Big Data‘ has been coined to refer to the gargantuan bulk of data that cannot be dealt with by
traditional data-handling techniques. Big Data is still a novel concept, and in the following literature we intend to
elaborate it in a palpable fashion. It commences with the concept of the subject in itself along with its properties and the
two general approaches of dealing with it. The comprehensive study further goes on to elucidate the applications of Big
Data in all diverse aspects of economy and being. The utilization of Big Data Analytics after integrating it with digital
capabilities to secure business growth and its visualization to make it comprehensible to the technically apprenticed
business analyzers has been discussed in depth. Aside this, the incorporation of Big Data in order to improve
population health, for the betterment of finance, telecom industry, food industry and for fraud detection and sentiment
analysis have been delineated. The challenges that are hindering the growth of Big Data Analytics are accounted for in
depth in the paper. This topic has been segregated into two arenas- one being the practical challenges faces whilst the
other being the theoretical challenges. The hurdles of securing the data and democratizing it have been elaborated
amongst several others such as inability in finding sound data professionals in required amounts and software that
possess ability to process data at a high velocity. Through the article, the authors intend to decipher the notions in an
intelligible manner embodying in text several use-cases and illustrations.

Keywords: Big Data, 3 V‘s, Sentiment Analysis, Data Visualization, Integration, Data Democratization, Encryption.

I. CONCEPTS
―Every day, we create 2.5 quintillion bytes of data — so his 2001 Metagroup publication, ‗3D data management:
much that 90% of the data in the world today has been Controlling Data Volume, Variety and Velocity‘.
created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information,
posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals
to name a few.‖[1]. such colossal amount of data that is
being produced continuously is what can be coined as Big
Data. Big Data decodes previously untouched data to
derive new insight that gets integrated into business
operations. However, as the amounts of data increases
exponential, the current techniques are becoming obsolete.
Dealing with Big Data requires comprehensive coding
skills, domain knowledge and statistics.
Despite being Herculean in nature, Big Data applications
are almost ubiquitous- from marketing to scientific
research to customer interests and so on. We can witness Image-1: schematic representation of the 3V‘s [4] of Big
Big Data in action almost everywhere today. From Data
Facebook which handles over 40 billion photos from its
a. Volume: This essentially concerns the large quantities
user base to CERN‘s Large Hydron Collider (LHC) which
of data that is generated continuously. Initially storing
generates 15PB a year to Walmart which handles more
such data was problematic because of high storage costs.
than 1 billion customer transactions in an hour. Over a
However with decreasing storage costs, this problem has
year ago, the World Bank organized the first WBG Big
been kept somewhat at bay as of now. However this is
Data Innovation Challenge which brought forward several
only a temporary solution and better technology needs to
unique ideas applying Big Data such as big data to predict
be developed. Smartphones, E-Commerce and social
poverty and for climate smart agriculture and fore user-
networking websites are examples where massive amounts
focused Identification of Road Infrastructure Condition
of data are being generated. This data can be easily
and safety and so on [2].
distinguishes between structured data, unstructured data
Big Data can be simply defined by explaining the 3V‘s –
and semi-structured data.
volume, velocity and variety which are the driving
dimensions of Big Data quantification. Gartner analyst, b. Velocity: In what now seems like the pre-historic times,
Doug Laney [3] introduced the famous 3 V‘s concept in data was processed in batches. However this technique is

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 66


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

only feasible when the incoming data rate is slower than advantage. The major applications of Big Data have been
the batch processing rate and the delay is much of a listed below.
hindrance. At present times, the speed at which such
 The Third Eye- Data Visualization
colossal amounts of data are being generated is
Organizations worldwide are slowly and perpetually
unbelievably high. Take Facebook [5] for example – it
recognizing the importance of big data analytics. From
generates 2.7 billion like actions/day and 300 million
predicting customer purchasing behavior patterns to
photos amongst others roughly amounting to 2.5 million
influencing them to make purchases to detecting fraud and
pieces of content in each day while Google Now processes
misuse which until very recently used to be an
over 1.2 trillion searches per year worldwide.[6].
incomprehensible task for most companies big data
c. Variety: Documents to databases to excel tables to analytics is a one-stop solution. Business experts should
pictures and videos and audios in hundreds of formats, have the opportunity to question and interpret data
data is now losing structure. Structure can no longer be according to their business requirements irrespective of the
imposed like before for the analysis of data. Data complexity and volume of the data. In order to achieve this
generated can be o any type- structures, semi-structured or requirement, data scientists need to efficiently visualize
unstructured. The conventional form of data is structured and present this data in a comprehensible manner. Giants
data. For example text. Unstructured data can be generated like Google, Facebook, Twitter, EBay, Wal-Mart etc.,
from social networking sites, sensors and satellites. adopted data visualization to ease complexity of handling
Implementing Big Data is a mammoth task given the large data. Data visualization has shown immense positive
volume, velocity and variety. ―Big Data‖ is a term outcomes in such business organizations. Implementing
encompassing the use of techniques to capture, process, data analytics and data visualization, enterprises can
analyze and visualize potentially large datasets in a finally begin to tap into the immense potential that Big
reasonable timeframe not accessible to standard IT data possesses and ensure greater return on investments
technologies. By extension, the platform, tools and and business stability.
software used for this purpose are collectively called ―Big  Integration- An exigency of the 21st century
Data technologies‖. [7] Currently, the most commonly Integrating digital capabilities in decision-making of an
implemented technology is Hadoop. Hadoop is the organization is transforming enterprises. By transforming
culmination of several other technologies like Hadoop the processes, such companies are developing agility,
Distribution File Systems, Pig, Hive and HBase. Etc. flexibility and precision that enables new growth. Gartner
However, even Hadoop or other existing techniques will described the confluence of mobile devices, social
be highly incapable of dealing with the complexities of networks, cloud services and big data analytics as the as
Big Data in the near future. The following are few cases nexus of forces. Using social and mobile technologies to
where standard processing approaches to problems will alter the way people connect and interact with the
fail due to Big Data- organizations and incorporating big data analytics in this
 Large Synoptic Survey Telescope (LSST): ―Over 30 process is proving to be a boon for organizations
thousands gigabytes (30TB) of images will be implementing it. Using this concept, enterprises are
generated every night during the decade –long LSST finding ways to leverage the data better either to increase
survey sky.‖ [8] revenues or to cut costs even if most of it is still focused
 There is a corollary to Parkinson‘s Law that states: on customer-centric outcomes. Such customer-centric
―Data expands to fill the space available for objectives may still be the primary concern of most
storage.‖[9] companies, a gradual shift to integrating big data
 This is no longer true since the data being generated technologies into the background operations and internal
will soon exceed all available storage space.[10][8] processes.
 72 hours of video are uploaded to YouTube every
minute.[11]
There are at present two general approaches to big data-
a. Divide and Conquer using Hadoop: The huge data set
is broken into smaller parts and processed in a parallel
fashion using many servers.
b. Brute Force using technology on the likes of SAP
HANA: One very powerful server with massive storage
is used to compress the data set into a single unit.

II. APPLICATIONS
Big Data is slowly becoming ubiquitous. Every arena of Image-2: Analysis as generated by IBM institute of
business, health or general living standards now can Business Value 2014 Analytics Study
implement big data analytics. To put simply, Big Data is a
field which can be used in any zone whatsoever given that  Big Data in Healthcare:
this large quantity of data can be harnessed to one‘s Healthcare is one of those arenas in which Big Data ought

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 67


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

to have the maximum social impact. Right from the events such as large financial market moves, valuable
diagnosis of potential health hazards in an individual to information may be contained in search engine data for
complex medical research, big data is present in all aspects keywords with less-obvious semantic connections to the
of it [12]. Devices such as the Fitbit [13], Jawbone [14] event in question. Overall, we find that increases in
and the Samsung Gear Fit [15] allow the user to track and searches for information about political issues and
upload data. Soon enough such data will be compiled and business tended to be followed by stock market falls.‖
made available to doctors, which will aid them in the Big Data is also being implemented in a field called
diagnosis. Several partnerships like the Pittsburgh Health ‗Quantitative Investing‘ [20] where data scientists with
Data Alliance have been established. The Pittsburgh negligible financial training are trying to incorporate
Health Data Alliance [16] is a collaboration of the computing power into predicting securities prices by
Carnegie Mellon University, University of Pittsburgh and drawing ideas from sources like newswires, earning
the UPMC. In their website, they state [16], ―The reports, weather bulletins, Facebook and Twitter.
health care field generates an enormous amount of data
every day. There is a need, and opportunity, to mine this
data and provide it to the medical researchers and
practitioners who can put it to work in real life, to benefit
real people……The solutions we develop will be focused
on preventing the onset of disease, improving diagnosis
and enhancing quality of care…….Further, there is the
potential to lower health care costs, one of the greatest
challenges facing our nation. And the Alliance will also
drive economic growth in Pittsburgh, attracting hundreds
of companies and entrepreneurs, and generating thousands
of jobs, from around the world…‖The patients diagnosis
will be analyzed and compared with the symptoms of
others to discover patterns and ensure better treatment.
IBM [17] has taken initiative in a large scale to implement
big data in healthcare systems be in its collaboration with
healthcare giant Fletcher Allen or with the Premier
healthcare alliance to change the way unstructured but
useful clinical data is made available to more medical
practitioners so as to improve population health. Big Data
can also be used in major clinical trials like cure for
various forms of cancer and developing tailor-made
medicines [12] for individual patients according to their
genetic makeup. To summarize, Sundar Ram of Oracle
stated [18], ―Big Data solutions can help the industry
acquire organize & analyze this data to optimize resource Image-3: Wall Street Journal [20] summarizes the above
allocation, plug inefficiencies, reduce cost of treatment, concept.
improve access to healthcare & advance medicinal
research.‖ One very interesting avenue of using Big Data in finance
is the sentiment extraction [21] from news articles. Market
 Big Data and the World of Finance: sentiment refers to the irrational belief in investors about
Big Data can be a very useful tool in analyzing the cash-flow returns [22]. The Heston-Sinha‘s Application of
incredibly complex stock market moves and aid in making the Machine Learning algorithm [23] provides us with the
global financial decisions. For example, intelligent and probability of an article being ‗positive‘, ‗negative‘ and
extensive analysis of the big data available on Google ‗neutral‘ using two other popular methods, one being with
Trends can aid in forecasting the stock market. Though the use of the Harvard IV Dictionary.
this is not a fool-proof method, it definitely is an In general, big data is set to revolutionize the landscape of
advancement in the field. A research study [19] by the Finance and Economy. Several financial institutions are
Warwick Business School drew on records from Google, adopting big data policies in order to gain a competitive
Wikipedia and Amazon Mechanical Trunk in the time edge. Complex algorithms are being developed to execute
period of 2004-2012 and analyzed the link between trades through all the structured and unstructured data
Internet searches on politics or business and stock market gained from the sources. The methods adopted so far has
moves. In the paper, the author states, ―We draw on data not been completely adept, however, extensive research
from Google and Wikipedia, as well as Amazon ensures growing dependence of the stock markets,
Mechanical Turk. Our results are in line with the financial organizations and economies on big data
intriguing possibility that changes in online information- analytics.
gathering behavior relating to both politics and business
were historically linked to subsequent stock market  Big Data in Fraud Detection:
moves….Our results provide evidence that for complex Forensic Data Analytics or FDA has been an intriguing

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 68


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

area of interest in the past decade. However, very few developing tools for efficient sentiment analysis.
companies are actually using FDA to mine big data. The IBM has developed IBM Social Media Analytics [29]
reasons [24] for this unfortunate situation vary from the which is a powerful SaaS solution. It captures structured
deficit of expertise and awareness, developing the right and unstructured data from social networking sites to
tools to mine big data to lack of appropriate technology develop a comprehensive understanding of attitudes,
and inability to handle such humungous quantities of data. opinions and trends. It then applies tools of predictive
Ernst & Young undertook the Global forensic data analysis to determine customer behavior and improve
analytics survey [25] in 2014 and found that, ―Our survey customer experience. This can aid the company to create
finds that 42% of companies with revenues between personalized campaigns and promotion to increase the
US$100 million to US$1 billion are reviewing less than consumer base. It has presented their framework as the
10,000 records. And 71% companies with more than US$1 following:
billion in sales report examining just one million records
or fewer….Companies know there are high risk numbers
in book entries, such as round thousands or duplicates, but
they‘re only just starting to analyze descriptions for those
book entries. Looking at both the numbers and words can
mean the difference between uncovering fraud, and falling
victim to it.‖ The combination of appropriate data and big
data analytics can help combat fraudulent activities.
Though several companies are mining big data for this
purpose there are still limitations [26] in their approach.
They are either keeping the data siloed, limiting the
analysis to be performed or only taking into consideration
the structured data thus only giving a subset of Image-4: IBM‘s Social Media Analytics [29] framework
information. A more holistic approach to the
implementation of big data analytics is required. Similarly SAP has developed a SAP-HANA based
Companies such as Pactera [27] is developing solutions application known as Social Contact Intelligence [30]
which will process massive amounts of structured and which monitors and develops insights from social media at
unstructured data and develop varied models and real-time, determines the primary influencers thus
algorithms to find patterns of fraud and anomalies and determining new opportunities and improving the overall
predict customer behavior. customer satisfaction.
A 10 step approach has been suggested by Infosys [28] to  Big Data and the Food Industry:
implement analytics for fraud detection: The impact of Big Data on the food industry [31] is
increasing exponentially. Be it for tracking the quality of
1. Perform SWOT analysis of existing fraud detecting
products or presenting recommendations to the customer
paradigms.
or developing marketing strategies for better customer
2. Assign a dedicated fraud management team.
experience, the presence of Big Data analytics on the food
3. Developing or purchasing appropriate data analytics
industry is slowly becoming ubiquitous.
software.
IBM collaborated with The Cheesecake Factory to analyze
4. Integrate siloed data and clear inefficiencies in the
structured data like restaurant‘s location and unstructured
processes.
data such as flavours to increase customer satisfaction. In
5. Establish rules relevant business obligations.
a news article, [32] it stated, ―N2N has teamed up with
6. Determine thresholds for detection of error or
IBM to provide The Cheesecake Factory with a
discrepancies.
technology that can communicate critical supply chain
7. Implement predictive analysis to determine potential
data instantly, so thousands of food items won‘t need to be
discrepancies and frauds.
recalled and tested. Nardone said they have initiated a
8. Use Social Network Analysis or SNA to determine
conversation with the Centers for Disease Control and
fraudulent activities.
Prevention, as it may be easier to track the culprit if a
9. Develop an integrated case management system.
food-related scandal occurs.‖
10. Continue with extensive research to integrate existing
Similarly, apps such as the Food Genius [33] applies big
systems of fraud detection with new set of techniques
data to predict specific recommendations to the customers.
developed.
The company accumulates menu-level data parsed with
 Big Data and Sentiment Analysis: ingredients, preparation methods, spices etc. and then
Sentiment Analysis is by far the most extensively used analyzes them with individual customer preferences to
application of big data. Presently, zillions of conversations determine trends and aid food giants make marketing
are occurring on the social media, which when harnessed strategies. Companies such as Starbucks, Dominos and
to one‘s advantage can aid any company in determining Subway take advantage of big data analytics [31] to track
new patterns, protecting their brand image and segmenting individual customer preferences and present customers
consumer base to improve product marketing and the with personalized offers so as to increase customer base
overall customer experience. Several giants are presently and improve customer satisfaction.

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 69


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

 Big Data for the Telecom Industry The following visual aid further explains the situation.
In order to improve customer service and satisfaction,
concepts of Big Data and Machine Learning are being
progressively implemented. Call detail records, web and
customer service logs, emails to social media as well as
geospatial and weather data are the few examples of data
being accessible to telecom operators. Handling such
massive amounts of data can be a daunting task.
Developing deep insights with the aid of Machine
Language running on Apache Hadoop can help operators
to economically take advantage of the ever-increasing
datasets so as to enhance their quality of service and Image 5- [41] what is the primary reason your
customer experience as well as to increase the customer organization is not considering or exploring the use of
base with ad targeting and promotions and reduce the external big data to help make business decisions?
operational costs. The benefits of using such technologies
are immense. Predictive maintenance ensures that There are several big data experts however most of their
operational disruptions are predicted, prevented and expertise is limited to the implementation of one paradigm
recovered. Real-time processed data can be used to (usually one using the applications in Hadoop) rather than
dynamically allocate the bandwidth to reduce congestion big data management skills. Most of these data scientists
and outages. continue to remain oblivious to the practical zones of data-
handling. A report from 2012 stated the following-
III. OBSTACLES IN BIG DATA ―Gartner analysts predicted that by 2015, 4.4 million IT
INMPLEMENTATION jobs globally will be created to support big data with 1.9
million of those jobs in the United States. … [43]
In the 1990‘s Big Data became a hyped topic of interest in However, while the jobs will be created, there is no
the world of distributed systems [34] when the rapidly assurance that there will be employees to fill those
increasing impact of the world-wide Web and the positions.‖
exponential growth of the content. None of the then There are, at this present moment diverse variety of tools
available resources were sufficient or cost-efficient [35] to available that are available in the market to implement
handle this task. At the turn of the millennium, in response operational and analytical processing of big data. Most of
to this issue, Google created the Google File System these are lumped together into a category called NoSql. A
(GFS) [36] which provided consumers with OS-level byte survey held in 2014 [44] summarizes the data management
stream operations [35] on data spanning several machines options available.
in clusters using rather expensive hardware. Later, Google
developed the MapReduce paradigm [34] which was
identical to the partitioned parallelism used in shred-
nothing parallel query processing. Following this trend
[37], multi-national giants like Yahoo and Facebook
developed their own software. Yahoo! Developed Pig
while Hive was developed by Facebook [38], Jaql by IBM
[39] and Dryad and Scope by Microsoft [40].
Practical Challenges facing Big Data
Despite the extensive hype around Big Data in the industry
today, very few companies have actually been able to
implement the concept of Big Data. A survey published in
2013 by SAS (‗2013 Big Data Survey Research Brief‘)
analyzed the reasons on why most industries are still
delaying or refusing to pursue a big data strategy. It states, Image 6- Current adoption of relational database
[41] ―A little more than one-fifth of the respondents are technologies with projected two year growth [44].
still trying to learn more about big data, while others are
Such varied options have created a sense of confusion
still trying to understand the benefits of big data. Even
though the industry has written countless articles, blogs amongst the industry data experts making it difficult for
and white papers about big data, there is still a significantthem to zero in on one particular strategy. Choosing an
contingent of data management professionals trying to appropriate Big Data Platform is a very complex task
understand the basics.‖ given the immeasurable amount of data that needs to be
accessed, transmitted and delivered from the numerous
The obstacles that limit the implementation of big data by sources and then accumulated in data sets. Finally
any industry are aplenty. The ‗Big Data Talent Gap‘ [42] synchronization of such vast quantities of data [42]
which distinctively exists even though a lot of research has coming from numerous sources with its originating
gone into this field in the past decade is a massive issue. systems is one massive job as rampant inconsistencies and

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 70


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

asynchrony in the big data environment can have a organizations that offer encryption technology
disastrous effect. implementable on big data. However, most of the times
One of the crucial practical challenges faced by Big Data they can only ensure security of specific big data nodes
is the cost implications of it. Even though implementation and does not protect the original data that is fed into the
of Big Data analytics has been there for about a decade big data platform. With such incompatible approaches in
now, the cost implications of storing such humungous securing Big Data, IT industry has to make do with
amounts of data still remains a matter of serious concern. fragmented key and policy management, which increases
It is not only the quantity of data, but also the complex administrative effort and makes it almost impossible to
processing techniques which make its applications so apply them consistently. Though several large
expensive. An article by Forbes [45] states, ―A Petabyte organizations are taking their own initiatives to protect the
Hadoop cluster will require between 125 and 250 nodes data that they are generated, a mass awareness of the
which costs ~$1 million. The cost of a supported Hadoop implications of unsecured data need to be initiated and
distribution will have similar annual costs (~$4,000 per smaller organizations need to step up to ensure that the
node), which is a small fraction of an enterprise data world is a safe place for the data to reside
warehouse ($10-$100s of millions).‖ This tumultuous
situation requires that new technologies and algorithms be Theoretical Challenges facing Big Data
developed that will ensure that the financial challenges One of the key set of challenges [46] faced in today‘s tight
that face big data analytics today is made minimal so that market is the need to find and analyze the required data at
an increasing number of enterprises can implement big the least speed possible. However with exponentially
data analytics in their regular operations. growing amount of data, speed becomes a major issue as
analyzing such sheer volumes of data in detail to find out
Data Democratization: The present business scenario has required output becomes more and more tedious. It is not
brought forward several small and medium sized only the quantity of data but also discovering the data
organizations who are trying to harness Big Data. according to the appropriateness of the project which is a
However not all data can always be accessed. As said by Herculean task. Elimination of out-of-context data is an
Paul Kent, the vice-president of Big Data with SAS, ―So if essential objective. Even if in-context data retrieved at a
you‘re not Google or LinkedIn or Facebook, and you don‘t high speed is achieved, the quality of data may be
have thousands of engineers to work with Big Data, it can compromised if it is not accurate or timely. As a result of
be difficult to find business answers in the information‖. this, appropriate results of the project may not be
In an IDG Research study, it was discovered that amongst published.
all the organizations who claim to be effective at Big Data
analysis, only about 58% have already implemented or in Another zone of challenges involves those relating to the
the process of implementing a data visualization solution vulnerability and security of Big Data. Breaches of
while another 40% have concrete plans of implementing privacy, especially with data relating to individuals and
them. Tammi Kay George, the manager of R&D Program organizations have been a topic of serious concern. One
& Project Management at SAS concisely summarizes the solution has been to anonymize data by removing
whole concept, ―A crucial element in minimizing the identifiers which could be used to pinpoint particular
amount of time needed to understand data, visualization individuals thus compromising their privacy. However this
tools are imperative [to] realizing the value from a Big has been largely unsuccessful as it is possible to de-
Data initiative, When incorporated with approachable anonymize the data. One very popular example of this
analytics capabilities from the onset, organizations are came out in 2007 when Arvind Narayanan and Vitaly
empowered with focus and the ability to reduce the time Shmatikov of the University of Texas, Austin identified
required to know where opportunities, issues, and risks particular people who had given IMDB ratings with their
reside in voluminous data.‖ names and were later anonymized in a Netflix dataset of
movie ratings which was built for a data-mining
Encryption- Securing Big Data: With such massive competition. They stated, [47] ―Our third contribution is a
amounts of data being generated, ensuring that the data practical analysis of the Netflix Prize dataset, containing
doesn‘t fall at risk is quintessential. Such data left anonymized movie ratings of 500,000 Netflix subscribers
unsecured may put organizations or the general human (section 5). Netflix—the world‘s largest online DVD
race at risk. Sans the correct security solutions and rental service—published this dataset to support the
encryption techniques, Big Data can imply big problems. Netflix Prize data mining contest. We demonstrate that an
The characteristics which make Big Data valuable to the adversary who knows a little bit about some subscriber can
market also make it valuable to various anti-social easily identify her record if it is present in the dataset, or,
elements like cyber criminals. The number of encryption at the very least, identify a small set of records which
techniques available is aplenty. However, they mostly include the subscriber‘s record. The adversary‘s
tackle one specific aspect and this is what makes it background knowledge need not be precise, e.g., the dates
challenging. To make it easier to understand, one could may only be known to the adversary with a 14-day error,
consider a certain transparent encryption technique that are the ratings may be known only approximately, and some
provided by a certain database vendor. They might be of the ratings and dates may even be completely wrong.
applicable to a particular database nut may not be suitable Because our algorithm is robust, if it uniquely identifies a
for implementation in a big data platform. There are a few record in the published dataset, with high probability this

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 71


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

identification is not a false positive.‖ The confidentiality change how everyone lives their day-to-day lives," - Susan
of data, that is, the assurance that regardless of whether the Hauser, corporate vice president of Microsoft.
anonymity of data is maintained, the data is not visible to Data is the biggest thing to hit the industry since PC was
anyone beyond the trusted and the allowed zone is also an invented by Steve Jobs. As mentioned earlier in this paper,
important aspect. Protecting data so that confidential data every day data is generated in such a rapid manner that,
is not made available to anyone who is unauthorized is a traditional database and other data storing system will
very complex task and no concrete solutions have yet been gradually give up in storing, retrieving, and finding
developed in this field. relationships among data. Big data technologies have
Organizations dealing with big data need to take this issue addressed the problems related to this new big data
in their stride and make sure that the data storage and revolution through the use of commodity hardware and
location be made heavily protected so that it is not distribution. Companies like Google, Yahoo!, General
misused. They could do so by using unique database Electric, Cornerstone, Microsoft, Kaggle, Facebook,
tables, having dedicated database servers, encrypting the Amazon that are investing a lot in Big Data research and
data, having multiple security levels, having separate projects. IDC estimated the value of Big Data market to be
authentication and authorization modules and ensuring ―about $ 6.8 billion in 2012 growing almost 40 percent
secure system operations, data transmission and data flow every year to $17 billion by 2015.‖ By 2017, Wikibon‘s
control. Jeff Kelly predicts the Big Data market will top $50
billion. [52]
Three key areas of security threats [48] have been
―Demand is so hot for solutions that all companies are
identified in the implementation of BigData using software
exploring big data strategies. The problem is that the
such as Hadoop- Breach of privacy by unauthorized
companies lack internal expertise and best practices.. the
release of data, manipulation of data in the database and
side effect is that there is a services and consulting boom
denial of information. In particular, in Hadoop the
in big data. It‘s a perfect storm of product and services‖
following areas of threat [49] have been recognized.
says Wikibon‘s Jeff Kelly.
 Unauthorized access of an HDFS client via RPC or via Recently it was announced that, Indian Prime Minister‘s
HTTP protocols. office is using Big Data analytics to understand Indian
 Manipulation of data in a file at a DataNode through citizen‘s sentiments and ideas through crowd sourcing
pipeline-streaming data-transfer protocol. platform www.mygov.in and social media to get a picture
 Adding/deleting/changing priority of a job in a queue. of common people‘s thought and opinion on government
 Unauthorized access of intermediate data of Map job actions. [53]
via its task trackers HTTP shuffle protocol. Google is launching the Google Cloud Platform, which
 An executing task may use the host operating system provides developers to develop a range of products from
interfaces to access other tasks, access local data which simple websites to complex applications. It enables users
include intermediate Map output or the local storage of to launch virtual machines, store huge amount of data
the DataNode that runs on the same physical node. online, and plenty of other things [54]. Basically, it will
 Masquerading as Hadoop service component. be an one stop platform for cloud based applications,
 Submitting a workflow to Oozie as another user. online gaming, mobile applications, etc. [55]. All these
required huge amount of data processing where Big Data
Real time security [50] or compliance monitoring is also a plays an immense role in data processing.
challenge that is faced by Big Data analysts. Due to the
copious amounts of data involved, the number of alarms The predictions from the IDC Future Scope for Big Data
triggered by the security devices is so large that several of and Analytics are:
these alarms tend to be overlooked as humans cannot cope 1. Visual data discovery tools will be growing 2.5 times
with the shear amount [51]. faster than rest of the Business Intelligence (BI)
The above challenges that are faced by Big Data needs to market. By 2018, investing in this enabler of end-user
be addressed and solutions of these problems need to be self-service will become a requirement for all
determined so that industries can start implementing big enterprises.
data analytics in their business strategies. 2. Over the next five years spending on cloud-based Big
Data and analytics (BDA) solutions will grow three
IV. FUTURE SCOPE AND DEVELOPMENT times faster than spending for on-premise solutions.
Hybrid on/off premise deployments will become a
Today, Big Data is influencing IT industry like few requirement.
technologies have done before. The massive data 3. Shortage of skilled staff will persist. In the U.S. alone
generated from sensor-enabled machines, mobile devices, there will be 181,000 deep analytics roles in 2018 and
cloud computing, social media, satellites help different five times that many positions requiring related skills
organizations improve their decision making and take their in data management and interpretation.
business to another level. 4. By 2017 unified data platform architecture will become
"Big data absolutely has the potential to change the way the foundation of BDA strategy. The unification will
governments, organizations, and academic institutions occur across information management, analysis, and
conduct business and make discoveries, and its likely to search technology.

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 72


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

5. Growth in applications incorporating advanced and [5] http://www.internetlivestats.com/twitter-statistics/


predictive analytics, including machine learning, will [6] http://www.internetlivestats.com/google-search-statistics/
[7] Grand Challenge: Applying Regulatory Science and Big Data to
accelerate in 2015. These apps will grow 65% faster Improve Medical Device Innovation, Arthur G. Erdman∗, Daniel F.
than apps without predictive functionality. Keefe, Senior Member, IEEE, and Randall Schiestl, IEEE
6. 70% of large organizations already purchase external TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 60,
data and 100% will do so by 2019. In parallel more NO. 3, MARCH 2013
[8] http://lsst.org/lsst/google
organizations will begin to monetize their data by [9] http://en.wikipedia.org/wiki/Parkinson‘s_law
selling them or providing value-added content. [10] http://www.economist.com/node/15557443
7. Adoption of technology to continuously analyze [11] http://www.youtube.com/t/press_statistics/?hl=en
streams of events will accelerate in 2015 as it is applied [12] http://www.forbes.com/sites/bernardmarr/2015/04/21/how-big-
data-is-changing-healthcare/
to Internet of Things (IoT) analytics, which is expected [13] http://www.forbes.com/sites/bryanpearson/2015/04/10/exercise-in-
to grow at a five-year compound annual growth rate service-fitbit-omni-channel-begs-for-omni-prescience/
(CAGR) of 30%. [14] http://www.engadget.com/2015/04/10/jawbone-up3-shipping-april-
8. Decision management platforms will expand at a 20th/
[15] http://www.samsung.com/uk/consumer/mobile-
CAGR of 60% through 2019 in response to the need devices/wearables/gear/SM-R3500ZKABTU
for greater consistency in decision making and decision [16] http://healthdataalliance.com/
making process knowledge retention. [17] http://www.ibm.com/software/data/bigdata/industry-healthcare.html
9. Rich media (video, audio, image) analytics will at least [18] http://www.firstpost.com/business/big-data-booster-shot-
healthcare-industry-needs-2160271.html
triple in 2015 and emerge as the key driver for BDA [19] Chester Curme, Tobias Preis, Eugene Stanley, Helen Susannah
technology investment. Moat, ―Quantifying the semantics of search behavior before stock
10. By 2018 half of all consumers will interact with market moves‖; CrossMark, December 2013
services based on cognitive computing on a regular [20] http://www.wsj.com/articles/how-computers-trawl-a-sea-of-data-
for-stock-picks-1427941801
basis.[56] [21] Nitish Sinha, ―Using Big Data in Finance: Example of sentiment-
extraction from news articles‖; FEDS notes, March 2014
Big data isn't new, but now has reached critical mass as [22] Baker, Malcolm and Jeffrey Wurgler, 2007. "Investor Sentiment in
people digitize their lives. "People are walking sensors," the Stock Market", Journal of Economic Perspectives, vol. 21(2),
said Nicholas Skytland, project manager at NASA within pages 129-152.
the Human Adaptation and Countermeasures Division of [23] Heston, Steven L. and Sinha, Nitish Ranjan, 2013. "News versus
Sentiment: Comparing Textual Processing Approaches for
the Space Life Sciences Directorate [57]. Predicting Stock Returns", Robert H. Smith School Research Paper.
Taking an average of all the figures suggested by leading Available at SSRN: http://ssrn.com/abstract=2311310 or
big data market analyst and research firms, it can be http://dx.doi.org/10.2139/ssrn.2311310
concluded that approximately 15 percent of all IT [24] http://www.ey.com/GL/en/Services/Assurance/Fraud-Investigation-
--Dispute-Services/EY-Global-Forensic-Data-Analytics-Survey-
organizations will move to cloud-based service platforms, 2014
and between 2015 and 2021, this service market is [25] http://www.ey.com/CA/en/Newsroom/News-releases/2014-Global-
expected to grow about 35 percent. forensic-data-analytics-survey
[26] http://www.ikanow.com/how-can-i-use-big-data-analytics-for-
fraud-detection/
V. CONCLUSION [27] http://www.pactera.com/resources/blog/how-big-data-is-
This literature survey discusses Big Data from its infancy revolutionizing-fraud-detection-in-financial-services/
[28] Ruchi Verma, Sathyan Ramakrishna Mani, ―Using Analyrtics for
until itscurrent state. It elaborates onthe concepts of big Insurance Fraud Detection‖; FINsights, Infosys, Issue 10
data followed bythe applications and the challenges faced [29] http://www-01.ibm.com/software/analytics/solutions/customer-
by it. Finally we have discussed the future opportunities analytics/social-media-analytics/
that could be harnessed in this field. Big Data is an [30] http://www.news-sap.com/sentiment-analysis-with-big-data/
[31] https://datafloq.com/read/big-datas-impact-food-industry/96
evolving field, where much of the research is yet to be [32] http://venturebeat.com/2013/03/01/ibm-brings-big-data-tech-to-
done. Big data at present, is handled by the software food-to-prevent-the-next-horse-meat-scandal/
named Hadoop. However, the proliferating amounts of [33] http://www.forbes.com/sites/daniellegould/2012/09/24/food-
data is making Hadoop insufficient. To harness the industry-understand-trends-big-data-tools/
[34] Puneet Singh Duggal, Sanchita Paul ,― Big Data Analysis:
potential of Big Data completely in the future, extensive Challenges and Solutions‖, International Conference on Cloud, Big
research needs to be carried out and revolutionary Data and Trust 2013, Nov 13-15, RGPV
technologies need to be developed. Summarising, Peter [35] MarcinJedyk, MAKING BIG DATA, SMALL, Using distributed
Sondergaard, Senior Vice President of Gartner Research systems for processing, analysing and managing large huge data
sets, Software Professional‟s Network, Cheshire Data systems Ltd.
famously stated, ―Information is the oil of the 21st century [36] S. Ghemawat, H. Gobioff, and S. Leung, ―The Google File
and analytics is the combustion engine.‖ System.‖ in ACM Symposium on Operating Systems Principles,
Lake George, NY, Oct 2003, pp. 29 – 43.
REFERENCES [37] Jefry Dean and Sanjay Ghemwat, MapReduce:A Flexible Data
Processing Tool, Communications of the ACM, Volume 53,
[1] Apache Hive. Available at http://hive.apache.org Issuse.1,January 2010, pp 72-77.
[2] http://blogs.worldbank.org/voices/meet-winners-and-finalists-first- [38]PIGTutorial,YahooInc.,http://developer.yahoo.com/hadoop/tutorial/p
wbg-big-data-innovation-challenge igtutorial.html
[3] http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big- [39] IBM-What.is.Jaql,
data-definition-consists-of-three-parts-not-to-be-confused-with- www.ibm.com/software/data/infosphere/hadoop/jaql/
three-vs/ [40] Dryad - Microsoft Research, http://research.microsoft.com/en-
[4] http://www.exist.com/wp-content/uploads/2014/10/3Vsbigdata.png us/projects/dryad/

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 73


ISSN (Online) 2278-1021
IJARCCE ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering


Vol. 5, Issue 2, February 2016

[41] ―2013 Big Data Survey Research Brief‖, SAS, The power to know,
2013
[42] David Loshin, ―Addressing Five Emerging Challenges Of Big
Data‖; Progress Software, 2014
[43] Eric Lundquist, ―Gartner: 2013 Tech Spending To Hit $3.7
Trillion‖ October 23, 2012
[44] 2014 Data Connectivity Outlook paper‖; Progress Software,
January 2014.
[45] http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-
big-data/
[46] SAS, The power to know, ―Five Big Data Challenges And how to
overcome them with visual analytics‖
[47] Arvind Narayanan, Vitaly Shmatikov, ―Robust De-anonymization
of Large Sparse Datasets‖; The University of texas at Austin, 2007
[48] Victor L. Voydock and Stephen T. Kent. Security mechanisms in
high-level network protocols. ACM Comput. Surv 15(2):135–171,
1983.
[49] Devaraj Das, Owen O‘Malley, Sanjay Radia and Kan Zhang,
―Adding security to Apache Hadoop‖, Hortonworks Technical
Report 1
[50] Disha H. Parekh, Dr. R. Sridaran ,‖An Analysis of Security
Challenges in Cloud Computing‖ in International Journal of
Advanced Computer Science and Applications, Vol. 4, No.1, 2013
[51] Rashmi N, Uma K M, Jayalakshmi K, Vinodkumar K P, ―Big Data
Security Challenges: Dealing with too many issues‖; International
Journal of Recent Development in Engineering and Technology,
Volume 3, Issue 2, August 2014
[52] http://www.forbes.com/sites/siliconangle/2012/02/29/big-data-is-
creating-the-future-its-a-50-billion-market/
[53] http://dataconomy.com/indian-government-using-big-data-to-
revolutionise-democracy/
[54] https://en.wikipedia.org/wiki/Google_Cloud_Platform
[55] https://cloud.google.com/
[56] http://www.idc.com/getdoc.jsp?containerId=prUS25329114
[57] http://www.zdnet.com/article/30-big-data-project-takeaways/

Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5215 74

You might also like