Big Data (Unit 1)

Big Data (Unit 1)
 Digital Data
 In computing world, digital data is considered as a collection of facts that’s transmitted
and saved in a electronic format, and processed through software system.
 Digital data is data that represents other forms of data using specific machine language
systems that can be interpreted by various technologies.
 Digital Data is generated by various devices, like desktops, laptops, tablets, mobile
phones, and electronic sensors.
 Digital data is stored as strings of binary values (0s and 1s) on a storage medium that’s
either internal or external to the devices generating or accessing the information.
 For Example - Whenever you send an email, read a social media post, or take pictures
with your digital camera, you are working with digital data.
 Types of Digital Data

 Structured Data :- Structured data conforms to a data model or schema and is often
stored in tabular form. It is used to capture relationships between different entities and is
therefore most often stored in a relational database.
 Structured is one of the types of big data and By structured data, we mean data that can
be processed, stored, and retrieved in a fixed format.
 This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
 It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple search engine algorithms.
 For example - the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized
manner.
Satish Kr Singh Page 1

 Unstructured Data :- Data that does not conform to a data model or data schema is
known as unstructured data.
 Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze
unstructured data.
 About 80—90% data of an organization is in this format; for example, memos, chat
rooms, PowerPoint presentations, images, videos, letters, researches, white papers,
body of an email, etc.
 For example - To play a video file, it is essential that the correct codec (coder-
decoder) is available. Unstructured data cannot be directly processed or queried using
SQL. Email is an example of unstructured data.

 Semi-Structured Data :- Semi-structured data has a defined level of structure and
consistency, but is not relational in nature. Instead, semi-structured data is
hierarchical or graph-based. This kind of data is commonly stored in files that contain
text.
 This is the data which does not conform to a data model but has some structure.
Semi-structured data is information that does not reside in a relational database but
that have some organizational properties that make it easier to analyze.
 It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms.
 For example - emails, XML, markup languages like HTML, etc.
 History of Big Data innovation

 Big Data phase 1.0
• Data analysis, data analytics and Big Data originate from the longstanding domain of
database management. It relies heavily on the storage, extraction, and optimization
techniques that are common in data that is stored in Relational Database Management
Systems (RDBMS).
• Database management and data warehousing are considered the core components of Big
Data Phase 1. It provides the foundation of modern data analysis as we know it today,
using well-known techniques such as database queries, online analytical processing and
standard reporting tools.
 Since the early 2000s, the Internet and the Web began to offer unique data collections
and data analysis opportunities. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by
analysing click-rates, IP-specific location data and search logs. This opened a whole new
world of possibilities.

• From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic
introduced a massive increase in semi-structured and unstructured data. Besides the
standard structured data types, organizations now needed to find new approaches and
storage solutions to deal with these new data types in order to analyse them effectively.
The arrival and growth of social media data greatly aggravated the need for tools,
technologies and analytics techniques that were able to extract meaningful information
out of this unstructured data.
• Although web-based unstructured content is still the main focus for many organizations
in data analysis, data analytics, and big data, the current possibilities to retrieve valuable
information are emerging out of mobile devices.
• Mobile devices not only give the possibility to analyze behavioral data (such as clicks
and search queries), but also give the possibility to store and analyze location-based data
(GPS-data). With the advancement of these mobile devices, it is possible to track
movement, analyze physical behavior and even health-related data (number of steps you
take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
• Simultaneously, the rise of sensor-based internet-enabled devices is increasing the data

generation like never before. Famously coined as the ‘Internet of Things’ (IoT), millions
of TVs, thermostats, wearables and even refrigerators are now generating zettabytes of
data every day. And the race to extract meaningful and valuable information out of these
new data sources has only just begun.
 Introduction to Big Data Platform

 Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
 It is an enterprise class IT platform that enables organization in developing, deploying,

operating and managing a big data infrastructure /environment.

 Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities. It also
supports custom development, querying and integration with other systems.
 The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution. Big data platform are also delivered
through cloud where the provider provides an all-inclusive big data solutions and
services.
 Characteristics of a Big Data Platform

 Any good big data platform should have the following important features.
 Ability to accommodate new applications and tools depending on the evolving business
needs.
 Support several data formats.
 Ability to accommodate large volumes of streaming or at-rest data.
 Have a wide variety of conversion tools to transform data to different preferred

formats.
 Capacity to accommodate data at any speed.
 Provide the tools for scouring the data through massive data sets.
 Support linear scaling.
 The ability for quick deployment.
 Have the tools for data analysis and reporting requirements.
 Drivers for Big Data

 A number of business drivers are at the core of this success and six main business drivers
can be identified:
 1. The digitization of society: - Big Data is largely consumer driven and consumer
oriented. Most of the data in the world is generated by consumers, who are nowadays
‘always-on’. Most people now spend 4-6 hours per day consuming and generating data
through a variety of devices and (social) applications. With every click, swipe or
message, new data is created in a database somewhere around the world. Because
everyone now has a smartphone in their pocket, the data creation sums to
incomprehensible amounts. Some studies estimate that 60% of data was generated within
the last two years, which is a good indication of the rate with which society has digitized.
 2. The plummeting of technology costs :- Technology related to collecting and

processing massive quantities of diverse (high variety) data has become increasingly
more affordable. The costs of data storage and processors keep declining, making it
possible for small businesses and individuals to become involved with Big Data. For
storage capacity, the often-cited Moore’s Law still holds that the storage density (and

therefore capacity) still doubles every two years. The plummeting of technology costs
has been depicted in the figure below.
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard
for Big Data) is Apache Hadoop for distributed storage and processing. Due to the high
availability of these software frameworks in open sources, it has become increasingly
inexpensive to start Big Data projects in organizations.
 3. Connectivity through cloud computing :- Cloud computing environments (where

data is remotely stored in distributed storage systems) have made it possible to quickly
scale up or scale down IT infrastructure and facilitate a pay-as-you-go model. This means
that organizations that want to process massive quantities of data (and thus have large
storage and processing requirements) do not have to invest in large quantities of IT
infrastructure. Instead, they can license the storage and processing capacity they need and
only pay for the amounts they actually used. As a result, most of Big Data solutions
leverage the possibilities of cloud computing to deliver their solutions to enterprises.
 4. Increased knowledge about data science :- The demand for data scientist (and
similar job titles) has increased tremendously and many people have actively become
engaged in the domain of data science. As a result, the knowledge and education about
data science has greatly professionalized and more information becomes available every
day. While statistics and data analysis mostly remained an academic field previously, it is
quickly becoming a popular subject among students and the working population.
 5. Social media applications: - Everyone understands the impact that social media has
on daily life. However, in the study of Big Data, social media plays a role of paramount
importance. Not only because of the sheer volume of data that is produced everyday
through platforms such as Twitter, Facebook, LinkedIn and Instagram, but also because
social media provides nearly real-time data about human behaviour. Social media data
provides insights into the behaviours, preferences and opinions of ‘the public’ on a scale
that has never been known before. Due to this, it is immensely valuable to anyone who is
able to derive meaning from these large quantities of data. Social media data can be used
to identify customer preferences for product development, target new customers for

future purchases, or even target potential voters in elections. Social media data might
even be considered one of the most important business drivers of Big Data.
 6. The upcoming internet of things (IoT) :- The Internet of things (IoT) is the network
of physical devices, vehicles, home appliances and other items embedded with
electronics, software, sensors, actuators, and network connectivity which enables these
objects to connect and exchange data. It is increasingly gaining popularity as consumer
goods providers start including ‘smart’ sensors in household appliances. Whereas the
average household in 2010 had around 10 devices that connected to the internet, this
number is expected to rise to 50 per household by 2020. Examples of these devices
include thermostats, smoke detectors, televisions, audio systems and even smart
refrigerators.
 Big Data Architecture

A big data architecture is designed to handle the ingestion, processing, and analysis of
data that is too large or complex for traditional database systems.
 Data Sources :- Data is sourced from multiple inputs in a variety of formats, including
both structured and unstructured. Data sources, open and third-party, play a significant
role in architecture. Relational databases, data warehouses, cloud-based data warehouses,
SaaS applications, real-time data from company servers and sensors such as IoT devices,
third-party data providers, and also static files such as Windows logs, comprise several
data sources. Both batch processing and real-time processing are possible. The data
managed can be both batch processing and real-time processing.
 Data Storage:- There is data stored in file stores that are distributed in nature and that
can hold a variety of format-based big files. It is also possible to store large numbers of
different format-based big files in the data lake. This consists of the data that is managed
for batch built operations and is saved in the file stores. We provide HDFS, Microsoft
Azure, AWS, and GCP storage, among other blob containers.
 Batch Processing:- Each chunk of data is split into different categories using long-
running jobs, which filter and aggregate and also prepare data for analysis. These jobs
typically require sources, process them, and deliver the processed files to new files.
Multiple approaches to batch processing are employed, including Hive jobs, U-SQL jobs,
Sqoop or Pig and custom map reducer jobs written in any one of the Java or Scala or
other languages such as Python.

 Real-time message ingestion:- If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer
for messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and Kafka.
 Stream processing:- After capturing real-time messages, the solution must process them
by filtering, aggregating, and otherwise preparing the data for analysis. The processed
stream data is then written to an output sink. Azure Stream Analytics provides a managed
stream processing service based on perpetually running SQL queries that operate on
unbounded streams. You can also use open source Apache streaming technologies like
Storm and Spark Streaming in an HDInsight cluster.
 Analytics-Based Data-store:- In order to analyze and process already processed data,
analytical tools use the data store that is based on HBase or any other NoSQL data
warehouse technology. The data can be presented with the help of a hive database, which
can provide metadata abstraction, or interactive use of a hive database, which can
provide metadata abstraction in the data store. NoSQL databases like HBase or Spark
SQL are also available.
 Analysis and reporting: - Most Big Data platforms are geared to extracting business
insights from the stored data via analysis and reporting. This requires multiple tools.
Structured data is relatively easy to handle, while more advanced and specialized
techniques are required for unstructured data. Data scientists may undertake interactive
data exploration using various notebooks and tool-sets. A data modeling layer might also
be included in the architecture, which may also enable self-service BI using popular
visualization and modeling techniques. Analytics results are sent to the reporting
component, which replicates them to various output systems for human viewers, business
processes, and applications. After visualization into reports or dashboards, the analytic
results are used for data-driven business decision making.
 Orchestration:- Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple
sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard. To automate these workflows, you can use an
orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
 Big Data Characteristics & 5 Vs

Big Data contains a large amount of data that is not being processed by traditional data
storage or the processing unit. It is used by many multinational companies to process the
data and business of many organizations. The data flow would exceed 150 Exabyte's per
day before replication. There are five V's of Big Data that explains the characteristics.
 1. Volume:
 The name ‘Big Data’ itself is related to a size which is enormous.
 Volume is a huge amount of data.
 To determine the value of data, size of data plays a very crucial role. If the volume of
data is very large then it is actually considered as a ‘Big Data’. This means whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data.
 Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
 Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 ExaBytes of
data.

 2. Velocity:
 Velocity refers to the high speed of accumulation of data.
 In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.
 There is a massive and continuous flow of data. This determines the potential of data
that how fast the data is generated and processed to meet the demands.
 Sampling data can help in dealing with the issue like ‘velocity’.
 Example: There are more than 3.5 billion searches per day are made on Google. Also,
FaceBook users are increasing by 22%(Approx.) year by year.
 3. Variety:
 It refers to nature of data that is structured, semi-structured and unstructured data.
 It also refers to heterogeneous sources.
 Variety is basically the arrival of data from new sources that are both inside and outside
of an enterprise. It can be structured, semi-structured and unstructured.
 4. Veracity:
 It refers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
 Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
 Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.
 5. Value:
 After having the 4 V’s into account there comes one more V which stands for Value!.
The bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.
o Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information.
 Big Data Technology Components

Given below are the main technology components of big data.

• 1. Machine Learning :- It is the science of making computers learn stuff by themselves.
In machine learning, a computer is expected to use algorithms and statistical models to
perform specific tasks without any explicit instructions. Machine learning applications
provide results based on past experience. For example, these days, there are some mobile
applications that will give you a summary of your finances, bills, will remind you of your
bill payments, and also may give you suggestions to go for some saving plans. These
functions are done by reading your emails and text messages.
• 2. Natural Language Processing (NLP) :- It is the ability of a computer to understand
human language as spoken. The most obvious examples that people can relate to these
days are google home and Amazon Alexa. Both use NLP and other technologies to give
us a virtual assistant experience. NLP is all around us without us even realizing it. When
writing a mail, while making any mistakes, it automatically corrects itself, and these days
it gives auto-suggests for completing the mails and automatically intimidates us when we
try to send an email without the attachment that we referenced in the text of the email,
this is part of Natural Language Processing Applications which are running at the
backend.
• 3. Business Intelligence :- Business Intelligence (BI) is a method or process that is
technology-driven to gain insights by analysing data and presenting it in a way that the
end-users (usually high-level executives) like managers and corporate leaders can gain
some actionable insights from it and make informed business decisions on it.
• 4. Cloud Computing :- If we go by the name, it should be computing done on clouds;
well, it is true, just here we are not talking about real clouds, cloud here is a reference for
the Internet. So we can define cloud computing as the delivery of computing services
servers, storage, databases, networking, software, analytics, intelligence, and more over
the Internet (“the cloud”) to offer faster innovation, flexible resources, and economies of
scale.
 Big Data Importance

Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data. Every
company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows. The companies in the present market need to collect it and
analyze it because:
 1. Cost Savings :- Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving
benefits to businesses when they have to store large amounts of data. These tools help
organizations in identifying more effective ways of doing business.
 2. Time-Saving :- Real-time in-memory analytics helps companies to collect data
from various sources. Tools like Hadoop help them to analyse data immediately thus
helping in making quick decisions based on the learning.
 3. Understand the market conditions :- Big Data analysis helps businesses to get a
better understanding of market situations. For example, analysis of customer
purchasing behavior helps companies to identify the products sold most and thus
produces those products accordingly. This helps companies to get ahead of their
competitors.
 4. Social Media Listening :- Companies can perform sentiment analysis using Big
Data tools. These enable them to get feedback about their company, that is, who is
saying what about the company.
 5. Boost Customer Acquisition and Retention :- Customers are a vital asset on
which any business depends on. No single business can achieve its success without
building a robust customer base. But even with a solid customer base, the companies
can’t ignore the competition in the market. If we don’t know what our customers

want then it will degrade companies’ success. It will result in the loss of clientele
which creates an adverse effect on business growth. Big data analytics helps
businesses to identify customer related trends and patterns. Customer behavior
analysis leads to a profitable business.
 6. Solve Advertisers Problem and Offer Marketing Insights :- Big data analytics
shapes all business operations. It enables companies to fulfil customer expectations.
Big data analytics helps in changing the company’s product line. It ensures powerful
marketing campaigns.
 7. The driver of Innovations and Product Development :- Big data makes
companies capable to innovate and redevelop their products.
 Big Data Applications
Big data has found many applications in various fields today. The major fields where
big data is being used are as follows.
 Government :- Big data analytics has proven to be very useful in the government
sector. Big data analysis played a large role in Barack Obama’s successful 2012 re-
election campaign. Also most recently, Big data analysis was majorly responsible for the
BJP and its allies to win a highly successful Indian General Election 2014. The Indian
Government utilizes numerous techniques to ascertain how the Indian electorate is
responding to government action, as well as ideas for policy augmentation.
 Social Media Analytics :- The advent of social media has led to an outburst of big
data. Various solutions have been built in order to analyse social media activity like
IBM’s Cognos Consumer Insights, a point solution running on IBM’s Big Insights Big
Data platform, can make sense of the chatter. Social media can provide valuable real-
time insights into how the market is responding to products and campaigns. With the
help of these insights, the companies can adjust their pricing, promotion, and campaign
placements accordingly. Before utilizing the big data there needs to be some pre-
processing to be done on the big data in order to derive some intelligent and valuable
results. Thus to know the consumer mind-set the application of intelligent decisions
derived from big data is necessary.
 Technology :- The technological applications of big data comprise of the following
companies which deal with huge amounts of data every day and put them to use for
business decisions as well. For example, eBay.com uses two data warehouses at 7.5
petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer
recommendations, and merchandising. Inside eBay‟s 90PB data warehouse.
Amazon.com handles millions of back-end operations every day, as well as queries from
more than half a million third-party sellers. The core technology that keeps Amazon
running is Linux-based and as of 2005, they had the world’s three largest Linux
databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers determine their typical drive times to
and from work throughout various times of the day.
 Fraud Detection :- For businesses whose operations involve any type of claims or
transaction processing, fraud detection is one of the most compelling Big Data
application examples. Historically, fraud detection on the fly has proven an elusive goal.
In most cases, fraud is discovered long after the fact, at which point the damage has been
done and all that’s left is to minimize the harm and adjust policies to prevent it from
happening again. Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous
behavior from an individual user, can change the fraud detection game.

 Call Center Analytics :- Now we turn to the customer-facing Big Data application
examples, of which call center analytics are particularly powerful. What’s going on in a
customer’s call center is often a great barometer and influencer of market sentiment, but
without a Big Data solution, much of the insight that a call center can provide will be
overlooked or discovered too late. Big Data solutions can help identify recurring
problems or customer and staff behavior patterns on the fly not only by making sense of
time/quality resolution metrics but also by capturing and processing call content itself.
 Banking :- The use of customer data invariably raises privacy issues. By uncovering
hidden connections between seemingly unrelated pieces of data, big data analytics could
potentially reveal sensitive personal information. Research indicates that 62% of bankers
are cautious in their use of big data due to privacy issues. Further, outsourcing of data
analysis activities or distribution of customer data across departments for the generation
of richer insights also amplifies security risks. Such as customers’ earnings, savings,
mortgages, and insurance policies ended up in the wrong hands. Such incidents reinforce
concerns about data privacy and discourage customers from sharing personal information
in exchange for customized offers.
 Agriculture :- A biotechnology firm uses sensor data to optimize crop efficiency. It
plants test crops and runs simulations to measure how plants react to various changes in
condition. Its data environment constantly adjusts to changes in the attributes of various
data it collects, including temperature, water levels, soil composition, growth, output, and
gene sequencing of each plant in the test bed. These simulations allow it to discover the
optimal environmental conditions for specific gene types.
 Marketing :- Marketers have begun to use facial recognition software to learn how
well their advertising succeeds or fails at stimulating interest in their products. A recent
study published in the Harvard Business Review looked at what kinds of advertisements
compelled viewers to continue watching and what turned viewers off. Among their tools
was “a system that analyses facial expressions to reveal what viewers are feeling.” The
research was designed to discover what kinds of promotions induced watchers to share
the ads with their social network, helping marketers create ads most likely to “go viral”
and improve sales.
 Smart Phones :- Perhaps more impressive, people now carry facial recognition
technology in their pockets. Users of I Phone and Android smartphones have applications
at their fingertips that use facial recognition technology for various tasks. For example,
Android users with the remember app can snap a photo of someone, then bring up stored
information about that person based on their image when their own memory lets them
down a potential boon for salespeople.
 Telecom :- Now a day’s big data is used in different fields. In telecom also it plays a
very good role. Operators face an uphill challenge when they need to deliver new,
compelling, revenue-generating services without overloading their networks and keeping
their running costs under control. The market demands new set of data management and
analysis capabilities that can help service providers make accurate decisions by taking
into account customer, network context and other critical aspects of their businesses.
Most of these decisions must be made in real time, placing additional pressure on the
operators. Real-time predictive analytics can help leverage the data that resides in their
multitude systems, make it immediately accessible and help correlate that data to
generate insight that can help them drive their business forward.
 Healthcare :- Traditionally, the healthcare industry has lagged behind other industries
in the use of big data, part of the problem stems from resistance to change providers are
accustomed to making treatment decisions independently, using their own clinical
judgment, rather than relying on protocols based on big data. Other obstacles are more
structural in nature. This is one of the best place to set an example for Big Data
Application. Even within a single hospital, payor, or pharmaceutical company, important
information often remains siloed within one group or department because organizations
lack procedures for integrating data and communicating findings. Health care
stakeholders now have access to promising new threads of knowledge. This information
is a form of “big data,” so called not only for its sheer volume but for its complexity,
diversity, and timelines. Pharmaceutical industry experts, payers, and providers are now
beginning to analyse big data to obtain insights. Recent technologic advances in the
industry have improved their ability to work with such data, even though the files are
enormous and often have different database structures and technical characteristics.
 Big Data Features
 1). Easy Result Formats :- Results are imperative parts of big data analytics model as
they support in the decision-making process, that are made to decide future strategy and
goals. Scientists prefer the results to get the result in the real-time so that they can take
better and appropriate decisions, based on the analysis result. The tools must be able to
produce a result in such a way that it can provide insights into data analysis and decision-
making platform. The platform should be able to provide the real-time streams that can
help in making instant and quick decisions.
 2). Raw data Processing :- Here, the data processing means collecting and organizing
data in a meaningful manner. Data modeling takes complex data sets and displays them
in the visual form or diagram or chart. Here, data should be interpretable and digestible
so that it can be used in making decisions. Tools of big data analytics must be able to
import data from various data sources such as Microsoft Access, text files, Microsoft
Excel and other files. Tools must be able to collect data from multiple data sources and in
multiple formats. In this way need for data conversion will be reduced and overall
process speed will be improved. Even the export quality and capability to visualize data
sets and handling various formats like PDFs, Excel, or Word files can be used directly to
collect and transfer the data. Below-listed features are essential for the data processing
tools:
 Data Mining
 Data Modeling
 File Exporting
 Data File Sources
 3). Prediction apps or Identity Management :- Identity management is also a required
and essential feature for any data analytics tool. The tool should be able to access any
system and all related information that may be related to the computer hardware,
software or any other individual computer. Here, the identity management system is also
related to managing all issues related to the identity, data protection, and access so that it
can support system, network passwords, and protocols. Here, it should be clear that
whether a user can access the system or not and to which level the system access
permission is granted? Identity management applications and system ensure that only
authenticated users can access the system information and the tool or system must be
able to organize a security plan and include fraud analytics and real-time security.
 4). Reporting Feature :- Businesses remain on top with the help of reporting features.
Even time-to-time data should be fetched and represented in a well-organized manner.
These way decision-makers can take timely decisions and handle the critical situations as
well, especially in a society that is moving rapidly. Data tools use dashboards to present
KPIs and metrics. The reports must be customizable and target data set oriented. The
expected capabilities of reporting tools are Real-time reporting, dashboard management,
and location-based insights.
 5). Security Features :- For any successful business, it is essential to save their data. The
tools that are used for big data analytics should offer safety and security to the data. For
this there should be SSO feature that is known as a single sign-on feature with the help of
that there is no need for the user to sign-in multiple times during the same session, even
with the help of single or same login user can log in multiple times and monitor user
activities and accounts. Moreover, data encryption is also an imperative feature that
should be provided by Big Data analytics tools. It means to change the form of data or to
make it unreadable from a readable form by using several algorithms and codes.
Sometimes automatic encryption is also offered by web browsers. Comprehensive
encryption capabilities are also offered by data analytics tools. For this single sign-on and
data encryption are two of the most used and popular features.
 6). Fraud management :- A variety of fraud detection functionalities remain involved in
the fraud analytics. Mainly when it comes to the fraud detection activities then it involves
various fraud analytics. Due to these activities, businesses mainly focus on the way with
which they will deal with the fraud rather than preventing any fraud. Fraud detection can
be performed by data analytics tools. The tools should be able to perform repeated tests
on the data at any time just to ensure that there will be no amiss. In this way, threats can
be identified quickly and efficiently. With effective fraud analytics and identity
management capabilities.
 7). Technologies Support :- Your data analytics tool must support the latest tools and
technologies, especially those that are important for your organization. Here, one most
important one is the A/B testing that is also known as the bucket or split testing, in this
testing two webpage versions are compared to determine the performance of a better
page. Here both the versions are compared on the basis in which user interacts with the
webpage and then the best one is considered. Moreover, as far as technical support is
concerned then your tool must be able to integrate with Hadoop, that is a set of open-
source programs that can work as the backbone of data-analytics activities. Hadoop
mainly involves the following four modules with which integration is expected:
 MapReduce: It can read data from a file system that can be interpreted in the
visualized manner.
 Hadoop Common: For this, Java tool collection may be required to read data stored
in the user’s file system.
 YARN: It is responsible to manage system resources so that data can be stored and
analysis can be performed
 Distributed File System: It allows data to be stored in an easy format. If the results of
tools will be integrated with these Hadoop modules then the user can easily send the
results to the user system. In this way flexibility, interoperability and both way
communication can be ensured between organizations.
 8). Version Control :- Most of the data analytics tools are involved in adjusting data
analytics model parameters. But it may cause problems when pushed into production.
Version control feature of big analytics tools will surely improve the capabilities to track
changes and it is able to release previous versions too whenever needed.
 9). Scalability :- Data will not the same all the times but it will grow as your
organization is growing. With big data tools, this is always easy to scale-up as soon as
new data is collected for the company and it can be analyzed well as expected. Also, the
meaningful insights driven from data is pushed or integrated into the previous data
successfully.
 10). Quick Integrations :- With integration capabilities, this is always easy to share data
results with developers and data scientists. Big data tools always support the quick
integration with cloud apps, data warehouses, other databases etc.
 Big Data Security & steps of securing Big Data

Big data security is the collective term for all the measures and tools used to guard both
the data and analytics processes from attacks, theft, or other malicious activities that
could harm or negatively affect them. Big Data can come in a mix of structured formats

(organized into rows and columns containing numbers, dates, etc) or unstructured (social
media data, PDF files, emails, images, etc).
• 1. Cloud Security Monitoring :- Cloud computing generally offers more efficient
communication and increased profitability for all businesses. This communication needs
to be secure. Big data security offers cloud application monitoring. This provides host
sensitive data and also monitors cloud-hosted infrastructure. Solutions also offer support
across several relevant cloud platforms.
• 2. Network Traffic Analysis :- Traffic continually moves in and out of your network.
Due to the high volume of data over the network, it is difficult to maintain transactional
visibility over the network traffic. Security analytics allow your enterprise to watch over
this network traffic. It is used to establish baselines and detect anomalies. This also helps
in cloud security monitoring. It is used to analyze traffic in and out of cloud
infrastructure. It also illuminates dark spaces that are hidden in infrastructures and
analyze encrypted sensitive data. Thus, ensuring the proper working of channels.
• 3. Insider Threat Detection :- Insider threats are as much as a danger to your enterprise
as external threats. An active malicious user can do as much damage as any malware
attack. But it is only in some rare cases that an insider threat can destroy a network.
• 4. Threat Hunting :- Generally, the IT Security team mostly engage in threat hunting.
They search for potential indicators of dwelling threats and breaches that try to attack the
IT infrastructure. Security analytics helps to automate this threat of hunting. It acts as an
extra set of eyes for your threat hunting efforts. Threats hunting automation can help in
detecting malware beaconing activity and thus alerts for its stoppage as soon as possible.
• 5. Incident Investigation :- Generally, the sheer number of security alerts from SIEM
solutions would overwhelm your IT security team. These continuous alerts can cause
more fostering burnout and frustration. Thus to minimize this issue, security analytics
automates the incident investigation by providing contextualization to alerts. Thus your
team has more time to prioritize incidents and can deal with potential breach incidents
first.
• 6. User Behaviour Analysis :- Organization’s users generally interact with your IT
infrastructure all the time. Mainly it is the user’s behavior that decides the success or
failure of your cybersecurity. Therefore there is a need for tracking user’s behavior. The
security analytics monitor the unusual behavior of employees. Thus it helps to detect an
insider threat or a malicious account. It can also detect suspicious patterns by correlating
malicious activities. An example of one such renowned security analytics use case is
UEBA. It helps to provide visibility into the IT environment. Thus compiling user
activities from multiple datasets into complete profiles.
• 7. Data Exfiltration Detection :- Data exfiltration is termed as any unauthorized
movement of data moving in and out of any network. Unauthorized data movements can
cause theft and leakage of data. Thus there is a need to protect data from such
unauthorized access. Security analytics helps to detect the data exfiltration over a
network. It is generally used to detect data leakage in encrypted communications.
• 8. Advertisement :- With the help of security analytics, organizations can easily detect
the insider threats. This is anticipated through behaviors such as abnormal login times,
unusual email usage, and unauthorized database access requests. Sometimes it also looks
for indicators that ask for visibility to third-party actors.
 Big data ethics

The field of big data ethics itself is defined as outlining, defending and recommending concepts
of right and wrong practice when it comes to the use of data, with particular emphasis on
personal data. Big data ethics aims to create an ethical and moral code of conduct for data use.

There are five main areas of concern in big data ethics that outline the potential for immoral use
of data:
1. Informed Consent
To consent means that you give uncoerced permission for something to happen to
you. Informed consent is the most careful, respectful and ethical form of consent. It requires the
data collector to make a significant effort to give participants a reasonable and accurate
understanding of how their data will be used. In the past, informed consent for data collection
was typically taken for participation in a single study. Big data makes this form of consent
impossible as the entire purpose of big data studies, mining and analytics is to reveal patterns
and trends between data points that were previously inconceivable. In this way, consent cannot
possibly be ‘informed’ as neither the data collector nor the study participant can reasonably
know or understand what will be garnered from the data or how it will be used.
2. Privacy
The ethics of privacy involve many different concepts such as liberty, autonomy, security, and in
a more modern sense, data protection and data exposure. You can understand the concept of big
data privacy by breaking it down into three categories:
 The condition of privacy

 The right to privacy
 The loss of privacy and invasion
The scale and velocity of big data pose a serious concern as many traditional privacy processes
cannot protect sensitive data, which has led to an exponential increase in cybercrime and data
leaks. A hacker was able to access and scrape the database which stored:
 Names
 Phone numbers
 Email addresses
 Profile descriptions
 Follower and engagement data
 Locations
 LinkedIn profile links
 Connected social media account login names
A further concern is the growing analytical power of big data, i.e. how this can impact privacy
when personal information from various digital platforms can be mined to create a full picture of
a person without their explicit consent. For example, if someone applies for a job, information
can be gained about them via their digital data footprint to identify political leanings, sexual
orientation, social life, etc. All of this data could be used as a reason to reject an employment
application even though the information was not offered up for judgement by the applicant.
3. Ownership
When we talk about ownership in big data terms, we steer away from the traditional or legal
understanding of the word as the exclusive right to use, possess, and dispose of property. Rather,
in this context, ownership refers to the redistribution of data, the modification of data, and the
ability to benefit from data innovations.

We can split ownership of data into two categories:
 The right to control data - edit, manage, share and delete data
 The right to benefit from data - profit from the use or sale of data
Contrary to common belief, those who generate data, for example, Facebook users, do not
automatically own the data. Some even argue that the data we provide to use ‘free’ online
platforms is in fact a payment for that platform. But big data is big money in today’s world.
Many internet users feel that the current balance is tilted against them when it comes to
ownership of data and the transparency of companies who use and profit from the data we share.
4. Algorithm bias and objectivity
Algorithms are designed by humans, the data sets they study are selected and prepared by
humans, and humans have bias. So far, there is significant evidence to suggest that human
prejudices are infecting technology and algorithms, and negatively impacting the lives and
freedoms of humans. Particularly those who exist within the minorities of our societies.
Algorithm biases have become such an ingrained part of everyday life that they have also been
documented as impacting our personal psyches and thought processes. The phenomenon occurs
when we perceive our reality to be a reflection of what we see online. However, what we view is
often a tailored reality created by algorithms and personalised using our previous viewing habits.
The algorithm shows us content that we are most likely to enjoy or agree with and discards the
rest.
5. Big data divide
The big data divide seeks to define the current state of data access; the understanding and mining
capabilities of big data is isolated within the hands of a few major corporations. These divides
create ‘haves’ and ‘have nots’ in big data and exclude those who lack the necessary financial,
educational and technological resources to access and analyse big datasets.
The data divide creates further problems when we consider algorithm biases that place
individuals in categories based on a culmination of data that individuals themselves cannot
access. For example, profiling software can mark a person as a high-risk potential for
committing criminal activity, causing them to be legally stop-and-searched by authorities or
even denied housing in certain areas. The big data divide means that the ‘data poor’ cannot
understand the data or methods used to make these decisions about them and their lives.
 Big Data Compliance
Big data affects the compliance process directly because you will be expected to account for its
flow inside your organization. The regulatory bodies are keen to examine every stage of data
handling, including the collection, processing, and storage of data. The primary reason for the
comprehensive evaluation is to make sure that the data is safe from cyberattacks. In order to get
compliance status, you will build security measures to secure the data. During the analysis, you
are expected to show how each of the techniques for risk mitigation works and their level of
effectiveness. This thorough report on the data protection programs will make the organization’s
certification easier. Big data assists the creation of a compressive risk assessment framework by:

• Fraudulent Crime Prevention: The use of big data strengthens the approach to
predictive analysis, which is an effective way of detecting criminal activities such as
money laundering. If a compliance officer uses big data for internal audits, cyber risks
are discovered, and they intervene to avoid their occurrence. It speeds up the process of
compliance and builds trust among your clients.
• Managing Third Parties Threat: If you are in the process of obtaining compliance
certifications, you must maintain the risk associated with sharing the data with vendors
appropriately. Big data analytics can help you manage vendor-related risks. This you will
accomplish by carefully evaluating their ability to protect your data before sharing with
them.
• Helps in Customer Service: You are required to prove that your customers are pleased
with how you treat their data before you get any compliance certification. If you apply
big data analytics, you will understand your customer’s behavior, which will directly
influence the decision-making process, thereby enabling the compliance process.
 Big Data Auditing

Many stakeholders don’t view the auditing process as applicable during big data
analytics. However, real-time data analysis can be applied in many different ways to
benefit auditors. Auditors can use big data to expand the scope of their projects and draw
comparisons over larger populations of data. Because big data involves the use of
automation and artificial intelligence, data can be processed in larger volumes and higher
velocity to uncover valuable insights for auditors. For example, previous cases of non-
compliance, current policy changes, and fraud can be identified and used to guide the
focus of both internal and external auditors. Big data is also helping financial auditors to
streamline the reporting process and detect fraud. These professionals can identify
business risks in time and conduct more relevant and accurate audits.
 Big Data Protection
Data protection is a set of strategies and processes you can use to secure the privacy,
availability, and integrity of your data. It is sometimes also called data security. A data
protection strategy is vital for any organization that collects, handles, or stores sensitive
data. A successful strategy can help prevent data loss, theft, or corruption and can help
minimize damage caused in the event of a breach or disaster.
What Are Data Protection Principles?
Data protection principles help protect data and make it available under any
circumstances. It covers operational data backup and business continuity/disaster
recovery (BCDR) and involves implementing aspects of data management and data
availability.
Here are key data management aspects relevant to data protection:
 Data availability—ensuring users can access and use the data required to perform
business even when this data is lost or damaged.
 Data lifecycle management—involves automating the transmission of critical data to
offline and online storage.
 Information lifecycle management—involves the valuation, cataloguing, and
protection of information assets from various sources, including facility outages and disruptions,
application and user errors, machine failure, and malware and virus attacks.

What Is Data Privacy and Why Is it Important?
Data privacy is a guideline for how data should be collected or handled, based on its
sensitivity and importance. Data privacy is typically applied to personal health
information (PHI) and personally identifiable information (PII). This includes financial
information, medical records, social security or ID numbers, names, birthdates, and
contact information. Data privacy concerns apply to all sensitive information that
organizations handle, including that of customers, shareholders, and employees. Often,
this information plays a vital role in business operations, development, and finances.
Data privacy helps ensure that sensitive data is only accessible to approved parties. It
prevents criminals from being able to maliciously use data and helps ensure that
organizations meet regulatory requirements.
 Big Data Ethics
Big data ethics also known as simply data ethics refers to systemizing, defending, and
recommending concepts of right and wrong conduct in relation to data, in particular
personal data. Since the dawn of the Internet the sheer quantity and quality of data has
dramatically increased and is continuing to do so exponentially. Big data describes this
large amount of data that is so voluminous and complex that traditional data processing
application software is inadequate to deal with them. Recent innovations in medical
research and healthcare, such as high-throughput genome sequencing, high-resolution
imaging, electronic medical patient records and a plethora of internet-connected health
devices have triggered a data deluge that will reach the exabyte range in the near future.
Data ethics is of increasing relevance as the quantity of data increases because of the
scale of the impact.
Big data ethics serves as a branch of ethics that evaluates data practices by collecting,
generating, analyzing, and distributing data. As the world expands its digital footprint,
data collected has the potential to impact people and thus society. Much of this data
consists of Personally identifiable information (PII):
 Full name
 Birthdate
 Street address
 Phone number
 Social security number
 Credit card information
 Bank account information
 Passport number
 Why is Data ethics important?
Once an organization fails to act ethically, it’s no secret that it damages the company’s
brand and reputation. Similarly, after the many data scandals that occurred in the past
couple of years, people lost trust in companies that manipulated customers’. However,
these scandals don’t just consist of data manipulation and sale. Housing data and keeping
it safe from harm’s way also is part of big data ethics. Some of the top data breaches ever
to occur had lasting effects on brand trustworthiness. Therefore, adopting a concrete big
data ethics framework is essential for the success of any large organization. Companies
must act as information protectors as long as they choose to collect it.

 Ethical Issues in the Big Data Industry
The bigger the data central to the business, the higher its risk of violating customer
privacy and individual rights. In 2022, the responsibility to actively manage data privacy
and security falls on roles within the large organization.
 Privacy
When users submit their information, it’s with the expectation that companies will keep it
to themselves. Two common scenarios exist when this information is no longer private:
o A data breach
o A sale of information to a third party
 Cybersecurity is a growing field.
With its growth, users expect talented IT professionals to be able to protect their data. If a
data breach occurs, the company fails to meet privacy expectations. Furthermore, in the
21st century, consumers expect large companies to have the means to protect data if they
choose to collect it.
 Lack of transparency
Many users are unaware their information is being collected. In addition to being
unaware, companies go to lengths to make it very inconspicuous that they do so.
Websites add cookie opt-ins on pop-ups so that the user will accept to quickly see the
page. After getting a user to submit information, some companies don’t disclose how
they use a person’s data. Long lists of legal documentation are formatted in a way no
user expects to read through it. It’s only after scandals or some type of media reporting
do people discover the company’s data collection method is unsatisfactory.
 Lack of governance
Before big data, the method to collect information was simple. People either gave you a
physical copy of their information or they didn’t. Companies stored the physical files
under lock and key. Although someone could always potentially steal someone’s
identity, criminals had difficulty doing so to masses of people at one time. Now in a new
age of information abundance, we have users unknowing submitting heaps of
information. The possibilities of using that information with AI and algorithms to
someone’s advantage are endless.
 Bias and discrimination
Algorithms make assumptions about users. Depending on the assumption, they can begin
to discriminate. For example, court systems started using algorithms to evaluate the
criminal risk potential defendants and have used this data while sentencing. If the data
encompasses a certain gender, nationality, or race then the results house bias against
groupings outside of those specific groups.
 Big Data Ethics Framework and Other Ways to Resolve Ethical Issues

The government and large businesses now create a big data ethics framework to avoid
ethical issues in the big data industry. While receiving initial consent, a company should
develop competencies that voice how they use the data in an easily digestible manner.
Most businesses have a set of mission statements and values. Now, many will also house
a big data ethics framework as well.
 Big Data privacy
Big data privacy involves properly managing big data to minimize risk and protect sensitive
data. Because big data comprises large and complex data sets, many traditional privacy
processes cannot handle the scale and velocity required. To safeguard big data and ensure it
can be used for analytics, you need to create a framework for privacy protection that can
handle the volume, velocity, variety, and value of big data as it is moved between
environments, processed, analyzed, and shared.
 Big data includes big privacy concerns
In an era of multi-cloud computing, data owners must keep up with both the pace of data
growth and the proliferation of regulations that govern it especially regulations protecting
the privacy of sensitive data and personally identifiable information (PII). With more data
spread across more locations, the business risk of a privacy breach has never been higher,
and with it, consequences ranging from high fines to loss of market share. Big data privacy
is also a matter of customer trust. The more data you collect about users, the easier it gets to
"connect the dots:" to understand their current behavior, draw inferences about their future
behavior, and eventually develop deep and detailed profiles of their lives and preferences.
The more data you collect, the more important it is to be transparent with your customers
about what you're doing with their data, how you're storing it, and what steps you're taking
to comply with regulations that govern privacy and data protection. The volume and
velocity of data from existing sources, such as legacy applications and e-commerce, is
expanding fast. You also have new (and growing) varieties of data types and sources, such
as social networks and IoT device streams.
 Predictions for big data privacy: What to expect
 Prediction 1: Data privacy mandates will become more common. As organizations store
more types of sensitive data in larger amounts over longer periods of time, they will be
under increasing pressure to be transparent about what data they collect, how they analyze
and use it, and why they need to retain it. The European Union's General Data Privacy
Regulation (GDPR) is a high-profile example. More government agencies and regulatory
organizations are following suit. To respond to these growing demands, companies need
reliable, scalable big data privacy tools that encourage and help people to access, review,
correct, anonymize, and even purge some or all of their personal and sensitive information.
 Prediction 2: New big data analytic tools will enable organizations to perform deeper
analysis of legacy data, discover uses for which the data wasn't originally intended, and
combine it with new data sources. Big data analytics tools and solutions can now dig into
data sources that were previously unavailable, and identify new relationships hidden in
legacy data. That’s a great advantage when it comes to getting a complete view of your
enterprise data especially for customer 360 and analytics initiatives. But it also raises
questions about the accuracy of aging data and the ability to track down entities for
consent to use their information in new ways.

 Big Data Analytics
Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer preferences
that can help organizations make informed business decisions. Big Data analytics deals
with collection, storage, processing and analysis of this massive scale data. Specialized
tools and frameworks are required for big data analysis when:
 (1) the volume of data involved is so large that it is difficult to store, process and
analyse data on a single machine,
 (2) the velocity of data is very high and the data needs to be analysed in real-time,
 (3) there is variety of data involved, which can be structured, unstructured or
semi-structured, and is collected from multiple data sources,
 (4) various types of analytics need to be performed to extract value from the data
such as descriptive, diagnostic, predictive and prescriptive analytics.
Big data analytics involves several steps starting from data cleansing, data munging (or
wrangling), data processing and visualization. Big data analytics life-cycle starts from the
collection of data from multiple data sources. Specialized tools and frameworks are required to
ingest the data from different sources into the dig data analytics backend. The data is stored in
specialized storage solutions (such as distributed file systems and non-relational databases)
which are designed to scale.
Based on the analysis requirements (batch or real-time), and type of analysis to be performed
(descriptive, diagnostic, predictive, or predictive) specialized frameworks are used. Big data
analytics is enabled by several technologies such as cloud computing, distributed and parallel
processing frameworks, non-relational databases, in-memory computing, for instance. Some
examples of big data are listed as follows:
 Data generated by social networks including text, images, audio and video data
 Click-stream data generated by web applications such as e-Commerce to analyse user
behaviour
 Machine sensor data collected from sensors embedded in industrial and energy systems
for monitoring their health and detecting failures
 Healthcare data collected in electronic health record (EHR) systems
 Logs generated by web applications
 Stock markets data
 Transactional data generated by banking and financial applications

 Types of Analytics
 Descriptive Analytics :- Descriptive Analytics is the examination of data or content,

usually manually performed, to answer the question “What happened?” (or What is
happening?), characterized by traditional business intelligence (BI) and visualizations
such as pie charts, bar charts, line graphs, tables, or generated narratives. Descriptive
analytics is the interpretation of historical data to better understand changes that have
occurred in a business. For example, computing the total number of likes for a particular
post, computing the average monthly rainfall or finding the average number of visitors
per month on a website.
 Diagnostic Analytics:- Diagnostic analytics is a form of advanced analytics that

examines data or content to answer the question, “Why did it happen?” It is characterized
by techniques such as drill-down, data discovery, data mining and correlations.
Diagnostic analysis can be done manually, using an algorithm, or with statistical software
(such as Microsoft Excel). For example, a system that collects and analyses sensor data
from machines for monitoring their health and predicting failures. Diagnostic analytics
can provide more insights into why certain a fault has occurred based on the patterns in
the sensor data for previous faults.
 Predictive Analytics:- Predictive analytics is a branch of advanced analytics that makes

predictions about future outcomes using historical data combined with statistical
modelling, data mining techniques and machine learning. Predictive analytics aims to
answer - What is likely to happen?. Predictive Analytics is done using predictive models
which are trained by existing data. These models learn patterns and trends from the
existing data and predict the occurrence of an event or the likely outcome of an event or
forecast numbers. Developing prediction models is to divide the existing data into
training and test data sets (for example 75% of the data is used for training and 25% data
is used for testing the prediction model). For example, Predictive analytics are used to
determine customer responses or purchases, as well as promote cross-sell opportunities.
 Prescriptive Analytics:- Prescriptive analytics is a type of data analytics the use of

technology to help businesses make better decisions through the analysis of raw data.
Specifically, prescriptive analytics factors information about possible situations or
scenarios, available resources, past performance, and current performance, and suggests a
course of action or strategy. Prescriptive analytics aims to answer - What can we do to
make it happen? Prescriptive Analytics can predict the possible outcomes based on the

current choice of actions. For example, prescriptive analytics can be used to prescribe the
best medicine for treatment of a patient based on the outcomes of various medicines for
similar patients. Another example of prescriptive analytics would be to suggest the best
mobile data plan for a customer based on the customer’s browsing patterns.
 Big data analytics advantages and disadvantages
 Benefits or advantages of Big Data
Following are the benefits or advantages of Big Data:
➨Big data analysis derives innovative solutions. Big data analysis helps in understanding
and targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial trading’s, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver answers of any query.
➨Every second additions are made.
➨One platform carry unlimited information.
 Drawbacks or disadvantages of Big Data
Following are the drawbacks or disadvantages of Big Data:
➨Traditional storage can cost lot of money to store big data.

➨Lots of big data is unstructured.
➨Big data analysis violates principles of privacy.
➨It can be used for manipulation of customer records.
➨It may increase social stratification.

➨Big data analysis is not useful in short run. It needs to be analyzed for longer duration to
leverage its benefits.
➨Big data analysis results are misleading sometimes.
➨Speedy updates in big data can mismatch real figures.
 Challenges of conventional systems
The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology may
take place which can also lead to some unpleasant result.
• Sharing and Accessing Data:

– Perhaps the most frequent challenge in big data efforts is the inaccessibility of
data sets from external sources.
– Sharing data can cause substantial challenges.
– It include the need for inter and intra- institutional legal documents.
– Accessing data from public repositories leads to multiple difficulties.
– It is necessary for the data to be available in an accurate, complete and timely
manner because if data in the companies information system is to be used to make
accurate decisions in time then it becomes necessary for data to be available in
this manner.
• Privacy and Security:
– It is another most important challenge with Big Data. This challenge includes
sensitive, conceptual, technical as well as legal significance.
– Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform security
checks and observation in real time because it is most beneficial.
– There is some information of a person which when combined with external large
data may lead to some facts of a person which may be secretive and he might not
want the owner to know this information about that person.
– Some of the organization collects information of the people in order to add value
to their business. This is done by making insights into their lives that they’re
unaware of.
• Analytical Challenges:
– There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too
large?
– Or how to find out the important data points?
– Or how to use data to the best advantage?
– These large amount of data on which these type of analysis is to be done can be
structured (organized data), semi-structured (Semi-organized data) or
unstructured (unorganized data). There are two techniques through which
decision making can be done:
• Either incorporate massive data volumes in the analysis.
• Or determine upfront which Big data is relevant.
• Technical challenges:
– Quality of data:
• When there is a collection of a large amount of data and storage of this
data, it comes at a cost. Big companies, business leaders and IT leaders
always want large data storage.
• For better results and conclusions, Big data rather than having irrelevant
data, focuses on quality data storage.

• This further arise a question that how it can be ensured that data is
relevant, how much data would be enough for decision making and
whether the stored data is accurate or not.
– Fault tolerance:
• Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
• Nowadays some of the new technologies like cloud computing and big
data always intended that whenever the failure occurs the damage done
should be within the acceptable threshold that is the whole task should not
begin from the scratch.
– Scalability:
• Big data projects can grow and evolve rapidly. The scalability issue of Big
Data has lead towards cloud computing.
• It leads to various challenges like how to run and execute various jobs so
that goal of each workload can be achieved cost-effectively.
• It also requires dealing with the system failures in an efficient manner.
This leads to a big question again that what kinds of storage devices are to
be used.
 Intelligent Data Analysis (IDA)

• Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data (White,
2008). It also helps in making a decision. Based on machine learning, artificial
intelligence, recognition of pattern, and records and visualization technology mainly,
IDA helps to obtain useful information, necessary data and interesting models from a lot
of data available online in order to make the right choices.
• Intelligent data analysis helps to solve a problem that is already solved as a matter of
routine. If the data is collected for the past cases together with the result that was finally
achieved, such data can be used to revise and optimize the presently used strategy to
arrive at a conclusion. In certain cases, if some questions arise for the first time, and have
only a little knowledge about it, data from the related situations helps us to solve the new
problem or any unknown relationships can be discovered from the data to gain
knowledge in an unfamiliar area.
• Intelligent Data Analysis provides a forum for the examination of issues related to the
research and applications of Artificial Intelligence techniques in data analysis across a
variety of disciplines. These techniques include (but are not limited to): all areas of data
visualization, data pre-processing (fusion, editing, transformation, filtering, sampling),
data engineering, database mining techniques, tools and applications, use of domain
knowledge in data analysis, big data applications, evolutionary algorithms, machine
learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and
post-processing.
 Steps Involved In IDA:
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation . The preparation of data involves opting for the required data
from the related data source and incorporating it into a data set that can be used for data
mining. The main goal of intelligent data analysis is to obtain knowledge. Data analysis
is the process of a combination of extracting data from data set, analyzing, classification
of data, organizing, reasoning, and so on. It is challenging to choose suitable methods to
resolve the complexity of the process. Regarding the term visualization, we have moved

away from visualization to use the term charting. The term analysis is used for the
method of incorporating, influencing, filtering and scrubbing the data, which certainly
contains, but is not limited to interrelating with their data through charts.
 Nature of Data
 What is Data?
Data is a collection of information gathered by observations, measurements, research or

analysis. They may consist of facts, numbers, names, figures or even description of
things. Data is organized in the form of graphs, charts or tables. There exist data scientist
who does data mining and with the help of that data analyse our world. In general, data is
a distinct piece of information that is gathered and translated for some purpose. If data is
not formatted in a specific way, it does not valuable to computers or humans. Data can be
available in terms of different forms, such as bits and bytes stored in electronic memory,
numbers or text on pieces of paper, or facts stored in a person's mind.
 Properties of data
• 1. Consistency
The element of consistency removes room for contradictory data. Rules will have to be
set around consistency metrics, which include range, variance, and standard deviation.
• 2. Accuracy
It is a necessity for DQ data to remain error-free and precise, which means it should be
free of erroneous information, redundancy, and typing errors. Error ratio and deviation
are two examples of accuracy metrics.
• 3. Completeness
The data should be complete without any missing data. To deliver cloud data quality
tools, all data entries should be complete with no room for lapses. The completeness
metric is defined as the percentage of complete data records.
• 4. Auditability
The ability to trace data and analyses the changes over time adds to the Data Quality
dimensions of audibility of data. An example of audacity metrics is the percentage of the
gaps in data sets, modified data, and untraceable and disconnected data.
• 5. Validity
Quality data in terms of validity indicates that all data is aligned with the existing
formatting rules. An example of a validity metric is the percentage of data records in the
required format.
• 6. Uniqueness

There will be no overlapping of data and it will be recorded only once. The same data
may be used in multiple ways, but it will remain unique. Uniqueness metrics are defined
by the percentage of repeated values.
• 7. Timeliness
For data to retain its quality, it should be recorded promptly to manage changes. Weekly
over annually, tracking is the solution to timeliness. An example of timeliness metrics is
time variance.
• 8. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
• 9. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
 Types of data
 Categorical Data
Categorical data represents characteristics. Therefore it can represent things like a

person’s gender, language etc. Categorical data can also take on numerical values
(Example: 1 for female and 0 for male). Note that those numbers don’t have
mathematical meaning. Categorical data represents characteristics. It is also called
qualitative data, which means you cannot express it as a numerical value. Therefore, you
cannot measure it. It involves variables such as words, symbols, pictures, which help you
sort information based on a category for example, holiday destinations, gender, language,
etc.
 Nominal Data
Nominal values represent discrete units and are used to label variables, that have no
quantitative value. Just think of them as „labels“. Note that nominal data that has no
order. Therefore if you would change the order of its values, the meaning would not
change. Nominal data are used to label variables where there is no quantitative value and
has no order. So, if you change the order of the value then the meaning will remain the
same.
Thus, nominal data are observed but not measured, are unordered but non-equidistant, and have
no meaningful zero. The only numerical activities you can perform on nominal data is to state
that perception is (or isn't) equivalent to another (equity or inequity), and you can use this data to
amass them.
 You can't organize nominal data, so you can't sort them.
Neither would you be able to do any numerical tasks as they are saved for numerical data. With
nominal data, you can calculate frequencies, proportions, percentages, and central points.
Examples of Nominal data:
 What languages do you speak?
 English
 German
 French
 Punjabi
 What’s your nationality?
 American
 Indian
 Japanese
 German
 Ordinal Data
Ordinal values represent discrete and ordered units. It is therefore nearly the same as
nominal data, except that it’s ordering matters. Ordinal data is almost the same as
nominal data but not in the case of order as their categories can be ordered like 1st, 2nd,
etc. However, there is no continuity in the relative distances between adjacent categories.
Examples of Ordinal data:
 Opinion
o Agree
o Disagree
o Mostly agree
o Neutral
o Mostly disagree
 Time of day
o Morning
o Noon
o Night
 Numerical Data
This data type tries to quantify things and it does by considering numerical values that
make it countable in nature. The price of a smartphone, discount offered, number of

ratings on a product, the frequency of processor of a smartphone, or ram of that particular
phone, all these things fall under the category of Quantitative data types. It is one of the
simplest data types to understand. As the name says, it represents a numerical value and
helps answer questions like how many, how much, how long, etc. For example,
 The number of apples

 The number of students in a class
 The height/weight of a person
 It attempts to quantify items by measuring numerical variables that make them count in
nature. The key here is that a numerical variable can take an infinite number of values.
For example, the height of a person can vary from x cm to y cm and can be further
broken down based on the fractional values.
 Interval Data
Interval Data are measured and ordered with the nearest items but have no meaningful zero.
The central point of an Interval scale is that the word 'Interval' signifies 'space in between',
which is the significant thing to recall, interval scales not only educate us about the order but
additionally about the value between every item.
Interval data can be negative, though ratio data can't.
Even though interval data can show up fundamentally the same as ratio data, the thing that
matters is in their characterized zero-points. If the zero-point of the scale has been picked
subjectively, at that point the data can't be ratio data and should be interval data. Examples of
Interval data:
 Temperature (°C or F, but not Kelvin)

 Dates (1066, 1492, 1776, etc.)
 Time interval on a 12-hour clock (6 am, 6 pm)
 Ratio Data
Ratio Data are measured and ordered with equidistant items and a meaningful zero and never be
negative like interval data. An outstanding example of ratio data is the measurement of heights.
It could be measured in centimetres, inches, meters, or feet and it is not practicable to have a
negative height. Ratio data enlightens us regarding the order for variables, the contrasts among
them, and they have absolutely zero. It permits a wide range of estimations and surmisings to be
performed and drawn. Example of Ratio data:
 Age (from 0 years to 100+)

 Temperature (in Kelvin, but not °C or F)
 Distance (measured with a ruler or any other assessing device)
 Time interval (measured with a stop-watch or similar)
 Sources and nature of data
In general, data is any set of characters that is gathered and translated for some purpose, usually
analysis. If data is not put into context, it doesn't do anything to a human or computer. There are
multiple types of data. Some of the more common types of data include the following:

 Single character
 Boolean (true or false)
 Text (string)
 Number (integer or floating-point)
 Picture
 Sound
 Video
In a computer's storage, digital data is a sequence of bits (binary digits) that have the value
one or zero. Data is processed by the CPU, which uses logical operations to produce new
data (output) from source data (input).
 What is the difference between data and information?
While data is a collection of individual statistics or facts, information is knowledge gained

through research, study, instruction, or communication. A key difference between the two is
that data can be unorganized, unrelated, or raw, while information is organized. Data on its
own might not have any meaning and might need to be sorted, analyzed, or interpreted to
become information. When trying to make a decision, data might not be enough, but
decisions can be made based solely on information. Businesses that can collect and use data
can gain valuable information from it that will help them make faster and smarter business
decisions. Computers are an extremely useful tool to turn data into information using
software applications, formulas, and scripts.
 Machine-readable vs. human-readable data
All data can be categorized as machine-readable, human-readable, or both. Human-readable

data utilizes natural language formats (such as a text file containing ASCII codes or PDF
document), whereas machine-readable data uses formally structured computer languages
(Parquet, Avro, etc.) to be read by computer systems or software. Some data is readable by
both machines and humans, as in the case of CSV, HTML, or JSON. The line between
machine- and human-readable data is becoming increasingly blurred because so many
formats that are prevalent today are accessible enough to be navigated by a human yet
structured enough to be processed by a machine. This is largely the result of artificial
intelligence, machine learning, and automation, which streamlines tasks and workflows so
manual data entry and analysis is done by a machine rather than a human. However, these
processes need to maintain their human readability in case the programming needs to be
adjusted. Most data in these cases also exist in a vacuum and does not have much meaning
without context from a human perspective.
 Analytic processes
There are 6 analytic processes:
 1. Deployment
 2. Business Understanding
 3. Data Exploration
 4. Data Preparation
 5. Data Modeling
 6. Data Evaluation

Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding

• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.

• Business objectives are defined in this phase.
Step 3: Data Exploration

• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data and verify
data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application and the need for
the project in this phase.
– This is also known as data exploration.

• This is necessary to verify the quality of data collected.

Step 4: Data Preparation
• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful information and
– then integrate it all.

• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
Step 5: Data Modeling

• we need to
– select a modeling technique, generate test design, build a model and assess the model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented on the data in
this phase.
 Analytic tools
Big data tools for HPC and supercomputing

– MPI (Message Passing Interface)
 Provide standardized function interfaces for communication between parallel
processes.
• Big data tools on clouds

– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools

– SaS
–R
– Hadoop
Thus the BDA tools are used throughout the BDA applications development.
 Analysis vs Reporting
 Reporting: The process of organizing data into informational summaries in order to
monitor how different areas of a business are performing.****
 Analysis: The process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.
• Data reporting: Gathering data into one place and presenting it in visual representations
• Data analysis: Interpreting your data and giving it context
 Comparing analysis with reporting

 Reporting is “the process of organizing data into informational summaries in order to
monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a slidedeck, or online
dashboard — falls under this category.
 Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”
• Reporting helps companies to monitor their online business and be alerted to when data falls
outside of expected ranges.
 Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and providing actionable
recommendations.
• A firm may be focused on the general area of analytics (strategy, implementation, reporting,
etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related activities and
don’t make it to the analysis stage
A reporting activity deliberately proposes Analysis activity.
 Contrast between analysis and reporting

The basis differences between Analysis and Reporting are as follows:
Reporting translates raw data into information.

• Analysis transforms data and information into insights.
• reporting shows you what is happening
• while analysis focuses on explaining why it is happening and what you can do about it.
 Reports are like Robots n monitor and alter you and whereas analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
 Reporting and analysis can go hand-in-hand:
 Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
 Reporting translate a raw data into information
 Reporting usually raises a question – What is happening ?
 Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.
 Characteristics of Data Analysis
The characteristics of the data analysis depend on different aspects such as volume, velocity, and
variety.
1. Programmatic
There might be a need to write a program for data analysis by using code to manipulate it or do
any kind of exploration because of the scale of the data.
2. Data-driven
A lot of data scientists depend on a hypothesis-driven approach to data analysis. For appropriate
data analysis, one can also avail the data to foster analysis. This can be of significant advantage
when there is a large amount of data. For example machine learning approaches can be used in
place of hypothetical analysis.
3. Attributes usage
For proper and accurate analysis of data, it can use a lot of attributes. In the past, analysts dealt
with hundreds of attributes or characteristics of the data source. With Big Data, there are now
thousands of attributes and millions of observations.
4. Iterative
As whole data is broken into samples and samples are then analyzed, therefore data analytics can
be iterative in nature. Better compute power enables iteration of the models until data analysts
are satisfied. This has led to the development of new applications designed for addressing
analysis requirements and time frames.
 Modern Data Analytic Tools

• Modern Analytic Tools: Current Analytic tools concentrate on three classes:
a) Batch processing tools
b) Stream Processing tools and
c) Interactive Analysis tools.
a) Big Data Tools Based on Batch Processing:
Batch processing system :-

• Batch Processing System involves
– collecting a series of processing jobs and carrying them out periodically as a group (or batch)
of jobs.
• It allows a large volume of jobs to be processed at the same time.
• An organization can schedule batch processing for a time when there is little activity on their
computer systems, for example overnight or at weekends.
• One of the most famous and powerful batch process-based Big Data tools is Apache
Hadoop.
 It provides infrastructures and platforms for other specific Big Data applications.
b) Stream Processing tools

• Stream processing – Envisioning (predicting) the life in data as and when it transpires.
• The key strength of stream processing is that it can provide insights faster, often within
milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real time.
– It translates into processing of data from single or multiple sources
– in real or near-real time applying the desired business logic and emitting the processed
information to the sink.
• Stream processing serves
– multiple
– resolves in today’s business arena.
Real time data streaming tools are:

a) Storm
• Storm is a stream processing engine without batch support,
• a true real-time processing framework,
• taking in a stream as an entire ‘event’ instead of series of small batches.
• Apache Storm is a distributed real-time computation system.
• It’s applications are designed as directed acyclic graphs.
b) Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault tolerance and
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics framework that is fully
compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.
c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to large stores
of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by loading the
aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.
c) Interactive Analysis -Big Data Tools

• The interactive analysis presents
– the data in an interactive environment,
– allowing users to undertake their own analysis of information.
• Users are directly connected to
– the computer and hence can interact with it in real time.
• The data can be :
– reviewed,
– compared and
– analyzed
• in tabular or graphic format or both at the same time.

Big Data (Unit 1)

Uploaded by

Copyright:

Available Formats

Big Data (Unit 1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data (Unit 1)

Uploaded by

Copyright:

Available Formats

Big Data (Unit 1)

 Types of Digital Data

Satish Kr Singh Page 1

Satish Kr Singh Page 2

 For example - emails, XML, markup languages like HTML, etc.

 History of Big Data innovation

 Big Data phase 2.0

Satish Kr Singh Page 3

 Big Data phase 3.0

• Simultaneously, the rise of sensor-based internet-enabled devices is increasing the data

 Introduction to Big Data Platform

 It is an enterprise class IT platform that enables organization in developing, deploying,

Satish Kr Singh Page 4

 Characteristics of a Big Data Platform

 Support several data formats.

 Ability to accommodate large volumes of streaming or at-rest data.

 Have a wide variety of conversion tools to transform data to different preferred

 Capacity to accommodate data at any speed.

 Support linear scaling.

 The ability for quick deployment.

 Have the tools for data analysis and reporting requirements.

 Drivers for Big Data

 2. The plummeting of technology costs :- Technology related to collecting and

Satish Kr Singh Page 5

 3. Connectivity through cloud computing :- Cloud computing environments (where

Satish Kr Singh Page 6

 Big Data Architecture

Satish Kr Singh Page 7

 Big Data Characteristics & 5 Vs

Satish Kr Singh Page 8

 Big Data Technology Components

Satish Kr Singh Page 9

 Big Data Importance

Satish Kr Singh Page 10

 Big Data Applications

Satish Kr Singh Page 11

 Big Data Features

 Big Data Security & steps of securing Big Data

Satish Kr Singh Page 14

 Big data ethics

Satish Kr Singh Page 15

 The condition of privacy

Satish Kr Singh Page 16

4. Algorithm bias and objectivity

5. Big data divide

 Big Data Compliance

Satish Kr Singh Page 17

 Big Data Auditing

 Big Data Protection

What Are Data Protection Principles?

Here are key data management aspects relevant to data protection:

Satish Kr Singh Page 18

 Big Data Ethics

 Why is Data ethics important?

Satish Kr Singh Page 19

 Cybersecurity is a growing field.

 Bias and discrimination

Satish Kr Singh Page 20

 Big Data privacy

 Big data includes big privacy concerns