Introduction To Big Data-0

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 77

Introduction Big Data

Big Data Definition


Definition: Big data is a collection of massive and
complex data sets and data volume that include the
huge quantities of data, data management capabilities,
social media analytics and real-time data.
• Big data analytics is the process of examining large
amounts of data. There exist large amounts of
heterogeneous digital data. Big data is about data
volume and large data set's measured in terms of
terabytes or petabytes.
• This phenomenon is called Bigdata.
• Walmart handles 1 million customer transcations
every hour.
• Facebook handles 40 billion photos from its user
base.

• Big Data is so huge and complex that is impossible for
traditional systems and traditional warehousing tools to process
and work on them.
• Data (Big Data) is generated by machines, generated by humans.
With the growth of technologies and services this large data is
produced that carn be structured, semi- structured, un structured,
from different sources. Big data neither be work upon by using
traditional SQL like queries nor can the RDBMS be used for
storage.
• So that wide variety of scable database tools and techniques are
evolved.
• Hadoop, an open source distribution data processing system is
one of the prominent and well known solutions.
• The need of big data companies comes from
like Google, Facebook, ebay and linked.
• Like many new information technologies, big
data can bring out dramatic cost reduction,
substantial improvements in the time required to
perform computational task or new product and
service offering.
• Big data requires different approaches :
Techniques, tools and architecture
• An aim to solve new problems and old problems
in better way.
Taxonomy of Big Data Technology
Taxonomy of Big Data
• SEMANTIC OF BIG DATA
• Data Semantics refers to the “meaning and
meaningful use of data”, i.e. the effective use of a
data object for representing a concept or object in the
real world. Such a general notion interconnects a large
variety of applications.
• A historic achievement of the database community
was Representing Data via suitable schemata.
Unfortunately, Big Data deal with evolving
heterogeneous data that make it difficult, or even
impossible, to identify a data schema prior to data
processing .
Taxonomy of Big Data
• SEMANTIC OF BIG DATA
• Managing a large volume of heterogeneous and
distributed data requires definition and continuous
updating of metadata describing different aspects of
semantics and data quality, such as data
documentation, provenance, trust, accuracy, and other
properties.

Taxonomy of Big Data
• COMPUTE INFRASTRUCTURE
• MapReduce.
– The MapReduce programming paradigm is the best
parallel programming concept even if the
MapReduce is purely based on the only Map and
Reduce task.
– IT can solve almost every problem of distributed
and parallel computing, and large-scale data-
intensive computing.
Taxonomy of Big Data
• COMPUTE INFRASTRUCTURE
• BSP:  Bulk synchronous parallel (model for
designing parallel algorithms).
– Giraph is an open source system which is used for
graph processing on big data. It uses MapReduce
implementation for graph processing. In general, it
follows a master/workers model.
– Pregel is a system that provides a graph processing
API along with BSP with a vertex-centric,
programming model. Its programs are inspired by
Valiants Bulk Synchronous Parallel model.
Taxonomy of Big Data
• COMPUTE INFRASTRUCTURE
• BSML
– Bulk Synchronous Parallel ML language (BSML)
is a library for parallel programming along with
functional language Objective Caml. It is based on
an extension of the -calculus by parallel operations
on parallel vector. Parallel vector is a parallel data
structure.
Taxonomy of Big Data
• STORAGE
• The storage system architecture is broadly categorized
into three categories, namely,
– Direct-Attached Storage (DAS),
– Network Attached Storage (NAS)
– Storage Area Network (SAN).
Taxonomy of Big Data
• STORAGE
– Direct-Attached Storage (DAS)
– DAS is digital storage, which attaches storage
directly to the computer that accessing it
– These storages are from USB drive, and by
Bus, i.e., every server has its own storage space
directly attached to it without using network
accessing
Taxonomy of Big Data
• STORAGE
– Network Attached Storage (NAS)
– Network Attached Storage (NAS) The storage is
attached through an Ethernet switch to scale the storage
system. The NAS uses TCP/IP protocol to access the
storage.
– The application server is detached from the file system
and data storage. The advantages of detaching
application server, and file system & storage system is
incremental scalability.
– It is really easy to design a disaster recovery system
using NAS. The performance is the major issue in the
NAS.
Taxonomy of Big Data
• STORAGE
• Storage Area Network (SAN).
• The storage devices are attached with fiber channel and
storage are networked together.
• Thus SAN strides the speed accessing of storage devices.
• Storage is connected through fiber switch so that the
accessing the data become faster.
• The performance is the major advantage and scalability is the
major issue in the SAN.
• The SAN supports fast accessing of data through a fibre
channel.
• The SAN outperform NAS and DAS in performance, but
NAS outperform SAN and DAS in scalability.
Taxonomy of Big Data
• STORAGE IMPLEMENTATION
• File System for Big Data
• Object Storage for Big Data
• Block Storage System
• Cloud Storage
Taxonomy of Big Data
• BIG DATA MANAGEMENT
• Data Acquisition:-
– Data acquisition is the process of collecting, filtering and
removing any noise from data before they can be stored in
any data warehouse or any storage system. It adopts
adaptive and time efficient algorithms for processing of
high value data. For data acquisition, frameworks are
available that are based on predefined protocols.
Taxonomy of Big Data
• BIG DATA MANAGEMENT
• Data Pre-processing:-
– Data preprocessing is the set of techniques used before the
application of data processing techniques. It removes data
redundancies and inconsistencies, and make it suitable for
application of data processing algorithms.
– Some data preprocessing approaches are Dimensionality
reduction and Instance reduction. Dimensionality of data
refers to the instances of the data. And Dimensionality
reduction include Feature selection and Space
transformations.
Taxonomy of Big Data
• BIG DATA MANAGEMENT
• Data Pre-processing:-
– Data preprocessing is the set of techniques used before the
application of data processing techniques. It removes data
redundancies and inconsistencies, and make it suitable for
application of data processing algorithms.
– Some data preprocessing approaches are Dimensionality
reduction and Instance reduction. Dimensionality of data
refers to the instances of the data. And Dimensionality
reduction include Feature selection and Space
transformations.
Taxonomy of Big Data
• DATAMINING and MACHINE LEARNING

• Data Mining: Data Mining is a technique to extract


important and vital information and knowledge from a
huge set/libraries of data. It derives insight by carefully
extracting, reviewing, and processing the huge data to
find out pattern and co-relations which can be important
for the business.
Taxonomy of Big Data
• DATAMINING and MACHINE LEARNING
– A machine learning algorithm (MLA) is an approach or
tool to help in big data analytics (BDA) of applications.
This tool is suitable to analyze a large amount of amount
generated by an application for effective and efficient
utilization of the data.
– Machine learning algorithms considered to find out
meaningful data and information for industrial
applications. It is one of the services under big data
analytics (BDA). Big data analytics (BDA) is suitable
for identifying risk management, cause of failure,
identifying the customer based on their procurement
detail records, detection of fraud, etc
Three Characteristics of Big Data 3vs
Volume
Velocity
Data Variety
Quantity Data Speed
Data Types

The three Vs describe the data to be analyzed.


Analytics is the process of deriving value from that
data. Taken together, there is the potential for amazing
insight.
Security and Privacy
• The Security and Privacy challenges for big
data may be organized into four aspects of the
big data ecosystem as depicted in Figure

• Infrastructure Security
• Data Privacy
• Data Management
• Integrity and Reactive Security
Security and Privacy
• Securing the infrastructure of big data systems involves securing
distributed computations and data stores. Securing the data itself
is of paramount importance, so we have to ensure that information
dissemination is privacy-preserving and that sensitive data is
protected through the use of cryptography and granular access
control.
• Managing enormous volumes of data necessitates scalable and
distributed solutions for not only securing data stores but also
enabling efficient audits and investigations of data provenance.
• Finally, the streaming data that is coming in from diverse
endpoints has to be checked for integrity and can be used to
perform real-time analytics for security incidents to ensure the
security issues that arise for the various
forms of data
• Streaming Data – There are two complementary security
problems for streaming data depending on whether the
data is public or not.
• For public data, confidentiality may not be an issue, but
the filtering criteria applied by individual clients, such as
governments, may be classified.
• For private data, confidentiality may be a concern, while
at the same time suitably modified version of the data
may be disclosed to achieve specific utilities, such as
predictive analytics.
Case Study: Diabetes Prevention
• What if we could predict the occurrence of diabetes and take
appropriate measures beforehand to prevent it?

• In this case study let us predict the occurrence of diabetes


making use of the entire lifecycle of Data Science
• Let’s go through the various steps.
Step 1: First, we will collect the data based on the medical history of
the patient as discussed in Phase 1.
You can refer to the sample data below.
• Attributes:
• npreg     –   Number of times pregnant
• glucose   –   Plasma glucose concentration
• bp          –   Blood pressure
• skin        –   Triceps skinfold thickness
• bmi        –   Body mass index
• ped        –   Diabetes pedigree function
• age        –   Age
• income   –   Income
• Step 2: continued….

• Now, once we have the data, we need to clean and prepare the
data for data analysis.
• This data has a lot of inconsistencies like missing values, blank
columns, abrupt values and incorrect data format which need to
be cleaned.
• Here, we have organized the data into a single table under
different attributes – making it look more structured.
Step 2:

• This data has a lot of inconsistencies.


• In the column npreg, “one” is written in words, whereas it should be in
the numeric form like 1.
• In column bp one of the values is 6600 which is impossible (at least for
humans) as bp cannot go up to such huge value.
• As you can see the Income column is blank and also makes no sense in
predicting diabetes. Therefore, it is redundant to have it here and should
be removed from the table.
• So, we will clean and preprocess this data by removing the outliers,
filling up the null values and normalizing the data type. If you remember,
this is our second phase which is data preprocessing.
• Finally, we get the clean data as shown below which can be used for
analysis.
Step 3: Now let’s do some analysis as discussed earlier in Phase 3.

• First, we will load the data into the analytical sandbox and apply
various statistical functions on it. For example, R has functions
like describe which gives us the number of missing values and
unique values. We can also use the summary function which will
give us statistical information like mean, median, range, min and
max values.
Then, we use visualization techniques like histograms, line
graphs, box plots to get a fair idea of the distribution of data.
Step 4:
• Now, based on insights derived from the previous step, the best
fit for this kind of problem is the decision tree.
• Since, we already have the major attributes for analysis
like npreg, bmi, etc., so we will unsupervised learning
technique to build a model here.
• we have particularly used decision tree because it takes all
attributes into consideration in one go.
• In our case, we have a linear relationship
between npreg and age, whereas the nonlinear relationship
between npreg and ped.
• Decision tree models are also very robust as we can use the
different combination of attributes to make various trees and
then finally implement the one with the maximum efficiency.
Varying data structures

• Big data” is high-volume, velocity, and variety


information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.
what is Big Data:
• what is Big Data:
• It refers to a massive amount of data that keeps on
growing exponentially with time.
• It is so voluminous that it cannot be processed or
analyzed using conventional data processing techniques.
• It includes data mining, data storage, data analysis, data
sharing, and data visualization.
• The term is an all-comprehensive one including data,
data frameworks, along with the tools and techniques
used to process and analyze the data.
Types of Big Data
• Structured (retail, financial, bioinformatics, geodata)
• Semi-structured(web logs, email, documents)
• Unstructured(images, video, sensor data, web pages)
Structured Data
• Structured data is exemplified by data contained in relational
databases and spreadsheets. Structured data conforms to a
database model,
• which is largely characterized by the various fields that data
belongs to (name, address, age and so forth), and the data
type for each field (numeric, currency, alphabetic, name, date,
address).
• The model also has a notion of restrictions or constraints on
each field (for example, integers in a certain range), and
constraints between elements in the various fields that are
used to enforce a notion of consistency (no duplicates, cannot
be scheduled in two different places at the same time, etc.)
Unstructured Data
• Unstructured Data (or unstructured information) refers to
information that either does not have a pre-defined data
model or is not organized in a predefined manner.
• Unstructured information is typically text-heavy, but may
also contain data such as dates, numbers, and facts.
• Other examples include the “raw” (untagged) data
representing photos and graphic images, videos, streaming,
web pages, PDF files, PowerPoint presentations, emails,
blog entries, wikis, and word processing documents.
Semi-Structured Data

• Semi-structured data lies in between structured and


unstructured data. It is a type of structured data, but
lacks a strict structure imposed by an underlying data
model.
• With semi-structured data, tags or other types of
markers are used to identify certain elements within
the data, but the data doesn’t have a rigid structure
from which complete semantic meaning can be easily
extracted without much further processing.
Semi-Structured Data
• For example, word processing software now can include
metadata showing the author's name and the date created,
while the bulk of the document contains unstructured text.
(Sophisticated learning algorithms would have to mine the text
to understand what the text was about, because no model
exists that classifies the text into neat categories).
• Emails have the sender, recipient, date, time and other fixed
fields added to the unstructured data of the email message
content and any attachments.
• Photos or other graphics can be tagged with keywords such as
the creator, date, location and other content-specific keywords
(such as names of people in the photos), making it possible to
organize and locate graphics. XML and other markup
languages are often used to manage semi-structured data.
Yet another way to characterize the domains is to
look at the types of industries that generate and
need to extract information from the data
• Financial services
• Retail
• Network security
• Large-scale science
• Social networking
• Internet of Things/sensor networks
• Visual media
list of the top 10 industries using big data
applications:
1. Banking and Securities
2.Communications, Media and Entertainment
3.Healthcare Providers
4.Education
5.Manufacturing and Natural Resources
6.Government
7.Insurance
8.Retail and Wholesale trade
9.Transportation
10. Energy and Utilities
list of the top 10 industries using big data
applications:
1. Banking and Securities
– Industry-specific Big Data Challenges
– A study of 16 projects in 10 top investment and retail banks
shows that the challenges in this industry include: securities
fraud early warning, tick analytics, card fraud detection,
archival of audit trails, enterprise credit risk reporting, trade
visibility, customer data transformation, social analytics for
trading, IT operations analytics, and IT policy compliance
analytics, among others.
list of the top 10 industries using big data
applications:
– Applications of Big Data in the Banking and Securities Industry
• The Securities Exchange Commission (SEC) is using Big Data to monitor
financial market activity. They are currently using network analytics and
natural language processors to catch illegal trading activity in the financial
markets.
• Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in
the financial markets use Big Data for trade analytics used in high-
frequency trading, pre-trade decision-support analytics, sentiment
measurement, Predictive Analytics, etc.
• This industry also heavily relies on Big Data for risk analytics, including;
anti-money laundering, demand enterprise risk management, "Know Your
Customer," and fraud mitigation.
• Big Data providers are specific to this industry includes 1010data,
Panopticon Software, Streambase Systems, Nice Actimize, and Quartet
FS.
list of the top 10 industries using big data
applications:
2. Communications, Media and Entertainment
• Industry-specific Big Data Challenges
• Since consumers expect rich media on-demand in different
formats and a variety of devices, some Big Data challenges in the
communications, media, and entertainment industry include:
• Collecting, analyzing, and utilizing consumer insights
• Leveraging mobile and social media content
• Understanding patterns of real-time, media content usage
list of the top 10 industries using big data
applications:
• Applications of Big Data in the Communications, Media and
Entertainment Industry
• Organizations in this industry simultaneously analyze customer
data along with behavioral data to create detailed customer
profiles that can be used to:
• Create content for different target audiences
• Recommend content on demand
• Measure content performance
list of the top 10 industries using big data
applications:
• Applications of Big Data in the Communications, Media and
Entertainment Industry
• A case in point is the Wimbledon Championships (YouTube
Video) that leverages Big Data to deliver detailed sentiment
analysis on the tennis matches to TV, mobile, and web users in
real-time.
• Spotify, an on-demand music service, uses Hadoop Big Data
analytics, to collect data from its millions of users worldwide and
then uses the analyzed data to give informed music
recommendations to individual users.
• Amazon Prime, which is driven to provide a great customer
experience by offering video, music, and Kindle books in a one-
stop-shop, also heavily utilizes Big Data.
list of the top 10 industries using big data
applications:
• Healthcare Providers
• The healthcare sector has access to huge amounts of data but has
been plagued by failures in utilizing the data to curb the cost of
rising healthcare and by inefficient systems that stifle faster and
better healthcare benefits across the board.
• This is mainly because electronic data is unavailable, inadequate,
or unusable. Additionally, the healthcare databases that hold
health-related information have made it difficult to link data that
can show patterns useful in the medical field.
list of the top 10 industries using big data
applications:
• Healthcare Providers

Source: 
Big Data in the Healthcare Sector Revolutionizing the Management of Laborious T
ask
list of the top 10 industries using big data
applications:
• Healthcare Providers
• Applications of Big Data in the Healthcare Sector

• Some hospitals, like Beth Israel, are using data collected from a cell phone app,
from millions of patients, to allow doctors to use evidence-based medicine as
opposed to administering several medical/lab tests to all patients who go to the
hospital.
• A battery of tests can be efficient, but it can also be expensive and usually
ineffective.
• Free public health data and Google Maps have been used by the University of
Florida to create visual data that allows for faster identification and efficient
analysis of healthcare information, used in tracking the spread of chronic
disease.
• Obamacare has also utilized Big Data in a variety of ways.
list of the top 10 industries using big data
applications:
4. Education
• Industry-specific Big Data Challenges
• From a technical point of view, a significant challenge in the
education industry is to incorporate Big Data from different sources
and vendors and to utilize it on platforms that were not designed for
the varying data.
• From a practical point of view, staff and institutions have to learn
new data management and analysis tools.
• On the technical side, there are challenges to integrating data from
different sources on different platforms and from different vendors
that were not designed to work with one another.
• Politically, issues of privacy and personal data protection associated
with Big Data used for educational purposes is a challenge.
list of the top 10 industries using big data
applications:
4. Education
Applications of Big Data in Education
• Big data is used quite significantly in higher education. For example,
The University of Tasmania. An Australian university with over
26000 students has deployed a Learning and Management System that
tracks, among other things, when a student logs onto the system, how
much time is spent on different pages in the system, as well as the
overall progress of a student over time.
• In a different use case of the use of Big Data in education, it is also
used to measure teacher’s effectiveness to ensure a pleasant
experience for both students and teachers. Teacher’s performance can
be fine-tuned and measured against student numbers, subject matter,
student demographics, student aspirations, behavioral classification,
and several other variables.
list of the top 10 industries using big data
applications:
5. Manufacturing and Natural Resources
• Increasing demand for natural resources, including oil,
agricultural products, minerals, gas, metals, and so on, has led to
an increase in the volume, complexity, and velocity of data that is
a challenge to handle.
• Similarly, large volumes of data from the manufacturing industry
are untapped. The underutilization of this information prevents the
improved quality of products, energy efficiency, reliability, and
better profit margins.
list of the top 10 industries using big data applications:

• Applications of Big Data in Manufacturing and Natural


Resources
• In the natural resources industry, Big Data allows for
predictive modeling to support decision making that has
been utilized for ingesting and integrating large amounts of
data from geospatial data, graphical data, text, and temporal
data.
• Areas of interest where this has been used include; seismic
interpretation and reservoir characterization.
list of the top 10 industries using big data applications:

6. Government
• Industry-specific Big Data Challenges
• In governments, the most significant challenges are the integration
and interoperability of Big Data across different government
departments and affiliated organizations.
• Applications of Big Data in Government
• In public services, Big Data has an extensive range of
applications, including energy exploration, financial market
analysis, fraud detection, health-related research, and
environmental protection.
list of the top 10 industries using big data applications:

7. Insurance
• Industry-specific Big Data Challenges
• Lack of personalized services, lack of personalized pricing, and
the lack of targeted services to new segments and specific market
segments are some of the main challenges.
• In a survey conducted by Market force challenges identified by
professionals in the insurance industry include underutilization of
data gathered by loss adjusters and a hunger for better insight.
list of the top 10 industries using big data applications:

• Applications of Big Data in the Insurance Industry


• Big data has been used in the industry to provide customer
insights for transparent and simpler products, by analyzing and
predicting customer behavior through data derived from social
media, GPS-enabled devices, and CCTV footage. The Big Data
also allows for better customer retention from insurance
companies.
• When it comes to claims management, predictive analytics from
Big Data has been used to offer faster service since massive
amounts of data can be analyzed mainly in the underwriting stage.
Fraud detection has also been enhanced.
list of the top 10 industries using big data applications:

8. Retail and Wholesale trade


• Industry-specific Big Data Challenges
• From traditional brick and mortar retailers and wholesalers to
current day e-commerce traders, the industry has gathered a lot of
data over time. This data, derived from customer loyalty cards,
POS scanners, RFID, etc. are not being used enough to improve
customer experiences on the whole. Any changes and
improvements made have been quite slow.
Big Data Challenges
Big Data Challenges
Big data applications:

• Big data applications can help companies to make better business


decisions by analyzing large volumes of data and discovering
hidden patterns. These data sets might be from social media, data
captured by sensors, website logs, customer feedbacks, etc.
• Organizations are spending huge amounts on big data applications
to discover hidden patterns, unknown associations, market style,
consumer preferences, and other valuable business information. 
• The following are domains where big data can be applied:
– health care,
– media and entertainment,
– IoT,
– manufacturing, and
– government.
Big data applications:

• The following are domains where big data can be applied:


• Health care
• There is a significant improvement in the healthcare domain by
personalized medicine and prescriptive analytics due to the role of
big data systems. Researchers analyze the data to determine the
best treatment for a particular disease, side effects of the drugs,
forecasting the health risks, etc. Mobile applications on health and
wearable devices are causing available data to grow at an
exponential rate. It is possible to predict a disease outbreak by
mapping healthcare data and geographical data.
Big data applications:

• The following are domains where big data can be applied:


• Media and entertainment
• The media and entertainment industries are creating, advertising,
and distributing their content using new business models. This is
due to customer requirements to view digital content from any
location and at any time. The introduction of online TV shows,
Netflix channels, etc. is proving that new customers are not only
interested in watching TV but are interested in accessing data
from any location. The media houses are targeting audiences by
predicting what they would like to see, how to target the ads,
content monetization, etc. Big data systems are thus increasing the
revenues of such media houses by analyzing viewer patterns.
Big data applications:

• The following are domains where big data can be applied:


• Internet of Things
• IoT devices generate continuous data and send them to a server on
a daily basis. These data are mined to provide the
interconnectivity of devices. This mapping can be put to good use
by government agencies and also a range of companies to increase
their competence. IoT is finding applications in smart irrigation
systems, traffic management, crowd management, etc.
Big data applications:

• The following are domains where big data can be applied:


• Manufacturing
• Predictive manufacturing can help to increase efficiency by
producing more goods by minimizing the downtime of machines.
This involves a massive quantity of data for such industries.
Sophisticated forecasting tools follow an organized process to
explore valuable information for these data. The following are the
some of the major advantages of employing big data
applications in manufacturing industries:
Big data applications:

• The following are domains where big data can be applied:


• Manufacturing
– high product quality,
– tracking faults,
– supply planning,
– predicting the output,
– increasing energy efficiency,
– testing and simulation of new manufacturing processes, and
– large-scale customization of manufacturing.
Big data applications:

• The following are domains where big data can be applied:


• Government
• By adopting big data systems, the government can attain
efficiencies in terms of cost, output, and novelty. Since the same
data set is used in many applications, many departments can work
in association with each other. Government plays an important
role in innovation by acting in all these domains.
Big data applications:

• Big data applications can be applied in each and every field.


• Some of the major areas where big data finds applications include:
• agriculture,
• aviation,
• cyber security and intelligence,
• crime prediction and prevention,
• e-commerce,
• fake news detection,
• fraud detection,
• pharmaceutical drug evaluation,
• scientific research,
• weather forecasting, and tax compliance.
Big data applications:

• Big data applications can be applied in each and every field.


• Recommendation: By tracking customer spending habit,
shopping behavior, Big retails store provide a recommendation to
the customer. E-commerce site like Amazon, Walmart, Flipkart
does product recommendation. They track what product a
customer is searching, based on that data they recommend that
type of product to that customer.
• As an example, suppose any customer searched bed cover on
Amazon. So, Amazon got data that customer may be interested to
buy bed cover.
• Next time when that customer will go to any google page,
advertisement of various bed covers will be seen. Thus,
advertisement of the right product to the right customer can be
sent.
Big data applications:

• Big data applications can be applied in each and every field.


• YouTube also shows recommend video based on user’s previous
liked, watched video type. Based on the content of a video, the
user is watching, relevant advertisement is shown during video
running. As an example suppose someone watching a tutorial
video of Big data, then advertisement of some other big data
course will be shown during that video.

You might also like