Unit-1 Introduction To Big Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33


Prepared By:
Aayushi Chaudhari,
Assistant Professor, CE, CSPIT,

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 1

• What is Big data?
• What is Big Data Analytics
• Big Data Characteristics
• Types of Big Data
• Big data domains
• Use cases of big data
• Traditional vs Big Data Approach
• Big data analytics technologies
• Infrastructure for Big Data
• Use of Data Analytics
• Big Data Challenges
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 2
Introduction to Big Data

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 3

What is Big Data?
• Massive amount of data which cannot be stored, processed and analyzed
using traditional tools is known as big data.
• Big data is lot of data in terabyte or petabyte.
• It includes collection of large and complex data sets, that it becomes
difficult to process using traditional data processing tools and applications.
• Complex data does not mean just few tables or columns, it would contain
various type of data in it, such as Structured, Unstructured, Semi-
• Eg: Facebook data (500+ TB/day)
• Data coming with huge velocity.
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 4
What is Big Data?
• Too much of data will be generated using smart devices within
next 5 years.
• Around 50 billion smart devices will be function in that era.
• Even today 2.5 Quintillion bytes of data is generated everyday.
• There are around 6.1 billions of smart phone users, that is 4 to
6 times larger than now.
• As of now, several companies are using big data tools i.e. 35%.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 5

What is big data analytics?
It is a process of extracting meaningful information from big data such as hidden patterns,
unknown correlations, market trends and customer preferences.
It is useful for lowering the risk in banking systems.

It is used for product development and innovations.

It helps in quicker and better decision making in organizations.

Helps improve the customer experience.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 6

Lifecycle of big data analytics
Stage 1: Motive behind the analysis is finalized(How and what).
Stage 2: Identify the sources of gathering the data.
Stage 3: Remove the unwanted corrupt data using filtering process.
Stage 4: Make the data compatible for analytics tool by extracting
and transforming it to compatible form.
Stage 5: Used for data validation and cleaning.
Stage 6: Data with same fields will be integrated.
Stage 7: Analytical and statistical tools are used to get the meaningful
information from the data.
Stage 8: Results of stage 6 are visualized graphically.
Stage 9: Decision making by organizations.
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 7
Big Data Characteristics
Volume: the size and amounts of big data that companies
manage and analyze.
Value: the most important “V” from the perspective of the
business, the value of big data usually comes from insight
discovery and pattern recognition.
Variety: the diversity and range of different data types, including
unstructured data, semi-structured data and raw data.
Velocity: the speed at which companies receive, store and
manage data.
Veracity: the “truth” or accuracy of data and information assets,
which often determines executive-level confidence.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 8

Types of big data analytics
There are four main types of big data analytics: diagnostic, descriptive, prescriptive, and
predictive analytics.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 9

Descriptive Analytics - What happened?
• Descriptive Analytics is considered as a useful technique for uncovering patterns
within a certain segment of customers.
• It simplifies the data and summarizes past data into a readable form. 
• Descriptive analytics provide insights into what has occurred in the past and with
the trends to dig into for more detail.
• This helps in creating reports like a company’s revenue, profits, sales, and so on. 
•  Examples of descriptive analytics include
• Data Queries
• Reports
• Descriptive Statistics
• Data dashboard

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 10

Diagnostics Analysis - Why did this happen?
• Diagnostic Analytics, as the name suggests, gives a diagnosis to a problem.
• It gives a detailed and in-depth insight into the root cause of a problem. 
• Data scientists turn to this analytics craving for the reason behind a
particular happening.
• Techniques like drill-down, data mining, and data recovery, churn reason
analysis, and customer health score analysis are all examples of diagnostic

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 11

Predictive Analytics - What might happen in the future?
• Predictive Analytics, as can be discerned from the name itself, is concerned with
predicting future incidents. These future incidents can be market trends, consumer
trends, and many such market-related events. 
• This type of analytics makes use of historical data and present data to predict future
events. This is the most commonly used form of analytics among businesses. 
• Example: It uses all past payment data and user behavior data to predict fraudulent
• One predictive analytics tool is regression analysis, which can determine the relationship
between two variables (single linear regression) or three or more variables (multiple

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 12

Prescriptive Analytics - What should we do next?
• Prescriptive analytics is the process of using data to determine an optimal course of action.
• This type of analysis goes beyond explanations and predictions to recommend the best
course of action moving forward.
• prescriptive analytics is a valuable tool for data-driven decision-making.
• Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefit from the predictions and showing the decision maker the implication of each
decision option.
• Prescriptive Analytics not only anticipates what will happen and when to happen but also
why it will happen.
• Prescriptive Analytics can suggest decision options on how to take advantage of a future
opportunity or mitigate a future risk and illustrate the implication of each decision
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 13
Big data domains - Healthcare

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 14

Major benefits of using Big Data applications in manufacturing industry are:
• Product quality and defects tracking
• Supply planning
• Manufacturing process defect tracking
• Output forecasting
• Increasing energy efficiency
• Testing and simulation of new manufacturing processes
• Support for mass-customization of manufacturing

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 15

Media & Entertainment
Big Data applications benefits media and entertainment industry by:
• Predicting what the audience wants
• Scheduling optimization
• Increasing acquisition and retention
• Ad targeting
• Content monetization and new product development

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 16

Use Cases of Big data analytics
• Log analytics
Big data log analytics applications are now widely used for various business goals, from IT
system security and network performance, to market trends and e-commerce personalization.
• E-commerce personalization
A powerful search and big data analytics platform allows e-commerce companies to (1) clean
and enrich product data for a better search experience on both desktops and mobile devices;
and (2) use predictive analytics and machine learning to predict user preferences through log
data, then personalize products in a most-likely-to-buy order that maximizes conversion. 

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 17

Use Cases of Big data analytics
• Recommendation engines
Big data, with its scalability and power to process massive amounts of both structured (eg.
video titles users search for, music genre they prefer) and unstructured data (eg. user
viewing/listening patterns), can enable companies to analyze billions of clicks and viewing
data from you and other users like you for the best recommendations.
• Automated candidate placement in recruiting
A big data recruitment platform can mine from internal databases and provide a 360-degree
view of a candidate, such as education, experience, skill sets, job titles, certifications,
geography, and anything else recruiters can think of, then compare them to the company’s past
hiring experience, salaries, previously successful candidates, etc. to identify the “best match.” 

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 18

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 19
Use Cases of Big data analytics

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 20

Big data Analytical Tools

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 21

Big Data Analytics Tools
Top 10 big data tools –
• Apache Hadoop
• Apache Spark
• Flink
• Apache Storm
• Apache Cassandra
• MongoDB
• Kafka
• Tableau
• RapidMiner
• R Programming
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 22
Apache Hadoop
• Apache Hadoop is one of the most popularly used tools in the Big Data industry.
• Hadoop is an open-source framework from Apache and runs on commodity hardware. It is
used to store, process and analyze Big Data.
• Hadoop is written in Java. Apache Hadoop enables parallel processing of data as it works
on multiple machines simultaneously. It uses clustered architecture.
• Hadoop does not support real-time processing. It only supports batch processing.
• Hadoop cannot do in-memory calculations.
It consists of 3 parts-
• Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.
• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 23
Apache Spark
Spark supports both real-time as well as batch processing. It is a general-
purpose clustering system.
It also supports in-memory calculations, which makes it 100 times faster
than Hadoop. This is made possible by reducing the number of read/write
operations into the disk.
It provides more flexibility and versatility as it works with different data stores
such as HDFS, OpenStack and Apache Cassandra.
It offers high-level APIs in Java, Python, Scala and R.
It also consists of 80 high-level operators for efficient query execution.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 24

Apache Storm
• Apache Storm is an open-source big data tool, distributed real-time and fault-tolerant
processing system.
• It efficiently processes unbounded streams of data.
• By unbounded streams, we refer to the data that is ever-growing and has a beginning
but no defined end.
• Apache Storm is used with any of the programming languages and it further supports
JSON based protocols.
• The processing speed of Storm is very high.
• It is easily scalable and also fault-tolerant.
• It is much easier to use. It guarantees the processing of each data set.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 25

Apache Cassandra
• Apache Cassandra is a distributed database that provides high availability
and scalability without compromising performance efficiency.
• It is one of the best big data tools that can accommodate all types of data sets
namely structured, semi-structured, and unstructured.
• It is the perfect platform for mission-critical data with no single point of
failure and provides fault tolerance on both commodity hardware and
cloud infrastructure.
• Cassandra works quite efficiently under heavy loads. Apache Cassandra
supports the ACID (Atomicity, Consistency, Isolation, and Durability)
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 26
• MongoDB is an open-source data analytics tool, NoSQL database that provides cross-
platform capabilities. It is exemplary for a business that needs fast-moving and real-
time data for taking decisions.
• MongoDB is perfect for those who want data-driven solutions. It is user-friendly as it
offers easier installation and maintenance. MongoDB is reliable as well as cost-
• It is written in C, C++, and JavaScript. It is one of the most popular databases for Big
Data as it facilitates the management of unstructured data or the data that changes
• MongoDB uses dynamic schemas. Hence, you can prepare data quickly. This allows in
reducing the overall cost. It executes on MEAN software stack, NET applications and,
Java platform. It is also flexible in cloud infrastructure.
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 27
Apache Flink
• Apache Flink is an Open-source distributed processing framework
data analytics tool for bounded and unbounded data streams.
• It is written in Java and Scala. It provides high accuracy results
even for late-arriving data.
• Flink is a stateful and fault-tolerant i.e. it has the ability to recover
from faults easily. It provides high-performance efficiency at a large
scale, performing on thousands of nodes.
• It gives a low-latency, high throughput streaming engine and
supports event time and state management.
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 28
• Apache Kafka is an open-source platform that was created by LinkedIn in the
year 2011.
• Apache Kafka is a distributed event processing or streaming platform which
provides high throughput to the systems. It is efficient enough to handle
trillions of events a day. It is a streaming platform that is highly scalable and
also provides great fault tolerance.
• The streaming process includes publishing and subscribing to streams of
records alike to the messaging systems, storing these records durably, and
then processing these records. These records are stored in groups called topics.
• Apache Kafka offers high-speed streaming and guarantees zero downtime.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 29

Tableau is one of the best data visualization and software solution tools in the Business
Intelligence industry. It’s a tool that unleashes the power of your data.
It turns your raw data into valuable insights and enhancing the decision-making process of the
Tableau offers a rapid data analysis process which further results in visualizations as an
interactive dashboards and worksheets.
It works in synchronization with other Big Data tools such as Hadoop.
Tableau offered the capabilities of data blending are best in the market. It provides an efficient
real-time analysis.
Tableau is not only bound to the technology industry but is a crucial part of some other
industries as well. This software doesn’t require any technical or programming skills to
Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 30
• RapidMiner is a cross-platform tool that provides a robust environment for
Data Science, Machine Learning and Data Analytics procedures. It is an
integrated platform for the complete Data Science lifecycle starting from
data prep to machine learning to predictive model deployment.
• RapidMiner is an open-source tool that is written in java.
• RapidMiner offers high efficiency even when integrated with APIs and
cloud services. It provides some robust Data Science tools and algorithms.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 31

R Programming
R is an open-source programming language and is one of the most comprehensive
statistical analysis languages.
It is a multi-paradigm programming language that offers a dynamic development
environment. As it is an open-source project and thousands of people have contributed
to the development of the R.
R is written in C and Fortran. It is one of the most widely used statistical analysis tools as it
provides a vast package ecosystem.
It facilitates the efficient performance of different statistical operations and helps in
generating the results of data analysis in graphical as well as text format.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 32

Thank You.

Wednesday, April 12, 2023| U & P U. Patel Department of Computer Engineering 33

You might also like