Big - Data PPT Unit 1
Big - Data PPT Unit 1
Introduction
Unit: 1
2.Understand the dimensions of tools required to manage and analyze big data
like Hadoop, MapReduce.
4.Identify the impact of Skills that will help them to solve complex real-world
problems using YARN , MONGODB, SCALA, Spark.
5.Identify the importance of tools for Hadoop Eco system framework like
PIG, HIVE,HBASE.
11/27/2024 5
THE CONCEPT LEARNING TASK
Course Outcome
Prerequisites:
• Linux operating system.
• Java.
• MySQL.
Recap:
• Discussion about Big Data Environments.
Objective:
In this topic we learn about how big data came into existence and
what was the industry need for Bid Data. This shows the innovation
importance of the technology as an open source framework.
Recap:
Revision of database systems.
11/27/2024 11
THE CONCEPT
History LEARNING
of Big Data innovationTASK
Batch processing:
• Batch processing is a technique in which an Operating System collects the programs
and data together in a batch before processing starts. An operating system does the
following activities related to batch processing.
• The OS defines a job which has predefined sequence of commands, programs and
data as a single unit.
• The OS keeps a number a jobs in memory and executes them without any manual
information.
• Jobs are processed in the order of submission, i.e., first come first served fashion.
• When a job completes its execution, its memory is released and the output for the job
gets copied into an output spool for later printing or processing.
Objective:
This topic introduces the big data as an open source framework with
its ecosystems and also how processing takes place worth huge
amount of data in cloud infrastructure.
Recap:
Revision of cloud interfaces and high performance computing.
11/27/2024 14
THE CONCEPT
Introduction LEARNING
to Big TASK
Data platform
Objective:
This topic depicts the basic drivers behind the innovation of big data
framework and deals with business needs of Big data and need of
framework in current scenario.
Recap:
Revision of Google file systems.
11/27/2024 17
THE CONCEPT
Drivers forLEARNING
Big Data TASK
Objective:
This topic deals with the concept of different types if big data
requirements on digital platform and how they are occupying space
in our day to day environment.
Recap:
Revision cloud infrastructure basics.
11/27/2024 23
THE CONCEPT LEARNING
Types of digital data TASK
Objective:
This unit focuses on different types of big data components and data
issues which are to be managed with the technology innovations.
Recap:
Revision of data generation ethics and mechanism.
11/27/2024 30
THE CONCEPT
Types ofLEARNING
Big Data TASK
Objective:
This Unit is basically dealing with the big data ecosystem and
frameworks. It also focusses on how we can manage big data with
the open source frameworks.
Recap:
Revision of data generation process.
11/27/2024 36
THE CONCEPT
Big Data LEARNING
technology TASK
components
Objective:
This unit objective is to specify the application areas of Big data and
also list the importance of Big data environment with industry
standards.
Recap:
Revision of need of Big Data in Industry.
11/27/2024 38
THE
Big CONCEPT
Data LEARNING
importance TASK
and applications
1. Data Management
2. Data Mining
3. Hadoop
4. In-Memory Analytics
5. Predictive Analytics
6. Text Mining
Why is big data concepts analytics
important?
1. Reduced cost
2. Quick decision making
3. New products and features
Objective:
This unit deals of all challenges were big data faces an obstacle for
implementation and also signifies how these challenges can be met
with breath taking solutions.
Recap:
Revision of architecture for implementation of Big Data.
11/27/2024 41
THE CONCEPT
Challenges LEARNING
of conventional TASK
systems
Objective:
This Unit focusses on the tools utilized to manage and maintain the
performance of Big data in industry, it also lists the programming
language and how we can manage large data pool in cluster
computing with the help of tools.
Recap:
Revision of interface of Big Data.
11/27/2024 43
THE CONCEPT
Analytic processes and tools & LEARNING
Modern data TASK
analytic tools
• Data wrapper: It is an online data visualization tool for making interactive charts.
You need to paste your data file in a csv, pdf or excel format or paste it directly in the
field. Data wrapper then generates any visualization in the form of bar, line, map etc.
It can be embedded into any other website as well. It is easy to use and produces
visually effective charts.
• Content Grabber: Content Grabber is a data extraction tool. It is suitable for people
with advanced programming skills. It is a web crawling software. Businesses can
use it to extract content and save it in a structured format. It offers editing and
debugging facility among many others for analysis later.
• Tableau Public: Tableau is another popular big data tool. It is simple and very intuitive to
use. It communicates the insights of the data through data visualization. Through Tableau,
an analyst can check a hypothesis and explore the data before starting to work on it
extensively.
• Tableau Public is a free software that connects any data source be it corporate Data
Warehouse, Microsoft Excel or web-based data, and creates data visualizations, maps,
dashboards etc. with real-time updates presenting on web. They can also be shared
through social media or with the client. It allows the access to download the file in
different formats. If you want to see the power of tableau, then we must have very good
data source. Tableau’s Big Data capabilities makes them important and one can analyze
and visualize data better than any other data visualization software in the market.
November 27, 2024 SOVERS SINGH BISHT ( KCS-061 ) U 46
nit 1
THE CONCEPT
Analytic processes and tools & LEARNING
Modern data TASK
analytic tools
• Python is easy to learn as it is very similar to JavaScript, Ruby, and PHP. Also, Python
has very good machine learning libraries viz. Scikitlearn, Theano, Tensorflow and
Keras. Another important feature of Python is that it can be assembled on any platform
like SQL server, a MongoDB database or JSON. Python can also handle text data very
well.
Sas is a programming environment and language for data manipulation and a leader in
analytics, developed by the SAS Institute in 1966 and further developed in 1980’s and 1990’s.
SAS is easily accessible, manageable and can analyze data from any sources. SAS introduced a
large set of products in 2011 for customer intelligence and numerous SAS modules for web,
social media and marketing analytics that is widely used for profiling customers and prospects.
It can also predict their behaviors, manage, and optimize communications.
• Apache Spark: The University of California, Berkeley’s AMP Lab, developed Apache in 2009.
Apache Spark is a fast large-scale data processing engine and executes applications in Hadoop
clusters 100 times faster in memory and 10 times faster on disk. Spark is built on data science and
its concept makes data science effortless. Spark is also popular for data pipelines and machine
learning models development. Spark also includes a library – MLlib, that provides a progressive
set of machine algorithms for repetitive data science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.
• MS Excel: Excel is a basic, popular and widely used analytical tool almost in all industries.
Whether you are an expert in Sas, R or Tableau, you will still need to use Excel. Excel becomes
important when there is a requirement of analytics on the client’s internal data. It analyzes the
complex task that summarizes the data with a preview of pivot tables that helps in filtering the
data as per client requirement. Excel has the advance business analytics option which helps in
modelling capabilities which have prebuilt options like automatic relationship detection, a
creation of DAX measures and time grouping.
November 27, 2024 SOVERS SINGH BISHT ( KCS-061 ) U 49
nit 1
THE CONCEPT LEARNING
Analysis vs Reporting TASK
Analysis vs Reporting
Reporting helps companies to monitor their online business and be alerted to when data
falls outside of expected ranges. The goal of analysis is to answer questions by
interpreting the data at a deeper level and providing actionable recommendations.
Objective:
This unit deals with the purpose of why analysis and reporting came
under existence and with Big Data how can we implement these
mechanism to enhance the capabilities for processing big data.
Recap:
Revision Big Data framework.
11/27/2024 51
THE CONCEPT LEARNING
Analysis vs Reporting TASK
1. Purpose: Reporting helps companies monitor their data even before digital
technology boomed. Various organizations have been dependent on the information it
brings to their business, as reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier (think of
a dashboard, charts, and graphs, which are reporting tools and not analysis reports),
analysis interprets this information and provides recommendations on actions.
3. Outputs: Reporting has a push approach, as it pushes information to users and outputs
come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and
to answer business questions. Outputs from such can be in the form of ad hoc responses
and analysis presentations.
5. Value: This Path to Value illustrates how data converts into value by reporting and
analysis such that it’s not achievable without the other.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
Volume
Veracity
Variety
Value
Velocity
• Analytics. ...
• Technologies Support.
Objective:
This unit deals with the privacy issues of data processed
analytically with BIG Data Framework and how ethics can me
maintained to secure our data . Cloud is not secure as per
observation so we follow some ethics to maintain privacy.
Recap:
Revision of Big Data Analytics.
11/27/2024 57
THEBig
CONCEPT LEARNING
Data privacy and ethics TASK
• Customers should have a transparent view of how our data is being used or sold,
and the ability to manage the flow of their private information across massive,
third-party analytical systems.
• Big Data should not interfere with human will: Big data analytics can moderate
and even determine who we are before we make up our own minds. Companies
need to begin to think about the kind of predictions and inferences that should be
allowed and the ones that should not.
• Big data should not institutionalize unfair biases like racism or sexism. Machine
learning algorithms can absorb unconscious biases in a population and amplify
them via training samples.
Objective:
This Unit focusses on how data analytics process can be monitored
and maintained within the big data framework . If makes us
understand the importance of fundamental issues and techniques
with Big Data Analytics.
Recap:
Revision of Data Analytics.
11/27/2024 60
THE CONCEPT LEARNING TASK
Big Data Analytics
• Data analysts, data scientists, predictive modelers, statisticians and other analytics
professionals collect, process, clean and analyze growing volumes of structured
transaction data as well as other forms of data not used by conventional BI and
analytics programs.
• Here is an overview of the four steps of the data preparation process:
• Data professionals collect data from a variety of different sources. Often, it is a mix
of semi-structured and unstructured data. While each organization will use different
data streams, some common sources include:
• internet clickstream data.
• web server logs.
• cloud applications.
• mobile applications.
• social media content.
• text from customer emails and survey responses.
• mobile phone records and
• machine data captured by sensors connected to the internet of things (IoT).
Objective:
This unit focusses of nature and types of data. The management &
Processing of data in huge amount, the data needs to be processed
for analytics so here we discuss the mechanism of processing.
Recap:
Revision of types of digital data.
11/27/2024 64
THE CONCEPT
Nature LEARNING
of data TASK
• Data is processed. After data is collected and stored in a data warehouse or data lake, data
professionals must organize, configure and partition the data properly for analytical
queries. Thorough data processing makes for higher performance from analytical queries.
• Data is cleansed for quality. Data professionals scrub the data using scripting tools or
enterprise software. They look for any errors or inconsistencies, such as duplications or
formatting mistakes, and organize and tidy up the data.
• The collected, processed and cleaned data is analyzed with analytics software. This
includes tools for:
• data mining, which sifts through data sets in search of patterns and relationships
• predictive analytics, which builds models to forecast customer behavior and other future
developments
• machine learning, which taps algorithms to analyze large data sets
• deep learning, which is a more advanced offshoot of machine learning
• text mining and statistical analysis software
• artificial intelligence (AI)
• mainstream business intelligence software
• data visualization tools
Objective:
This unit focussed on the compliance for auditing and protection of
Big Data from third party interfaces. It focussed on how data can be
managed and secured with the policies over the cloud.
Recap:
Revision of Big data Architecture and open source frameworks over
cloud.
11/27/2024 66
THE CONCEPT
Compliance LEARNING
auditing TASK
and protection
• The sheer size of Big Data brings with it a major security challenge. Proper
security entails more than keeping the bad guys out; it also means backing up
data and protecting data from corruption.
• Data access: data can be protected if you eliminate access to the data! Not
pragmatic so we opt to control access.
• Data availability: controlling where the data are stored and how it is
distributed; more control position you better to protect the data.
• Performance: encryption and other measures can improve security but they
carry a processing burden that can severely affect the system performance!
• Liability: accessible data carry with them liability, such as the sensitivity of
the data. The legal requirements connected to the data privacy issues, and IP
concerns.
• Adequate security becomes a strategic balancing act among the above
concerns. With planning, logic, and observations, security becomes
manageable. Effectively protecting data while allowing access to the
authorized users and systems. 67
THE CONCEPT
Compliance LEARNING
auditing TASK
and protection
Objective:
This unit focussed on importance of big data I industry, depicts
practical issues were it can be utilized , its need in current scenario
and how many applications use it on daily basis.
Recap:
Revision of Big Data applications in Industry.
11/27/2024 73
THE
Big CONCEPT
Data LEARNING
importance TASK
and applications
https://www.youtube.com/watch?v=rvJgArru8dI
https://www.youtube.com/watch?v=jmDV93UOngo
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=zez2Tv-bcXY
https://www.youtube.com/watch?v=iANBytZ26MI
This unit provide us fundamentals domain of Big Data and its latest
trends in industry.
In this unit we are also benefitted with the knowledge of different
types of data
and very important one are the 5 V’s of Big Data and we also through
the concept of reporting vs analysis which is used in industry
prospects.
This unit will impart us with knowledge of analytics tool like tableau ,
SAS , R, etc.
Thank You