Big - Data Unit-1
Big - Data Unit-1
data
Unit- 1
Introduction
• Data exists everywhere.
• Amount of digital data exists is rising at rapid rate, doubling after
every few years and changing our life.
• Quantity of data generated per second is much large.
• Real time analysis in data stream is needed to manage huge data,
through proper analysis we can get essential data, through this we
can predict network traffic, intrusion related activity, weather and so
many.
• Data is growing rapidly increasing there are specific trends and
patterns in the data. It is difficult to know where to look or how to fins
them.
• Year back only Structured data was used by organization, so system
which is easy to handle by RDBMS , It is tools to store , mange,
process and report this data. But present day nature of data change ,
huge amount of data generating in various formats and at very fast
rate.
• These data not simple structured data, so for this almost impossible
to use traditional relational databases and store, mange, process and
report this data.
• Big data is the solution to overcome such problems about data store
and manipulation.
Concept of Big data
• Big data refer to the tools, processes, and procedures allowing
organization to create manipulate and manage huge data and store
facilities.
• It refers to huge volume of data that cant be processed effectively
with the traditional existing application and analysis technique.
• It is not possible to store and aggregate the raw data in the memory
of a single computer for processing.
• So it requires efficient tools for data management and analysis.
• Big data is one which help to analyze that can guide to better
decisions and also for strategic business steps.
Definition of Big data
• Big data analytics involves using advanced tools and techniques to
uncover patterns, correlations, and insights (understanding) from
these large datasets to inform decision-making and strategic planning.
• Big data refers to extremely large and complex datasets that cannot
be easily managed, processed, or analyzed using traditional data
processing tools.
Characteristics of Big Data
• Big-data Characteristics measures in 5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
Volume
• The name Big Data itself is related to an enormous size. Big Data is a
vast 'volumes' of data generated from many sources daily, such
as business processes, machines, social media platforms, networks,
human interactions, and many more.
• It is related to the quantity of data that represents the amount of data
generated, stored and operated within the system
• Facebook can generate approximately a billion messages, 4.5
billion times that the "Like" button is recorded, and more than 350
million new posts are uploaded each day. Big data technologies can
handle large amounts of data.
Variety
• Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, Social Media
posts, photos, videos, etc.
Variety
The data is categorized as below:
• Structured data: In Structured schema, along with all the required
columns. It is in a tabular form. Structured Data is stored in the
relational database management system.
• Semi-structured: In Semi-structured, the schema is not appropriately
defined, e.g., JSON, XML, CSV, TSV, and email. OLTP (Online
Transaction Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.
• Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some
organizations have much data available, but they did not know how
to derive the value of data since the data is raw.
Veracity
• Veracity means how much the data is reliable. It has many ways to
filter or translate the data. Veracity is the process of being able to
handle and manage data efficiently. Big Data is also essential in
business development.
• Veracity is the assurance of the quality or trustworthiness of the data.
It refer to inconsistencies and uncertainty in data.
• For example, Facebook posts with hashtags.
Velocity
• Velocity plays an important role compared to others. Velocity creates
the speed by which the data is created in real-time. It contains the
linking of incoming data sets speeds, rate of change, and activity
bursts.
• The primary aspect of Big Data is to provide demanding data rapidly.
• Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media
sites, sensors, mobile devices, etc.
Value
• It refers to ability to turn the data into value. Big data must have
value.
• The potential insights and benefits that can be derived from analyzing
the data.
• Data having no value is of not good for any organization, unless turn it
into something useful.
• It is type of data most familiar to our everyday lives. for ex: birthday, address
• A certain schema ( structure) binds it, so all the data has the same set of properties. Structured
data is also called relational data. It is split into multiple tables to enhance the integrity
(veracity) of the data by creating a single record to depict (represent) an entity. Relationships
are enforced by the application of table constraints.
• The business value of structured data lies within how well an organization can utilize its existing
systems and processes for analysis purposes.
Sources of Structured data
Semi-Structured Data
• The data is not in the relational format and is not neatly organized into rows and columns like that in a
spreadsheet. However, there are some features like key-value pairs that help in understanding the different
entities from each other.
• Since semi-structured data doesn’t need a structured query language, it is commonly called NoSQL data.
• A data serialization language is used to exchange semi-structured data across systems that may even have
varied underlying ( basic) infrastructure.
• Semi-structured content is often used to store metadata about a business process but it can also include files
containing machine instructions for computer programs.
• This type of information typically comes from external sources such as social media platforms or other web-
based data feeds.
Sources of Semi-Structured data
Unstructured Data
• Unstructured data is the kind of data that doesn’t adhere( follow) to any definite
schema or set of rules. Its arrangement is unplanned and haphazard (disorganized).
• Photos, videos, text documents, and log files can be generally considered
unstructured data. Even though the metadata accompanying an image or a video
may be semi-structured, the actual data being dealt with is unstructured.