Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58


“Data is widely available. What is scarce is

the ability to extract wisdom from it.”
Hal Varian, Google‟s Chief Economist, 2010
Objectives and
Learning Objectives Learning Outcomes
Introduction to big data a) To understand the significance of
big data.
1. Definition of big data.
b) To understand the other
2. Challenges of big data. characteristics of data that are not
definitional characteristics of big
3. Why big data? data.

4. Traditional Business c) To understand the challenges of

Intelligence versus big data. big data and how to deal with
the same.

d) To understand what is new today.

◻ Definition of Big Data
❖ Volume
❖ Velocity
❖ Variety
◻ Challenges of Big Data
◻ Other Characteristics of Data Which are Not Definitional Traits of Big
◻ Why Big Data?
◻ Traditional Business Intelligence (BI) versus Big Data
◻ Composition: The composition of data deals with the
structure of data, that is, the sources of data the granularity,
the types, and the nature of data as to whether it is static or
real-time streaming.
◻ Condition: The condition of data deals with the state of data,
that is, “Can one use this data as is for analysis?” or “Does it
require cleansing for further enhancement and enrichment?”
◻ Context: The context of data deals with “Where has this data
been generated?” “Why was this data generated?” “How
sensitive is this data?” “What are the events associated with
this data?” and so on.
◻ 1970s and before was the era of mainframes. The
data was essentially primitive and structured.
Relational databases evolved in 1980s and 1990s.
◻ The era was of data intensive applications. The
World Wide Web (WWW) and the Internet of
Things (IoT) have led to an onslaught of structured,
unstructured, and multimedia data.
Data Generation Data Utilization Data Driven
and Storage

Complex and Structured data,

Unstructured Unstructured data,
Multimedia data

Complex and Relational databases:

Relational Data-intensive
Primitive and Mainframes: Basic
Structured data storage

Existance 1970s and before Relational(1980s and 2000s and beyond

Why Big
◻ The more data we have for analysis, the greater
will be the analytical accuracy and also the greater
would be the confidence in our decisions based on
these analytical findings.
More data —» More accurate analysis —» Greater
confidence in decision making —» Greater
operational efficiencies, cost reduction, time
reduction, new product development, and optimized
offerings, etc.
Why Big

More Data

More Accurate

More Confidence in decision making

Greater operational efficiencies, Cost reduction,

Time reduction, New product development,
Optimized offerings, etc.
Definition of Big
Big is
Data and high
high-volume, High-volume
variety - High-velocity
that demand cost asse
information High-variety

effective, forms ts of
e processing
Cost-effective, innovative
informatio for insight forms of information
n making. and
Source: Gartner IT
Glossary Enhanced insight &
decision making
Other Definitions of Big
◻ “Big data is high-volume, high-velocity, and high-variety
information assets” talks about voluminous data
(humongous data) that may have great variety (a good
mix of structured, semi-structured, and unstructured data)
and will require a good speed/pace for storage,
preparation, pro-cessing, and analysis.
◻ “Cost effective, innovative forms of information
processing” talks about embracing new techniques and
technologies to capture (ingest), store, process, persist,
integrate, and visualize the high-volume, high-velocity,
and high-variety data.
Other Definitions of Big Data
◻ “Enhanced insight and decision making” talks about
deriving deeper, richer, and meaningful insights and
then using these insights to make faster and better
decisions to gain business value and thus a
competitive edge.

Data —» Information —» Actionable

intelligence —» Better decisions —» Enhanced
business value
Volume - A Mountain of

1 Kilobyte (KB) = 1000 bytes

1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
Data Sizes in form of
Where Does This Data get Generated?

◻ There are a multitude of sources for big data. An

XLS, a DOC, a PDF. etc. is unstructured data; a
video on YouTube, a chat conversation on Internet
Messenger, a customer feedback form on an online
retail website is unstructured data; a CCTV
coverage, a weather forecast report is unstructured
data too.
1. Typical Internal Data
◻ Data present within an organizations firewall. It is
as follows:
Data storage: File systems, SQL (RDBMSs - Oracle,
MS SQL Server, DB2, MySQL, PostgreSQL, etc.),
NoSQL (MongoDB, Cassandra, etc.), and so on.
Archives of scanned document paper
Archives: customer s, patients
health records, students admissionrecords,
correspondence records, ‟
assessment records, and so on.
2. External Data
◻ Data residing outside an organization‟s firewall. It
is as follows:
Public Web: Wikipedia, weather, regulatory,
compliance, census, etc.
3. Both (Internal+External) Data
◻ Sensor data: Car sensors, smart electric meters, office
buildings, air conditioning units, refriget*- tors, and so on.
◻ Machine log data: Event logs, application logs, Business
process logs, audit logs, clickstream data, etc.
◻ Social media: Twitter, blogs, Facebook, Linkedln,
YouTube, Instagram, etc.
◻ Business apps: ERP, CRM, HR, Google Docs, and so on.
◻ Media: Audio, Video, Image, Podcast, etc.
◻ Docs: Comma separated value (CSV), Word
Documents, PDF, XLS, PPT, and so on.
Where all this data stores?
Sources of Big
◻ We have moved from the days of batch
processing (remember payroll applications) to
real-time processing (when you buy a product
the website shows related product)
Batch → Periodic → Near real time
→ Real-time processing
What volume
◻ Structured data: From traditional transaction
processing systems and RDBMS, etc.
◻ Semi-structured data: For example: Hyper
Text Markup Language (HTML), eXtensible
Markup Language (XML).
◻ Unstructured data: For example: unstructured
text documents, audio, video, email, photos,
PDFs, social media, etc.
Other Characteristics of Data

❑ Characteristics of data which are not

Definitional Traits of Big Data
❑ Veracity and Validity
❑ Volatility
❑ Variability
Veracity and
◻ Veracity refers to biases, noise, and abnormality in
data. The key question here is: “Is all the data that
is being stored, mined, and analyzed meaningful
and pertinent to the problem under consideration?”
Validity refers to the accuracy and correctness of
the data. Any data that is picked up for analysis
needs to be accurate. It is not just true about big
data alone.
◻ Volatility of data deals with, how long is the data
valid? And how long should it be stored? There is
some data that is required for long-term decisions
and remains valid for longer periods of time.
However, there are also pieces of data that quickly
become obsolete minutes after their generation.
◻ Data flows can be highly inconsistent with periodic
peaks. For example: An online retailer announces
the "big sale day" for a particular week. The
retailer is likely to experience an upsurge in
customer traffic to the website during this week. In
the same way, he/she might experience a slump in
his/her business immediately after the festival
season. This reemphasizes the point that one might
witness spikes in data at some point in time and at
other times, the data flow can go flat.
Challenges with Big
Following are a few challenges with big data:
1. Usefulness: Data today is growing at an
exponential rate. Most of the data that we have
today has been generated in the last 2—3 years.
This high tide of data will continue to rise
incessantly (persistently). The key questions here
are: “Will all this data be useful for analysis?”,
“Do we work with all this data or a subset of it?”,
“How will we separate the knowledge from the
noise?”, etc.
Challenges with Big Data
2. Cloud computing and virtualization: Cloud
computing is the answer to managing infrastructure
for big data as far as cost-efficiency, elasticity, and
easy upgrading/downgrading is concerned. This
further complicates the decision to host big data
solutions outside the enterprise.
Challenges with Big Data
3. Retention: The other challenge is to decide on the
period of retention of big data. Just how long
should one retain this data? A tricky question
indeed as some data is useful for making long-term
decisions, whereas in few cases, the data may
quickly become irrelevant and obsolete just a few
hours after having being generated.
Challenges with Big Data
4. Scarcity of Data Scientist: There is a dearth
(shortage) of skilled professionals who possess a
high level of proficiency in data sciences that is
vital in implementing big data solutions.
5. Data Visualization: Then, of course, there are
other challenges with respect to capture, storage,
preparation, search, anal-ysis, transfer, security, and
visualization of big data. Data visualization is
becoming popular as a separate discipline.
Challenges with Big Data
6. Storage Capacity: Big data refers to datasets
whose size is typically beyond the storage capacity
of traditional database software tools. There is no
explicit definition of how big the dataset should be
for it to be considered “big data.” Here we are to
deal with data that is just too big, moves way to
fast, and does not fit the structures of typical
database systems.
Challenges with Big Data
Storage Curation

Challenges with Big Data





Privacy Violations

You might also like