Bda - 1 Unit
Bda - 1 Unit
Bda - 1 Unit
Topics Covered:
Big data is not a single technology but a combination of old and new technologies that helps
companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of
disparate data, at the right speed, and within the right time frame to allow real-time analysis and
reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics:
Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly
simplistic. For example, you may be managing a relatively small amount of very disparate, complex
data or you may be processing a huge volume of very simple data. That simple data may be all
structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data
in predicting business value? Do the results of a big data analysis actually make sense?
It is critical that you don’t underestimate the task at hand. Data must be able to be verified based on
both accuracy and context. An innovative business may want to be able to analyze massive amounts of
data in real time to quickly assess the value of that customer and the potential to provide additional
offers to that customer. It is necessary to identify the right amount and types of data that can be
analyzed to impact business outcomes. Big data incorporates all data, including structured d ata and
unstructured data from e-mail, social media, text streams, and more. This kind of data management
requires that companies leverage both their structured and unstructured data
• 1. volume: giga (10 9 )> tera(10 12) > peta(10 15) >
exa (10 18)> zetta (10 21)> yotta (10 24) bytes
• 2. velocity: batch processing> periodic > real
pg. 4 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
time processing..(mbps)
• 3. (variety): structured + semi structured+ unstructurerd
data
• 4. veracity : all the data may not be relevant to the problem
• 5. validity : all the data may not be accurate
• 6. volatility : the data may not be valid for long periods
• 5.variability : rate of data flow may not be constant
• Velocity means that data is generated extremely fast and often continuously processed,
like live streaming social media data.
• Volume simply means large amounts that cannot be processed fast enough by one’s existing
computing system, like gigabytes and terabytes of data.
• variety means different types of data, like a large dataset in an Excel sheet, text, videos from
CCTV cameras, energy data , internet, email, face book etc
pg. 5 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
In 2000s and beyond : due to explosion of internet processing speeds were required to be faster,
and “unstructured” data (art, photographs, music, etc.) became much more common place.
Unstructured data is both non-relational and schema-less, and Relational Database Management
Systems simply were not designed to handle this kind of data.
• NoSQL database are primarily called as non-relational or distributed database.
• SQL databases are table based databases which represent data (schema) in the form of rows
and columns whereas NoSQL databases are collection of
documents
key-value pairs
graph databases or
wide-column stores.
which do not have such standard schema definitions
to adhere to but have a dynamic schema for the
unstructured data
NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and
the need for faster speed and the processing of unstructured data.
NoSQL databases are preferable in certain use cases to relational databases because of their
speed and flexibility.
The NoSQL model is non-relational and uses a “distributed” database system.
This non-relational system is fast, uses an ad-hoc method of organizing data, and processes
high- volumes of different kinds of data.
“Not only” does it handle structured and unstructured data, it can also process unstructured Big
Data, very quickly.
NoSQL is not faster than SQL, nor is SQL faster than NoSQL. They are each different
technologies suited to different work. ... No RDBMS (whether we are discussing SQL /
Relational vs Distributed / NoSQL) is "magic". In effect, all of them work with files.
The widespread use of NoSQL can be connected to the services offered by Twitter, LinkedIn,
Facebook, and Google.
Solution
NoSQL databases are designed with a distribution architecture that includes
redundant backup storage of both data and functions.
It does this by using multiple nodes (database servers).
If one, or more, of the nodes goes down, the other nodes can continue with
normal operations and suffer no data loss.
When used correctly, NoSQL databases can provide high performance at an
extremely large scale, and never shut down.
Types of NoSQL databases-
There are 4 basic types of NoSQL databases:
Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3
(Dynamo)}
Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}
Column-based Store- Each storage block contains data from only one column, {Example-
HBase, Cassandra}
Graph-based-A network database that uses edges and nodes to represent and store data.
{Example- Neo4J}
higher scalability
A distributed computing system
Lower costs
pg. 6 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
A flexible schema
Can process unstructured and semi-structured data
Has no complex relationship
Disadvantage of NoSQL databases
It is resource intensive, demanding high RAM and CPU allocations.
It can also be difficult to find tech support if your open source NoSQL system
goes down
Not all problems require distributed computing. If a big time constraint doesn’t exist, complex
processing can done via a specialized service remotely. When companies needed to do complex data
analysis, IT would move data to an external service or entity where lots of spare resources were
available for processing.
It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically
feasible to buy enough computing resources to handle these emerging requirements. In many
situations, organizations would capture only selections of data rather than try to capture all the data
because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the
right data at the right time.
If you have ever used a wireless phone, you have experienced latency firsthand. It is the delay in the
transmissions between you and your caller. At times, latency has little impact on
Comparison
pg. 7 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
2 Can be easily used by a computer Can not be easily used Can not be easily
by a computer used by a computer
2 Ex: data stored in data abses Ex: emails, XML, Ex: memos, chat,
HTML PP, images, videos,
letter, researches,
body of an email
Data Structures are the programmatic way of storing data so that data can be used efficiently.
Almost every enterprise application uses various types of data structures in one or the other
way.
Data Structure is a systematic way to organize data in order to use it efficiently.
Following terms are the foundation terms of a data structure.
Interface (function)
Each data structure has an interface.
Interface represents the set of operations that a data structure supports.
An interface only provides
o the list of supported operations,
o type of parameters they can accept
o return type of these operations.
Implementation
Implementation provides the internal representation of a data structure.
Implementation also provides the definition of the algorithms used in the
operations of the data structure.
pg. 8 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
If the application is to search an item, it has to search an item in 1 million(106)
items every time slowing down the search.
As data grows, search will become slower.
Processor speed
Processor speed although being very high, falls limited if the data grows to
billion records.
Multiple requests
As thousands of users can search data simultaneously on a web server, even the
fast server fails while searching the data.
To solve the above-mentioned problems, data structures come to rescue.
Data can be organized in a data structure in such a way that all items may not be required to be
searched, and the required data can be searched almost instantly.
Basic Terminology
Data
Data are values or set of values.
Data Item
Data item refers to single unit of values.
Group Items
Data items that are divided into sub items are called as Group Items.
Elementary Items
Data items that cannot be divided are called as Elementary Items.
Attribute and Entity
An entity is that which contains certain attributes or properties, which may be
assigned values.
Entity Set
Entities of similar attributes form an entity set.
Field
Field is a single elementary unit of information representing an attribute of an
entity.
Record
Record is a collection of field values of a given entity.
File
File is a collection of records of the entities in a given entity set.
Update
Algorithm to update an existing item in a data structure.
Delete
Algorithm to delete an existing item from a data structure.
• The idea, therefore, is to get unstructured information, process it according to requirements and
then store it into a data structure as structured data. This is where the necessity of developing a
new frame work for structuring Big data comes in Hadoop is such platform that facilitates data
distribution and storage of unstructured data.
These 4 elements of big data reflect the tasks involved in using Big data for business intelligence.
1.data collection: deals with how to collect such big data (with characteristic 5 Vs) from multiple,
geographically separated, sources
2. data storage : where and how to store retrieve such data which cannot be accommodated at one
server/memory
3. data analysis : how to process such data if it is not stored at one storage. (BDA)
4. data visualization/output
VARIETY
Data can be sourced from emails, audio players, video recorders, watches, personal
devices, computers, health monitoring systems, satellites..etc.
Each device that is recording data is recording and encoding it in a different format
and pattern.
Additionally, the data generated from these devices also can vary by granularity,
timing, pattern and schema.
Much of the data generated is based on object structures that vary depending on an
event, individual, transaction or location.
Data collections for varied source and forms means that traditional relational databases
and structures cannot be used to interpret and store this information.
NoSQL technologies are the solution to move us forward because of the flexible
approach they bring to storing and reading data without imposing strict relational
bindings.
pg. 10 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
NoSQL systems such as Document Stores and Column Stores already provide a good
replacement to OLTP/relational database technologies as well as read/write speeds that
are much faster.
Velocity
The velocity of data streaming is extremely fast paced.
Every millisecond, systems all around the world are generating data based on events and
interactions.
Devices like heart monitors, televisions, RFID scanners and traffic monitors generate data at
the millisecond. Servers, weather devices, and social networks generate data at the second.
As technology furthers, it would not be surprising to see devices that generated data even at
the nanosecond.
The reward that this data velocity provides is information in real time that can be harnessed to
make near real time decisions or actions.
Most of the traditional insights we have are based on aggregations of actuals over days and
months.
Having data at the grain of seconds or milliseconds will provide a more detailed and vivid
information.
With the speed in which data is generated, it demands equally, if not quicker, tools and
technology to be able to extract, process and analyze the data.
This limitation has lead to the emergence of Big Data architectures and technologies. NoSQL,
Distributed and Service Oriented Systems.
NoSQL systems replace traditional OLTP/relational database technologies because they place
less importance on ACID (Atomicity, Consistency, Isolation, Durability) principles and are
able to read/write records at much faster speeds.
Distributed and Load Balancing systems have now become a standard in all organizations to
split and distribute the load of extracting, processing and analyzing data across a series of
servers.
This allows for large amounts of data to be processed in high speeds which eliminate bottle
necks.
Enterprise Service Bus (ESB) systems replace traditional integration frameworks written in
custom code.
These distributed and easily scalable systems allow for serialization across large workloads
and applications to process large amounts of data to a variety of different applications and
systems.
Volume
If we take all the data generated in the world between the beginning of time and 2008,
the same amount of data will soon be generated every minute.
billions of touch points generate Petabytes and Zettabytes of data.
On social media and telecommunication sites alone, billions of messages, clicks and uploads
take place everyday.
We now have information for every interaction, perspective and alternate. Having this diverse
data allows us to more effectively analyze, predict, test and ultimately prescribe to our
customers.
Large collections of data coupled with the challenges of Variety (different formats)
and Velocity (near real time generation) pose significant managing costs to
organizations.
Despite the pace of Moore's Law, the challenge to store large data sets can no longer be met
with traditional databases or data stores.
pg. 11 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
The strengths of distributed storage systems like SAN (Storage Area Network) as well as
NoSQL data stores that are able effectively divide, compress and store large amounts of
data with improved read/write performances.
Veracity
In context, a fourth V, Veracity is often referenced.
Veracity concerns the data quality risks and accuracy as data is generated at such a high
and distributed frequency.
In solving the challenge of the 3 Vs, organization put little emphasis or work into cleaning up
the data and filtering on what is necessary and as a result the credibility and reliability of data
have suffered.
Differences between traditional and big data handling for business intelligence
• data collection: in traditional practice the data is collected from one enterprise whereas Big
data is collected from different sources across internet.
• Data storage: in traditional data can be accommodated in one server storage. Whereas big data
cannot be and has to be distribiuted into different storages.
Also big data is required to be scaled up horizontally by adding more server and storage space and
not on the same server whereas in traditional the data should be scaled up vertically .
• Data Analysis: since the big data is distributed it has to be also processed parallely and both off
line and in real time while in traditional the data could be analyzed off line
• Also In traditional the data is structured and data is moved to the processing functions whereas
the Big data it is difficult to move large volumes data and so the processing functions must be
moved to data instead
• data visualization/output: to steer the business to excellence by understanding customers,
vendors and suppliers’ requirements and preferences
Not all problems require distributed computing. If a big time constraint doesn’t exist, complex
processing can done via a specialized service remotely. When companies needed to do complex data
analysis, IT would move data to an external service or entity where lots of spare resources were
available for processing.
It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically
feasible to buy enough computing resources to handle these emerging requirements. In many
situations, organizations would capture only selections of data rather than try to capture all the data
because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the
right data at the right time.
Key hardware and software breakthroughs revolutionized the data management industry. First,
innovation and demand increased the power and decreased the price of hardware. New software
emerged that understood how to take advantage of this hardware by automating processes like load
balancing and optimization across a huge cluster of nodes.
pg. 12 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
THE CHANGING ECONOMICS OF COMPUTING AND BIG DATA
Fast-forward and a lot have changed. Over the last several years, the cost to purchase computing and
storage resources has decreased dramatically. Aided by virtualization, commodity servers that could
be clustered and blades that could be networked in a rack changed the economics of computing. This
change coincided with innovation in software automation solutions that dramatically improved the
manageability of these systems.
The capability to leverage distributed computing and parallel processing techniques dramatically
transformed the landscape and dramatically reduce latency. There are special cases, such as High
Frequency Trading (HFT), in which low latency can only be achieved by physically locating servers in
a single location.
Existing analytics tools and techniques will be very helpful in making sense of big data. However,
there is a catch. The algorithms that are part of these
tools have to be able to work with large amounts of potentially real-time and disparate data. The
infrastructure that we cover earlier in the chapter will
need to be in place to support this. And, vendors providing analytics tools will also need to ensure that
their algorithms work across distributed implementations. Because of these complexities, we also
expect a new class of tools to help make sense of big data. We list three classes of tools in this layer of
our reference architecture. They can be used independently or collectively by decision makers to help
steer the business. The three classes of tools are as follows:
✓ Reporting and dashboards: These tools provide a “user-friendly” representation of the information
from various sources. Although a mainstay in the traditional data world, this area is still evolving for
big data. Some of the tools that are being used are traditional ones that can now access the new kinds
of databases collectively called NoSQL (Not Only SQL). We explore NoSQL databases in Chapter 7.
✓ Visualization: These tools are the next step in the evolution of reporting. The output tends to be
highly interactive and dynamic in nature.
Another important distinction between reports and visualized output is animation. Business users can
watch the changes in the data utilizing a variety of different visualization techniques, including mind
maps, heat maps, infographics, and connection diagrams. Often, reporting and visualization occur at
the end of the business activity. Although the data may be imported into another tool for further
computation or examination, this is the final step.
✓ Analytics and advanced analytics: These tools reach into the data warehouse and process the data
for human consumption. Advanced analytics should explicate trends or events that are transformative,
unique, or revolutionary to existing business practice. Predictive analytics and sentiment analytics are
good examples of this science
pg. 13 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
what is BDA?
1. working with data sets whose volume, variety and velocity exceed the present storage and
computing capabilities.
2. to steer the business to excellence by understanding customers, vendors and suppliers’ requirements
and preferences
3. for quicker and better decision making
4. better collaboration between IT, Business users and data scientists
5. writing the code for distributed processing for achieving the above tasks
What isn’t BDA?
Data Analytics
• Data Analytics (DA) is the science of examining raw data with the purpose of drawing
conclusions about that information.
• The data that is captured by any data collection agent or tool or software is in its raw form, i.e.,
unformatted or unstructured or unclean with noises/errors or redundant or inconsistent.
• Hence, analytics covers a spectrum of activities starting from data collection till visualization.
• data analytics is generally divided into three broad categories:
• (i) Exploratory Data Analysis (EDA)
• (ii) Confirmatory Data Analysis (CDA)
• (iii) Qualitative Data Analysis (QDA)
Traditional Analytics
Classification of analytics
• Classification I
• 1.basic analytics
• 2.operationalized analytics
• 3.advanced analytics
• 4.monetized analytics
• Classification 2
• 1.analytics 1.0
• 2. analytics 2.0
• 3. analytics 3.0
• Classification 3
• (i)Exploratory Data Analysis (EDA)
• (ii)Confirmatory Data Analysis (CDA)
• (iii)Qualitative Data Analysis (QDA)]
pg. 15 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
1.basic analytics 2.operationalized 3.advanced analytics 4.monetized
analytics analytics
slicing and dicing of where the analysis is using predictive and used to derive direct
historical data to oven into the prescriptive revenue
generate reporting business processes of modeling to forecast
and basic an enterprise. the future
visualization etc.,
CAP Theorem
• Only 2 of the 3 : C , A, or P is guaranteed.
• CA:traditional RDBMS, MySQL etc.,
• CP: Hbase, MongoDB ..
• AP: Risk , Cassandra ..
1. Reactive –business intelligence: this approach is analyzes the historical,static data sets and
generates reports. By this approach It enables business to take better decisions by providing right info
to the right person at the right time in the right format
2.reactive- BDA: this approach analyzes static data only but here the data is huge
3. Proactive –analytics: this approach is traditional data mining, predictive modeling, text mining and
statistical analysis but applied on big data- therefore it has limitations on storage d processing capacity
4.proactive-BDA: this approach is to filter relevant data from big data and analy ze using high
performance analytics to solve complex problems usingmore data
Terminologies
• In-memory analytics: technology to quiery data in RAM rather than stored in disks
• In-data base processing
• Symmetric multiprocessor system
• Massively parallel processing: a coordinated processing of a program by multiple processors ,
each working on different parts of the program and using its own OS and memory
• Distributed and parallel computing
pg. 17 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Big data use cases
BASE
• It is used in distributed computing
• Why? To achieve high availability
• How achieved?
• BASE is a data system design philosophy that prefers availability over consistency of operations.
• BASE was developed as an alternative for
- producing more scalable and affordable data architectures,
- providing more options to expanding enterprises/ IT clients
- and simply acquiring more hard ware to expand data operations
• BASE is an acronym for Basically Available, Soft state, Eventual consistency
• BasicallyAvailable: The system is guaranteed to be available for querying by all users.
• Soft State: The values stored in the system may change because of the eventual consistency
model, as described in the next bullet.
• Eventually Consistent: As data is added to the system, the system’s state is gradually replicated
across all nodes. For the short period before the blocks are replicated, the state of the file
system isn’t consistent.
Analytics tools
• MS EXCEL
• SAS
• IBM SPSS Modeler
• Statistica
• Salford systems
• WPS
pg. 18 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Main open source analytics tools
• R analytics
• Weka
• Apache Hadoop.
• Apache Spark. ...
• Apache Storm. ...
• Apache Cassandra. ...
• MongoDB. ...
• R Programming Environment. ...
• Neo4j. ...
• Apache SAMOA.
Extra tools:
• 1. R tool
• 2 Weka
• 3. Pandas
• 4.Tanagra
• 5 Gephi
• 6.MOA( Massive Online Analysis)
• 7.Orange
• 8.Rapid Miner
• 9.Root packages
• 10.Encog,
• 11.NodeXL
• 12.Waffles