Bda - 1 Unit

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

BDA - UNIT-I

Topics Covered:

1.1.What is Big Data ?


1.2.History of data management
1.3.Structuring big data
1.4.elements of Big Data .
1.5. capitalizing big data, Distributed and parallel Computing as applicable to Big Data.
1.6 .what is Big Data Analytics?
1.7. classifications of Analytics
1.8.challenges facing Big data.
Data Science and Data Scientists
HBASE.
open source tools used in DA

what is Big data?

Big data is not a single technology but a combination of old and new technologies that helps
companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of
disparate data, at the right speed, and within the right time frame to allow real-time analysis and
reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics:

✓ Volume: How much data


✓ Velocity: How fast that data is processed
✓ Variety: The various types of data

Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly
simplistic. For example, you may be managing a relatively small amount of very disparate, complex
data or you may be processing a huge volume of very simple data. That simple data may be all
structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data
in predicting business value? Do the results of a big data analysis actually make sense?
It is critical that you don’t underestimate the task at hand. Data must be able to be verified based on
both accuracy and context. An innovative business may want to be able to analyze massive amounts of
data in real time to quickly assess the value of that customer and the potential to provide additional
offers to that customer. It is necessary to identify the right amount and types of data that can be
analyzed to impact business outcomes. Big data incorporates all data, including structured d ata and
unstructured data from e-mail, social media, text streams, and more. This kind of data management
requires that companies leverage both their structured and unstructured data

• 1. volume: giga (10 9 )> tera(10 12) > peta(10 15) >
exa (10 18)> zetta (10 21)> yotta (10 24) bytes
• 2. velocity: batch processing> periodic > real

pg. 4 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
time processing..(mbps)
• 3. (variety): structured + semi structured+ unstructurerd
data
• 4. veracity : all the data may not be relevant to the problem
• 5. validity : all the data may not be accurate
• 6. volatility : the data may not be valid for long periods
• 5.variability : rate of data flow may not be constant

• Velocity means that data is generated extremely fast and often continuously processed,
like live streaming social media data.
• Volume simply means large amounts that cannot be processed fast enough by one’s existing
computing system, like gigabytes and terabytes of data.
• variety means different types of data, like a large dataset in an Excel sheet, text, videos from
CCTV cameras, energy data , internet, email, face book etc

History of data management


• Before 1970s: only storage of primitive and structured data and storage intensive management
involved .Mainframes were used
• 1980s and 1990s: structured Relational data bases evolved .Storage and data intensive
applications management was required.
• 2000s and beyond: web and IOT caused evolution of unstructured multimedia data
 A database, is a collection of information.
 Database Management System can access the data and pull a specific information.
 in 1890 :Herman Hollerith is given credit for adapting the punch cards to act as the memory.
 In 1960: Charles W. Bachman designed the Integrated Database System, the “first” DBMS.
IBM created a database system of their own, known as IMS.
 In 1971 : evolved a standardization of a language for data base management called Common
Business Oriented Language (COBOL)
 in 1974 :IBM to develop SQL, which was more advanced .
 In 1980s-90s : RDBM Systems like Oracle, MS SQL, DB2,My SQL and Teradata became very
popular leading to development of enterprise resource planning systems (ERP), CRM,
 RDBMS were efficient to store and process structured data.

pg. 5 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 In 2000s and beyond : due to explosion of internet processing speeds were required to be faster,
and “unstructured” data (art, photographs, music, etc.) became much more common place.
Unstructured data is both non-relational and schema-less, and Relational Database Management
Systems simply were not designed to handle this kind of data.
• NoSQL database are primarily called as non-relational or distributed database.
• SQL databases are table based databases which represent data (schema) in the form of rows
and columns whereas NoSQL databases are collection of
 documents
 key-value pairs
 graph databases or
 wide-column stores.
which do not have such standard schema definitions
to adhere to but have a dynamic schema for the
unstructured data
 NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and
the need for faster speed and the processing of unstructured data.
 NoSQL databases are preferable in certain use cases to relational databases because of their
speed and flexibility.
 The NoSQL model is non-relational and uses a “distributed” database system.
 This non-relational system is fast, uses an ad-hoc method of organizing data, and processes
high- volumes of different kinds of data.
 “Not only” does it handle structured and unstructured data, it can also process unstructured Big
Data, very quickly.
 NoSQL is not faster than SQL, nor is SQL faster than NoSQL. They are each different
technologies suited to different work. ... No RDBMS (whether we are discussing SQL /
Relational vs Distributed / NoSQL) is "magic". In effect, all of them work with files.
 The widespread use of NoSQL can be connected to the services offered by Twitter, LinkedIn,
Facebook, and Google.
 Solution
 NoSQL databases are designed with a distribution architecture that includes
redundant backup storage of both data and functions.
 It does this by using multiple nodes (database servers).
 If one, or more, of the nodes goes down, the other nodes can continue with
normal operations and suffer no data loss.
 When used correctly, NoSQL databases can provide high performance at an
extremely large scale, and never shut down.
 Types of NoSQL databases-
 There are 4 basic types of NoSQL databases:
 Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3
(Dynamo)}
 Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}
 Column-based Store- Each storage block contains data from only one column, {Example-
HBase, Cassandra}
 Graph-based-A network database that uses edges and nodes to represent and store data.
{Example- Neo4J}

advantages and disadvantages of NoSQL over SQL and RDBM Systems

 higher scalability
 A distributed computing system
 Lower costs

pg. 6 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 A flexible schema
 Can process unstructured and semi-structured data
 Has no complex relationship
Disadvantage of NoSQL databases
 It is resource intensive, demanding high RAM and CPU allocations.
 It can also be difficult to find tech support if your open source NoSQL system
goes down

WHY DISTRIBUTED COMPUTING IS NEEDED FOR BIG DATA

Not all problems require distributed computing. If a big time constraint doesn’t exist, complex
processing can done via a specialized service remotely. When companies needed to do complex data
analysis, IT would move data to an external service or entity where lots of spare resources were
available for processing.

It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically
feasible to buy enough computing resources to handle these emerging requirements. In many
situations, organizations would capture only selections of data rather than try to capture all the data
because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the
right data at the right time.

THE PROBLEM WITH LATENCY FOR BIG DATA


One of the perennial problems with managing data — especially large quantities of data — has been
the impact of latency. Latency is the delay within a system based on delays in execution of a task.
Latency is an issue in every aspect of computing, including communications, data management,
system performance, and more.

If you have ever used a wireless phone, you have experienced latency firsthand. It is the delay in the
transmissions between you and your caller. At times, latency has little impact on

Big data structuring

Comparison

sno Structured Semi structured Unstructured

1 Conforms to a data model . does not confirm to a does not confirm to


Relationship exists between data. data model. But some a data model.
RDBMS conforms to relational data structure exists.
model wherein data is stored jn rows It uses tags to
and columns. segregate semantic
elements

pg. 7 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
2 Can be easily used by a computer Can not be easily used Can not be easily
by a computer used by a computer

2 Ex: data stored in data abses Ex: emails, XML, Ex: memos, chat,
HTML PP, images, videos,
letter, researches,
body of an email

3 Sources: On line processing systems XML, JSON,

Ease: i/o, security, In consistent structure


indexing/searching, scalability, , labrl/value pairs,
transaction processing schema info blended
with data values.

 Data Structures are the programmatic way of storing data so that data can be used efficiently.
 Almost every enterprise application uses various types of data structures in one or the other
way.
 Data Structure is a systematic way to organize data in order to use it efficiently.
 Following terms are the foundation terms of a data structure.
 Interface (function)
 Each data structure has an interface.
 Interface represents the set of operations that a data structure supports.
 An interface only provides
o the list of supported operations,
o type of parameters they can accept
o return type of these operations.
 Implementation
 Implementation provides the internal representation of a data structure.
 Implementation also provides the definition of the algorithms used in the
operations of the data structure.

Structuring big data


• As applied to big data, the idea, therefore, is to get unstructured information, process it
according to requirements and then store it into a suitable data structure as structured data.
• This is where the necessity of developing a new frame work for structuring Big data comes in
• Hadoop is such platform that facilitates data distribution and storage of unstructured data

Need for Data Structure


 As applications are getting complex and data rich, there are three common problems that
applications face now-a-days.
 Data Search
 Consider an inventory of 1 million(106) items of a store.

pg. 8 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 If the application is to search an item, it has to search an item in 1 million(106)
items every time slowing down the search.
 As data grows, search will become slower.
 Processor speed
 Processor speed although being very high, falls limited if the data grows to
billion records.
 Multiple requests
 As thousands of users can search data simultaneously on a web server, even the
fast server fails while searching the data.
 To solve the above-mentioned problems, data structures come to rescue.
 Data can be organized in a data structure in such a way that all items may not be required to be
searched, and the required data can be searched almost instantly.

Characteristics of a Data Structure


 Correctness
 Data structure implementation should implement its interface correctly.
 Time Complexity
 Running time or the execution time of operations of data structure must be as small as
possible. It is denoted as a function ƒ(n) secs ,where n is no of operations .
 Space Complexity −
 Memory usage of a data structure operation should be as little as possible.

Basic Terminology
 Data
 Data are values or set of values.
 Data Item
 Data item refers to single unit of values.
 Group Items
 Data items that are divided into sub items are called as Group Items.
 Elementary Items
 Data items that cannot be divided are called as Elementary Items.
 Attribute and Entity
 An entity is that which contains certain attributes or properties, which may be
assigned values.
 Entity Set
 Entities of similar attributes form an entity set.
 Field
 Field is a single elementary unit of information representing an attribute of an
entity.
 Record
 Record is a collection of field values of a given entity.
 File
 File is a collection of records of the entities in a given entity set.

Data Structures - Algorithms Basics

 Algorithm is a step-by-step procedure, which defines a set of instructions to be executed in a


certain order to get the desired output.
 Algorithms are generally created independent of underlying languages, i.e. an algorithm can be
implemented in more than one programming language.
 From the data structure point of view, following are some important categories of algorithms
 Search
pg. 9 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 Algorithm to search an item in a data structure.
 Sort
 Algorithm to sort items in a certain order.
 Insert
 Algorithm to insert item in a data structure.

 Update
 Algorithm to update an existing item in a data structure.
 Delete
 Algorithm to delete an existing item from a data structure.

Structuring big data

• The idea, therefore, is to get unstructured information, process it according to requirements and
then store it into a data structure as structured data. This is where the necessity of developing a
new frame work for structuring Big data comes in Hadoop is such platform that facilitates data
distribution and storage of unstructured data.

Technologies used in BD environments


1. In-memory analytics: preprocess and store the ing data relevant
2. In-database processing
3. Symmetric multiprocessor system(SMP)
4. Massively Parallel Processing: processing of applications by segmenting the programs and
allocating the segments to number of processors in parallel. Each processor may have its own
OS and dedicated memory. Segments of a program communicate using messaging interface.

Elements of big data

These 4 elements of big data reflect the tasks involved in using Big data for business intelligence.
1.data collection: deals with how to collect such big data (with characteristic 5 Vs) from multiple,
geographically separated, sources
2. data storage : where and how to store retrieve such data which cannot be accommodated at one
server/memory
3. data analysis : how to process such data if it is not stored at one storage. (BDA)
4. data visualization/output

 VARIETY
 Data can be sourced from emails, audio players, video recorders, watches, personal
devices, computers, health monitoring systems, satellites..etc.
 Each device that is recording data is recording and encoding it in a different format
and pattern.
 Additionally, the data generated from these devices also can vary by granularity,
timing, pattern and schema.
 Much of the data generated is based on object structures that vary depending on an
event, individual, transaction or location.
 Data collections for varied source and forms means that traditional relational databases
and structures cannot be used to interpret and store this information.
 NoSQL technologies are the solution to move us forward because of the flexible
approach they bring to storing and reading data without imposing strict relational
bindings.
pg. 10 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 NoSQL systems such as Document Stores and Column Stores already provide a good
replacement to OLTP/relational database technologies as well as read/write speeds that
are much faster.

Velocity
 The velocity of data streaming is extremely fast paced.
 Every millisecond, systems all around the world are generating data based on events and
interactions.
 Devices like heart monitors, televisions, RFID scanners and traffic monitors generate data at
the millisecond. Servers, weather devices, and social networks generate data at the second.
 As technology furthers, it would not be surprising to see devices that generated data even at
the nanosecond.
 The reward that this data velocity provides is information in real time that can be harnessed to
make near real time decisions or actions.
 Most of the traditional insights we have are based on aggregations of actuals over days and
months.
Having data at the grain of seconds or milliseconds will provide a more detailed and vivid
information.
 With the speed in which data is generated, it demands equally, if not quicker, tools and
technology to be able to extract, process and analyze the data.
 This limitation has lead to the emergence of Big Data architectures and technologies. NoSQL,
Distributed and Service Oriented Systems.
 NoSQL systems replace traditional OLTP/relational database technologies because they place
less importance on ACID (Atomicity, Consistency, Isolation, Durability) principles and are
able to read/write records at much faster speeds.
 Distributed and Load Balancing systems have now become a standard in all organizations to
split and distribute the load of extracting, processing and analyzing data across a series of
servers.
 This allows for large amounts of data to be processed in high speeds which eliminate bottle
necks.
 Enterprise Service Bus (ESB) systems replace traditional integration frameworks written in
custom code.
 These distributed and easily scalable systems allow for serialization across large workloads
and applications to process large amounts of data to a variety of different applications and
systems.

Volume
 If we take all the data generated in the world between the beginning of time and 2008,
the same amount of data will soon be generated every minute.
 billions of touch points generate Petabytes and Zettabytes of data.
 On social media and telecommunication sites alone, billions of messages, clicks and uploads
take place everyday.
 We now have information for every interaction, perspective and alternate. Having this diverse
data allows us to more effectively analyze, predict, test and ultimately prescribe to our
customers.
 Large collections of data coupled with the challenges of Variety (different formats)
and Velocity (near real time generation) pose significant managing costs to
organizations.
 Despite the pace of Moore's Law, the challenge to store large data sets can no longer be met
with traditional databases or data stores.
pg. 11 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
 The strengths of distributed storage systems like SAN (Storage Area Network) as well as
NoSQL data stores that are able effectively divide, compress and store large amounts of
data with improved read/write performances.

Veracity
 In context, a fourth V, Veracity is often referenced.
 Veracity concerns the data quality risks and accuracy as data is generated at such a high
and distributed frequency.
 In solving the challenge of the 3 Vs, organization put little emphasis or work into cleaning up
the data and filtering on what is necessary and as a result the credibility and reliability of data
have suffered.

Differences between traditional and big data handling for business intelligence
• data collection: in traditional practice the data is collected from one enterprise whereas Big
data is collected from different sources across internet.
• Data storage: in traditional data can be accommodated in one server storage. Whereas big data
cannot be and has to be distribiuted into different storages.
Also big data is required to be scaled up horizontally by adding more server and storage space and
not on the same server whereas in traditional the data should be scaled up vertically .
• Data Analysis: since the big data is distributed it has to be also processed parallely and both off
line and in real time while in traditional the data could be analyzed off line
• Also In traditional the data is structured and data is moved to the processing functions whereas
the Big data it is difficult to move large volumes data and so the processing functions must be
moved to data instead
• data visualization/output: to steer the business to excellence by understanding customers,
vendors and suppliers’ requirements and preferences

Parallel and distributed systems for Big data.

WHY DISTRIBUTED COMPUTING IS NEEDED FOR BIG DATA

Not all problems require distributed computing. If a big time constraint doesn’t exist, complex
processing can done via a specialized service remotely. When companies needed to do complex data
analysis, IT would move data to an external service or entity where lots of spare resources were
available for processing.

It wasn’t that companies wanted to wait to get the results they needed; it just wasn’t economically
feasible to buy enough computing resources to handle these emerging requirements. In many
situations, organizations would capture only selections of data rather than try to capture all the data
because of costs. Analysts wanted all the data but had to settle for snapshots, hoping to capture the
right data at the right time.

Key hardware and software breakthroughs revolutionized the data management industry. First,
innovation and demand increased the power and decreased the price of hardware. New software
emerged that understood how to take advantage of this hardware by automating processes like load
balancing and optimization across a huge cluster of nodes.

pg. 12 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
THE CHANGING ECONOMICS OF COMPUTING AND BIG DATA

Fast-forward and a lot have changed. Over the last several years, the cost to purchase computing and
storage resources has decreased dramatically. Aided by virtualization, commodity servers that could
be clustered and blades that could be networked in a rack changed the economics of computing. This
change coincided with innovation in software automation solutions that dramatically improved the
manageability of these systems.

The capability to leverage distributed computing and parallel processing techniques dramatically
transformed the landscape and dramatically reduce latency. There are special cases, such as High
Frequency Trading (HFT), in which low latency can only be achieved by physically locating servers in
a single location.

BIG DATA ANALYTICS

Big Data Analytics

Existing analytics tools and techniques will be very helpful in making sense of big data. However,
there is a catch. The algorithms that are part of these
tools have to be able to work with large amounts of potentially real-time and disparate data. The
infrastructure that we cover earlier in the chapter will
need to be in place to support this. And, vendors providing analytics tools will also need to ensure that
their algorithms work across distributed implementations. Because of these complexities, we also
expect a new class of tools to help make sense of big data. We list three classes of tools in this layer of
our reference architecture. They can be used independently or collectively by decision makers to help
steer the business. The three classes of tools are as follows:

✓ Reporting and dashboards: These tools provide a “user-friendly” representation of the information
from various sources. Although a mainstay in the traditional data world, this area is still evolving for
big data. Some of the tools that are being used are traditional ones that can now access the new kinds
of databases collectively called NoSQL (Not Only SQL). We explore NoSQL databases in Chapter 7.
✓ Visualization: These tools are the next step in the evolution of reporting. The output tends to be
highly interactive and dynamic in nature.
Another important distinction between reports and visualized output is animation. Business users can
watch the changes in the data utilizing a variety of different visualization techniques, including mind
maps, heat maps, infographics, and connection diagrams. Often, reporting and visualization occur at
the end of the business activity. Although the data may be imported into another tool for further
computation or examination, this is the final step.
✓ Analytics and advanced analytics: These tools reach into the data warehouse and process the data
for human consumption. Advanced analytics should explicate trends or events that are transformative,
unique, or revolutionary to existing business practice. Predictive analytics and sentiment analytics are
good examples of this science

pg. 13 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
what is BDA?
1. working with data sets whose volume, variety and velocity exceed the present storage and
computing capabilities.
2. to steer the business to excellence by understanding customers, vendors and suppliers’ requirements
and preferences
3. for quicker and better decision making
4. better collaboration between IT, Business users and data scientists
5. writing the code for distributed processing for achieving the above tasks
What isn’t BDA?

Data Analytics
• Data Analytics (DA) is the science of examining raw data with the purpose of drawing
conclusions about that information.
• The data that is captured by any data collection agent or tool or software is in its raw form, i.e.,
unformatted or unstructured or unclean with noises/errors or redundant or inconsistent.
• Hence, analytics covers a spectrum of activities starting from data collection till visualization.
• data analytics is generally divided into three broad categories:
• (i) Exploratory Data Analysis (EDA)
• (ii) Confirmatory Data Analysis (CDA)
• (iii) Qualitative Data Analysis (QDA)

Difference in analysis of data

Traditional Analytics

• It is structured and repeatable in nature


• Structure is built to store data
• Business users determine the questions which shall be answered by building
systems by IT experts

Big Data Analytics

• Iterative and exploratory in nature


• Data itself is a structure
• IT team and data experts deliver the data on flexible platform for any exploration and querying
by the business users

Classification of analytics
• Classification I
• 1.basic analytics
• 2.operationalized analytics
• 3.advanced analytics
• 4.monetized analytics
• Classification 2
• 1.analytics 1.0
• 2. analytics 2.0
• 3. analytics 3.0
• Classification 3
• (i)Exploratory Data Analysis (EDA)
• (ii)Confirmatory Data Analysis (CDA)
• (iii)Qualitative Data Analysis (QDA)]
pg. 15 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
1.basic analytics 2.operationalized 3.advanced analytics 4.monetized
analytics analytics

slicing and dicing of where the analysis is using predictive and used to derive direct
historical data to oven into the prescriptive revenue
generate reporting business processes of modeling to forecast
and basic an enterprise. the future
visualization etc.,

• Reports/ dash boards: what happened


• Data mining: why did it happen?
• Present:
• Real time analytics= what is happening
• Real time data mining= why is it happening
• Future:
• Predictive analytics: what is likely to happen
• Prescriptive analytics: how to leverage it to one’s own advantage

Challenges facing big data

• 1.scale: storage to handle elastically scaling data and vertical or horizontal


• 2.security: NoSQL platforms have poor security mechanisms.
• 3.schema:dynamic
• 4.continuous availability: Available 24/7 without down time which is built into NoSql AND
rdbms
• 5.consistency: always get latest updated data. should we opt for consistency or eventual
consistency?
• 6.Partition tolerant: if a network is partitioned it should still be able to handle hw and sw
problems
• 7. data quality: how to maintain ? Accuracy, timeliness. Is there metadata in place?

CAP Theorem
• Only 2 of the 3 : C , A, or P is guaranteed.
• CA:traditional RDBMS, MySQL etc.,
• CP: Hbase, MongoDB ..
• AP: Risk , Cassandra ..

Why BDA is important ?

Because BDA has various approaches that lead to

1. Reactive –business intelligence: this approach is analyzes the historical,static data sets and
generates reports. By this approach It enables business to take better decisions by providing right info
to the right person at the right time in the right format
2.reactive- BDA: this approach analyzes static data only but here the data is huge
3. Proactive –analytics: this approach is traditional data mining, predictive modeling, text mining and
statistical analysis but applied on big data- therefore it has limitations on storage d processing capacity
4.proactive-BDA: this approach is to filter relevant data from big data and analy ze using high
performance analytics to solve complex problems usingmore data

What to do with these data?


• Analyzing big data allows analysts, researchers, and business users to make better and faster
decisions using data that was previously inaccessible or unusable.
• Using advanced analytics techniques such as text analytics, machine learning, predictive
analytics, data mining, statistics, and natural language processing, businesses can analyze
previously untapped data sources independent or together with their existing enterprise data to
gain new insights resulting in significantly better and faster decisions.
• Aggregation and Statistics : information is gathered based on specific variables such as age,
profession, or income and expressed in a summary form for statistical analysis.
• Data aggregation is a common in data warehouses and OLAP operations
• • Indexing, Searching, and Querying : Indexing based on keys is suitable for keyword based
search and pattern matching applications. – Pattern matching (XML/RDF)
• • Knowledge discovery : Knowledge discovery by applying various data mining and statistical
modeling techniques on such data has become strategically important
• – Data Mining
• – Statistical Modeling
• Companies now use an increasing array of tools to develop a 360-degree view (figure 3) :
• social media listening tools to gather what customers are saying on sites like Facebook and
Twitter,
• predictive analytics tools to determine what customers may research or purchase next,
• customer relationship management suites and marketing automation software.
• companies can get a complete view of customers by aggregating data from the various touch
points that a customer may use

Terminologies
• In-memory analytics: technology to quiery data in RAM rather than stored in disks
• In-data base processing
• Symmetric multiprocessor system
• Massively parallel processing: a coordinated processing of a program by multiple processors ,
each working on different parts of the program and using its own OS and memory
• Distributed and parallel computing

Data science and data scientist


• Data science is the science of extracting knowledge from data.it is a science of recognizing
hidden patterns amomg the data using l techniques drawn from statistics, mathematics,IT, ML,
data engineering, probability models, statistical learning, pattern recognition etc.,
• It is multi disciplinary
• It explores massive data sets for weather predictions,oil drilling,seismic activities, etc.,

pg. 17 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Big data use cases

BASE
• It is used in distributed computing
• Why? To achieve high availability
• How achieved?
• BASE is a data system design philosophy that prefers availability over consistency of operations.
• BASE was developed as an alternative for
- producing more scalable and affordable data architectures,
- providing more options to expanding enterprises/ IT clients
- and simply acquiring more hard ware to expand data operations
• BASE is an acronym for Basically Available, Soft state, Eventual consistency
• BasicallyAvailable: The system is guaranteed to be available for querying by all users.
• Soft State: The values stored in the system may change because of the eventual consistency
model, as described in the next bullet.
• Eventually Consistent: As data is added to the system, the system’s state is gradually replicated
across all nodes. For the short period before the blocks are replicated, the state of the file
system isn’t consistent.

Analytics tools
• MS EXCEL
• SAS
• IBM SPSS Modeler
• Statistica
• Salford systems
• WPS

pg. 18 | BIG DATA ANALYTICS, IV CSE, JBIET, HYDERABAD Prepared By, Dr.G.Arun Sampaul Thomas
Main open source analytics tools
• R analytics
• Weka

Other open source analytics tools

• Apache Hadoop.
• Apache Spark. ...
• Apache Storm. ...
• Apache Cassandra. ...
• MongoDB. ...
• R Programming Environment. ...
• Neo4j. ...
• Apache SAMOA.

Extra tools:

• 1. R tool
• 2 Weka
• 3. Pandas
• 4.Tanagra
• 5 Gephi
• 6.MOA( Massive Online Analysis)
• 7.Orange
• 8.Rapid Miner
• 9.Root packages
• 10.Encog,
• 11.NodeXL
• 12.Waffles

Businesses and Big Data Analytics


Big Data analytics tools and techniques are rising in demand due to the use of Big Data in businesses.
Organizations can find new opportunities and gain new insights to run their business efficiently. These
tools help in providing meaningful information for making better business decisions.
The companies can improve their strategies by keeping in mind the customer focus. Big data analytics
efficiently helps operations to become more effective. This helps in improving the profits of the
company.
Big data analytics tools like Hadoop helps in reducing the cost of storage. This further increases the
efficiency of the business. With latest analytics tools, analysis of data becomes easier and quicker.
This, in turn, leads to faster decision-making saving time and energy.

Real-time Benefits of Big Data Analytics


There has been an enormous growth in the field of Big Data analytics with the benefits of the
technology. This has led to the use of big data in multiple industries ranging from
 Banking
 Healthcare
 Energy
 Technology
 Consumer
 Manufacturing
There are many other industries which use big data analytics. Banking is seen as the field making the
maximum use of Big Data Analytics.

You might also like