0% found this document useful (0 votes)

107 views33 pages

Big data analytics notes

Uploaded by

nesansiva553

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views33 pages

Big data analytics notes

Uploaded by

nesansiva553

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CCS334 BIG DATA ANALYTICS

UNIT I UNDERSTANDING BIG DATA

Introduction to big data – convergence of key trends – unstructured data – industry examples of
big data – web analytics – big data applications– big data technologies – introduction to Hadoop
– open source technologies – cloud and big data – mobile business intelligence – Crowd sourcing
analytics – inter and trans firewall analytics.
UNIT II NOSQL DATA MANAGEMENT
Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – schemaless databases – materialized views – distribution
models – master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra
examples – Cassandra clients
UNIT III MAPREDUCE APPLICATIONS
MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and YARN –
job scheduling – shuffle and sort – task execution – MapReduce types – input formats – output
formats.
UNIT IV BASICS OF HADOOP
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes –
design of Hadoop distributed file system (HDFS) – HDFS concepts – Java interface – data flow
– Hadoop I/O – data integrity – compression – serialization – Avro – file-based data structures -
Cassandra – Hadoop integration.
UNIT V HADOOP RELATED TOOLS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis. Pig –
Grunt – pig data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data types
and file formats – HiveQL data definition – HiveQL data manipulation – HiveQL queries.
REFERENCES:
1. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics:
Emerging Business Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013.
2. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World of
Polyglot Persistence", Addison-Wesley Professional, 2012.
3. Tom White, "Hadoop: The Definitive Guide", Third Edition, O'Reilley, 2012.
4. Eric Sammer, "Hadoop Operations", O'Reilley, 2012.
5. E. Capriolo, D. Wampler, and J. Rutherglen, "Programming Hive", O'Reilley, 2012.
6. Lars George, "HBase: The Definitive Guide", O'Reilley, 2011.
7. Eben Hewitt, "Cassandra: The Definitive Guide", O'Reilley, 2010.
8. Alan Gates, "Programming Pig", O'Reilley, 2011.
LIST OF EXPERIMENTS: 30 PERIODS

1 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup scripts,
Configuration files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories,
retrieving files and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Installation of Hive along with practice examples.
6. Installation of HBase, Installing thrift along with Practice examples
7. Practice importing and exporting data from various databases.
Software Requirements: Cassandra, Hadoop, Java, Pig, Hive and HBase.

1.1 Big Data

Big data is data that exceeds the processing capacity of conventional database systems. The data
is too big, moves too fast, or does not fit the structures of traditional database architectures. In
other words, Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or traditional
data processing applications. To gain value from this data, you must choose an alternative way to
process it. Big Data is the next generation of data warehousing and business analytics and is
poised to deliver top line revenues cost efficiently for enterprises. Big data is a popular term used
to describe the exponential growth and availability of data, both structured and unstructured.
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world
today has been created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Definition

❖ Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds. Which cannot be processed
using traditional technologies, processing methods and algorithms.
❖ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, create, manage, and process the data within a tolerable elapsed
time.
❖ Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision-making.

2 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

◻ Big data is often boiled down to a few varieties including social data, machine data,
and transactional data.

◻ Social media data is providing remarkable insights to companies on consumer behavior

and sentiment that can be integrated with CRM data for analysis, with 230 million tweets
posted on Twitter per day, 2.7 billion Likes and comments added to Facebook every day,
and 60 hours of video uploaded to YouTube every minute (this is what we mean by
velocity of data).

◻ Machine data consists of information generated from industrial equipment, real-time data
from sensors that track parts and monitor machinery (often also called the Internet of
Things), and even web logs that track user behavior online.

◻ Major retailers like Amazon.com, which posted $10B in sales in Q3 2011, and restaurants
like US pizza chain Domino's, which serves over 1 million customers per day, are
generating petabytes of transactional big data.

◻ The thing to note is that big data can resemble traditional structured data or unstructured,
high frequency information.

Big Data Analytics

3 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Big (and small) Data analytics is the process of examining data—typically of a variety of
sources, types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information.
The intent is to find business insights that were not previously possible or were missed, so that
better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions, groups, or
other summary levels you ’ll have insights into all the individuals, all the products, all the
parts, all the events, all the transactions, etc.

2. Broader insights. The world is complex. Operating a business in a global, connected

economy is very complex given constantly evolving and changing conditions. As
humans, we simplify conditions so we can process events and understand what is
happening. But our best-laid plans often go astray because of the estimating or
approximating. Big Data analytics takes into account all the data, including new data
sources, to understand the complex, evolving, and interrelated conditions to produce
more accurate insights.

3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.

Difference between Data Science and Big Data:

S.No Data Science Big Data

1. It is a field of scientific analysis of data in Big data is storing and processing large
order to solve analytically complex volumes of structured and unstructured
problems and the significant and data that cannot be possible with
necessary activity of cleansing, preparing traditional applications.
data.

2. It is used in Biotech, energy, gaming and Used in retail, education, healthcare and
insurance. social media.

3. Goals: Data classification, anomaly Goals: To provide better customer service,

detection, prediction, scoring and ranking. identifying new revenue opportunities,
effective marketing etc.

Benefits of Big Data Processing:

1. Improved customer service.
2. Business can utilize outside intelligence while making decisions.
3. Reducing maintenance costs.

4 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

4. Re-develop your products: Big data can also help you understand how others perceive
your products so that you can adapt them or your marketing. If need be.
5. Early identification of risk to the product/services, if any.
6. Better operational efficiency.
Big Data Challenges:
Collecting, storing and processing big data comes with its own set of challenges:
1. Big data is growing exponentially and existing data management solutions have to be
constantly updated to cope with three Vs.
2. Organizations do not have enough skilled data professionals who can understand and
work with big data and big data tools.
1.2 Convergence of Key Trends:
◻ The essence of computer applications is to store things in the real world into computer
systems in the form of data, i.e., it is a process of producing data. Some data are the
records related to culture and society and others are the descriptions of phenomena of the
universe and life. The large scale of data is rapidly generated and stored in computer
systems, which is called data explosion.
◻ Data is generated automatically by mobile devices and computers, think facebook, search
queries, directions and GPS locations and image capture.
◻ Sensors also generate volumes of data, including medical data and commerce location-
based sensors. Experts expect 55 billion IP - enabled sensors by 2021. Even storage of all
this data is expensive. Analysis gets more important and more expensive every year.
◻ The below diagram shows the big data explosion by the current data boom and how
critical it is for us to be able to extract meaning from all of this data.

◻ The phenomena of exponential multiplication of data that gets stored is termed as

"Data Explosion". Continuous inflow of real-time data from various processes,
machinery and manual inputs keeps flooding the storage servers every second.
◻ Sending emails, making phone calls, collecting information for campaigns; each day
we create a massive amount of data just by going about our normal business and this
data explosion does not seem to be slowing down. In fact, 90 % of the data that
currently exists was created in just the last two years.
Reason for this data explosion is Innovation.

5 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1. Business model transformation: Innovation changed the way in which we do business,

provide services. The data world is governed by three fundamental trends: business model
transformation, globalization and personalization of services.
● Organizations have traditionally treated data as a legal or compliance requirement,
supporting limited management reporting requirements. Consequently, organizations
have treated data as a cost to be minimized.
● The businesses are required to produce more data related to product and provide
services to cater each sector and channel of customer.
2. Globalization : Globalization is an emerging trend in business where organizations start
operating on an international scale. From manufacturing to customer service, globalization has
changed the commerce of the world. Variety and different formats of data are generated due to
globalization.
3. Personalization of services: To enhance customer service, the form of one-to-one
marketing in the form of personalization of service is opted by the customer. Customers expect
communication through various channels increases the speed of data generation.
4. New sources of data: The shift to online advertising supported by the likes of Google,
Yahoo and others is a key driver in the data boom. Social media, mobile devices, sensor
networks and new media are on the fingertips of customers or users. The data generated
through this is used by corporations for decision support systems like business intelligence and
analytics. The growth of technology helped to emerge new business models over the last
decade or more. Integration of all the data across the enterprise is used to create a business
decision support platform.
1.2.1 V's of Big Data
We differentiate big data characteristics from traditional data by one or more of the five V's:
Volume, velocity, variety, veracity and value.

6 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1. Volume : Volumes of data are larger than that conventional relational database infrastructure
can cope with. It consists of terabytes or petabytes of data.

➢ The size of available data has been growing at an increasing rate.

➢ The volume of data is growing. Experts predict that the volume of data in the world will
grow to 25 Zettabytes in 2020.
➢ That same phenomenon affects every business – their data is growing at the same
exponential rate too.
➢ This applies to companies and to individuals. A text file is a few kilobytes, a sound file is
a few megabytes while a full length movie is a few giga bytes. More sources of data are
added on a continuous basis.
➢ For companies, in the old days, all data was generated internally by employees.
Currently, the data is generated by employees, partners and customers. For a group of
companies, the data is also generated by machines.

7 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

➢ For example, Hundreds of millions of smartphones send a variety of information to the

network infrastructure. This data did not exist five years ago.
➢ More sources of data with a larger size of data combine to increase the volume of data
that has to be analyzed. This is a major issue for those looking to put that data to use
instead of letting it just disappear.
➢ Peta byte data sets are common these days and Exabyte is not far away.

2. Velocity:
➢ The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. It is
being created in or near real-time.
➢ Data is increasingly accelerating the velocity at which it is created and at which it is
integrated. We have moved from batch to a real-time business.
➢ Initially, companies analyzed data using a batch process. One takes a chunk of data,
submits a job to the server and waits for delivery of the result.

8 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

➢ That scheme works when the incoming data rate is slower than the batch-processing rate
and when the result is useful despite the delay.
➢ With the new sources of data such as social and mobile applications, the batch process
breaks down. The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short.
➢ Data comes at you at a record or a byte level, not always in bulk. And the demands of the
business have increased as well – from an answer next week to an answer in a minute.
➢ In addition, the world is becoming more instrumented and interconnected. The volume
of data streaming off those instruments is exponentially larger than it was even 2 years
ago.

3. Variety:
➢ It refers to heterogeneous sources and the nature of data, both structured and
unstructured.
➢ Variety presents an equally difficult challenge. The growth in data sources has fuelled the
growth in data types. In fact, 80% of the world’s data is unstructured.
➢ Yet most traditional methods apply analytics only to structured information.
➢ From excel tables and databases, data structure has changed to lose its structure and to
add hundreds of formats.
➢ Pure text, photo, audio, video, web, GPS data, sensor data, relational databases,
documents, SMS, pdf, flash, etc. One no longer has control over the input data format.
➢ Structure can no longer be imposed like in the past in order to keep control over the
analysis. As new applications are introduced new data formats come to life.
The variety of data sources continues to increase. It includes
● Internet data (i.e., click stream, social media, social networking links)
● Primary research (i.e., surveys, experiments, observations)
● Secondary research (i.e., competitive and marketplace data, industry reports, consumer
data, business data)
● Location data (i.e., mobile device data, geospatial data)
● Image data (i.e., video, satellite image, surveillance)
● Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
● Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

9 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

4. Value
➢ It represents the business value to be derived from big data. The ultimate objective of
any big data project should be to generate some sort of value for the company doing all
the analysis. Otherwise, you're just performing some technological task for technology's
sake.
➢ For real-time spatial big data, decisions can be enhanced through visualization of
dynamic change in such spatial phenomena as climate, traffic, social-media-based
attitudes and massive inventory locations.
➢ Exploration of data trends can include spatial proximities relationships. Once spatial big
data is structured, formal spatial analytics can be applied, such as spatial autocorrelation,
overlays, buffering, spatial cluster techniques and location quotients.
5. Veracity
➢ Big data must be fed with relevant and true data. We will not be able to perform useful
analytics if much of the incoming data comes from false sources or has errors.
➢ Veracity refers to the level of trustiness or messiness of data and if higher the trustiness
of the data, then lower the messiness and vice versa.
➢ It relates to the assurance of the data's quality, integrity, credibility and accuracy. We must
evaluate the data for accuracy, before using it for business insights because it is obtained
from multiple sources.

10 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Why Big data?

1. Understanding and Targeting Customers

2. Understanding and Optimizing Business Processes
3. Personal Quantification and Performance Optimization
4. Improving Healthcare and Public Health
5. Improving Sports Performance
6. Improving Science and Research
7. Optimizing Machine and Device Performance
8. Improving Security and Law Enforcement.
9. Improving and Optimizing Cities and Countries
10. Financial Trading
1.3 Unstructured data
★ Unstructured data is information that either does not have a predefined data model and/or
does not fit well into a relational database.
★ Rows and columns are not used for unstructured data, therefore it is difficult to retrieve
the required information.
★ Unstructured information is typically text heavy, but may contain data such as dates,
numbers, and facts as well.
★ The term semi-structured data is used to describe structured data that does not fit into a
formal structure of data models.
★ However, semi-structured data does contain tags that separate semantic elements, which
includes the capability to enforce hierarchies within the data.
★ The amount of data (all data, everywhere) is doubling every two years. Most new data is
unstructured.
★ Specifically, unstructured data represents almost 80 percent of new data, while structured
data represents only 20 percent.
★ Unstructured data tends to grow exponentially, unlike structured data, which tends to
grow in a more linear fashion. Unstructured data is vastly underutilized.

Structured data
★ Structured data is arranged in rows and columns format. It helps applications to retrieve
and process data easily. DBMS is used for storing structured data.
★ with a structured document, certain information always appears in the same location on
the page.
★ Structured data generally resides in a relational database, and as a result, it is sometimes
called "relational data." This type of data can be easily mapped into pre-designed fields.

11 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

★ For example, a database designer may set up fields for phone numbers, zip codes and
credit card numbers that accept a certain number of digits. Structured data has been or
can be placed in fields like these.

Mining Unstructured Data

★ Many organizations believe that their unstructured data stores include information that
could help them make better business decisions.
★ Unfortunately, it's often very difficult to analyze unstructured data. To help with the
problem, organizations have turned to a number of different software solutions designed
to search unstructured data and extract important information.
★ The primary benefit of these tools is the ability to glean actionable information that can
help a business succeed in a competitive environment.
★ Because the volume of unstructured data is growing so rapidly, many enterprises also turn
to technological solutions to help them better manage and store their unstructured data.
★ These can include hardware or software solutions that enable them to make the most
efficient use of their available storage space.

Implementing Unstructured Data Management

Organizations use a variety of different software tools to help them organize and manage
unstructured data. These can include the following:

● Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
● Business intelligence software: Also known as BI, this is a broad category of analytics,
data mining, dashboards and reporting tools that help companies make sense of their
structured and unstructured data for the purpose of making better business decisions.
● Data integration tools: These tools combine data from disparate sources so that they can
be viewed or analyzed from a single application. They sometimes include the capability
to unify structured and unstructured data.
● Document management systems: Also called "enterprise content management
systems," a DMS can track, store and share unstructured data that is saved in the form of
document files.
● Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
● Search and indexing tools: These tools retrieve information from unstructured data files
such as documents, Web pages and photos.

12 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1.4 Industry Examples of Big Data

Big data plays an important role in digital marketing. Each day information shared
digitally increases significantly. With the help of big data, marketers can analyze every action of
the consumer. It provides better marketing insights and it helps marketers to make more accurate
and advanced marketing strategies.

• Reasons why big data is important for digital marketers :

a) Real-time customer insights

b) Personalized targeting

c) Increasing sales

d) Improves the efficiency of a marketing campaign

e) Budget optimization

f) Measuring campaign's results more accurately.

★ Data constantly informs marketing teams of customer behaviors and industry trends and
is used to optimize future efforts, create innovative campaigns and build lasting
relationships with customers.
★ Big data regarding customers provides marketers details about user demographics,
locations and interests, which can be used to personalize the product experience and
increase customer loyalty over time.
★ Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers allocate
marketing resources and reduce costs for projects that are not yielding as much revenue
or meeting desired audience goals.
★ Personalized targeting : Nowadays, personalization is the key strategy for every
marketer. Engaging the customers at the right moment with the right message is the
biggest issue for marketers. Big data helps marketers to create targeted and personalized
campaigns.
★ Personalized marketing is creating and delivering messages to the individuals or the
group of the audience through data analysis with the help of consumer's data such as
geolocation, browsing history, clickstream behavior and purchasing history. It is also
known as one-to-one marketing.
★ Consumer insights: In this day an age, marketing has become the ability of a company
to interpret the data and change its strategies accordingly. Big data allows for real-time
consumer insights which is crucial to understanding the habits of your customers. By

13 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

interacting with your consumers through social media you will know exactly what they
want and expect from your product or service, which will be key to distinguishing your
campaign from your competitors.
★ Help increase sales: Big data will help with demand predictions for a product or service.
Information gathered on user behavior will allow marketers to answer what types of
product their users are buying, how often they conduct purchases or search for a product
or service and lastly, what payment methods they prefer using.
★ Analyse campaign results: Big data allows marketers to measure their campaign
performance. This is the most important part of digital marketing. Marketers will use
reports to measure any negative changes to marketing KPIs. If they have not achieved the
desired results it will be a signal that the strategy would need to be changed in order to
maximize revenue and make your marketing efforts more scalable in future.

1.5 Web Analytics

★ Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.

★ Web analytics is not just a tool for measuring web traffic but can be used as a tool for
business and market research, and to assess and improve the effectiveness of a web site.

★ The following are the some of the web analytic metrics: Hit, Page view, Visit / Session,
First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page
Time Viewed / Page Visibility Time / Page View Duration, Session Duration / Visit
Duration. Average Page View Duration, and Click path etc.

★ Most people in the online publishing industry know how complex and onerous it could be
to build an infrastructure to access and manage all the Internet data within their own IT
department. Back in the day, IT departments would opt for a four-year project and
millions of dollars to go that route. However, today this sector has built up an ecosystem
of companies that spread the burden and allow others to benefit.

Why use big data tools to analyse web analytics data?

Web event data is incredibly valuable

• It tells you how your customers actually behave (in lots of detail), and how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and
lifetime value

14 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

• It tells you how customers and prospective customers engage with your different
marketing campaigns and how that drives subsequent behavior

Deriving value from web analytics data often involves very bespoke analytics

• The web is a rich and varied space! E.g.

• Bank
• Newspaper
• Social network
• Analytics application
• Government organisation (e.g. tax office)
• Retailer
• Marketplace
• For each type of business you’d expect different :
• Types of events, with different types of associated data
• Ecosystem of customers / partners with different types of relationships
• Product development cycle (and approach to product development)
• Types of business questions / priorities to inform how the data is analysed

Web analytics tools are good at delivering the standard reports that are common across
different business types…

• Where does your traffic come from e.g.

• Sessions by marketing campaign / referrer
• Sessions by landing page
• Understanding events common across business types (page views, transactions, ‘goals’)
e.g.
• Page views per session
• Page views per web page
• Conversion rate by traffic source
• Transaction value by traffic source
• Capturing contextual data common people browsing the web
• Timestamps
• Referer data
• Web page data (e.g. page title, URL)
• Browser data (e.g. type, plugins, language)
• Operating system (e.g. type, timezone)
• Hardware (e.g. mobile / tablet / desktop, screen resolution, colour depth)

1.6 Big Data and Advances in Health Care

Big Data promises an enormous revolution in health care, with important advancements in
everything from the management of chronic disease to the delivery of personalized medicine.
In addition to saving and improving lives, Big Data has the potential to transform the entire
health care system by replacing guesswork and intuition with objective, data-driven science see
the following figure

15 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Data in the World of Health Care

★ The healthcare industry is now awash in data: from biological data such as gene
expression, Special Needs Plans (SNPs), proteomics, metabolomics to, more recently,
next-generation gene sequence data.

★ This exponential growth in data is further fueled by the digitization of patient-level data:
stored in Electronic Health Records (EHRs) and Health Information Exchanges (HIEs),
enhanced with data from imaging and test results, medical and prescription claims, and
personal health devices.
★ The U.S. healthcare system is increasingly challenged by issues of cost and access to
quality care. Payers, producers, and providers are each attempting to realize improved
treatment outcomes and effective benefits for patients within a disconnected health care
framework.

★ Historically, these healthcare ecosystem stakeholders tend to work at cross purposes with
other members of the health care value chain. High levels of variability and ambiguity
across these individual approaches increase costs, reduce overall effectiveness, and
impede the performance of the healthcare system as a whole.

★ Recent approaches to health care reform attempt to improve access to health care by
increasing government subsidies and reducing the ranks of the uninsured.

★ One outcome of the recently passed Accountable Care Act is a revitalized focus on cost
containment and the creation of quantitative proofs of economic benefit by payers,
producers, and providers.

★ A more interesting unintended consequence is an opportunity for these health care

stakeholders to set aside historical differences and create a combined counterbalance to
potential regulatory burdens established, without the input of the actual industry the
government is setting out to regulate.

16 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

★ This “the enemy of my enemy is my friend” mentality has created an urgent motivation
for payers, producers, and, to a lesser extent, providers, to create a new health care
information value chain derived from a common healthcare analytics approach.

★ The health care system is facing severe economic, effectiveness, and quality challenges.
These external factors are forcing a transformation of the pharmaceutical business model.

★ Health care challenges are forcing the pharmaceutical business model to undergo rapid
change. Our industry is moving from a traditional model built on regulatory approval and
settling of claims, to one of medical evidence and proving economic effectiveness
through improved analytics derived insights.

★ The success of this new business model will be dependent on having access to data
created across the entire healthcare ecosystem.

★ We believe there is an opportunity to drive competitive advantage for our LS clients by

creating a robust analytics capability and harnessing integrated real-world patient level
data.

1.7 Big Data Technology

Big data technology is defined as the technology and a software utility that is designed
for analysis, processing and extraction of the information from a large set of extremely complex
structures and large data sets which is very difficult for traditional systems to deal with. Big data
technology is used to handle both real-time and batch related data.
Big data technology is defined as software-utility. This technology is primarily designed
to analyze, process and extract information from a large data set and a huge set of extremely
complex structures. This is very difficult for traditional data processing software to deal with.
Big data technologies including Apache Hadoop, Apache Spark, MongoDB, Cassandra,
Plotly, Pig, Tableau and Apache Cassandra etc.
Cassandra: Cassandra is one of the leading big data technologies among the list of top NoSQL
databases. It is open-source, distributed and has extensive column storage options. It is freely
available and provides high availability without fail.
Apache Pig is a high level scripting language used to execute queries for larger datasets that are
used within Hadoop.
Apache Spark is a fast, in- Memory data processing engine suitable for use in a wide range of
circumstances. Spark can be deployed in several ways, it features java, Python, Scala and R
programming languages and supports SQL, streaming data, machine learning and graph
processing, which can be used together in an application.
MongoDB: MongoDB is another important component of big data technologies in terms of
storage. No relational properties and RDBMS properties apply to MongoDb because it is a
NoSQL database. This is not the same as traditional RDBMS databases that use structured query
languages. Instead, MongoDB uses schema documents.

17 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1.8 Introduction to Hadoop

★ Apache Hadoop is an open source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data.

★ Hadoop is designed to scale up from a single computer to thousands of clustered

computers, with each machine offering local computation and storage.

★ While Hadoop is sometimes referred to as an acronym for High Availability Distributed

Object Oriented Platform.

★ The Hadoop framework consists of a storage layer known as the Hadoop Distributed File
System (HDFS) and a processing framework called the MapReduce programming model.

★ Hadoop splits large amounts of data into chunks, distributes them within the network
cluster and processes them in its MapReduce Framework.

★ Hadoop can also be installed on cloud servers to better manage the compute and storage
resources required for big data. Leading cloud vendors such as Amazon Web Services
(AWS) and Microsoft Azure offer solutions.

★ Cloudera supports Hadoop workloads both on-premises and in the cloud, including
options for one or more public cloud environments from multiple vendors.

★ Hadoop provides a distributed file system and a framework for the analysis and
transformation of very large data sets using the MapReduce paradigm.

★ An important characteristic of Hadoop is the partitioning of data and computation across

many (thousands) of hosts and executing application computations in parallel close to
their data.

★ A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by
simply adding commodity servers.

Key features of Hadoop :

1. Cost Effective System

2. Large Cluster of Nodes
3. Parallel Processing
4. Distributed Data
5. Automatic Failover Management
6. Data Locality Optimization
7. Heterogeneous Cluster

18 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

8. Scalability.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously. Software clients input
data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then
processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of
individual machines or racks of machines are common and should be automatically handled in
software by the framework.
Challenges of Hadoop:
MapReduce complexity: As a file-intensive system, MapReduce can be a difficult tool to utilize
for complex jobs, such as interactive analytical tasks.
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other modules in Hadoop.
2. Hadoop MapReduce: This works as a parallel framework for scheduling and processing the
data.
3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved
version of MapReduce and is used for processes running over Hadoop.
4. Hadoop Distributed File System HDFS: This stores data and maintains records over various
machines or clusters. It also allows the data to be stored in an accessible format.
1.8.1 Hadoop Ecosystem
● Hadoop ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems.

● The Hadoop ecosystem refers to the various components of the Apache Hadoop software
library, as well as to the accessories and tools provided by the Apache Software
Foundation for these types of software projects and to the ways that they work together.

● Hadoop is a Java - based framework that is extremely popular for handling and analysing
large sets of data. The idea of a Hadoop ecosystem involves the use of different parts of
the core Hadoop set such as MapReduce, a framework for handling vast amounts of data
and the Hadoop Distributed File System (HDFS), a sophisticated file handling system.
There is also YARN, a Hadoop resource manager.

● In addition to these core elements of Hadoop, Apache has also delivered other kinds of
accessories or complementary tools for developers.

● Some of the most well known tools of the Hadoop ecosystem include HDFS, Hive, Pig,
YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper, etc.

19 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

● Hadoop Distributed File System (HDFS), is one of the largest Apache projects and
primary storage system of Hadoop. It employs a NameNode and DataNode architecture.

● It is a distributed file system able to store large files running over the cluster of
commodity hardware.

● YARN stands for Yet Another Resource Negotiator. It is one of the core components in
open source Apache Hadoop suitable for resource management. It is responsible for
managing workloads, monitoring and security controls implementation.

● Hive is an ETL and Data warehousing tool used to query or analyze large datasets stored
within the Hadoop ecosystem. Hive has three main functions: Data summarization, query
and analysis of unstructured and semi- structured data in Hadoop.

● Map - Reduce: It is the core component of processing in a Hadoop Ecosystem as it

provides the logic of processing. In other words, MapReduce is a software framework
which helps in writing applications that processes large data sets using distributed and
parallel algorithms inside the Hadoop environment.

● Apache Pig is a high level scripting language used to execute queries for larger datasets
that are used within Hadoop.

● Apache Spark is a fast, in - memory data processing engine suitable for use in a wide
range of circumstances. Spark can be deployed in several ways, it features Java, Python,
Scala and R programming languages and supports SQL, streaming data, machine learning
and graph processing, which can be used together in an application.

20 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

● Apache HBase is a Hadoop ecosystem component which is a distributed database that

was designed to store structured data in tables that could have billions of rows and
millions of columns. HBase is a scalable, distributed and NoSQL database that is built on
top of HDFS. HBase provides real time access to read or write data in HDFS.

1.8.2 Hadoop Advantages

1. Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

2. Cost effective: Hadoop is open source and uses commodity hardware to store data so it is
really cost effective as compared to traditional relational database management systems.

3. Resilient to failure: HDFS has the property with which it can replicate data over the network.

4. Hadoop can handle unstructured as well as semi-structured data.

5. The unique storage method of Hadoop is based on a distributed file system that effectively
maps data wherever the cluster is located.

1.9 Open Source Technologies

★ Open source software is like any other software (closed/proprietary software). This
software is differentiated by its use and licenses.

★ Open source software guarantees the right to access and modify the source code and to
use, reuses and redistribute the software, all with no royalty or other costs.

21 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

★ Standard Software is sold and supported commercially. However, Open Source software
can be sold and/or supported commercially, too. Open source is a disruptive technology.

★ Open source is an approach to the design, development and distribution of software,

offering practical accessibility to software's source code.

★ Open source licenses must permit non-exclusive commercial exploitation of the licensed
work, must make available the work's source code and must permit the creation of
derivative works from the work itself.

★ The Netscape Public License and subsequently under the Mozilla Public License.

★ Proprietary software is computer software which is the legal property of one party. The
terms of use for other parties are defined by contracts or licensing agreements. These
terms may include various privileges to share, alter, dissemble and use the software and
its code.

★ Closed source is a term for software whose license does not allow for the release or
distribution of the software's source code. Generally, it means only the binaries of a
computer program are distributed and the license provides no access to the program's
source code.

★ The source code of such programs is usually regarded as a trade secret of the company.
Access to source code by third parties commonly requires the party to sign a
non-disclosure agreement.

Need of open source

★ The demands of consumers as well as enterprises are ever increasing with the increase in
the information technology usage. Information technology solutions are required to
satisfy their different needs. It is a fact that a single solution provider cannot produce all
the needed solutions. Open source, freeware and free software are now available for
anyone and for any use.

★ In the 1970s and early 1980s, the software organization started using technical measures
to prevent computer users from being able to study and modify software. The copyright
law was extended to computer programs in 1980. The free software movement was
conceived in 1983 by Richard Stallman to satisfy the need for and to give the benefit of
"software freedom" to computer users.

★ Richard Stallman declared the idea of the GNU operating system in September 1983. The
GNU Manifesto was written by Richard Stallman and published in March 1985.

★ The Free Software Foundation (FSF) is a non-profit corporation started by Richard

Stallman on 4 October 1985 to support the free software movement, a copyleft based

22 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

movement which aims to promote the universal freedom to distribute and modify
computer software without restriction. In February 1986, the first formal definition of
free software was published.

★ The term "free software" is associated with FSFs definition, and the term "open source
software" is associated with OSI's definition. FSFs and OSI's definitions are worded quite
differently but the set of software that they cover is almost identical.

★ One of the primary goals of this foundation was the development of a free and open
computer operating system and application software that can be used and shared among
different users with complete freedom.

★ While open source differs from the operation of traditional copyright licensing by
permitting both open distribution and open modification.

★ Before the term open source became widely adopted, developers and producers used a
variety of phrases to describe the concept. The term open source gained popularity with
the rise of the Internet, which provided access to diverse production models,
communication paths and last but not least, interactive communities.

Successes of Open Source

Operating Systems: Linux, Symbian, GNU Project, NetBSD.

Servers: Apache, Tomcat, MediaWiki, WordPress, Eclipse, Moodle

Client Software: Mozilla Firefox, Mozilla Thunderbird, OpenOffice, 7-Zip

Digital Content: Wikipedia, Wiktionary, Project Gutenberg

1.10 Cloud and Big Data

★ The NIST defines cloud computing as : "Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.

★ This cloud model is composed of five essential characteristics, three service models and
four deployment models."

★ Cloud provider is responsible for the physical infrastructure and the cloud consumer is
responsible for application configuration, personalization and data.

★ Broad network access refers to resources hosted in a cloud network that are available for
access from a wide range of devices. Rapid elasticity is used to describe the capability to
provide scalable cloud computing services.

23 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

★ In measured services, NIST talks about measured service as a setup where cloud systems
may control a user or tenant's use of resources by a metering capability
somewhere in the system.
On-demand self-service refers to the service provided by cloud computing vendors that enables
the provision of cloud resources on demand whenever they are required.
The Cloud Cube Model has four dimensions to differentiate cloud formations :
a) External/Internal
b) Proprietary/Open
c) De-perimeterized / peremeterized
d) Outsourced/Insourced.
External Internal: Physical location of data is defined by external/internal dimension. It defines
the organization's boundary.
Example: Information inside a datacenter using a private cloud deployment would be considered
internal and data that resided on Amazon EC2 would be considered external.
Proprietary / Open: Ownership is proprietary or open; is a measurement for not only ownership
of technology but also its interoperability, use of data and ease of data-transfer and degree of
vendor's application's lock-in.
Proprietary means that the organization providing the service is keeping the means of provision
under their ownership. Clouds that are open are using technology that is not proprietary, meaning
that there are likely to be more suppliers.
De-perimeterized / peremeterized: Security Ranges is parameterized or de-parameterized;
which measures whether the operations are inside or outside the security boundary, firewall, etc.
Encryption and key management will be the technology means for providing data confidentiality
and integrity in a de-perimeterized model.
Outsourced / Insourced : Out-sourcing/In-sourcing; which defines whether the customer or the
service provider provides the service.
Outsourced means the service is provided by a third party. It refers to letting contractors or
service providers handle all requests and most cloud business models fall into this.
Insourced is the services provided by your own staff under organization control. Insourced
means in-house development of clouds.
★ Cloud computing is often described as a stack, as a response to the broad range of
services built on top of one another under the "cloud". A cloud computing stack is a
cloud architecture built in layers of one or more cloud-managed services (SaaS, Paas,
IaaS, etc.).

24 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

★ Cloud computing stacks are used for all sorts of applications and systems. They are
especially good in microservices and scalable applications, as each tier is dynamically
scaling and replaceable.

★ The cloud computing pile makes up a threefold system that comprises its lower-level
elements. These components function as formalized cloud computing delivery models:

a) Software as a Service (SaaS)

b) Platform as a Service (PaaS)
c) Infrastructure as a Service (IaaS)
SaaS applications are designed for end-users and delivered over the web.
PaaS is the set of tools and services designed to make coding and deploying those applications
quick and efficient.
IaaS is the hardware and software that powers it all, including servers, storage networks and
operating systems.
★ At the crossroads of high capital costs and rapidly changing business needs is a sea
change that is driving the need for a new, compelling value proposition that is being
manifested in a cloud-deployment model.

★ With a cloud model, you pay on a subscription basis with no upfront capital expense. You
don’t incur the typical 30 percent maintenance fees—and all the updates on the platform
are automatically available.

★ The traditional cost of value chains is being completely disintermediated by

platforms—massively scalable platforms where the marginal cost to deliver an
incremental product or service is zero.

★ The ability to build massively scalable platforms—platforms where you have the option
to keep adding new products and services for zero additional cost—is giving rise to
business models that weren’t possible before. Mehta calls it “the next industrial
revolution, where the raw material is data and data factories replace manufacturing
factories.” He pointed out a few guiding principles that his firm stands by:

1. Stop saying “cloud.” It’s not about the fact that it is virtual, but the true value lies in
delivering software, data, and/or analytics in an “as a service” model. Whether that is in a private
hosted model or a publicly shared one does not matter. The delivery, pricing, and consumption
model matters.
2. Acknowledge the business issues. There is no point to make light of matters around
information privacy, security, access, and delivery. These issues are real, more often than not
heavily regulated by multiple government agencies, and unless dealt with in a solution, will kill
any platform sell.

25 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

3. Fix some core technical gaps. Everything from the ability to run analytics at scale in a
virtual environment to ensuring information processing and analytics authenticity are issues that
need solutions and have to be fixed.
1.11 Mobile Business Intelligence
➔ Analytics on mobile devices is what some refer to as putting BI in your pocket. Mobile
drives straight to the heart of simplicity and ease of use that has been a major barrier to
BI adoption since day one.

➔ Mobile devices are a great leveling field where making complicated actions easy is the
name of the game. For example, a young child can use an iPad but not a laptop.

➔ As a result, this will drive broad-based adoption as much for the ease of use as for the
mobility these devices offer. This will have an immense impact on the business
intelligence sector.

➔ Mobile BI or mobile analytics is the rising software technology that allows the users to
access information and analytics on their phones and tablets instead of desktop-based BI
systems.

➔ Mobile analytics involves measuring and analyzing data generated by mobile platforms
and properties, such as mobile sites and mobile applications.

➔ Analytics is the practice of measuring and analyzing data of users in order to create an
understanding of user behavior as well as website or performance. If this practice is done
on mobile apps and app users, it is called "mobile analytics".

➔ Mobile analytics is the practice of collecting user behavior data, determining intent from
those metrics and taking action to drive retention, engagement and conversion.

➔ Mobile analytics is similar to web analytics where identification of the unique customer
and recording their usages.

➔ With mobile analytics data, you can improve your cross-channel marketing initiatives,
optimize the mobile experience for your customers and grow mobile user engagement
and retention.

➔ Analytics usually comes in the form of a software that integrates into a company's
existing websites and apps to capture, store and analyze the data.

➔ It is always very important for businesses to measure their critical KPIs (Key
Performance Indicators), as the old rule is always valid: "If you can't measure it, you
can't improve it".

26 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

➔ To be more specific, if a business find out 75 % of their users exit in the shipment screen
of their sales funnel, probably there is something wrong with that screen in terms of its
design, user interface (UI) or user experience (UX) or there is a technical problem
preventing users from completing the process.

Working of Mobile Analytics :

➔ Most of the analytics tools need a library (an SDK) to be embedded into the mobile app's
project code and at minimum an initialization code in order to track the users and screens.

➔ SDKs differ by platform so a different SDK is required for each platform such as iOS,
Android, Windows Phone etc. On top of that, additional code is required for custom event
tracking.

➔ With the help of this code, analytics tools track and count each user, app launch, tap,
event, app crash or any additional information that the user has, such as device, operating
system, version IP address (and probable location).

➔ Unlike web analytics, mobile analytics tools don't depend on cookies to identify unique
users since mobile analytics SDKs can generate a persistent and unique identifier for each
device.

➔ The tracking technology varies between websites, which use either javascript or cookies
and apps, which use a software development kit(SDK).

➔ Each time a website or app visitor takes an action, the application fires off data which is
recorded in the mobile analytics platform.

Three elements that have impacted the viability of mobile BI:

1. Location—the GPS component and location . . . know where you are in time as well as
the movement.
2. It’s not just about pushing data; you can transact with your smart phone based on
information you get.
3. Multimedia functionality allows the visualization pieces to really come into play.
Three challenges with mobile BI include:
1. Managing standards for rolling out these devices.
2. Managing security (always a big challenge).
3. Managing “bring your own device,” where you have devices both owned by the
company and devices owned by the individual, both contributing to productivity.

27 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

1.12 Crowdsourcing Analytics

★ Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts
available on the internet from large groups of people aimed at incorporating innovation,
implementing new ideas and eliminating product issues.
★ Crowdsourcing means the outsourcing of human-intelligence tasks to a large group of
unspecified people via the Internet.

★ Crowdsourcing is all about collecting data from users through some services, ideas, or
content and then it needs to be stored in a server such that the necessary data can be or
provided to users whenever necessary.

★ Most users nowadays use Truecaller to find unknown numbers and Google Maps to find
out places and the traffic in a region. All the services are based on crowdsourcing.

★ Crowdsourced data is a form of secondary data. Secondary data refers to data that is
collected by any party other than the researcher. Secondary data provides important
context for any investigation into a policy intervention.

★ When crowdsourcing data, researchers collect plentiful, valuable and dispersed data at
a cost typically lower than that of traditional data collection methods.

★ Consider the trade-offs between sample size and sampling issues before deciding to
crowdsource data. Ensuring data quality means making sure the platform which you are
collecting crowdsourced data is well-tested.

★ Crowdsourcing experiments are normally set up by asking a set of users to perform a task
for a very small remuneration on each unit of the task. Amazon Mechanical Turk
(AMT) is a popular platform that has a large set of registered remote workers who are
hired to perform tasks such as data labeling.

★ In data labeling tasks, the crowd workers are randomly assigned a single item in the
dataset. A data object may receive multiple labels from different workers and these have
to be aggregated to get the overall true label.
★ Crowdsourcing allows for many contributors to be recruited in a short period of time,
thereby eliminating traditional barriers to data collection. Furthermore, crowdsourcing
platforms usually employ their own tools to optimize the annotation process, making it
easier to conduct time-intensive labeling tasks.

★ Crowdsourcing data is especially effective in generating complex and free-form labels

such as in the case of audio transcription, sentiment analysis, image annotation or
translation.
★ With crowdsourcing, companies can collect information from customers and use it to
their advantage. Brands gather opinions, ask for help, receive feedback to improve their

28 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

product or service, and drive sales. For instance, Lego conducted a campaign where
customers had the chance to develop their designs of toys and submit them.

★ To become the winner, the creator had to receive the biggest amount of people's votes.
The best design was moved to the production process. Moreover, the winner got a
privilege that amounted to a 1 % royalty on the net revenue.
Types of Crowdsourcing:
There are four main types of crowdsourcing.
1. Wisdom of the crowd: It is a collective opinion of different individuals gathered in a group.
This type is used for decision-making since it allows one to find the best solution for problems.
2. Crowd creation : This type involves a company asking its customers to help with new
products. This way, companies get brand new ideas and thoughts that help a business stand out.

3. Crowd voting: It is a type of crowdsourcing where customers are allowed to choose a winner.
They can vote to decide which of the options is the best for them. This type can be appli ed to
different situations. Consumers can choose one of the options provided by experts or products
created by consumers.

4. Crowdfunding: It is when people collect money and ask for investments for charities,
projects and startups without planning to return the money to the owners. People do it
voluntarily. Often, companies gather money to help individuals and families suffering from
natural disasters, poverty, social problems, etc.

Example:

o 99designs.com/, which does crowdsourcing of graphic design

o agentanything.com/, which posts “missions” where agents vie for to run errands
o 33needs.com/, which allows people to contribute to charitable programs that make a
social impact

1.13 Inter- and Trans-Firewall Analytics

● A firewall is a device designed to control the flow of traffic into and out-of a
network. In general, firewalls are installed to prevent attacks. Firewall can be a
software program or a hardware device.

● Firewalls are software programs or hardware devices that filter the traffic that
flows into a user PC or user network through an internet connection.

● They sift through the data flow and block that which they deem harmful to the
user network or computer system.

29 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Fig. 1.13.1 Firewall

● Firewalls filter based on IP, UDP and TCP information. Firewall is placed on the
link between a network router and Internet or between a user and router.

● For large organizations with many small networks, the firewall is placed on every
connection attached to the Internet.

● Large organizations may use multiple levels of firewall or distributed firewalls,

locating a firewall at a single access point to the network.
● Firewalls test all traffic against consistent rules and pass traffic that meets those
rules. Many routers support basic firewall functionality. Firewall can also be used
to control data traffic.

● Firewall based security depends on the firewall being the only connectivity to the
size from outside; there should be no way to bypass the firewall via other
gateways; wireless connections.

● Firewall filters out all incoming messages addressed to a particular IP address or a

particular TCP port number. It divides a network into a more trusted zone internal
to the firewall and a less trusted zone external to the firewall.
● Firewalls may also impose restrictions on outgoing traffic, to prevent certain
attacks and to limit losses if an attacker succeeds in getting access inside the
firewall.

Functions of firewall:
1. Access control: Firewall filters incoming as well as outgoing packets.
2. Address/Port Translation: Using network address translation, internal machines,
though not visible on the Internet, can establish a connection with external machines on
the Internet. NATing is often done by firewall.
3. Logging: Security architecture ensures that each incoming or outgoing packet
encounters at least one firewall. The firewall can log all anomalous packets.

30 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Firewalls can protect the computer and user personal information from :
1. Hackers who your system security.

2. Firewall prevents malware and other Internet hacker attacks from reaching your
computer in the first place.

3. Outgoing traffic from your computer created by a virus infection.

Firewalls cannot provide protection :

1. Against phishing scams and other fraudulent activity

2. Viruses spread through e-mail

3. From physical access of your computer or network

4. For an unprotected wireless network.

Firewall Characteristics

1. All traffic from inside to outside and vice versa, must pass through the firewall.
2. The firewall itself is resistant to penetration.
3. Only authorized traffic, as defined by the local security policy, will be allowed to pass.

1.13.1 Firewall Rules

● The rules and regulations set by the organization. Policy determines the type of
internal and external information resources employees can access, the kinds of
programs they may install on their own computers as well as their authority for
reserving network resources.

● Policy is typically general and set at a high level within the organization. Policies
that contain details generally become too much of a "living document".

User can create or disable firewall filter rules based on following conditions :
1. IP addresses: System admin can block a certain range of IP addresses.

2. Domain names: Admin can only allow certain specific domain names to access your
systems or allow access to only some specific types of domain names or domain name
extension.

3. Protocol: A firewall can decide which of the systems can allow or have access to
common protocols like IP, SMTP, FTP, UDP, ICMP, Telnet or SNMP.

31 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

4. Ports: Blocking or disabling ports of servers that are connected to the internet will
help maintain the kind of data flow you want to see it used for and also close down
possible entry points for hackers or malignant software.

5. Keywords: Firewalls also can sift through the data flow for a match of the keywords
or phrases to block out offensive or unwanted data from flowing in.

● When your computer makes a connection with another computer on the network,
several things are exchanged including the source and destination ports.
● In a standard firewall configuration, most inbound ports are blocked. This would
normally cause a problem with return traffic since the source port is randomly
assigned.
● A state is a dynamic rule created by the firewall containing the source-destination
port combination, allowing the desired return traffic to pass the firewall.

1.13.2 Types of Firewall

1. Packet filter

2. Application level firewall

3. Circuit level gateway.

➔ Packet filter firewall controls access to packets on the basis of packet source and
destination address or specific transport protocol type.

➔ It is done at the OSI data link, network and transport layers. Packet filter firewall
works on the network layer of the OSI model.

➔ Packet filters do not see inside a packet; they block or accept packets solely on the
basis of the IP addresses and ports. All incoming SMTP and FTP packets are
parsed to check whether they should drop or forwarded.

➔ But outgoing SMTP and FTP packets have already been screened by the gateway
and do not have to be checked by the packet filtering router. Packet filter firewall
only checks the header information.

32 Prepared by J.Balachandar (Asst. Professor)

CCS334 BIG DATA ANALYTICS

Application level gateway is also called a bastion host. It operates at the application
level. Multiple application gateways can run on the same host but each gateway is a
separate server with its own processes.

These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.

Circuit level gateway: A circuit-level firewall is a second generation firewall that

validates TCP and UDP sessions before opening a connection.

The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a
session and permits traffic only from the allowed source and possibly only for a limited
period of time.

It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishing
the connection.
The decision to accept or reject a packet is based upon examining the packet's IP header
and TCP header.

Circuit level gateway cannot examine the data content of the packets it relays between a
trusted network and an untrusted network.

33 Prepared by J.Balachandar (Asst. Professor)

BDA_DIGITAL NOTES
No ratings yet
BDA_DIGITAL NOTES
85 pages
big data notes
No ratings yet
big data notes
89 pages
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
No ratings yet
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
130 pages
Bigdata Lecture Notes
No ratings yet
Bigdata Lecture Notes
166 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Course Material
100% (1)
Course Material
57 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BDA U1
No ratings yet
BDA U1
80 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
PMLE Book
No ratings yet
PMLE Book
507 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
big-data-2022-notes
No ratings yet
big-data-2022-notes
118 pages
Big Data Analysis Seminar
100% (1)
Big Data Analysis Seminar
15 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
BDA(UNIT-1)
No ratings yet
BDA(UNIT-1)
24 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
P.prabu (31x61c) CCS334 BDA - Unit 1
No ratings yet
P.prabu (31x61c) CCS334 BDA - Unit 1
31 pages
High Performance Computing BD4071 Unit 1 Notes
No ratings yet
High Performance Computing BD4071 Unit 1 Notes
28 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
P.Prabu (31x61c) CCS334-BDA.Unit-1
No ratings yet
P.Prabu (31x61c) CCS334-BDA.Unit-1
32 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Module 1
No ratings yet
Module 1
54 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
01_Introduction to Big Data Analytics.pdf
No ratings yet
01_Introduction to Big Data Analytics.pdf
37 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
BIG data1
No ratings yet
BIG data1
49 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Aditya 18cs03 Seminar Report
No ratings yet
Aditya 18cs03 Seminar Report
27 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
117769
No ratings yet
117769
20 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
BDA-UNIT-I-LM
No ratings yet
BDA-UNIT-I-LM
14 pages
Bl-Cs378-Lec-1922s HP 2
100% (1)
Bl-Cs378-Lec-1922s HP 2
52 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Bigquery Scenarios -Dipakraj Patil
No ratings yet
Bigquery Scenarios -Dipakraj Patil
37 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Unit I LM
No ratings yet
Unit I LM
12 pages
Unit 1
No ratings yet
Unit 1
19 pages
Certified AI & ML BlackBelt Plus Program Brochure
No ratings yet
Certified AI & ML BlackBelt Plus Program Brochure
40 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
Sagar Akunuri Sr. Python Developer
No ratings yet
Sagar Akunuri Sr. Python Developer
5 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
BDA2023Outline
No ratings yet
BDA2023Outline
7 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
35 (1)
No ratings yet
35 (1)
42 pages
Lotchi Dagbo
No ratings yet
Lotchi Dagbo
5 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
What Is Big Data?
No ratings yet
What Is Big Data?
3 pages
Question Bank R
No ratings yet
Question Bank R
19 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Report Bigdata
No ratings yet
Report Bigdata
5 pages
Vinith Siripuram Data Engineer
No ratings yet
Vinith Siripuram Data Engineer
5 pages
Praneeth Python Resume
No ratings yet
Praneeth Python Resume
7 pages
Aravind - Senior Azure Data Engineer
No ratings yet
Aravind - Senior Azure Data Engineer
5 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Neel DE
No ratings yet
Neel DE
1 page
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Hbase 1.x Installation Steps
No ratings yet
Hbase 1.x Installation Steps
4 pages
Naukri Vamsi (7y 5m)
No ratings yet
Naukri Vamsi (7y 5m)
5 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Hadoop Developer 1563278048
No ratings yet
Hadoop Developer 1563278048
2 pages
Resume
No ratings yet
Resume
3 pages
NLP Data Engineer
No ratings yet
NLP Data Engineer
1 page
Data Science Article
No ratings yet
Data Science Article
2 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet