Big data analytics notes
Big data analytics notes
1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup scripts,
Configuration files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories,
retrieving files and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
5. Installation of Hive along with practice examples.
6. Installation of HBase, Installing thrift along with Practice examples
7. Practice importing and exporting data from various databases.
Software Requirements: Cassandra, Hadoop, Java, Pig, Hive and HBase.
Big data is data that exceeds the processing capacity of conventional database systems. The data
is too big, moves too fast, or does not fit the structures of traditional database architectures. In
other words, Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using on-hand data management tools or traditional
data processing applications. To gain value from this data, you must choose an alternative way to
process it. Big Data is the next generation of data warehousing and business analytics and is
poised to deliver top line revenues cost efficiently for enterprises. Big data is a popular term used
to describe the exponential growth and availability of data, both structured and unstructured.
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world
today has been created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Definition
❖ Big data can be defined as very large volumes of data available at various sources, in
varying degrees of complexity, generated at different speeds. Which cannot be processed
using traditional technologies, processing methods and algorithms.
❖ Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, create, manage, and process the data within a tolerable elapsed
time.
❖ Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision-making.
◻ Big data is often boiled down to a few varieties including social data, machine data,
and transactional data.
◻ Machine data consists of information generated from industrial equipment, real-time data
from sensors that track parts and monitor machinery (often also called the Internet of
Things), and even web logs that track user behavior online.
◻ Major retailers like Amazon.com, which posted $10B in sales in Q3 2011, and restaurants
like US pizza chain Domino's, which serves over 1 million customers per day, are
generating petabytes of transactional big data.
◻ The thing to note is that big data can resemble traditional structured data or unstructured,
high frequency information.
Big (and small) Data analytics is the process of examining data—typically of a variety of
sources, types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information.
The intent is to find business insights that were not previously possible or were missed, so that
better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions, groups, or
other summary levels you ’ll have insights into all the individuals, all the products, all the
parts, all the events, all the transactions, etc.
3. Frictionless actions. Increased reliability and accuracy that will allow the deeper and
broader insights to be automated into systematic actions.
1. It is a field of scientific analysis of data in Big data is storing and processing large
order to solve analytically complex volumes of structured and unstructured
problems and the significant and data that cannot be possible with
necessary activity of cleansing, preparing traditional applications.
data.
2. It is used in Biotech, energy, gaming and Used in retail, education, healthcare and
insurance. social media.
4. Re-develop your products: Big data can also help you understand how others perceive
your products so that you can adapt them or your marketing. If need be.
5. Early identification of risk to the product/services, if any.
6. Better operational efficiency.
Big Data Challenges:
Collecting, storing and processing big data comes with its own set of challenges:
1. Big data is growing exponentially and existing data management solutions have to be
constantly updated to cope with three Vs.
2. Organizations do not have enough skilled data professionals who can understand and
work with big data and big data tools.
1.2 Convergence of Key Trends:
◻ The essence of computer applications is to store things in the real world into computer
systems in the form of data, i.e., it is a process of producing data. Some data are the
records related to culture and society and others are the descriptions of phenomena of the
universe and life. The large scale of data is rapidly generated and stored in computer
systems, which is called data explosion.
◻ Data is generated automatically by mobile devices and computers, think facebook, search
queries, directions and GPS locations and image capture.
◻ Sensors also generate volumes of data, including medical data and commerce location-
based sensors. Experts expect 55 billion IP - enabled sensors by 2021. Even storage of all
this data is expensive. Analysis gets more important and more expensive every year.
◻ The below diagram shows the big data explosion by the current data boom and how
critical it is for us to be able to extract meaning from all of this data.
1. Volume : Volumes of data are larger than that conventional relational database infrastructure
can cope with. It consists of terabytes or petabytes of data.
2. Velocity:
➢ The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. It is
being created in or near real-time.
➢ Data is increasingly accelerating the velocity at which it is created and at which it is
integrated. We have moved from batch to a real-time business.
➢ Initially, companies analyzed data using a batch process. One takes a chunk of data,
submits a job to the server and waits for delivery of the result.
➢ That scheme works when the incoming data rate is slower than the batch-processing rate
and when the result is useful despite the delay.
➢ With the new sources of data such as social and mobile applications, the batch process
breaks down. The data is now streaming into the server in real time, in a continuous
fashion and the result is only useful if the delay is very short.
➢ Data comes at you at a record or a byte level, not always in bulk. And the demands of the
business have increased as well – from an answer next week to an answer in a minute.
➢ In addition, the world is becoming more instrumented and interconnected. The volume
of data streaming off those instruments is exponentially larger than it was even 2 years
ago.
3. Variety:
➢ It refers to heterogeneous sources and the nature of data, both structured and
unstructured.
➢ Variety presents an equally difficult challenge. The growth in data sources has fuelled the
growth in data types. In fact, 80% of the world’s data is unstructured.
➢ Yet most traditional methods apply analytics only to structured information.
➢ From excel tables and databases, data structure has changed to lose its structure and to
add hundreds of formats.
➢ Pure text, photo, audio, video, web, GPS data, sensor data, relational databases,
documents, SMS, pdf, flash, etc. One no longer has control over the input data format.
➢ Structure can no longer be imposed like in the past in order to keep control over the
analysis. As new applications are introduced new data formats come to life.
The variety of data sources continues to increase. It includes
● Internet data (i.e., click stream, social media, social networking links)
● Primary research (i.e., surveys, experiments, observations)
● Secondary research (i.e., competitive and marketplace data, industry reports, consumer
data, business data)
● Location data (i.e., mobile device data, geospatial data)
● Image data (i.e., video, satellite image, surveillance)
● Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
● Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
4. Value
➢ It represents the business value to be derived from big data. The ultimate objective of
any big data project should be to generate some sort of value for the company doing all
the analysis. Otherwise, you're just performing some technological task for technology's
sake.
➢ For real-time spatial big data, decisions can be enhanced through visualization of
dynamic change in such spatial phenomena as climate, traffic, social-media-based
attitudes and massive inventory locations.
➢ Exploration of data trends can include spatial proximities relationships. Once spatial big
data is structured, formal spatial analytics can be applied, such as spatial autocorrelation,
overlays, buffering, spatial cluster techniques and location quotients.
5. Veracity
➢ Big data must be fed with relevant and true data. We will not be able to perform useful
analytics if much of the incoming data comes from false sources or has errors.
➢ Veracity refers to the level of trustiness or messiness of data and if higher the trustiness
of the data, then lower the messiness and vice versa.
➢ It relates to the assurance of the data's quality, integrity, credibility and accuracy. We must
evaluate the data for accuracy, before using it for business insights because it is obtained
from multiple sources.
Structured data
★ Structured data is arranged in rows and columns format. It helps applications to retrieve
and process data easily. DBMS is used for storing structured data.
★ with a structured document, certain information always appears in the same location on
the page.
★ Structured data generally resides in a relational database, and as a result, it is sometimes
called "relational data." This type of data can be easily mapped into pre-designed fields.
★ For example, a database designer may set up fields for phone numbers, zip codes and
credit card numbers that accept a certain number of digits. Structured data has been or
can be placed in fields like these.
★ Many organizations believe that their unstructured data stores include information that
could help them make better business decisions.
★ Unfortunately, it's often very difficult to analyze unstructured data. To help with the
problem, organizations have turned to a number of different software solutions designed
to search unstructured data and extract important information.
★ The primary benefit of these tools is the ability to glean actionable information that can
help a business succeed in a competitive environment.
★ Because the volume of unstructured data is growing so rapidly, many enterprises also turn
to technological solutions to help them better manage and store their unstructured data.
★ These can include hardware or software solutions that enable them to make the most
efficient use of their available storage space.
Organizations use a variety of different software tools to help them organize and manage
unstructured data. These can include the following:
● Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
● Business intelligence software: Also known as BI, this is a broad category of analytics,
data mining, dashboards and reporting tools that help companies make sense of their
structured and unstructured data for the purpose of making better business decisions.
● Data integration tools: These tools combine data from disparate sources so that they can
be viewed or analyzed from a single application. They sometimes include the capability
to unify structured and unstructured data.
● Document management systems: Also called "enterprise content management
systems," a DMS can track, store and share unstructured data that is saved in the form of
document files.
● Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
● Search and indexing tools: These tools retrieve information from unstructured data files
such as documents, Web pages and photos.
Big data plays an important role in digital marketing. Each day information shared
digitally increases significantly. With the help of big data, marketers can analyze every action of
the consumer. It provides better marketing insights and it helps marketers to make more accurate
and advanced marketing strategies.
b) Personalized targeting
c) Increasing sales
e) Budget optimization
★ Data constantly informs marketing teams of customer behaviors and industry trends and
is used to optimize future efforts, create innovative campaigns and build lasting
relationships with customers.
★ Big data regarding customers provides marketers details about user demographics,
locations and interests, which can be used to personalize the product experience and
increase customer loyalty over time.
★ Big data solutions can help organize data and pinpoint which marketing campaigns,
strategies or social channels are getting the most traction. This lets marketers allocate
marketing resources and reduce costs for projects that are not yielding as much revenue
or meeting desired audience goals.
★ Personalized targeting : Nowadays, personalization is the key strategy for every
marketer. Engaging the customers at the right moment with the right message is the
biggest issue for marketers. Big data helps marketers to create targeted and personalized
campaigns.
★ Personalized marketing is creating and delivering messages to the individuals or the
group of the audience through data analysis with the help of consumer's data such as
geolocation, browsing history, clickstream behavior and purchasing history. It is also
known as one-to-one marketing.
★ Consumer insights: In this day an age, marketing has become the ability of a company
to interpret the data and change its strategies accordingly. Big data allows for real-time
consumer insights which is crucial to understanding the habits of your customers. By
interacting with your consumers through social media you will know exactly what they
want and expect from your product or service, which will be key to distinguishing your
campaign from your competitors.
★ Help increase sales: Big data will help with demand predictions for a product or service.
Information gathered on user behavior will allow marketers to answer what types of
product their users are buying, how often they conduct purchases or search for a product
or service and lastly, what payment methods they prefer using.
★ Analyse campaign results: Big data allows marketers to measure their campaign
performance. This is the most important part of digital marketing. Marketers will use
reports to measure any negative changes to marketing KPIs. If they have not achieved the
desired results it will be a signal that the strategy would need to be changed in order to
maximize revenue and make your marketing efforts more scalable in future.
★ Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
★ Web analytics is not just a tool for measuring web traffic but can be used as a tool for
business and market research, and to assess and improve the effectiveness of a web site.
★ The following are the some of the web analytic metrics: Hit, Page view, Visit / Session,
First Visit / First Session, Repeat Visitor, New Visitor, Bounce Rate, Exit Rate, Page
Time Viewed / Page Visibility Time / Page View Duration, Session Duration / Visit
Duration. Average Page View Duration, and Click path etc.
★ Most people in the online publishing industry know how complex and onerous it could be
to build an infrastructure to access and manage all the Internet data within their own IT
department. Back in the day, IT departments would opt for a four-year project and
millions of dollars to go that route. However, today this sector has built up an ecosystem
of companies that spread the burden and allow others to benefit.
• It tells you how your customers actually behave (in lots of detail), and how that varies
• Between different customers
• For the same customers over time. (Seasonality, progress in customer journey)
• How behaviour drives value
• It tells you how customers engage with you via your website / webapp
• How that varies by different versions of your product
• How improvements to your product drive increased customer satisfaction and
lifetime value
• It tells you how customers and prospective customers engage with your different
marketing campaigns and how that drives subsequent behavior
Deriving value from web analytics data often involves very bespoke analytics
Web analytics tools are good at delivering the standard reports that are common across
different business types…
★ The healthcare industry is now awash in data: from biological data such as gene
expression, Special Needs Plans (SNPs), proteomics, metabolomics to, more recently,
next-generation gene sequence data.
★ This exponential growth in data is further fueled by the digitization of patient-level data:
stored in Electronic Health Records (EHRs) and Health Information Exchanges (HIEs),
enhanced with data from imaging and test results, medical and prescription claims, and
personal health devices.
★ The U.S. healthcare system is increasingly challenged by issues of cost and access to
quality care. Payers, producers, and providers are each attempting to realize improved
treatment outcomes and effective benefits for patients within a disconnected health care
framework.
★ Historically, these healthcare ecosystem stakeholders tend to work at cross purposes with
other members of the health care value chain. High levels of variability and ambiguity
across these individual approaches increase costs, reduce overall effectiveness, and
impede the performance of the healthcare system as a whole.
★ Recent approaches to health care reform attempt to improve access to health care by
increasing government subsidies and reducing the ranks of the uninsured.
★ One outcome of the recently passed Accountable Care Act is a revitalized focus on cost
containment and the creation of quantitative proofs of economic benefit by payers,
producers, and providers.
★ This “the enemy of my enemy is my friend” mentality has created an urgent motivation
for payers, producers, and, to a lesser extent, providers, to create a new health care
information value chain derived from a common healthcare analytics approach.
★ The health care system is facing severe economic, effectiveness, and quality challenges.
These external factors are forcing a transformation of the pharmaceutical business model.
★ Health care challenges are forcing the pharmaceutical business model to undergo rapid
change. Our industry is moving from a traditional model built on regulatory approval and
settling of claims, to one of medical evidence and proving economic effectiveness
through improved analytics derived insights.
★ The success of this new business model will be dependent on having access to data
created across the entire healthcare ecosystem.
★ The Hadoop framework consists of a storage layer known as the Hadoop Distributed File
System (HDFS) and a processing framework called the MapReduce programming model.
★ Hadoop splits large amounts of data into chunks, distributes them within the network
cluster and processes them in its MapReduce Framework.
★ Hadoop can also be installed on cloud servers to better manage the compute and storage
resources required for big data. Leading cloud vendors such as Amazon Web Services
(AWS) and Microsoft Azure offer solutions.
★ Cloudera supports Hadoop workloads both on-premises and in the cloud, including
options for one or more public cloud environments from multiple vendors.
★ Hadoop provides a distributed file system and a framework for the analysis and
transformation of very large data sets using the MapReduce paradigm.
★ A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by
simply adding commodity servers.
8. Scalability.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware.
Processing is performed in parallel on multiple servers simultaneously. Software clients input
data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then
processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of
individual machines or racks of machines are common and should be automatically handled in
software by the framework.
Challenges of Hadoop:
MapReduce complexity: As a file-intensive system, MapReduce can be a difficult tool to utilize
for complex jobs, such as interactive analytical tasks.
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other modules in Hadoop.
2. Hadoop MapReduce: This works as a parallel framework for scheduling and processing the
data.
3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved
version of MapReduce and is used for processes running over Hadoop.
4. Hadoop Distributed File System HDFS: This stores data and maintains records over various
machines or clusters. It also allows the data to be stored in an accessible format.
1.8.1 Hadoop Ecosystem
● Hadoop ecosystem is neither a programming language nor a service, it is a platform or
framework which solves big data problems.
● The Hadoop ecosystem refers to the various components of the Apache Hadoop software
library, as well as to the accessories and tools provided by the Apache Software
Foundation for these types of software projects and to the ways that they work together.
● Hadoop is a Java - based framework that is extremely popular for handling and analysing
large sets of data. The idea of a Hadoop ecosystem involves the use of different parts of
the core Hadoop set such as MapReduce, a framework for handling vast amounts of data
and the Hadoop Distributed File System (HDFS), a sophisticated file handling system.
There is also YARN, a Hadoop resource manager.
● In addition to these core elements of Hadoop, Apache has also delivered other kinds of
accessories or complementary tools for developers.
● Some of the most well known tools of the Hadoop ecosystem include HDFS, Hive, Pig,
YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper, etc.
● Hadoop Distributed File System (HDFS), is one of the largest Apache projects and
primary storage system of Hadoop. It employs a NameNode and DataNode architecture.
● It is a distributed file system able to store large files running over the cluster of
commodity hardware.
● YARN stands for Yet Another Resource Negotiator. It is one of the core components in
open source Apache Hadoop suitable for resource management. It is responsible for
managing workloads, monitoring and security controls implementation.
● Hive is an ETL and Data warehousing tool used to query or analyze large datasets stored
within the Hadoop ecosystem. Hive has three main functions: Data summarization, query
and analysis of unstructured and semi- structured data in Hadoop.
● Apache Pig is a high level scripting language used to execute queries for larger datasets
that are used within Hadoop.
● Apache Spark is a fast, in - memory data processing engine suitable for use in a wide
range of circumstances. Spark can be deployed in several ways, it features Java, Python,
Scala and R programming languages and supports SQL, streaming data, machine learning
and graph processing, which can be used together in an application.
1. Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
2. Cost effective: Hadoop is open source and uses commodity hardware to store data so it is
really cost effective as compared to traditional relational database management systems.
3. Resilient to failure: HDFS has the property with which it can replicate data over the network.
5. The unique storage method of Hadoop is based on a distributed file system that effectively
maps data wherever the cluster is located.
★ Open source software is like any other software (closed/proprietary software). This
software is differentiated by its use and licenses.
★ Open source software guarantees the right to access and modify the source code and to
use, reuses and redistribute the software, all with no royalty or other costs.
★ Standard Software is sold and supported commercially. However, Open Source software
can be sold and/or supported commercially, too. Open source is a disruptive technology.
★ Open source licenses must permit non-exclusive commercial exploitation of the licensed
work, must make available the work's source code and must permit the creation of
derivative works from the work itself.
★ The Netscape Public License and subsequently under the Mozilla Public License.
★ Proprietary software is computer software which is the legal property of one party. The
terms of use for other parties are defined by contracts or licensing agreements. These
terms may include various privileges to share, alter, dissemble and use the software and
its code.
★ Closed source is a term for software whose license does not allow for the release or
distribution of the software's source code. Generally, it means only the binaries of a
computer program are distributed and the license provides no access to the program's
source code.
★ The source code of such programs is usually regarded as a trade secret of the company.
Access to source code by third parties commonly requires the party to sign a
non-disclosure agreement.
★ The demands of consumers as well as enterprises are ever increasing with the increase in
the information technology usage. Information technology solutions are required to
satisfy their different needs. It is a fact that a single solution provider cannot produce all
the needed solutions. Open source, freeware and free software are now available for
anyone and for any use.
★ In the 1970s and early 1980s, the software organization started using technical measures
to prevent computer users from being able to study and modify software. The copyright
law was extended to computer programs in 1980. The free software movement was
conceived in 1983 by Richard Stallman to satisfy the need for and to give the benefit of
"software freedom" to computer users.
★ Richard Stallman declared the idea of the GNU operating system in September 1983. The
GNU Manifesto was written by Richard Stallman and published in March 1985.
movement which aims to promote the universal freedom to distribute and modify
computer software without restriction. In February 1986, the first formal definition of
free software was published.
★ The term "free software" is associated with FSFs definition, and the term "open source
software" is associated with OSI's definition. FSFs and OSI's definitions are worded quite
differently but the set of software that they cover is almost identical.
★ One of the primary goals of this foundation was the development of a free and open
computer operating system and application software that can be used and shared among
different users with complete freedom.
★ While open source differs from the operation of traditional copyright licensing by
permitting both open distribution and open modification.
★ Before the term open source became widely adopted, developers and producers used a
variety of phrases to describe the concept. The term open source gained popularity with
the rise of the Internet, which provided access to diverse production models,
communication paths and last but not least, interactive communities.
★ The NIST defines cloud computing as : "Cloud computing is a model for enabling
ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.
★ This cloud model is composed of five essential characteristics, three service models and
four deployment models."
★ Cloud provider is responsible for the physical infrastructure and the cloud consumer is
responsible for application configuration, personalization and data.
★ Broad network access refers to resources hosted in a cloud network that are available for
access from a wide range of devices. Rapid elasticity is used to describe the capability to
provide scalable cloud computing services.
★ In measured services, NIST talks about measured service as a setup where cloud systems
may control a user or tenant's use of resources by a metering capability
somewhere in the system.
On-demand self-service refers to the service provided by cloud computing vendors that enables
the provision of cloud resources on demand whenever they are required.
The Cloud Cube Model has four dimensions to differentiate cloud formations :
a) External/Internal
b) Proprietary/Open
c) De-perimeterized / peremeterized
d) Outsourced/Insourced.
External Internal: Physical location of data is defined by external/internal dimension. It defines
the organization's boundary.
Example: Information inside a datacenter using a private cloud deployment would be considered
internal and data that resided on Amazon EC2 would be considered external.
Proprietary / Open: Ownership is proprietary or open; is a measurement for not only ownership
of technology but also its interoperability, use of data and ease of data-transfer and degree of
vendor's application's lock-in.
Proprietary means that the organization providing the service is keeping the means of provision
under their ownership. Clouds that are open are using technology that is not proprietary, meaning
that there are likely to be more suppliers.
De-perimeterized / peremeterized: Security Ranges is parameterized or de-parameterized;
which measures whether the operations are inside or outside the security boundary, firewall, etc.
Encryption and key management will be the technology means for providing data confidentiality
and integrity in a de-perimeterized model.
Outsourced / Insourced : Out-sourcing/In-sourcing; which defines whether the customer or the
service provider provides the service.
Outsourced means the service is provided by a third party. It refers to letting contractors or
service providers handle all requests and most cloud business models fall into this.
Insourced is the services provided by your own staff under organization control. Insourced
means in-house development of clouds.
★ Cloud computing is often described as a stack, as a response to the broad range of
services built on top of one another under the "cloud". A cloud computing stack is a
cloud architecture built in layers of one or more cloud-managed services (SaaS, Paas,
IaaS, etc.).
★ Cloud computing stacks are used for all sorts of applications and systems. They are
especially good in microservices and scalable applications, as each tier is dynamically
scaling and replaceable.
★ The cloud computing pile makes up a threefold system that comprises its lower-level
elements. These components function as formalized cloud computing delivery models:
★ With a cloud model, you pay on a subscription basis with no upfront capital expense. You
don’t incur the typical 30 percent maintenance fees—and all the updates on the platform
are automatically available.
★ The ability to build massively scalable platforms—platforms where you have the option
to keep adding new products and services for zero additional cost—is giving rise to
business models that weren’t possible before. Mehta calls it “the next industrial
revolution, where the raw material is data and data factories replace manufacturing
factories.” He pointed out a few guiding principles that his firm stands by:
1. Stop saying “cloud.” It’s not about the fact that it is virtual, but the true value lies in
delivering software, data, and/or analytics in an “as a service” model. Whether that is in a private
hosted model or a publicly shared one does not matter. The delivery, pricing, and consumption
model matters.
2. Acknowledge the business issues. There is no point to make light of matters around
information privacy, security, access, and delivery. These issues are real, more often than not
heavily regulated by multiple government agencies, and unless dealt with in a solution, will kill
any platform sell.
3. Fix some core technical gaps. Everything from the ability to run analytics at scale in a
virtual environment to ensuring information processing and analytics authenticity are issues that
need solutions and have to be fixed.
1.11 Mobile Business Intelligence
➔ Analytics on mobile devices is what some refer to as putting BI in your pocket. Mobile
drives straight to the heart of simplicity and ease of use that has been a major barrier to
BI adoption since day one.
➔ Mobile devices are a great leveling field where making complicated actions easy is the
name of the game. For example, a young child can use an iPad but not a laptop.
➔ As a result, this will drive broad-based adoption as much for the ease of use as for the
mobility these devices offer. This will have an immense impact on the business
intelligence sector.
➔ Mobile BI or mobile analytics is the rising software technology that allows the users to
access information and analytics on their phones and tablets instead of desktop-based BI
systems.
➔ Mobile analytics involves measuring and analyzing data generated by mobile platforms
and properties, such as mobile sites and mobile applications.
➔ Analytics is the practice of measuring and analyzing data of users in order to create an
understanding of user behavior as well as website or performance. If this practice is done
on mobile apps and app users, it is called "mobile analytics".
➔ Mobile analytics is the practice of collecting user behavior data, determining intent from
those metrics and taking action to drive retention, engagement and conversion.
➔ Mobile analytics is similar to web analytics where identification of the unique customer
and recording their usages.
➔ With mobile analytics data, you can improve your cross-channel marketing initiatives,
optimize the mobile experience for your customers and grow mobile user engagement
and retention.
➔ Analytics usually comes in the form of a software that integrates into a company's
existing websites and apps to capture, store and analyze the data.
➔ It is always very important for businesses to measure their critical KPIs (Key
Performance Indicators), as the old rule is always valid: "If you can't measure it, you
can't improve it".
➔ To be more specific, if a business find out 75 % of their users exit in the shipment screen
of their sales funnel, probably there is something wrong with that screen in terms of its
design, user interface (UI) or user experience (UX) or there is a technical problem
preventing users from completing the process.
➔ SDKs differ by platform so a different SDK is required for each platform such as iOS,
Android, Windows Phone etc. On top of that, additional code is required for custom event
tracking.
➔ With the help of this code, analytics tools track and count each user, app launch, tap,
event, app crash or any additional information that the user has, such as device, operating
system, version IP address (and probable location).
➔ Unlike web analytics, mobile analytics tools don't depend on cookies to identify unique
users since mobile analytics SDKs can generate a persistent and unique identifier for each
device.
➔ The tracking technology varies between websites, which use either javascript or cookies
and apps, which use a software development kit(SDK).
➔ Each time a website or app visitor takes an action, the application fires off data which is
recorded in the mobile analytics platform.
★ Crowdsourcing is all about collecting data from users through some services, ideas, or
content and then it needs to be stored in a server such that the necessary data can be or
provided to users whenever necessary.
★ Most users nowadays use Truecaller to find unknown numbers and Google Maps to find
out places and the traffic in a region. All the services are based on crowdsourcing.
★ Crowdsourced data is a form of secondary data. Secondary data refers to data that is
collected by any party other than the researcher. Secondary data provides important
context for any investigation into a policy intervention.
★ When crowdsourcing data, researchers collect plentiful, valuable and dispersed data at
a cost typically lower than that of traditional data collection methods.
★ Consider the trade-offs between sample size and sampling issues before deciding to
crowdsource data. Ensuring data quality means making sure the platform which you are
collecting crowdsourced data is well-tested.
★ Crowdsourcing experiments are normally set up by asking a set of users to perform a task
for a very small remuneration on each unit of the task. Amazon Mechanical Turk
(AMT) is a popular platform that has a large set of registered remote workers who are
hired to perform tasks such as data labeling.
★ In data labeling tasks, the crowd workers are randomly assigned a single item in the
dataset. A data object may receive multiple labels from different workers and these have
to be aggregated to get the overall true label.
★ Crowdsourcing allows for many contributors to be recruited in a short period of time,
thereby eliminating traditional barriers to data collection. Furthermore, crowdsourcing
platforms usually employ their own tools to optimize the annotation process, making it
easier to conduct time-intensive labeling tasks.
product or service, and drive sales. For instance, Lego conducted a campaign where
customers had the chance to develop their designs of toys and submit them.
★ To become the winner, the creator had to receive the biggest amount of people's votes.
The best design was moved to the production process. Moreover, the winner got a
privilege that amounted to a 1 % royalty on the net revenue.
Types of Crowdsourcing:
There are four main types of crowdsourcing.
1. Wisdom of the crowd: It is a collective opinion of different individuals gathered in a group.
This type is used for decision-making since it allows one to find the best solution for problems.
2. Crowd creation : This type involves a company asking its customers to help with new
products. This way, companies get brand new ideas and thoughts that help a business stand out.
3. Crowd voting: It is a type of crowdsourcing where customers are allowed to choose a winner.
They can vote to decide which of the options is the best for them. This type can be appli ed to
different situations. Consumers can choose one of the options provided by experts or products
created by consumers.
4. Crowdfunding: It is when people collect money and ask for investments for charities,
projects and startups without planning to return the money to the owners. People do it
voluntarily. Often, companies gather money to help individuals and families suffering from
natural disasters, poverty, social problems, etc.
Example:
● A firewall is a device designed to control the flow of traffic into and out-of a
network. In general, firewalls are installed to prevent attacks. Firewall can be a
software program or a hardware device.
● Firewalls are software programs or hardware devices that filter the traffic that
flows into a user PC or user network through an internet connection.
● They sift through the data flow and block that which they deem harmful to the
user network or computer system.
● Firewalls filter based on IP, UDP and TCP information. Firewall is placed on the
link between a network router and Internet or between a user and router.
● For large organizations with many small networks, the firewall is placed on every
connection attached to the Internet.
● Firewall based security depends on the firewall being the only connectivity to the
size from outside; there should be no way to bypass the firewall via other
gateways; wireless connections.
Functions of firewall:
1. Access control: Firewall filters incoming as well as outgoing packets.
2. Address/Port Translation: Using network address translation, internal machines,
though not visible on the Internet, can establish a connection with external machines on
the Internet. NATing is often done by firewall.
3. Logging: Security architecture ensures that each incoming or outgoing packet
encounters at least one firewall. The firewall can log all anomalous packets.
Firewalls can protect the computer and user personal information from :
1. Hackers who your system security.
2. Firewall prevents malware and other Internet hacker attacks from reaching your
computer in the first place.
Firewall Characteristics
1. All traffic from inside to outside and vice versa, must pass through the firewall.
2. The firewall itself is resistant to penetration.
3. Only authorized traffic, as defined by the local security policy, will be allowed to pass.
● Policy is typically general and set at a high level within the organization. Policies
that contain details generally become too much of a "living document".
User can create or disable firewall filter rules based on following conditions :
1. IP addresses: System admin can block a certain range of IP addresses.
2. Domain names: Admin can only allow certain specific domain names to access your
systems or allow access to only some specific types of domain names or domain name
extension.
3. Protocol: A firewall can decide which of the systems can allow or have access to
common protocols like IP, SMTP, FTP, UDP, ICMP, Telnet or SNMP.
4. Ports: Blocking or disabling ports of servers that are connected to the internet will
help maintain the kind of data flow you want to see it used for and also close down
possible entry points for hackers or malignant software.
5. Keywords: Firewalls also can sift through the data flow for a match of the keywords
or phrases to block out offensive or unwanted data from flowing in.
● When your computer makes a connection with another computer on the network,
several things are exchanged including the source and destination ports.
● In a standard firewall configuration, most inbound ports are blocked. This would
normally cause a problem with return traffic since the source port is randomly
assigned.
● A state is a dynamic rule created by the firewall containing the source-destination
port combination, allowing the desired return traffic to pass the firewall.
➔ Packet filter firewall controls access to packets on the basis of packet source and
destination address or specific transport protocol type.
➔ It is done at the OSI data link, network and transport layers. Packet filter firewall
works on the network layer of the OSI model.
➔ Packet filters do not see inside a packet; they block or accept packets solely on the
basis of the IP addresses and ports. All incoming SMTP and FTP packets are
parsed to check whether they should drop or forwarded.
➔ But outgoing SMTP and FTP packets have already been screened by the gateway
and do not have to be checked by the packet filtering router. Packet filter firewall
only checks the header information.
Application level gateway is also called a bastion host. It operates at the application
level. Multiple application gateways can run on the same host but each gateway is a
separate server with its own processes.
These firewalls, also known as application proxies, provide the most secure type of data
connection because they can examine every layer of the communication, including the
application data.
The firewall does not simply allow or disallow packets but also determines whether the
connection between both ends is valid according to configurable rules, then opens a
session and permits traffic only from the allowed source and possibly only for a limited
period of time.
It typically performs basic packet filter operations and then adds verification of proper
handshaking of TCP and the legitimacy of the session information used in establishing
the connection.
The decision to accept or reject a packet is based upon examining the packet's IP header
and TCP header.
Circuit level gateway cannot examine the data content of the packets it relays between a
trusted network and an untrusted network.