MODULE-2
MODULE-2
MODULE-2
Data Analytics
Introduction to Analytics
• Analytics is a journey that involves a combination of potential skills, advanced
technologies, applications, and processes used by firm to gain business insights from data
and statistics.
• Data Analytics refers to the techniques used to analyze data to enhance productivity and
business gain.
• Data is extracted from various sources and is cleaned and categorized to analyze various
behavioral patterns.
• The techniques and the tools used vary according to the organization or individual.
• Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business requirements.
• Data analytics is the process of inspecting, transforming and Extract Meaningful Insights
from data for Decision making.
• Data analytics is a scientific process of Convert Data into Useful Information for
Decision Makers
• Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in business.
• Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
Applications of Analytics
There are several applications of data analytics, and businesses are actively using such data
analytics applications to keep themselves in the competition.
• Policing/Security
• Healthcare
• This has been known as one of the initial applications of data science which was
extracted from the discipline of Finance.
• So many organizations had very bad experiences with debt and were so fed up with it.
• Since they already had data that was collected during the time their customers applied for
loans, they applied data science which eventually rescued them from the losses they had
incurred.
• This led to banks learning to divide and conquer data from their customers’ profiles,
recent expenditure and other significant information that were made available to them.
• This made it easy for them to analyze and infer if there was any probability of customers
defaulting.
Policing/Security
• Several cities all over the world have employed predictive analysis in predicting areas
that would likely witness a surge in crime with the use of geographical data and historical
data.
• This has seemed to work in major cities such as Chicago, London, Los Angeles, etc.
• Although, it is not possible to make arrests for every crime committed but the availability
of data has made it possible to have police officers within such areas at a certain time of
the day which has led to a drop in crime rate.
• This shows that this kind of data analytics application will make us have safer cities
without police putting their lives at risk.
Healthcare
• Healthcare analytics is the process of analyzing current and historical industry data to
predict trends, improve outreach, and even better manage the spread of diseases.
• The field covers a broad range of businesses and offers insights on both the macro and
micro level.
• It can reveal paths to improvement in patient care quality, clinical data, diagnosis, and
business management.
• When combined with business intelligence suites and data visualization tools, healthcare
analytics help managers operate better by providing real-time information that can
support decisions and deliver actionable insights.
Descriptive Analytics
• Data aggregation and data mining are two techniques used in descriptive analytics to
discover historical data.
• Data is first gathered and sorted by data aggregation in order to make the datasets more
manageable by analysts.
• Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning.
• Identified patterns are analyzed to discover the specific ways that learners interacted with
the learning content and within the learning environment.
Predictive Analytics
• It tells us what will probably happen in the future as a result of something that has
already happened.
• It takes historical data and feeds it into a machine learning model that considers key
trends and patterns.
• The model is then applied to current data to predict what will happen next.
• Which use statistical models and forecasts techniques to understand the future and
answer: “What could happen?”
Prescriptive Analytics
• Which use optimization and simulation algorithms to advice on possible outcomes and
answer: “What should we do?”
• The vehicle makes millions of calculations on every trip that helps the car decide when
and where to turn, whether to slow down or speed up and when to change lanes the same
decisions a human driver makes behind the wheel.
• Data Science and Analytics are used by Manufacturing companies as well as Real Estate
firms to develop their business and solve various issues by the help of historical data
base.
• Tools are the software's that can be used for Analytics like SAS or R.
• While techniques are the procedures to be followed to reach up to a solution.
– 1. Data Preparation
– 3. Segmentation Icon
– 4. Forecasting
– 5. Descriptive Modeling
– 6. Predictive Modeling
– 7. Optimization
R Programming
• R is a statistical language created by statisticians.
• R compiles and runs on various platforms such as UNIX, Windows and Mac OS.
• R is helpful at every step of the data analysis process from gathering and cleaning data to
analyzing it and reporting the conclusions.
• R also integrates very well with many Big Data platforms which have contributed to its
success.
• R has vast number of packages and built in functions.
• The popular libraries like ggplot2 for visually appealing graphs that set R apart from
other programming languages.
Python
• Python is a high level, interpreted and general purpose dynamic programming language
that focuses on code readability.
• You need less lines of code to perform the same task as compared to other major
languages like C/C++ and Java.
• Due to the simplicity of Python, developers can focus on solving the problem.
• Python provides a large standard library which includes areas like internet protocols,
string operations, web services tools and operating system interfaces.
• you can find almost all the functions needed for your task.
• Python has built-in list and dictionary data structures which can be used to construct fast
runtime data structures.
Tableau Public
• Tableau Public is a free software that connects any data source be it corporate Data
Warehouse, Microsoft Excel or web-based data, and creates data visualizations, maps,
dashboards etc. with real-time updates presenting on web.
• Most suitable for quick and easy representation of big data which helps in resolving the
big data issues.
• Tableau is a powerful and fastest growing data visualization tool used in the Business
Intelligence Industry.
• It helps in simplifying raw data into the very easily understandable format.
• Data analysis is very fast with Tableau and the visualizations created are in the form of
dashboards and worksheets.
• The data that is created using Tableau can be understood by professional at any level in
an organization.
• The great thing about Tableau software is that it doesn't require any technical or any kind
of programming skills to operate.
The University of California, Berkeley’s AMP Lab, developed Apache in 2009. Apache Spark is
a fast large-scale data processing engine and executes applications in Hadoop clusters 100 times
faster in memory and 10 times faster on disk.
Types of Data Base Models
1. Hierarchical Model
2. Relational Model
3. Network Model
4. Object-Oriented Model
5. Entity-Relationship Model
1. Hierarchical Model
• As the name indicates, this model makes use of hierarchy to structure the data in a tree-
like format.
• The hierarchy starts from the root which has root data and then it expands in the form of a
tree adding child node to the parent node.
• This model easily represents some of the real-world relationships like food recipes,
sitemap of a website etc.
Advantages:
Disadvantages:
• As it does not support more than one parent of the child node so if we have some
complex relationship where a child node needs to have two parent node then that can't be
represented using this model.
2. Relational Model
• Relational Model is the most widely used model.
Advantages:
• Simple: This model is more simple as compared to the network and hierarchical model.
• Scalable: This model can be easily scaled as we can add as many rows and columns we
want.
• However, unlike the hierarchical model, this model makes it easier to convey complex
relationships as each record can be linked with multiple parent records.
• Its distinguishing feature is that the schema, viewed as a graph in which object types are
nodes and relationship types are arcs, is not restricted to being a hierarchy or lattice.
Advantages:
• The data can be accessed faster as compared to the hierarchical model. This is because
the data is more related in the network model and there can be more than one path to
reach a particular node. So the data can be accessed in many ways.
Disadvantages:
• As more and more relationships need to be handled the system might get complex. So, a
user must be having detailed knowledge of the model to work with the model.
• This type of model is also called the post relational database model.
• The real-world problems are more closely represented through the object-oriented data
model.
• In this model, two are more objects are connected through links.
• All the data and relationships of each object are contained as a single unit.
• The attributes like Name, Job_title of the employee and the methods which will be
performed by that object are stored as a single object.
• The two objects are connected through a common attribute i.e the Department_id and the
communication between these two will be done with the help of this common id.
5. Entity-relationship Model
• Entity-Relationship Model or simply ER Model is a high-level data model diagram.
• In this model, we represent the real-world problem in the pictorial form to make it easy
for the stakeholders to understand.
• Effective Communication Tool: This model is used widely by the database designers
for communicating their ideas.
• Easy Conversion to any Model: This model maps well to the relational model and can
be easily converted relational model by converting the ER model to the table. This model
can also be converted to any other model like network model, hierarchical model etc.
Disadvantages:
• Numerical(Quantitative)
• Categorical(Qualitative)
1. Quantitative data (Numerical data)
• It deals with numbers and things you can measure objectively: dimensions such as height,
width, and length. Temperature and humidity, Prices, Area and volume. Numerical data
is information that is measurable, and it is, of course, data represented as numbers and not
words or text.
Continuous Data:
• Continuous Data represents measurements and therefore their values can’t be counted but
they can be measured.
• Continuous numbers are numbers that don’t have a logical end to them.
• An example would be the height of a person, which you can describe by using intervals
on the real number line.
Discrete Data:
• We speak of discrete data if its values are distinct and separate. In other words: We speak
of discrete data if the data can only take on certain values.
• Discrete numbers are the opposite; they have a logical end to them.
• Some examples include variables for days in the month, or number of bugs logged.
2. Categorical Data(Qualitative)
• Categorical data represents characteristics.
• This is any data that isn’t a number, which can mean a string of text or date.
• These variables can be broken down into nominal and ordinal values.
Nominal Data :
• Nominal values represent discrete units and are used to label variables, that have no
quantitative value.
• It is therefore nearly the same as nominal data, except that it’s ordering matters.
Binary data :
• In addition to ordinal and nominal values, there is a special type of categorical data called
binary. Binary data types only have two values – yes or no.
• This can be represented in different ways such as “True” and “False” or 1 and 0.
• Examples of binary variables can include whether a person has stopped their subscription
service or not, or if a person bought a car or not.
Missing Imputations
• Imputation is the process of replacing missing data with substituted values.
1. MCAR
• Data which is Missing Completely At Random has nothing systematic about which
observations are missing values. There is no relationship between missingness and either
observed or unobserved covariates.
2. MAR
• Missing At Random is weaker than MCAR. The missingness is still random, but due
entirely to observed variables. For example, those from a lower socioeconomic status
may be less willing to provide salary information (but we know their SES status). The
key is that the missingness is not due to the values which are not observed. MCAR
implies MAR but not vice-versa.
3. MNAR
• If the data are Missing Not At Random, then the missingness depends on the values of
the missing data. Censored data falls into this category. For example, individuals who are
heavier are less likely to report their weight. Another example, the device measuring
some response can only measure values above .5. Anything below that is missing.
2. Fill in the missing value manually: In general, this approach is time-consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown” or -∞. If missing values are replaced by, say,
“Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common-that of “Unknown.” Hence, although this method
is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the average value of that
particular attribute and use this value to replace the missing value in that attribute column.
5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.
• Data Modeling is the process of analyzing the data objects and their relationship to the
other objects.
• It is used to analyze the data requirements that are required for the business processes.
• The data models are created for the data to be stored in a database.
• The Data Model's main focus is on what data is needed and how we have to organize data
rather than what operations we have to perform.
• Data Modeling helps create a robust design with a data model that can show an
organization's entire data on the same platform.
• The database at the logical, physical, and conceptual levels can be designed with the help
data model.
• Redundant data and missing data can be identified with the help of data models.
• The data model is quite a time consuming, but it makes the maintenance cheaper and
faster.
Need for Business Modeling
The main need of Business Modeling for the Companies that embrace big data analytics and
transform their business models in parallel will create new opportunities for revenue streams,
customers, products and services. Having a big data strategy and vision that identifies and
capitalizes on new opportunities.
• Business analytics involves the collating, sorting, processing, and studying of business-
related data using statistical models and iterative methodologies.
• The goal of BA is to narrow down which datasets are useful and which can increase
revenue, productivity, and efficiency.
• Business analytics (BA) is the combination of skills, technologies, and practices used to
examine an organization's data and performance as a way to gain insights and make data-
driven decisions in the future using statistical analysis.
• Credit and debit cards are an everyday part of consumer spending, and they are an ideal
way of gathering information about a purchaser’s spending habits, financial situation,
behavior trends, demographics, and lifestyle preferences.
• Excellent customer relations is critical for any company that wants to retain customer
loyalty to stay in business for the long haul. CRM systems analyze important
performance indicators such as demographics, buying patterns, socio-economic
information, and lifestyle.
3. Finance
• The financial world is a volatile place, and business analytics helps to extract insights that
help organizations maneuver their way through tricky terrain. Corporations turn to
business analysts to optimize budgeting, banking, financial planning, forecasting, and
portfolio management.
4. Human Resources
• Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average length
of employment, etc. By working with this information, business analysts help HR by
forecasting the best fits between the company and candidates.
5. Manufacturing
• Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks,
and supply-chain management to create maximum efficiency.
6. Marketing
• Business analysts help answer these questions and so many more, by measuring
marketing and advertising metrics, identifying consumer behavior and the target
audience, and analyzing market trends.