Handbook Introduction of Data Science AY 23-24
Handbook Introduction of Data Science AY 23-24
Handbook Introduction of Data Science AY 23-24
● Data Science is nothing short of magic and a data scientist is a magician who
performs tricks with the data in his hat. Now, as magic is composed of different
elements, similarly data science is an interdisciplinary field. You can consider data
science to be an amalgamation of different fields such as Data Manipulation, Data
Visualization, Statistical Analysis, and Machine Learning. Each of these sub-domains
is equally important when it comes to data science.
● Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed
using the scientific method, different technologies, and algorithms.
● It is a multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
Introduction of Data Science
● Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
● In short, we can say that data science is all about:
● Asking the correct questions and analyzing the raw data.
● Modeling the data using various complex and efficient algorithms.
● Visualizing the data to get a better perspective.
● Understanding the data to make better decisions and finding the final result.
Introduction of Data Science
● Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate answer
from these decisions, so this analysis of data is called the data analysis, which is a
part of data science.
Need for Data Science:
Introduction of Data Science
● Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some complex,
powerful, and efficient algorithms and technology, and that technology came into
existence as data Science. Following are some main reasons for using data science
technology:
● With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
● Data science technology is opting by various companies, whether it is a big brand or
a startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
Data Science Components:
●
Data Science Components:
1. Statistics: Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze the numerical data in a large amount and finding
meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.
The main components of Data Science are given below:
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions.
When you start any data science project, you need to determine what are the basic
requirements, priorities, and project budget. In this phase, we need to determine all the
requirements of the project such as the number of people, technology, time, data, an end
goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we
need to perform the following tasks:
● Data cleaning
● Data Reduction
● Data integration
● Data transformation,
Data Science Lifecycle
● After performing all the above tasks, we can easily use this data for our further
processes.
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply Exploratory
data analytics(EDA) by using various statistical formula and visualization tools to
understand the relations between variable and to see what data can inform us. Common
tools used for model planning are:
4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.
Following are some common Model building tools:
5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have
set on the initial phase. We will communicate the findings and final result with the
business team.
Data Science has a lot of real-world applications. Let’s have a look at some of those:
Applications of Data Science
Comparison of Data Science with Data Analytics
Comparison of Data Science with Data Analytics
● Predictive analytics is an area of statistics that deals with extracting information from
data and using it to predict trends and behavior patterns.
● The enhancement of predictive web analytics calculates statistical probabilities of
future events online. Predictive analytics statistical techniques include data modeling,
machine learning, AI, deep learning algorithms and data mining.
● Often the unknown event of interest is in the future, but predictive analytics can be
applied to any type of unknown whether it be in the past, present or future. For
example, identifying suspects after a crime has been committed, or credit card fraud
as it occurs.
Comparison of Data Science with Data Analytics
● Descriptive analysis or statistics does exactly what the name implies: they “describe”,
or summarize, raw data and make it something that is interpretable by humans. They
are analytics that describe the past. The past refers to any point of time that an event
has occurred, whether it is one minute ago, or one year ago. Descriptive analytics are
useful because they allow us to learn from past behaviors, and understand how they
might influence future outcomes.
Descriptive Analytics: Insight into the past
● The vast majority of the statistics we use fall into this category. (Think basic
arithmetic like sums, averages, percent changes.) Usually, the underlying data is a
count, or aggregate of a filtered column of data to which basic math is applied. For all
practical purposes, there are an infinite number of these statistics.
● Descriptive statistics are useful to show things like total stock in inventory, average
dollars spent per customer and year-over-year change in sales. Common examples of
descriptive analytics are reports that provide historical insights regarding the
company’s production, financials, operations, sales, finance, inventory and customers.
Descriptive Analytics: Insight into the past
● Use Descriptive Analytics when you need to understand at an aggregate level what is
going on in your company, and when you want to summarize and describe different
aspects of your business.
Predictive Analytics: Understanding the future
● Predictive analytics has its roots in the ability to “predict” what might happen. These
analytics are about understanding the future. Predictive analytics provides companies
with actionable insights based on data.
● Predictive analytics provides estimates about the likelihood of a future outcome. It is
important to remember that no statistical algorithm can “predict” the future with
100% certainty. Companies use these statistics to forecast what might happen in the
future. This is because the foundation of predictive analytics is based on probabilities.
Predictive Analytics: Understanding the future
● These statistics try to take the data that you have, and fill in the missing data with best
guesses. They combine historical data found in ERP, CRM, HR and POS systems to
identify patterns in the data and apply statistical models and algorithms to capture
relationships between various data sets. Companies use predictive statistics and
analytics any time they want to look into the future. Predictive analytics can be used
throughout the organization, from forecasting customer behavior and purchasing
patterns to identifying trends in sales activities. They also help forecast demand for
inputs from the supply chain, operations and inventory.
Predictive Analytics: Understanding the future
● One common application most people are familiar with is the use of predictive
analytics to produce a credit score. These scores are used by financial services to
determine the probability of customers making future credit payments on time.
Typical business uses include understanding how sales might close at the end of the
year, predicting what items customers will purchase together, or forecasting inventory
levels based upon a myriad of variables.
● Use Predictive Analytics any time you need to know something about the future, or
fill in the information that you do not have
Predictive Analytics: Understanding the future
● Prescriptive analytics are relatively complex to administer, and most companies are
not yet using them in their daily course of business. When implemented correctly,
they can have a large impact on how businesses make decisions, and on the
company’s bottom line. Larger companies are successfully using prescriptive
analytics to optimize production, scheduling and inventory in the supply chain to
make sure they are delivering the right products at the right time and optimizing the
customer experience.
● Use Prescriptive Analytics any time you need to provide users with advice on what
action to take
Predictive Analytics: Understanding the future
● In terms of methodology, big data analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally we
model the data in a way to explain a response. The objectives of this approach is to
predict the response behavior or understand how the input variables relate to a
response. Normally in statistical experimental designs, an experiment is developed
and data is retrieved as a result. This allows to generate data in a way that can be used
by a statistical model, where certain assumptions hold such as independence,
normality, and randomization.
Prescriptive Analytics: Advise on possible outcomes
● The relatively new field of prescriptive analytics allows users to “prescribe” a number
of different possible actions and guide them towards a solution. In a nutshell, these
analytics are all about providing advice.
● Prescriptive analytics attempts to quantify the effect of future decisions in order to
advise on possible outcomes before the decisions are actually made.
● At their best, prescriptive analytics predicts not only what will happen, but also why it
will happen, providing recommendations regarding actions that will take advantage of
the predictions.
Big data analytics
● In big data analytics, we are presented with the data. We cannot design an experiment
that fulfills our favorite statistical model. In large-scale applications of analytics, a
large amount of work (normally 80% of the effort) is needed just for cleaning the
data, so it can be used by a machine learning model.
Big data analytics
● One of the most important tasks in big data analytics is statistical modeling, meaning
supervised and unsupervised classification or regression problems.
● Once the data is cleaned and preprocessed, available for modeling, care should be
taken in evaluating different models with reasonable loss metrics and then once the
model is implemented, further evaluation and results should be reported.
● A common pitfall in predictive modeling is to just implement the model and never
measure its performance.
Big data analytics
● As mentioned in the big data life cycle, the data products that result from developing
a big data product are in most of the cases some of the following −
● Machine learning implementation − This could be a classification algorithm, a
regression model or a segmentation model.
● Recommender system − The objective is to develop a system that recommends
choices based on user behavior. Netflix is the characteristic example of this data
product, where based on the ratings of users, other movies are recommended.
● Dashboard − Business normally needs tools to visualize aggregated data. A dashboard
is a graphical mechanism to make this data accessible.
● Ad-Hoc analysis − Normally business areas have questions, hypotheses or myths that
can be answered doing ad-hoc analysis with data.
Introduction to Web and Social Media Analytics
● Web analytics is the measurement of data, the collection of information, analysis, and
reporting of Internet data for the purposes of optimizing and understanding Web
usage.
● First, Let us understand Web Analytics, by definition “Web analytics is the
measurement of data, the collection of information, analysis, and reporting of Internet
data for the purposes of optimizing and understanding Web usage. Here the web
usage refers to the respective business website usage data.
Social Media Analytics
● Now Let us understand social media analytics. Social media analytics is the practice
of gathering data from social media websites or networks such as Facebook, Twitter,
Google plus, etc. and analyzing those metrics to understand insights to make business
decisions.
● Most often people confuse web analytics and Social media Analytics, let us have a
clear demarcation between these two. Web analytics uses the data collected directly
from a particular business website and Social media analytics uses the data collected
from social media networks.
Web Analytics
The four important key metrics can be analyzed from web analytics,
1. Total Traffic
2. Traffic Source
3. Bounce Rate
4. Conversion rate
Web Analytics
● Total Traffic:- Total Traffic to your website gives insights about from where are you
getting more traffic, that helps you to understand your target market. In addition, you
can analyze which hours of the day and days of the week, you are getting more visits
to your websites. Based on this information you can run a campaign to optimize more
conversions.
● Traffic Source:- Traffic source is about how you are getting most visitors to your
website. Is it through social media, search engines or Referral Sites? Based on that
information you can strategies your marketing campaign or write a blog or focusing
on a particular social media network. For example, if most of your visitors are coming
from social media networks, use that information to brand your business more on
Facebook, Twitter or any other social media platforms to boost your website traffic.
Web Analytics
● Bounce Rate:- Bounce rate is the percentage of visitors to a particular website who
are leaving the site after viewing only one page without navigating other pages on the
website.
● This could be higher for any number of reasons may be,
● Irrelevant content
● Inappropriate designs
● Confusing navigation
● Frequent pop-ups
● Unnecessary ads
● Or, Annoying sounds
● However, this metric helps you to improve your web design overall.
Web Analytics
● Conversion Rate:- A conversion rate is the percentage of visitors who have taken
some action in your website or complete the desired goal; it could be purchasing a
product, Sign up for new letters etc.
● There are many tools available in the market to track and analyze web data, however
Google Analytics is one of the vital tools for any websites, as it offers the chance to
obtain and analyze in-depth statistics about who your customers are, what they are
interested in, and how they interact with your website or online store.
● When you take advantage of the information, which Google Analytics offers, you can
make data-informed decisions about the best ways to improve business.
Machine Learning Algorithms
● Machine learning is used to predict, categorize, classify, finding polarity, etc from the
given datasets and concerned with minimizing the error.
● It uses training data for artificial intelligence.
● Since there are many algorithms like SVM Algorithm in Python, Bayes algorithm,
logistic regression, etc. which will use training data to match with input data and then
it will provide a conclusion with maximum accuracy.
Machine Learning Algorithms
● Machine learning is categorized into
● The critical element of data science is Machine Learning algorithms, which are a
process of a set of rules to solve a certain problem
Machine Learning Algorithms
● Some of the important data science algorithms include regression, classification and
clustering techniques, decision trees and random forests, machine learning techniques
like supervised, unsupervised and reinforcement learning. In addition to these, there
are many algorithms that organizations develop to serve their unique needs.
Machine Learning Algorithms
● Supervised learning
● It is used for the structured dataset. It analyzes the training data and generates a
function that will be used for other datasets.
● Unsupervised learning
● It is used for raw datasets. Its main task is to convert raw data to structured data.In
today’s world, there is a huge amount of raw data in every field. Even the computer
generates log files which are in the form of raw data. Therefore it’s the most
important part of machine learning.
Challenges in Data-Driven Decision Making and Future
● Discrimination
● Algorithmic discrimination can come from various sources. First, the data used to
train algorithms may have biases that lead to discriminatory decisions. Second,
discrimination may arise from the use of a particular algorithm.
● Categorization, for example, can be considered a form of direct discrimination
because it uses algorithms to determine different treatments of various classes. Third,
algorithms can result in discrimination as a result of misuse of certain models in
different contexts. Fourth, biased data can be used both as evidence for the training of
algorithms and as evidence of their effectiveness.
Challenges in Data-Driven Decision Making and Future
● Lack of transparency
● Transparency refers to the capacity to understand a computational model and
therefore contribute to the attribution of responsibility for consequences derived
from its use.
● A model is transparent if a person can easily observe it and understand it. It would
therefore be desirable for models to have low computational complexity.
● Violation of privacy
● Reports and studies have focused on the misuse of users' personal data and on data
aggregation by entities such as data brokers, which have direct implications for
people's privacy
Challenges in Data-Driven Decision Making and Future
● Digital literacy
● It is extremely important that we devote resources to digital and computer literacy
programs for all citizens, from children to the elderly. If we do not, it will be very
difficult (if not impossible) as a society to make decisions about technologies that
we do not understand. The book Los nativos digitales no existen ("Digital natives do
not exist") emphasizes the need to teach children and adolescents about computer
thinking and proper use of technology.
●
Challenges in Data-Driven Decision Making and Future
● Fuzzy responsibility
● As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
attribution of responsibility. However, transparency and audits are not enough to
guarantee clear responsibility.
Challenges in Data-Driven Decision Making and Future
● Lack of diversity
● Given the variety of cases in which algorithms can be applied for decision-making, it
is important to reflect on the frequent lack of diversity in the teams that generate such
algorithms. So far, data-based algorithms and artificial intelligence techniques for
decision-making have been developed by homogeneous groups of IT professionals. In
the future, we should make sure that teams are diverse in terms of areas of knowledge
as well as demographic factors (particularly gender, given that women account for
less than 20% of IT professionals at many technology companies).
Types & Scales of Data in Descriptive Statistics
● Descriptive statistics help you to understand the data, but before we understand what
data is, we should know different data types in descriptive statistical analysis. The
below screen helps you to get an overview of it.
Types of data
● A data set is a grouping of information that is related to each other. A data set can be
either qualitative or quantitative. A qualitative data set consists of words that can be
observed, not measured. A quantitative data set consists of numbers that can be
directly measured. Months in a year would be an example of qualitative, while the
weight of persons would be an example of quantitative data.
Types of data
● Now, let’s suppose you go to KFC to eat some burgers along with your friends, you
placed the order at coupon counter and after receiving from the food counter everyone
eats what you ordered on their behalf. If someone asked about the taste to others then
the ratings on the taste will vary from one to another but if asked how many burgers
we ordered then everyone will come to a definite count and it will be the same for all.
Here, Taste’s ratings represent the Categorical Data and the number of burgers is
Numerical Data.
Types of Categorical Data:
● Nominal Data: When there is no natural order between categories then data is
nominal type.
● Example: Color of an Eye, Gender (Male & Female), Blood Type, Political Party, and
Zipcode, Type of living accommodation(House, Apartment, Trailer, Other), Religious
preference( Hindu, Buddhist, Muslim, Jewish, Christian, Other), etc.
Types of Categorical Data:
● Ordinal Data: When there is natural order between categories then data is ordinal
type. But here, the difference between the values in order does not matter.
Example: Exam Grades, Socio-economic status (poor, middle class, rich),
Education-level (kindergarten, primary, secondary, higher secondary, graduation),
satisfaction rating(extremely dislike, dislike, neutral, like, extremely like), Time of
Day(dawn, morning, noon, afternoon, evening, night), Level of Agreement(yes,
maybe, no), The Likert Scale(strongly disagree, disagree, neutral, agree, strongly
agree), etc.
Types of Numerical Data
● Discrete Data: The data is said to be discrete if the measurements are integers. It
represents count or an item that can be counted.
● Example: Number of people in a family, the number of kids in class, the number of
cricket players in a team, the number of cricket playing nations in the world.
● Discrete data is a special kind of data because each value is separate and different.
Types of Numerical Data
● Continous Data: The data is said to be continuous if the measurements can take any
value usually within some range. It is a scale of measurement that can consist of
numbers other than whole numbers, like decimals and fractions.
● Example: height, weight, length, temperature
● Continuous data usually require a tool, like a ruler, measuring tape, scale, or
thermometer, to produce the values in a continuous data set.
Types of Numerical Data
● Scales of Measurement:
● Data can be classified as being on one of four scales: nominal, ordinal, interval or
ratio. Each level of measurement has some important properties that are useful to
know.
Types of Numerical Data
● Nominal Scale: Nominal datatype defined above can be placed into this category.
They don’t have a numeric value and so it neither be added, subtracted, divided nor
be multiplied. They also have no order; if they appear to have an order then you
probably have ordinal variables instead.
Types of Numerical Data
● Ordinal Scale: Ordinal datatype defined above can be placed into this category. The
ordinal scale contains things that you can place in order. For example, hottest to
coldest, lightest to heaviest, richest to poorest. So, if you can rank data by 1st, 2nd,
3rd place (and so on), then you have data that is on an ordinal scale.
Types of Numerical Data
● Interval Scale: An interval scale has ordered numbers with meaningful divisions.
Temperature is on the interval scale: a difference of 10 degrees between 90 and 100
means the same as 10 degrees between 150 and 160. Compare that to Olympic
running race (which is ordinal), where the time difference between the winner and
runner up might be 0.01 second and between second-last and last 0.5 seconds. If you
have meaningful divisions, you have something on the interval scale.
Types of Numerical Data
● Ratio Scale: The ratio scale has all the property of interval scale with one major
difference: zero is meaningful. When the scale is equal to 0.0 then there is none of
that scale. For example, a height of zero is meaningful (it means you don’t exist). The
temperature in Kelvin(0.0 K), 0.0 Kelvin really does mean “no heat”. Compare that to
a temperature of zero, which while it exists, it doesn’t mean anything in particular
(although admittedly, in the Celsius scale it’s the freezing point for water).
Populations and Samples
Populations and Samples
Population : it’s a number of something we are observing, humans, events, animals etc. It has some parameters
such as the mean, median, mode, standard deviation, among others.
Sample: it is a random subset from the population. Usually you use samples when the population is big enough to
difficult the analysis of the whole set. In a sample you don’t have parameters you have statistics.
Percentile, Decile and Quartile
● Quartiles, Deciles, and Percentiles From the definition of median that it’s the middle
point in the axis frequency distribution curve, and it is divided the area under the
curve for two areas have the same area in the left, and in the right. From this may be
divided the area under the curve for four equally area and this called quartiles, in the
same procedure divided the area for ten equally pieces of area is called deciles, finally
where divided the area for hundred equally pieces of area is called percentiles
Skewness
● Skewness
● It is the degree of distortion from the symmetrical bell curve or the normal
distribution. It measures the lack of symmetry in data distribution.
● It differentiates extreme values in one versus the other tail. A symmetrical distribution
will have a skewness of 0.
● There are two types of Skewness: Positive and Negative
Skewness
Skewness
● Positive Skewness means when the tail on the right side of the distribution is longer
or fatter. The mean and median will be greater than the mode.
● Negative Skewness is when the tail of the left side of the distribution is longer or
fatter than the tail on the right side. The mean and median will be less than the mode.
● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
● If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
● If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.
Skewness
● Example
● Let us take a very common example of house prices. Suppose we have house values
ranging from $100k to $1,000,000 with the average being $500,000.
● If the peak of the distribution was left of the average value, portraying a positive
skewness in the distribution. It would mean that many houses were being sold for less
than the average value, i.e. $500k. This could be for many reasons, but we are not
going to interpret those reasons here.
● If the peak of the distributed data was right of the average value, that would mean a
negative skew. This would mean that the houses were being sold for more than the
average value.
Kurtosis
● Kurtosis
● Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is
used to describe the extreme values in one versus the other tail. It is actually the
measure of outliers present in the distribution.
● High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there
is a high kurtosis, then, we need to investigate why do we have so many outliers. It
indicates a lot of things, maybe wrong data entry or other things. Investigate!
● Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If
we get low kurtosis(too good to be true), then also we need to investigate and trim the
dataset of unwanted results.
Kurtosis
●
Kurtosis
● Mesokurtic: This distribution has kurtosis statistic similar to that of the normal
distribution. It means that the extreme values of the distribution are similar to that of a
normal distribution characteristic. This definition is used so that the standard normal
distribution has a kurtosis of three.
● Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and
sharper than Mesokurtic, which means that data are heavy-tailed or profusion of
outliers.
● Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of
the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness”
of a leptokurtic distribution.
Kurtosis
● Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data
are light-tailed or lack of outliers.
● The reason for this is because the extreme values are less than that of the normal
distribution.
Data Science
Module-2
Normal probability plots are used to see how closely the elements of a dataset
follow the normal distribution. The assumption of normality is common in many
disciplines. For example, it's often assumed in finance and economics that the
returns to stocks are normally distributed. The assumption of normality is very
convenient, and many statistical tests are based on this assumption.
Exploratory Data Analysis (EDA)
Predictive analytics is a category of data analytics aimed at making predictions
about future outcomes based on historical data and analytics techniques such as
statistical modeling and machine learning. The science of predictive analytics can
generate future insights with a significant degree of precision.With the help of
sophisticated predictive analytics tools and models, any organization can now use
past and current data to reliably forecast trends and behaviors milliseconds, days,
or years into the future.
Exploratory Data Analysis (EDA)
With predictive analytics, organizations can find and exploit patterns contained
within data in order to detect risks and opportunities. Models can be designed, for
instance, to discover relationships between various behavior factors. Such models
enable the assessment of either the promise or risk presented by a particular set
of conditions, guiding informed decision-making across various categories of
supply chain and procurement events.
Exploratory Data Analysis (EDA)
Benefits of predictive analytics
Predictive analytics makes looking into the future more accurate and reliable than
previous tools. As such it can help adopters find ways to save and earn money.
Retailers often use predictive models to forecast inventory requirements, manage
shipping schedules, and configure store layouts to maximize sales. Airlines
frequently use predictive analytics to set ticket prices reflecting past travel trends.
Hotels, restaurants, and other hospitality industry players can use the technology
to forecast the number of guests on any given night in order to maximize
occupancy and revenue
Exploratory Data Analysis (EDA)
Predictive analytics is used in a wide variety of ways by companies worldwide.
Adopters from diverse industries such as banking, healthcare, commerce,
hospitality, pharmaceuticals, automotive, aerospace, and manufacturing get
benefitted from the technology.
Here are a few examples of how businesses are using predictive analytics:
Customer Service
Insurance firms evaluate policy applicants to assess the chance of having to pay
out for a future claim based on the existing risk pool of comparable policyholders,
as well as previous occurrences that resulted in payments. Actuaries frequently
utilize models that compare attributes to data about previous policyholders and
claims.
Higher Education
Predictive analytics applications in higher education include enrollment
management, fundraising, recruiting, and retention. Predictive analytics offers a
significant advantage in each of these areas by offering intelligent insights that
would otherwise be neglected.
Exploratory Data Analysis (EDA)
A prediction algorithm can rate each student and tell administrators ways to serve
students during the duration of their enrollment using data from a student's high
school years. Models can give crucial information to fundraisers regarding the
optimal times and strategies for reaching out to prospective and current donors.
Supply Chain
Forecasting is an important concern in manufacturing because it guarantees that
resources in a supply chain are used optimally. Inventory management and the
shop floor, for example, are critical spokes of the supply chain wheel that require
accurate forecasts to function.
Predictive modeling is frequently used to clean and improve the data utilized for
such estimates. Modeling guarantees that additional data, including data from
customer-facing activities, may be consumed by the system, resulting in a more
Exploratory Data Analysis (EDA)
Forecasting is an important concern in manufacturing because it guarantees that
resources in a supply chain are used optimally. Inventory management and the
shop floor, for example, are critical spokes of the supply chain wheel that require
accurate forecasts to function.
Predictive modeling is frequently used to clean and improve the data utilized for
such estimates. Modeling guarantees that additional data, including data from
customer-facing activities, may be consumed by the system, resulting in a more
accurate prediction.
Data Science
Module-3
● Feature generation is the process of creating new features from one or multiple
existing features, potentially for use in statistical analysis.
● This process adds new information to be accessible during the model construction and
therefore hopefully result in a more accurate model.
● In machine learning and pattern recognition, a feature is an individual measurable
property or characteristic of a phenomenon being observed. Collecting and processing
data can be an expensive and time-consuming process
● Therefore, choosing informative, discriminating and independent features is a crucial
step for effective algorithms in pattern recognition, classification, and regression
●
Feature generation
● In the age of information, which we live in, datasets sizes become increasingly larger
in both number of instances and number of features. It has become quite common to
see datasets with tens of thousands of features or more. Furthermore, many times
algorithm developers are using a process called feature generation.
● Feature generation is the process of creating new features from one or multiple
features, for potential use in the statistical analysis. Usually, this process is adding
new information to the model and makes it more accurate. Feature generation can
improve model accuracy when there is a feature interaction
Feature generation
● By adding new features that encapsulate the feature interaction, new information
becomes more accessible for the prediction model (PM)
● Feature extraction is a quite complex concept concerning the translation of raw data
into the inputs that a particular Machine Learning algorithm requires.
● Features must represent the information of the data in a format that will best fit the
needs of the algorithm that is going to be used to solve the problem.
● Feature extraction fills this requirement: it builds valuable information from raw data
– the features – by reformatting, combining, transforming primary features into new
ones…until it yields a new set of data that can be consumed by the Machine Learning
models to achieve their goals.
Feature generation
● Feature extraction is a quite complex concept concerning the translation of raw data
into the inputs that a particular Machine Learning algorithm requires.
● Features must represent the information of the data in a format that will best fit the
needs of the algorithm that is going to be used to solve the problem.
Feature Selection
● Feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features. Feature selection is a way of reducing the
input variable for the model by using only relevant data in order to reduce overfitting
in the model.
Feature Selection
●
Feature Selection
●
●
Feature Selection
Filter Methods
These methods are generally used while doing the pre-processing step. These methods
select features from the dataset irrespective of the use of any machine learning algorithm.
In terms of computation, they are very fast and inexpensive and are very good for
removing duplicated, correlated, redundant features but these methods do not remove
multicollinearity.
Feature Selection
Selection of feature is evaluated individually which can sometimes help when features are
in isolation (don’t have a dependency on other features) but will lag when a combination
of features can lead to increase in the overall performance of the model.
Feature Selection
. Filter Method: In this method, features are dropped based on their relation to the output,
or how they are correlating to the output. We use correlation to check if the features are
positively or negatively correlated to the output labels and drop features accordingly. Eg:
Information Gain, Chi-Square Test, Fisher’s Score, etc.
Feature Selection
2. Wrapper Method: We split our data into subsets and train a model using this. Based on
the output of the model, we add and subtract features and train the model again. It forms
the subsets using a greedy approach and evaluates the accuracy of all the possible
combinations of features. Eg: Forward Selection, Backwards Elimination, etc.
Feature Selection
Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset.
Feature Selection
●
Data Science
Module-4
Through data visualization, insights and patterns in data can be easily interpreted and
communicated to a wider audience, making it a critical component of machine learning.
Data visualization helps machine learning analysts to better understand and analyze complex data
sets by presenting them in an easily understandable format. Data visualization is an essential step
in data preparation and analysis as it helps to identify outliers, trends, and patterns in the data that
may be missed by other forms of analysis.
Data Visualization
Advantages
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and
squares from circles. Our culture is visual, including everything from art and
advertisements to TV and movies. Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see
trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with
a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend,
you know how much more effective a visualization can be.
Some other advantages of data visualization include:
● Easily sharing information.
● Interactively explore opportunities.
● Visualize patterns and relationships.
Data Visualization
Advantages
● Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and
squares from circles. Our culture is visual, including everything from art and
advertisements to TV and movies.
● Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message.
● When we see a chart, we quickly see trends and outliers. If we can see something, we
internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how much more
effective a visualization can be.
Data Visualization
Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.
Some other disadvantages include:
● Biased or inaccurate information.
● Correlation doesn’t always mean causation.
● Core messages can get lost in translation.
Principal of Data Visualization
● Be skeptical. Ask yourself questions about what data is not represented and what insights
might therefore be misinterpreted or missing.
Principal of Data Visualization
Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis
paralysis. So, data visualization is categorized into following categories :
Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis
paralysis. So, data visualization is categorized into following categories :
Example of Data visualization
Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to represent the relationship
between them. The scatter() method in the matplotlib library is used to draw a scatter plot.
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
Tools of Data Visualization
1. Tableau
Tableau is a data visualization tool that can be used by data analysts, scientists, statisticians, etc. to visualize the
data and get a clear opinion based on the data analysis. Tableau is very famous as it can take in data and
produce the required data visualization output in a very short time
2. Looker
Looker is a data visualization tool that can go in-depth into the data and analyze it to obtain useful insights. It
provides real-time dashboards of the data for more in-depth analysis so that businesses can make instant
decisions based on the data visualizations obtained.
Tools of Data Visualization
Zoho Analytics
Zoho Analytics is a Business Intelligence and Data Analytics software that can help you create
wonderful-looking data visualizations based on your data in a few minutes.
4. Sisense
Sisense is a business intelligence-based data visualization system and it provides various tools that allow data
analysts to simplify complex data and obtain insights for their organization and outsiders
IBM Cognos Analytics is an Artificial Intelligence-based business intelligence platform that supports data
analytics among other things. You can visualize as well as analyze your data and share actionable insights with
anyone in your organization
Tools of Data Visualization
Qlik Sense
Qlik Sense is a data visualization platform that helps companies to become data-driven enterprises by providing
an associative data analytics engine, sophisticated Artificial Intelligence system, and scalable multi-cloud
architecture that allows you to deploy any combination of SaaS, on-premises, or a private cloud.
Domo
Domo is a business intelligence model that contains multiple data visualization tools that provide a consolidated
platform where you can perform data analysis and then create interactive data visualizations that allow other
people to easily understand your data conclusions.
8. Microsoft Power BI
Microsoft Power BI is a Data Visualization platform focused on creating a data-driven business intelligence
culture in all companies today. To fulfill this, it offers self-service analytics tools that can be used to analyze,
aggregate, and share data in a meaningful fashion.
Data Science
Module-5
Discovering drugs
The major contribution of data science in the pharmaceutical industry is to provide the
groundwork for the synthesis of drugs using Artificial Intelligence. Mutation profiling and
the metadata of the patients are used to develop compounds that address the statistical
correlation between the attributes.
Virtual assistance
Nowadays, chatbots and AI platforms are designed by data scientists to help people get a
better idea of their health by putting in certain health information about themselves and
getting an accurate diagnosis. Furthermore, these platforms also assist consumers with
health insurance policies and better lifestyle guides.
Application of Data Science
Wearables
The present-day phenomenon of the Internet of Things (IoT), which ensure maximum
connectivity, is a blessing to data science. Now when this technology is applied to the
medical field, it can help monitor patient health. Nowadays, physical fitness monitors
and smartwatches are used by people to track and manage their health. Furthermore,
these wearable sensor devices can be tracked by a doctor if given access, and in
chronicle cases, the doctor can remotely provide solutions to the patient.
Tracking Patient Health
Did you know that the human body generates 2TB of data daily? Data scientists for
public health have developed wearable devices that allow doctors to collect most of this
data like heart rate, sleep patterns, blood glucose, stress levels, and even brain activity.
With the help of data science tools and machine learning algorithms, doctors can detect
and track common conditions, like cardiac or respiratory diseases.
Application of Data Science
Data Science tech can also detect the slightest changes in the patient's health indicators and
predict possible disorders. Various wearable and home devices as a part of an IoT network
use real-time analytics to predict if a patient will face any problem based on their present
condition.
Diagnostics
An integral part of medical services, diagnosis can be made easier and quicker by data
science applications in healthcare. Not only does the patient’s data analysis facilitate early
detection of health issues, but medical heatmaps pertaining to demographic patterns of
ailments can also be prepared.
Predictive Analytics in Healthcare
A predictive analytical model utilizes historical data, finds patterns from the data, and
generates accurate predictions. The data could entail anything from a patient's blood
pressure and body temperature to sugar level.
Application of Data Science
Predictive models in Data Science correlate and associate every data point to symptoms,
habits, and diseases. This enables the identification of a disease's stage, the extent of
damage, and an appropriate treatment measure. Predictive analytics in healthcare also
helps:
The study and evaluation of ethical problems connected to data have given rise to a new
field of study in ethics, known as ethics for data science. Data may be collected, recorded,
produced, processed, shared, and used, among other things. It also encompasses different
data and technology, such as programming hackers, professional codes and algorithms.
The importance of ethics in data science has been felt because there has to be a clear set of
rules governing what businesses can and cannot do with the personal information they
acquire from customers.
Principles of Data Ethics
● The word privacy does not imply confidentiality since private information may be
required for audits depending on the needs of the legal process. However, this
sensitive information is obtained from a person with their permission. Additionally, it
is stated that the information must not be made public so that other people or
businesses might use it to determine the user's identity.
● Private information that has been disclosed should never be made public. In order to
protect the privacy and comply with regulations, they must also impose limitations on
how the data may be shared.
Privacy Security and Confidentiality of Data
A further ethical obligation associated with data management is privacy security and ethics
in data science. Customers may not want their Personally Identifiable Information (PII)
made public even if they allow your organization authority to gather, keep, and analyze it.
PII refers to any data associated with a specific person's identity, and PII includes, for
example:
● Bank account number
● Birthdate
● Credit card information
● Full name
● Passport number
● Phone number
● Social Security card
● Street address
Future of Data Science
Rise of Automation:
With an increase in the complexity of operations, there is always a strive to simplify
processes.
Scarcity or Abundance of Data Scientists:
Today thousands of individuals learn Data Science related skills through college degrees or
the numerous resources that can be found online and this could result in newer aspirants
getting a feeling of saturation in this domain