0% found this document useful (0 votes)

16 views

Handbook Introduction of Data Science AY 23-24

Uploaded by

kp1981303

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Handbook Introduction of Data Science AY 23-24

Uploaded by

kp1981303

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 171

Introduction of Data Science

● Data Science is nothing short of magic and a data scientist is a magician who
performs tricks with the data in his hat. Now, as magic is composed of different
elements, similarly data science is an interdisciplinary field. You can consider data
science to be an amalgamation of different fields such as Data Manipulation, Data
Visualization, Statistical Analysis, and Machine Learning. Each of these sub-domains
is equally important when it comes to data science.
● Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed
using the scientific method, different technologies, and algorithms.
● It is a multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
Introduction of Data Science

● Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
● In short, we can say that data science is all about:
● Asking the correct questions and analyzing the raw data.
● Modeling the data using various complex and efficient algorithms.
● Visualizing the data to get a better perspective.
● Understanding the data to make better decisions and finding the final result.
Introduction of Data Science

● Let suppose we want to travel from station A to station B by car. Now, we need to
take some decisions such as which route will be the best route to reach faster at the
location, in which route there will be no traffic jam, and which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate answer
from these decisions, so this analysis of data is called the data analysis, which is a
part of data science.
Need for Data Science:
Introduction of Data Science

● Now, handling of such huge amount of data is a challenging task for every
organization. So to handle, process, and analysis of this, we required some complex,
powerful, and efficient algorithms and technology, and that technology came into
existence as data Science. Following are some main reasons for using data science
technology:
● With the help of data science technology, we can convert the massive amount of raw
and unstructured data into meaningful insights.
● Data science technology is opting by various companies, whether it is a big brand or
a startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are
using data science algorithms for better customer experience.
Data Science Components:

●
Data Science Components:

● Data science is working for automating transportation such as creating a self-driving

car, which is the future of transportation.
● Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.
The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze the numerical data in a large amount and finding
meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data
science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.
The main components of Data Science are given below:

4. Visualization: Data visualization is meant by representing data in a visual context so

that people can easily understand the significance of data. Data visualization makes it easy
to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced
computing involves designing, writing, debugging, and maintaining the source code of
computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science,
The main components of Data Science are given below:
Data Science Lifecycle
Data Science Lifecycle

The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions.
When you start any data science project, you need to determine what are the basic
requirements, priorities, and project budget. In this phase, we need to determine all the
requirements of the project such as the number of people, technology, time, data, an end
goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we
need to perform the following tasks:
● Data cleaning
● Data Reduction
● Data integration
● Data transformation,
Data Science Lifecycle
● After performing all the above tasks, we can easily use this data for our further
processes.
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply Exploratory
data analytics(EDA) by using various statistical formula and visualization tools to
understand the relations between variable and to see what data can inform us. Common
tools used for model planning are:

● SQL Analysis Services

● R
● SAS
● Python
Data Science Lifecycle

4. Model-building: In this phase, the process of model building starts. We will create
datasets for training and testing purpose. We will apply different techniques such as
association, classification, and clustering, to build the model.
Following are some common Model building tools:

● SAS Enterprise Miner

● WEKA
● SPCS Modeler
● MATLAB
Data Science Lifecycle

5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of
complete project performance and other components on a small scale before the full
deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which we have
set on the initial phase. We will communicate the findings and final result with the
business team.
Data Science has a lot of real-world applications. Let’s have a look at some of those:
Applications of Data Science
Comparison of Data Science with Data Analytics
Comparison of Data Science with Data Analytics

● Predictive analytics is an area of statistics that deals with extracting information from
data and using it to predict trends and behavior patterns.
● The enhancement of predictive web analytics calculates statistical probabilities of
future events online. Predictive analytics statistical techniques include data modeling,
machine learning, AI, deep learning algorithms and data mining.
● Often the unknown event of interest is in the future, but predictive analytics can be
applied to any type of unknown whether it be in the past, present or future. For
example, identifying suspects after a crime has been committed, or credit card fraud
as it occurs.
Comparison of Data Science with Data Analytics

● The core of predictive analytics relies on capturing relationships between explanatory

variables and the predicted variables from past occurrences, and exploiting them to
predict the unknown outcome. It is important to note, however, that the accuracy and
usability of results will depend greatly on the level of data analysis and the quality of
assumptions.
● Descriptive models
● Descriptive models quantify relationships in data in a way that is often used to
classify customers or prospects into groups. Unlike predictive models that focus on
predicting a single customer behavior (such as credit risk), descriptive models
identify many different relationships between customers or products.
Comparison of Data Science with Data Analytics

● Descriptive models do not rank-order customers by their likelihood of taking a

particular action the way predictive models do. Instead, descriptive models can be
used, for example, to categorize customers by their product preferences and life stage.
Descriptive modeling tools can be utilized to develop further models that can simulate
large number of individualized agents and make predictions.
● Business
● Analytical customer relationship management (CRM) is a frequent commercial
application of predictive analysis. Methods of predictive analysis are applied to
customer data to construct a holistic view of the customer. CRM uses predictive
analysis in applications for marketing campaigns, sales, and customer services.
Analytical CRM can be applied throughout the customers' lifecycle (acquisition,
relationship growth, retention, and win-back).
Descriptive Analytics: Insight into the past

● Descriptive analysis or statistics does exactly what the name implies: they “describe”,
or summarize, raw data and make it something that is interpretable by humans. They
are analytics that describe the past. The past refers to any point of time that an event
has occurred, whether it is one minute ago, or one year ago. Descriptive analytics are
useful because they allow us to learn from past behaviors, and understand how they
might influence future outcomes.
Descriptive Analytics: Insight into the past

● The vast majority of the statistics we use fall into this category. (Think basic
arithmetic like sums, averages, percent changes.) Usually, the underlying data is a
count, or aggregate of a filtered column of data to which basic math is applied. For all
practical purposes, there are an infinite number of these statistics.
● Descriptive statistics are useful to show things like total stock in inventory, average
dollars spent per customer and year-over-year change in sales. Common examples of
descriptive analytics are reports that provide historical insights regarding the
company’s production, financials, operations, sales, finance, inventory and customers.
Descriptive Analytics: Insight into the past

● Use Descriptive Analytics when you need to understand at an aggregate level what is
going on in your company, and when you want to summarize and describe different
aspects of your business.
Predictive Analytics: Understanding the future

● Predictive analytics has its roots in the ability to “predict” what might happen. These
analytics are about understanding the future. Predictive analytics provides companies
with actionable insights based on data.
● Predictive analytics provides estimates about the likelihood of a future outcome. It is
important to remember that no statistical algorithm can “predict” the future with
100% certainty. Companies use these statistics to forecast what might happen in the
future. This is because the foundation of predictive analytics is based on probabilities.
Predictive Analytics: Understanding the future

● These statistics try to take the data that you have, and fill in the missing data with best
guesses. They combine historical data found in ERP, CRM, HR and POS systems to
identify patterns in the data and apply statistical models and algorithms to capture
relationships between various data sets. Companies use predictive statistics and
analytics any time they want to look into the future. Predictive analytics can be used
throughout the organization, from forecasting customer behavior and purchasing
patterns to identifying trends in sales activities. They also help forecast demand for
inputs from the supply chain, operations and inventory.
Predictive Analytics: Understanding the future

● One common application most people are familiar with is the use of predictive
analytics to produce a credit score. These scores are used by financial services to
determine the probability of customers making future credit payments on time.
Typical business uses include understanding how sales might close at the end of the
year, predicting what items customers will purchase together, or forecasting inventory
levels based upon a myriad of variables.
● Use Predictive Analytics any time you need to know something about the future, or
fill in the information that you do not have
Predictive Analytics: Understanding the future

● These analytics go beyond descriptive and predictive analytics by recommending one

or more possible courses of action. Essentially they predict multiple futures and allow
companies to assess a number of possible outcomes based upon their actions.
● Prescriptive analytics use a combination of techniques and tools such as business
rules, algorithms, machine learning and computational modelling procedures. These
techniques are applied against input from many different data sets including historical
and transactional data, real-time data feeds, and big data.
Predictive Analytics: Understanding the future

● Prescriptive analytics are relatively complex to administer, and most companies are
not yet using them in their daily course of business. When implemented correctly,
they can have a large impact on how businesses make decisions, and on the
company’s bottom line. Larger companies are successfully using prescriptive
analytics to optimize production, scheduling and inventory in the supply chain to
make sure they are delivering the right products at the right time and optimizing the
customer experience.
● Use Prescriptive Analytics any time you need to provide users with advice on what
action to take
Predictive Analytics: Understanding the future

● In terms of methodology, big data analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally we
model the data in a way to explain a response. The objectives of this approach is to
predict the response behavior or understand how the input variables relate to a
response. Normally in statistical experimental designs, an experiment is developed
and data is retrieved as a result. This allows to generate data in a way that can be used
by a statistical model, where certain assumptions hold such as independence,
normality, and randomization.
Prescriptive Analytics: Advise on possible outcomes

● The relatively new field of prescriptive analytics allows users to “prescribe” a number
of different possible actions and guide them towards a solution. In a nutshell, these
analytics are all about providing advice.
● Prescriptive analytics attempts to quantify the effect of future decisions in order to
advise on possible outcomes before the decisions are actually made.
● At their best, prescriptive analytics predicts not only what will happen, but also why it
will happen, providing recommendations regarding actions that will take advantage of
the predictions.
Big data analytics

● In big data analytics, we are presented with the data. We cannot design an experiment
that fulfills our favorite statistical model. In large-scale applications of analytics, a
large amount of work (normally 80% of the effort) is needed just for cleaning the
data, so it can be used by a machine learning model.
Big data analytics

● One of the most important tasks in big data analytics is statistical modeling, meaning
supervised and unsupervised classification or regression problems.
● Once the data is cleaned and preprocessed, available for modeling, care should be
taken in evaluating different models with reasonable loss metrics and then once the
model is implemented, further evaluation and results should be reported.
● A common pitfall in predictive modeling is to just implement the model and never
measure its performance.
Big data analytics

● As mentioned in the big data life cycle, the data products that result from developing
a big data product are in most of the cases some of the following −
● Machine learning implementation − This could be a classification algorithm, a
regression model or a segmentation model.
● Recommender system − The objective is to develop a system that recommends
choices based on user behavior. Netflix is the characteristic example of this data
product, where based on the ratings of users, other movies are recommended.
● Dashboard − Business normally needs tools to visualize aggregated data. A dashboard
is a graphical mechanism to make this data accessible.
● Ad-Hoc analysis − Normally business areas have questions, hypotheses or myths that
can be answered doing ad-hoc analysis with data.
Introduction to Web and Social Media Analytics
● Web analytics is the measurement of data, the collection of information, analysis, and
reporting of Internet data for the purposes of optimizing and understanding Web
usage.
● First, Let us understand Web Analytics, by definition “Web analytics is the
measurement of data, the collection of information, analysis, and reporting of Internet
data for the purposes of optimizing and understanding Web usage. Here the web
usage refers to the respective business website usage data.
Social Media Analytics
● Now Let us understand social media analytics. Social media analytics is the practice
of gathering data from social media websites or networks such as Facebook, Twitter,
Google plus, etc. and analyzing those metrics to understand insights to make business
decisions.
● Most often people confuse web analytics and Social media Analytics, let us have a
clear demarcation between these two. Web analytics uses the data collected directly
from a particular business website and Social media analytics uses the data collected
from social media networks.
Web Analytics
The four important key metrics can be analyzed from web analytics,

1. Total Traffic

2. Traffic Source

3. Bounce Rate

4. Conversion rate
Web Analytics
● Total Traffic:- Total Traffic to your website gives insights about from where are you
getting more traffic, that helps you to understand your target market. In addition, you
can analyze which hours of the day and days of the week, you are getting more visits
to your websites. Based on this information you can run a campaign to optimize more
conversions.
● Traffic Source:- Traffic source is about how you are getting most visitors to your
website. Is it through social media, search engines or Referral Sites? Based on that
information you can strategies your marketing campaign or write a blog or focusing
on a particular social media network. For example, if most of your visitors are coming
from social media networks, use that information to brand your business more on
Facebook, Twitter or any other social media platforms to boost your website traffic.
Web Analytics
● Bounce Rate:- Bounce rate is the percentage of visitors to a particular website who
are leaving the site after viewing only one page without navigating other pages on the
website.
● This could be higher for any number of reasons may be,
● Irrelevant content
● Inappropriate designs
● Confusing navigation
● Frequent pop-ups
● Unnecessary ads
● Or, Annoying sounds
● However, this metric helps you to improve your web design overall.
Web Analytics
● Conversion Rate:- A conversion rate is the percentage of visitors who have taken
some action in your website or complete the desired goal; it could be purchasing a
product, Sign up for new letters etc.
● There are many tools available in the market to track and analyze web data, however
Google Analytics is one of the vital tools for any websites, as it offers the chance to
obtain and analyze in-depth statistics about who your customers are, what they are
interested in, and how they interact with your website or online store.
● When you take advantage of the information, which Google Analytics offers, you can
make data-informed decisions about the best ways to improve business.
Machine Learning Algorithms
● Machine learning is used to predict, categorize, classify, finding polarity, etc from the
given datasets and concerned with minimizing the error.
● It uses training data for artificial intelligence.
● Since there are many algorithms like SVM Algorithm in Python, Bayes algorithm,
logistic regression, etc. which will use training data to match with input data and then
it will provide a conclusion with maximum accuracy.
Machine Learning Algorithms
● Machine learning is categorized into
● The critical element of data science is Machine Learning algorithms, which are a
process of a set of rules to solve a certain problem
Machine Learning Algorithms
● Some of the important data science algorithms include regression, classification and
clustering techniques, decision trees and random forests, machine learning techniques
like supervised, unsupervised and reinforcement learning. In addition to these, there
are many algorithms that organizations develop to serve their unique needs.
Machine Learning Algorithms
● Supervised learning
● It is used for the structured dataset. It analyzes the training data and generates a
function that will be used for other datasets.
● Unsupervised learning
● It is used for raw datasets. Its main task is to convert raw data to structured data.In
today’s world, there is a huge amount of raw data in every field. Even the computer
generates log files which are in the form of raw data. Therefore it’s the most
important part of machine learning.
Challenges in Data-Driven Decision Making and Future

● Discrimination
● Algorithmic discrimination can come from various sources. First, the data used to
train algorithms may have biases that lead to discriminatory decisions. Second,
discrimination may arise from the use of a particular algorithm.
● Categorization, for example, can be considered a form of direct discrimination
because it uses algorithms to determine different treatments of various classes. Third,
algorithms can result in discrimination as a result of misuse of certain models in
different contexts. Fourth, biased data can be used both as evidence for the training of
algorithms and as evidence of their effectiveness.
Challenges in Data-Driven Decision Making and Future

● Lack of transparency
● Transparency refers to the capacity to understand a computational model and
therefore contribute to the attribution of responsibility for consequences derived
from its use.
● A model is transparent if a person can easily observe it and understand it. It would
therefore be desirable for models to have low computational complexity.
● Violation of privacy
● Reports and studies have focused on the misuse of users' personal data and on data
aggregation by entities such as data brokers, which have direct implications for
people's privacy
Challenges in Data-Driven Decision Making and Future

● Digital literacy
● It is extremely important that we devote resources to digital and computer literacy
programs for all citizens, from children to the elderly. If we do not, it will be very
difficult (if not impossible) as a society to make decisions about technologies that
we do not understand. The book Los nativos digitales no existen ("Digital natives do
not exist") emphasizes the need to teach children and adolescents about computer
thinking and proper use of technology.
●
Challenges in Data-Driven Decision Making and Future

● Fuzzy responsibility
● As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
attribution of responsibility. However, transparency and audits are not enough to
guarantee clear responsibility.
Challenges in Data-Driven Decision Making and Future

● Lack of ethical frameworks

● Algorithmic data-based decision-making processes generate important ethical
dilemmas regarding what actions are appropriate in light of the inferences made
by algorithms. It is therefore essential that decisions be made in accordance with a
clearly defined and accepted ethical framework.
●
Challenges in Data-Driven Decision Making and Future

● Lack of diversity
● Given the variety of cases in which algorithms can be applied for decision-making, it
is important to reflect on the frequent lack of diversity in the teams that generate such
algorithms. So far, data-based algorithms and artificial intelligence techniques for
decision-making have been developed by homogeneous groups of IT professionals. In
the future, we should make sure that teams are diverse in terms of areas of knowledge
as well as demographic factors (particularly gender, given that women account for
less than 20% of IT professionals at many technology companies).
Types & Scales of Data in Descriptive Statistics
● Descriptive statistics help you to understand the data, but before we understand what
data is, we should know different data types in descriptive statistical analysis. The
below screen helps you to get an overview of it.
Types of data
● A data set is a grouping of information that is related to each other. A data set can be
either qualitative or quantitative. A qualitative data set consists of words that can be
observed, not measured. A quantitative data set consists of numbers that can be
directly measured. Months in a year would be an example of qualitative, while the
weight of persons would be an example of quantitative data.
Types of data
● Now, let’s suppose you go to KFC to eat some burgers along with your friends, you
placed the order at coupon counter and after receiving from the food counter everyone
eats what you ordered on their behalf. If someone asked about the taste to others then
the ratings on the taste will vary from one to another but if asked how many burgers
we ordered then everyone will come to a definite count and it will be the same for all.
Here, Taste’s ratings represent the Categorical Data and the number of burgers is
Numerical Data.
Types of Categorical Data:
● Nominal Data: When there is no natural order between categories then data is
nominal type.
● Example: Color of an Eye, Gender (Male & Female), Blood Type, Political Party, and
Zipcode, Type of living accommodation(House, Apartment, Trailer, Other), Religious
preference( Hindu, Buddhist, Muslim, Jewish, Christian, Other), etc.
Types of Categorical Data:
● Ordinal Data: When there is natural order between categories then data is ordinal
type. But here, the difference between the values in order does not matter.
Example: Exam Grades, Socio-economic status (poor, middle class, rich),
Education-level (kindergarten, primary, secondary, higher secondary, graduation),
satisfaction rating(extremely dislike, dislike, neutral, like, extremely like), Time of
Day(dawn, morning, noon, afternoon, evening, night), Level of Agreement(yes,
maybe, no), The Likert Scale(strongly disagree, disagree, neutral, agree, strongly
agree), etc.
Types of Numerical Data
● Discrete Data: The data is said to be discrete if the measurements are integers. It
represents count or an item that can be counted.
● Example: Number of people in a family, the number of kids in class, the number of
cricket players in a team, the number of cricket playing nations in the world.
● Discrete data is a special kind of data because each value is separate and different.
Types of Numerical Data
● Continous Data: The data is said to be continuous if the measurements can take any
value usually within some range. It is a scale of measurement that can consist of
numbers other than whole numbers, like decimals and fractions.
● Example: height, weight, length, temperature
● Continuous data usually require a tool, like a ruler, measuring tape, scale, or
thermometer, to produce the values in a continuous data set.
Types of Numerical Data
● Scales of Measurement:
● Data can be classified as being on one of four scales: nominal, ordinal, interval or
ratio. Each level of measurement has some important properties that are useful to
know.
Types of Numerical Data
● Nominal Scale: Nominal datatype defined above can be placed into this category.
They don’t have a numeric value and so it neither be added, subtracted, divided nor
be multiplied. They also have no order; if they appear to have an order then you
probably have ordinal variables instead.
Types of Numerical Data
● Ordinal Scale: Ordinal datatype defined above can be placed into this category. The
ordinal scale contains things that you can place in order. For example, hottest to
coldest, lightest to heaviest, richest to poorest. So, if you can rank data by 1st, 2nd,
3rd place (and so on), then you have data that is on an ordinal scale.
Types of Numerical Data
● Interval Scale: An interval scale has ordered numbers with meaningful divisions.
Temperature is on the interval scale: a difference of 10 degrees between 90 and 100
means the same as 10 degrees between 150 and 160. Compare that to Olympic
running race (which is ordinal), where the time difference between the winner and
runner up might be 0.01 second and between second-last and last 0.5 seconds. If you
have meaningful divisions, you have something on the interval scale.
Types of Numerical Data
● Ratio Scale: The ratio scale has all the property of interval scale with one major
difference: zero is meaningful. When the scale is equal to 0.0 then there is none of
that scale. For example, a height of zero is meaningful (it means you don’t exist). The
temperature in Kelvin(0.0 K), 0.0 Kelvin really does mean “no heat”. Compare that to
a temperature of zero, which while it exists, it doesn’t mean anything in particular
(although admittedly, in the Celsius scale it’s the freezing point for water).
Populations and Samples
Populations and Samples
Population : it’s a number of something we are observing, humans, events, animals etc. It has some parameters
such as the mean, median, mode, standard deviation, among others.
Sample: it is a random subset from the population. Usually you use samples when the population is big enough to
difficult the analysis of the whole set. In a sample you don’t have parameters you have statistics.
Percentile, Decile and Quartile
● Quartiles, Deciles, and Percentiles From the definition of median that it’s the middle
point in the axis frequency distribution curve, and it is divided the area under the
curve for two areas have the same area in the left, and in the right. From this may be
divided the area under the curve for four equally area and this called quartiles, in the
same procedure divided the area for ten equally pieces of area is called deciles, finally
where divided the area for hundred equally pieces of area is called percentiles
Skewness
● Skewness
● It is the degree of distortion from the symmetrical bell curve or the normal
distribution. It measures the lack of symmetry in data distribution.
● It differentiates extreme values in one versus the other tail. A symmetrical distribution
will have a skewness of 0.
● There are two types of Skewness: Positive and Negative
Skewness
Skewness
● Positive Skewness means when the tail on the right side of the distribution is longer
or fatter. The mean and median will be greater than the mode.
● Negative Skewness is when the tail of the left side of the distribution is longer or
fatter than the tail on the right side. The mean and median will be less than the mode.
● If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
● If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
● If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.
Skewness
● Example
● Let us take a very common example of house prices. Suppose we have house values
ranging from $100k to $1,000,000 with the average being $500,000.
● If the peak of the distribution was left of the average value, portraying a positive
skewness in the distribution. It would mean that many houses were being sold for less
than the average value, i.e. $500k. This could be for many reasons, but we are not
going to interpret those reasons here.
● If the peak of the distributed data was right of the average value, that would mean a
negative skew. This would mean that the houses were being sold for more than the
average value.
Kurtosis
● Kurtosis
● Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is
used to describe the extreme values in one versus the other tail. It is actually the
measure of outliers present in the distribution.
● High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there
is a high kurtosis, then, we need to investigate why do we have so many outliers. It
indicates a lot of things, maybe wrong data entry or other things. Investigate!
● Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If
we get low kurtosis(too good to be true), then also we need to investigate and trim the
dataset of unwanted results.
Kurtosis
●
Kurtosis
● Mesokurtic: This distribution has kurtosis statistic similar to that of the normal
distribution. It means that the extreme values of the distribution are similar to that of a
normal distribution characteristic. This definition is used so that the standard normal
distribution has a kurtosis of three.
● Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and
sharper than Mesokurtic, which means that data are heavy-tailed or profusion of
outliers.
● Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of
the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness”
of a leptokurtic distribution.
Kurtosis
● Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data
are light-tailed or lack of outliers.
● The reason for this is because the extreme values are less than that of the normal
distribution.
Data Science

Module-2

Prepared By: Dr. Ajay N Upadhyaya & Prof. Madhuri Parekh

Data Analytics
● Data Analysis is a process of collecting, transforming, cleaning, and modeling data
with the goal of discovering the required information. The results so obtained are
communicated, suggesting conclusions, and supporting decision-making. Data
visualization is at times used to portray the data for the ease of discovering the useful
patterns in the data. The terms Data Modeling and Data Analysis mean the same.
● Data Analysis Process consists of the following phases that are iterative in nature
○ Data Requirements Specification
○ Data Collection
○ Data Processing
○ Data Cleaning
○ Data Analysis
○ Communication
Data Analytics
●
Data Analytics
● Data Requirements Specification
● The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the
analysis is identified (e.g., Population of people). Specific variables regarding a
population (e.g., Age and Income) may be specified and obtained. Data may be
numerical or categorical.
● Data Collection
● Data Collection is the process of gathering information on targeted variables
identified as data requirements. The emphasis is on ensuring accurate and honest
collection of data. Data Collection ensures that data gathered is accurate such that the
related decisions are valid. Data Collection provides both a baseline to measure and a
target to improve.
Data Analytics
● Data is collected from various sources ranging from organizational databases to the
information in web pages. The data thus obtained, may not be structured and may
contain irrelevant information. Hence, the collected data is required to be subjected to
Data Processing and Data Cleaning.
● Data Processing
● The data that is collected must be processed or organized for analysis. This includes
structuring the data as required for the relevant Analysis Tools. For example, the data
might have to be placed into rows and columns in a table within a Spreadsheet or
Statistical Application. A Data Model might have to be created.
Data Analytics
● Data Cleaning
● The processed and organized data may be incomplete, contain duplicates, or contain
errors. Data Cleaning is the process of preventing and correcting these errors. There
are several types of Data Cleaning that depend on the type of data. For example,
while cleaning the financial data, certain totals might be compared against reliable
published numbers or defined thresholds. Likewise, quantitative data methods can be
used for outlier detection that would be subsequently excluded in analysis.
Data Analytics
● Data Analysis
● Data that is processed, organized and cleaned would be ready for the analysis.
Various data analysis techniques are available to understand, interpret, and derive
conclusions based on the requirements. Data Visualization may also be used to
examine the data in graphical format, to obtain additional insight regarding the
messages within the data.
● Statistical Data Models such as Correlation, Regression Analysis can be used to
identify the relations among the data variables. These models that are descriptive of
the data are helpful in simplifying analysis and communicate results.
● The process might require additional Data Cleaning or additional Data Collection,
and hence these activities are iterative in nature.
Data Analytics
● Communication
● The results of the data analysis are to be reported in a format as required by the users
to support their decisions and further action. The feedback from the users might result
in additional analysis.
Exploratory Data Analysis (EDA)
● Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques. It is used to discover trends, patterns, or to check
assumptions with the help of statistical summary and graphical
representations.
● Although EDA is mainly based on graphical techniques, it also consists of a
few quantitative techniques. This article discusses two of these:
● 1) Interval estimation
● 2) Hypothesis testing.
●
Exploratory Data Analysis (EDA)
● Interval estimation
● Interval estimation is a technique that's used to construct a range of values
within which a variable is likely to fall. One important example of this is the
confidence interval. A confidence interval is a range of numbers that is likely
to contain the value of a population measure such as the mean. A confidence
interval is constructed as follows:
●
●
Exploratory Data Analysis (EDA)
● Point Estimate+-Margin of Error
● The confidence interval consists of a lower limit equal to the point estimate
minus the margin of error, and an upper limit equal to the point estimate plus
the margin of error.
● The point estimate is a single value estimated from a sample. For example,
the sample mean is a point estimate of the population mean. Similarly, the
sample standard deviation is a point estimate of the population standard
deviation.
Exploratory Data Analysis (EDA)
● The margin of error reflects the amount of uncertainty associated with the
point estimate. In other words, it shows how much the point estimate can
change from one sample to the next. The margin of error is based on the
standard deviation and the size of the sample being used. The result of these
calculations is a range of values that is likely to contain the true value of the
population measure.
● For example, suppose a researcher determines that with 95 percent
confidence, the interval (–2.0 percent, +8.0 percent) contains the true value of
the mean return to the S&P 500 next year. The sample mean is the average
of the lower and upper limit of this interval (that is, 3.0 percent). The margin of
error is therefore 5 percent.
●
Exploratory Data Analysis (EDA)
Hypothesis testing

A statistical hypothesis is a statement that's assumed to be true unless there's

strong contradictory evidence. Hypothesis testing is widely used in many
disciplines to determine whether a proposition is true or false. For example,
hypothesis testing could be used to determine whether
The mean age of the residents of a state is 43 years old.
The mean return to the stocks in a portfolio is 7.2 percent.
Exploratory Data Analysis (EDA)
● Depending on the type of analysis we can also subcategorize EDA into two
parts.
● Non-graphical Analysis – In non-graphical analysis, we analyze data using
statistical tools like mean median or mode or skewness
● Graphical Analysis – In graphical analysis, we use visualizations charts to
visualize trends and patterns in the data
● For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender,
Start Date, Last Login, Salary, Bonus%, Senior Management, and Team. We
can get the dataset here Employees.csv
● Let’s read the dataset using the Pandas read_csv() function and print the 1st
five rows. To print the first five rows we will use the head() function.
Exploratory Data Analysis (EDA)
● import pandas as pd
● import numpy as np
● # read datasdet using pandas
● df = pd.read_csv('employees.csv')
● df.head()
●
Exploratory Data Analysis (EDA)
● Let’s see the shape of the data using the shape.
● df.shape
● Output:
● (1000, 8)
●
●
Exploratory Data Analysis (EDA)
● EDA is based heavily on graphical techniques. You can use graphical
techniques to identify the most important properties of a dataset. Here are
some of the more widely used graphical techniques:
● Box plots
● Histograms
● Normal probability plots
● Scatter plots
●
●
Exploratory Data Analysis (EDA)
● Box plots
● You use box plots to show some of the most important features of a dataset,
such as the following:
● Minimum value
● Maximum value
● Quartiles
● Quartiles separate a dataset into four equal sections. The first quartile (Q1) is
a value such that the following is true:
●
Exploratory Data Analysis (EDA)
● 25 percent of the observations in a dataset are less than the first quartile.
● 75 percent of the observations are greater than the first quartile.
● The second quartile (Q2) is a value such that
● 50 percent of the observations in a dataset are less than the second quartile.
● 50 percent of the observations are greater than the second quartile.
● The third quartile (Q3) is a value such that
● 75 percent of the observations in a dataset are less than the third quartile.
● 25 percent of the observations are greater than the third quartile.
● You can also use box plots to identify outliers. These are values that are
substantially different from the rest of the dataset. Outliers can cause
problems for traditional statistical tests, so it's important to identify them
before performing any type of statistical analysis
Exploratory Data Analysis (EDA)
●
Exploratory Data Analysis (EDA)
●
Exploratory Data Analysis (EDA)
● Histogram
● It can be used for both uni and bivariate analysis.
● Example:
● # importing packages
● import seaborn as sns
● import matplotlib.pyplot as plt
● sns.histplot(x='Salary', data=df, )
● plt.show()
●
Exploratory Data Analysis (EDA)
● Scatter plots
● A scatter plot is a series of points that show how two variables are related to
each other. A random scatter of points indicates that the two variables are
unrelated, or that the relationship between them is very weak. If the points
closely resemble a straight line, this indicates that the relationship between
the two variables is approximately linear.
● X is the independent variable, and Y is the dependent variable. m is the slope,
which represents the change in Y due to a given change in X. b is the
intercept, which shows the value of Y when X equals zero.
● The figure shows a scatter plot between two variables in which the
relationship appears to be linear.
●
Exploratory Data Analysis (EDA)
● Scatter Boxplot For Data Visualization
● It can be used for bivariate analyses.
● Example:
● # importing packages
● import seaborn as sns
● import matplotlib.pyplot as plt
● sns.scatterplot( x="Salary", y='Team', data=df,hue='Gender', size='Bonus %')
●
●
Exploratory Data Analysis (EDA)
● # Placing Legend outside the Figure
● plt.legend(bbox_to_anchor=(1, 1), loc=2)
● plt.show()
●
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA)
Normal probability plots

Normal probability plots are used to see how closely the elements of a dataset
follow the normal distribution. The assumption of normality is common in many
disciplines. For example, it's often assumed in finance and economics that the
returns to stocks are normally distributed. The assumption of normality is very
convenient, and many statistical tests are based on this assumption.
Exploratory Data Analysis (EDA)
Predictive analytics is a category of data analytics aimed at making predictions
about future outcomes based on historical data and analytics techniques such as
statistical modeling and machine learning. The science of predictive analytics can
generate future insights with a significant degree of precision.With the help of
sophisticated predictive analytics tools and models, any organization can now use
past and current data to reliably forecast trends and behaviors milliseconds, days,
or years into the future.
Exploratory Data Analysis (EDA)
With predictive analytics, organizations can find and exploit patterns contained
within data in order to detect risks and opportunities. Models can be designed, for
instance, to discover relationships between various behavior factors. Such models
enable the assessment of either the promise or risk presented by a particular set
of conditions, guiding informed decision-making across various categories of
supply chain and procurement events.
Exploratory Data Analysis (EDA)
Benefits of predictive analytics

Predictive analytics makes looking into the future more accurate and reliable than
previous tools. As such it can help adopters find ways to save and earn money.
Retailers often use predictive models to forecast inventory requirements, manage
shipping schedules, and configure store layouts to maximize sales. Airlines
frequently use predictive analytics to set ticket prices reflecting past travel trends.
Hotels, restaurants, and other hospitality industry players can use the technology
to forecast the number of guests on any given night in order to maximize
occupancy and revenue
Exploratory Data Analysis (EDA)
Predictive analytics is used in a wide variety of ways by companies worldwide.
Adopters from diverse industries such as banking, healthcare, commerce,
hospitality, pharmaceuticals, automotive, aerospace, and manufacturing get
benefitted from the technology.

Here are a few examples of how businesses are using predictive analytics:

Customer Service

Businesses may better estimate demand by utilizing advanced and effective

analytics and business intelligence. Consider a hotel company that wants to
estimate how many people will stay in a certain area this weekend so that they
can guarantee they have adequate employees and resources to meet demand.
Exploratory Data Analysis (EDA)
Insurance

Insurance firms evaluate policy applicants to assess the chance of having to pay
out for a future claim based on the existing risk pool of comparable policyholders,
as well as previous occurrences that resulted in payments. Actuaries frequently
utilize models that compare attributes to data about previous policyholders and
claims.

Higher Education
Predictive analytics applications in higher education include enrollment
management, fundraising, recruiting, and retention. Predictive analytics offers a
significant advantage in each of these areas by offering intelligent insights that
would otherwise be neglected.
Exploratory Data Analysis (EDA)
A prediction algorithm can rate each student and tell administrators ways to serve
students during the duration of their enrollment using data from a student's high
school years. Models can give crucial information to fundraisers regarding the
optimal times and strategies for reaching out to prospective and current donors.

Supply Chain
Forecasting is an important concern in manufacturing because it guarantees that
resources in a supply chain are used optimally. Inventory management and the
shop floor, for example, are critical spokes of the supply chain wheel that require
accurate forecasts to function.
Predictive modeling is frequently used to clean and improve the data utilized for
such estimates. Modeling guarantees that additional data, including data from
customer-facing activities, may be consumed by the system, resulting in a more
Exploratory Data Analysis (EDA)
Forecasting is an important concern in manufacturing because it guarantees that
resources in a supply chain are used optimally. Inventory management and the
shop floor, for example, are critical spokes of the supply chain wheel that require
accurate forecasts to function.

Predictive modeling is frequently used to clean and improve the data utilized for
such estimates. Modeling guarantees that additional data, including data from
customer-facing activities, may be consumed by the system, resulting in a more
accurate prediction.
Data Science

Module-3

Prepared By: Dr. Ajay N Upadhyaya & Prof. Madhuri Parekh

Feature generation

● Feature generation is the process of creating new features from one or multiple
existing features, potentially for use in statistical analysis.
● This process adds new information to be accessible during the model construction and
therefore hopefully result in a more accurate model.
● In machine learning and pattern recognition, a feature is an individual measurable
property or characteristic of a phenomenon being observed. Collecting and processing
data can be an expensive and time-consuming process
● Therefore, choosing informative, discriminating and independent features is a crucial
step for effective algorithms in pattern recognition, classification, and regression
●
Feature generation

● In the age of information, which we live in, datasets sizes become increasingly larger
in both number of instances and number of features. It has become quite common to
see datasets with tens of thousands of features or more. Furthermore, many times
algorithm developers are using a process called feature generation.
● Feature generation is the process of creating new features from one or multiple
features, for potential use in the statistical analysis. Usually, this process is adding
new information to the model and makes it more accurate. Feature generation can
improve model accuracy when there is a feature interaction
Feature generation

● By adding new features that encapsulate the feature interaction, new information
becomes more accessible for the prediction model (PM)
● Feature extraction is a quite complex concept concerning the translation of raw data
into the inputs that a particular Machine Learning algorithm requires.
● Features must represent the information of the data in a format that will best fit the
needs of the algorithm that is going to be used to solve the problem.
● Feature extraction fills this requirement: it builds valuable information from raw data
– the features – by reformatting, combining, transforming primary features into new
ones…until it yields a new set of data that can be consumed by the Machine Learning
models to achieve their goals.
Feature generation

● Feature extraction is a quite complex concept concerning the translation of raw data
into the inputs that a particular Machine Learning algorithm requires.
● Features must represent the information of the data in a format that will best fit the
needs of the algorithm that is going to be used to solve the problem.
Feature Selection

● Feature selection is about selecting the subset of the original feature set, whereas
feature extraction creates new features. Feature selection is a way of reducing the
input variable for the model by using only relevant data in order to reduce overfitting
in the model.
Feature Selection

●
Feature Selection

● What is Feature Selection?

● Feature Selection is the method of reducing the input variable to your model by using
only relevant data and getting rid of noise in data.
● It is the process of automatically choosing relevant features for your machine learning
model based on the type of problem you are trying to solve. We do this by including
or excluding important features without changing them. It helps in cutting down the
noise in our data and reducing the size of our input data.
Feature Selection

●
●
Feature Selection

● Feature Selection Models

● Feature selection models are of two types:
● 1.Supervised Models: Supervised feature selection refers to the method which uses
the output label class for feature selection. They use the target variables to identify the
variables which can increase the efficiency of the model
● 2.Unsupervised Models: Unsupervised feature selection refers to the method which
does not need the output label class for feature selection. We use them for unlabelled
data.
Feature Selection

● The role of feature selection in machine learning is,

1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Feature Selection

● Features Selection Algorithms are as follows:

● 1. Instance based approaches: There is no explicit procedure for feature subset
generation. Many small data samples are sampled from the data. Features are
weighted according to their roles in differentiating instances of different classes for a
data sample. Features with higher weights can be selected.
● 2. Nondeterministic approaches: Genetic algorithms and simulated annealing are
also used in feature selection.
● 3. Exhaustive complete approaches: Branch and Bound evaluates estimated
accuracy and ABB checks an inconsistency measure that is monotonic. Both start
with a full feature set until the preset bound cannot be maintained.
Feature Selection

Some popular techniques of feature selection in machine learning are:

● Filter methods
● Wrapper methods
● Embedded methods

Filter Methods
These methods are generally used while doing the pre-processing step. These methods
select features from the dataset irrespective of the use of any machine learning algorithm.
In terms of computation, they are very fast and inexpensive and are very good for
removing duplicated, correlated, redundant features but these methods do not remove
multicollinearity.
Feature Selection

Selection of feature is evaluated individually which can sometimes help when features are
in isolation (don’t have a dependency on other features) but will lag when a combination
of features can lead to increase in the overall performance of the model.
Feature Selection

. Filter Method: In this method, features are dropped based on their relation to the output,
or how they are correlating to the output. We use correlation to check if the features are
positively or negatively correlated to the output labels and drop features accordingly. Eg:
Information Gain, Chi-Square Test, Fisher’s Score, etc.
Feature Selection

2. Wrapper Method: We split our data into subsets and train a model using this. Based on
the output of the model, we add and subtract features and train the model again. It forms
the subsets using a greedy approach and evaluates the accuracy of all the possible
combinations of features. Eg: Forward Selection, Backwards Elimination, etc.
Feature Selection

Intrinsic Method: This method combines the qualities of both the Filter and Wrapper
method to create the best subset.
Feature Selection

●
Data Science

Module-4

Prepared By: Dr. Ajay N Upadhyaya & Prof. Madhuri Parekh

Data Visualization

Data visualization is a crucial aspect of machine learning that enables analysts to

understand and make sense of data patterns, relationships, and trends.

Through data visualization, insights and patterns in data can be easily interpreted and
communicated to a wider audience, making it a critical component of machine learning.

Data visualization helps machine learning analysts to better understand and analyze complex data
sets by presenting them in an easily understandable format. Data visualization is an essential step
in data preparation and analysis as it helps to identify outliers, trends, and patterns in the data that
may be missed by other forms of analysis.
Data Visualization

Advantages
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and
squares from circles. Our culture is visual, including everything from art and
advertisements to TV and movies. Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see
trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with
a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend,
you know how much more effective a visualization can be.
Some other advantages of data visualization include:
● Easily sharing information.
● Interactively explore opportunities.
● Visualize patterns and relationships.
Data Visualization

Advantages
● Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and
squares from circles. Our culture is visual, including everything from art and
advertisements to TV and movies.
● Data visualization is another form of visual art that grabs our interest and keeps our
eyes on the message.
● When we see a chart, we quickly see trends and outliers. If we can see something, we
internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how much more
effective a visualization can be.
Data Visualization

Disadvantages
While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an
inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s
biased or confusing.
Some other disadvantages include:
● Biased or inaccurate information.
● Correlation doesn’t always mean causation.
● Core messages can get lost in translation.
Principal of Data Visualization

● Determine your audience. What questions will they need answered?

● Choose the right kind of chart (or other visualization) to depict the type of information you
have.
● Form follows function. Focus on how your audience needs to use the data, and let that
determine the presentation style.
● Provide the necessary context for data to be interpreted and acted upon appropriately.
● Keep it simple. Remove any non-essential information.
● Choose colors carefully to draw attention while also considering accessibility issues such
as contrast.
● Seek balance in your visual elements, including texture, color, shape, and negative space.
● Use patterns (of chart types, colors, or other design elements) to identify similar types of
information.
● Use proportion carefully so that differences in design size fairly represent differences in
value.
Principal of Data Visualization

● Be skeptical. Ask yourself questions about what data is not represented and what insights
might therefore be misinterpreted or missing.
Principal of Data Visualization

Characteristics of Effective Graphical Visual :

● It shows or visualizes data very clearly in an understandable manner.
● It encourages viewers to compare different pieces of data.
● It closely integrates statistical and verbal descriptions of data set.
● It grabs our interest, focuses our mind, and keeps our eyes on message as human brain tends to focus
on visual data more than written data.
● It also helps in identifying area that needs more attention and improvement.
● Using graphical representation, a story can be told more efﬁciently. Also, it requires less time to
understand picture than it takes to understand textual data.
Principal of Data Visualization

Categories of Data Visualization ;

Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis
paralysis. So, data visualization is categorized into following categories :
Data visualization is very critical to market research where both numerical and categorical data can be
visualized that helps in an increase in impacts of insights and also helps in reducing risk of analysis
paralysis. So, data visualization is categorized into following categories :
Example of Data visualization
Scatter Plot

Scatter plots are used to observe relationships between variables and uses dots to represent the relationship
between them. The scatter() method in the matplotlib library is used to draw a scatter plot.
import pandas as pd
import matplotlib.pyplot as plt

# reading the database

data = pd.read_csv("tips.csv")

# Scatter plot with day against tip

plt.plot(data['tip'])
plt.plot(data['size'])

# Adding Title to the Plot

plt.title("Scatter Plot")
Example of Data visualization
# Setting the X and Y labels
plt.xlabel('Day')
plt.ylabel('Tip')

plt.show()
Tools of Data Visualization

1. Tableau

Tableau is a data visualization tool that can be used by data analysts, scientists, statisticians, etc. to visualize the
data and get a clear opinion based on the data analysis. Tableau is very famous as it can take in data and
produce the required data visualization output in a very short time

2. Looker

Looker is a data visualization tool that can go in-depth into the data and analyze it to obtain useful insights. It
provides real-time dashboards of the data for more in-depth analysis so that businesses can make instant
decisions based on the data visualizations obtained.
Tools of Data Visualization

Zoho Analytics

Zoho Analytics is a Business Intelligence and Data Analytics software that can help you create
wonderful-looking data visualizations based on your data in a few minutes.

4. Sisense

Sisense is a business intelligence-based data visualization system and it provides various tools that allow data
analysts to simplify complex data and obtain insights for their organization and outsiders

5. IBM Cognos Analytics

IBM Cognos Analytics is an Artiﬁcial Intelligence-based business intelligence platform that supports data
analytics among other things. You can visualize as well as analyze your data and share actionable insights with
anyone in your organization
Tools of Data Visualization

Qlik Sense

Qlik Sense is a data visualization platform that helps companies to become data-driven enterprises by providing
an associative data analytics engine, sophisticated Artiﬁcial Intelligence system, and scalable multi-cloud
architecture that allows you to deploy any combination of SaaS, on-premises, or a private cloud.

Domo

Domo is a business intelligence model that contains multiple data visualization tools that provide a consolidated
platform where you can perform data analysis and then create interactive data visualizations that allow other
people to easily understand your data conclusions.

8. Microsoft Power BI

Microsoft Power BI is a Data Visualization platform focused on creating a data-driven business intelligence
culture in all companies today. To fulﬁll this, it offers self-service analytics tools that can be used to analyze,
aggregate, and share data in a meaningful fashion.
Data Science

Module-5

Prepared By: Dr. Ajay N Upadhyaya & Prof. Madhuri Parekh

Application of Data Science

Discovering drugs
The major contribution of data science in the pharmaceutical industry is to provide the
groundwork for the synthesis of drugs using Artificial Intelligence. Mutation profiling and
the metadata of the patients are used to develop compounds that address the statistical
correlation between the attributes.

Virtual assistance
Nowadays, chatbots and AI platforms are designed by data scientists to help people get a
better idea of their health by putting in certain health information about themselves and
getting an accurate diagnosis. Furthermore, these platforms also assist consumers with
health insurance policies and better lifestyle guides.
Application of Data Science

Wearables
The present-day phenomenon of the Internet of Things (IoT), which ensure maximum
connectivity, is a blessing to data science. Now when this technology is applied to the
medical field, it can help monitor patient health. Nowadays, physical fitness monitors
and smartwatches are used by people to track and manage their health. Furthermore,
these wearable sensor devices can be tracked by a doctor if given access, and in
chronicle cases, the doctor can remotely provide solutions to the patient.
Tracking Patient Health
Did you know that the human body generates 2TB of data daily? Data scientists for
public health have developed wearable devices that allow doctors to collect most of this
data like heart rate, sleep patterns, blood glucose, stress levels, and even brain activity.
With the help of data science tools and machine learning algorithms, doctors can detect
and track common conditions, like cardiac or respiratory diseases.
Application of Data Science

Data Science tech can also detect the slightest changes in the patient's health indicators and
predict possible disorders. Various wearable and home devices as a part of an IoT network
use real-time analytics to predict if a patient will face any problem based on their present
condition.
Diagnostics
An integral part of medical services, diagnosis can be made easier and quicker by data
science applications in healthcare. Not only does the patient’s data analysis facilitate early
detection of health issues, but medical heatmaps pertaining to demographic patterns of
ailments can also be prepared.
Predictive Analytics in Healthcare
A predictive analytical model utilizes historical data, finds patterns from the data, and
generates accurate predictions. The data could entail anything from a patient's blood
pressure and body temperature to sugar level.
Application of Data Science

Predictive models in Data Science correlate and associate every data point to symptoms,
habits, and diseases. This enables the identification of a disease's stage, the extent of
damage, and an appropriate treatment measure. Predictive analytics in healthcare also
helps:

● Manage chronic diseases

● Monitor and analyze the demand for pharmaceutical logistics
● Predict future patient crisis
● Deliver faster hospital data documentation
Application of Data Science

Medical Image Analysis

Healthcare professionals often use various imaging techniques like X-Ray, MRI, and CT
Scan to visualize your body's internal systems and organs. Deep learning and image
recognition technologies in health Data Science allow detection of minute deformities in
these scanned images, helping doctors plan an effective treatment strategy. Some of the
commonly used machine learning algorithms include:
● Image processing algorithm: For image analysis, enhancement and denoising.
● Anomaly detection algorithm: For bone fracture and displacement detection.
● Descriptive image recognition algorithm: Data extraction and interpretation from
images and merging several images to form a bigger picture.
What is Ethics for Data Science?

The study and evaluation of ethical problems connected to data have given rise to a new
field of study in ethics, known as ethics for data science. Data may be collected, recorded,
produced, processed, shared, and used, among other things. It also encompasses different
data and technology, such as programming hackers, professional codes and algorithms.
The importance of ethics in data science has been felt because there has to be a clear set of
rules governing what businesses can and cannot do with the personal information they
acquire from customers.
Principles of Data Ethics

● The word privacy does not imply confidentiality since private information may be
required for audits depending on the needs of the legal process. However, this
sensitive information is obtained from a person with their permission. Additionally, it
is stated that the information must not be made public so that other people or
businesses might use it to determine the user's identity.
● Private information that has been disclosed should never be made public. In order to
protect the privacy and comply with regulations, they must also impose limitations on
how the data may be shared.
Privacy Security and Confidentiality of Data

A further ethical obligation associated with data management is privacy security and ethics
in data science. Customers may not want their Personally Identifiable Information (PII)
made public even if they allow your organization authority to gather, keep, and analyze it.
PII refers to any data associated with a specific person's identity, and PII includes, for
example:
● Bank account number
● Birthdate
● Credit card information
● Full name
● Passport number
● Phone number
● Social Security card
● Street address
Future of Data Science

Splitting of Data Science:

Presently, the term data science is perceived quite vaguely. There are various designations
and descriptions that are associated with data science like Data Analyst, Data Engineer,
Data Visualization, Data Architect, Machine Learning and Business Intelligence to name a
few. However, as we move into the future, we would begin to better interpret and
understand the contribution of each of their roles independently.
Data Explosion:
Today enormous amounts of data are being produced on a daily basis. Every organization
is dependent on the data being created for its processes.
Future of Data Science

Rise of Automation:
With an increase in the complexity of operations, there is always a strive to simplify
processes.
Scarcity or Abundance of Data Scientists:
Today thousands of individuals learn Data Science related skills through college degrees or
the numerous resources that can be found online and this could result in newer aspirants
getting a feeling of saturation in this domain

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Electronic Reverse Auction and The Public Sector: Factors of Success Moshe E. Shalev & Stee Asbjorensen
100% (3)
Electronic Reverse Auction and The Public Sector: Factors of Success Moshe E. Shalev & Stee Asbjorensen
25 pages
Virtual EAV: The Electro-Dermal Screening Test
100% (1)
Virtual EAV: The Electro-Dermal Screening Test
21 pages
Arth 2470 Exam 1 Notes
No ratings yet
Arth 2470 Exam 1 Notes
30 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Session 1819
No ratings yet
Session 1819
47 pages
DataScience Intro
No ratings yet
DataScience Intro
36 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science Components
No ratings yet
Data Science Components
7 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
DS QB unit 1
No ratings yet
DS QB unit 1
45 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Introduction to Data-Science
No ratings yet
Introduction to Data-Science
246 pages
Data Science
No ratings yet
Data Science
18 pages
Data Science
No ratings yet
Data Science
18 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
1. Data Science Introduction
No ratings yet
1. Data Science Introduction
24 pages
Data Science
100% (2)
Data Science
33 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science Consulting Infographics by Slidesgo
No ratings yet
Data Science Consulting Infographics by Slidesgo
12 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
introduction to data science
No ratings yet
introduction to data science
8 pages
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
100% (1)
OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith
122 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
DATA SCIENCE LIFE CYCLE
No ratings yet
DATA SCIENCE LIFE CYCLE
12 pages
COMPUTATIONAL DATA SCIENCE - UNIT 1
No ratings yet
COMPUTATIONAL DATA SCIENCE - UNIT 1
18 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
Basic of ds
No ratings yet
Basic of ds
14 pages
Unit 3
No ratings yet
Unit 3
9 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
53 pages
Chapter 1- Intr to DS and Business Understanding
No ratings yet
Chapter 1- Intr to DS and Business Understanding
35 pages
02 Introduction_Fall 23-24
No ratings yet
02 Introduction_Fall 23-24
29 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
Unit 1
No ratings yet
Unit 1
30 pages
Data Science Consulting Infographics by Slidesgo - 1
No ratings yet
Data Science Consulting Infographics by Slidesgo - 1
12 pages
himadev
No ratings yet
himadev
37 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Data Science
No ratings yet
Data Science
11 pages
Data Science 2
No ratings yet
Data Science 2
3 pages
Extended_Comprehensive_Guide_to_Data_Science
No ratings yet
Extended_Comprehensive_Guide_to_Data_Science
2 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
File
No ratings yet
File
27 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
1 - Introduction To Data Science
No ratings yet
1 - Introduction To Data Science
28 pages
Overview of Data Science
No ratings yet
Overview of Data Science
3 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
Appendices Stop The Rot
No ratings yet
Appendices Stop The Rot
23 pages
Chapter 2-THC 10
No ratings yet
Chapter 2-THC 10
5 pages
Difference Between Web 123
No ratings yet
Difference Between Web 123
2 pages
Big Data Analytics in Mobile Cellular Networks
No ratings yet
Big Data Analytics in Mobile Cellular Networks
29 pages
Thomas Hobbes Leviathan 1st Edition Noel Malcolm All Chapters Instant Download
No ratings yet
Thomas Hobbes Leviathan 1st Edition Noel Malcolm All Chapters Instant Download
77 pages
Keats John
No ratings yet
Keats John
13 pages
Band Clamp Sheet
No ratings yet
Band Clamp Sheet
5 pages
Lcs Star Observation
No ratings yet
Lcs Star Observation
6 pages
Indicative Programme: Day 1 (November 18, 2021) : Barangay Development Planning Time Topic Discussant
No ratings yet
Indicative Programme: Day 1 (November 18, 2021) : Barangay Development Planning Time Topic Discussant
2 pages
LINDAPTOR
No ratings yet
LINDAPTOR
3 pages
Week-1-Hrm (Bac5)
No ratings yet
Week-1-Hrm (Bac5)
7 pages
CAM and CNC Slides
No ratings yet
CAM and CNC Slides
9 pages
2 SC 2078
No ratings yet
2 SC 2078
3 pages
EDEC114 Professional Experience Handbook 2024
No ratings yet
EDEC114 Professional Experience Handbook 2024
40 pages
Top 10 Practical Tips On Resolving Construction Claims - March 2012 - RKL PDF
No ratings yet
Top 10 Practical Tips On Resolving Construction Claims - March 2012 - RKL PDF
21 pages
LAB 1 Compiler Construction
No ratings yet
LAB 1 Compiler Construction
8 pages
Em Lecture 1-Engineering Management
No ratings yet
Em Lecture 1-Engineering Management
43 pages
ABBREVIATIONS
No ratings yet
ABBREVIATIONS
3 pages
Key TEST 12 SAP XEP CAU QUOC GIA 5 Ngay 24
No ratings yet
Key TEST 12 SAP XEP CAU QUOC GIA 5 Ngay 24
3 pages
CHM 101 Exams 20182019 Type B-1
No ratings yet
CHM 101 Exams 20182019 Type B-1
4 pages
LCD Interfacing With Arduino _ New Topic 2025 - Poly Notes Hub
No ratings yet
LCD Interfacing With Arduino _ New Topic 2025 - Poly Notes Hub
9 pages
Lesson Plans: Pre A1 Starters Listening Part 4 - Teacher's Notes
No ratings yet
Lesson Plans: Pre A1 Starters Listening Part 4 - Teacher's Notes
6 pages
Insight SUB CSP21T23S POL
No ratings yet
Insight SUB CSP21T23S POL
103 pages
Lecture - 1 - Materials - by Prof. Manu Santhanam, 13.01.2024
No ratings yet
Lecture - 1 - Materials - by Prof. Manu Santhanam, 13.01.2024
40 pages
Understanding Bugs in Rust Compilers
No ratings yet
Understanding Bugs in Rust Compilers
12 pages
Monthly Current Affairs Capsule IBPSGuide PDF
No ratings yet
Monthly Current Affairs Capsule IBPSGuide PDF
45 pages
Skema Fizik Tingkatan 5 Kertas 2 Pep Percubaan SPM SBP 2011
No ratings yet
Skema Fizik Tingkatan 5 Kertas 2 Pep Percubaan SPM SBP 2011
8 pages