0% found this document useful (0 votes)
1 views237 pages

Introduction to Data Science_students

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 237

Introduction to Data Science

By Dr. Deepika Chaudhary


Introduction to Data Science
Data science is an interdisciplinary field that
uses scientific
methods, processes, algorithms and systems
to extract knowledge and insights from
noisy, structured and unstructured
data,[1][2] and apply knowledge and actionable
insights from data across a broad range of
application domains. Data science is related
to data mining, machine learning and big data.
Data science is a "concept to
unify statistics, data analysis, informatics, and
their related methods" in order to
"understand and analyze actual phenomena"
with data.[3] It uses techniques[disambiguation
needed] and theories drawn from many fields

within the context


of mathematics, statistics, computer
science, information science, and domain
knowledge.
Who is a Data Scientist?
• A data scientist is someone who creates
programming code and combines it with
statistical knowledge to create insights from
data.[
• Data science is an interdisciplinary field focused on extracting knowledge
from data sets, which are typically large (see big data), and applying the
knowledge and actionable insights from data to solve problems in a wide range of
application domains.[8]
• The field encompasses preparing data for analysis, formulating data science
problems, analyzing data, developing data-driven solutions, and presenting
findings to inform high-level decisions in a broad range of application domains.
• As such, it incorporates skills from computer science, statistics, information
science, mathematics, data visualization, information visualization, data
sonification, data integration, graphic design, complex
systems, communication and business.[9][10]
• Statistician Nathan Yau, drawing on Ben Fry, also links data science to human–
computer interaction: users should be able to intuitively control
and explore data.[11][12] In 2015, the American Statistical
Association identified database management, statistics and machine learning,
and distributed and parallel systems as the three emerging foundational
professional communities.[13]
Introduction to Data
• The quantities, characters, or symbols on
which operations are performed
by a computer, which may be stored and
transmitted in the form of electrical signals
and recorded on magnetic, optical, or
mechanical recording media.
What is Big Data?
• Big Data is a collection of data that is huge in
volume, yet growing exponentially with time.
It is a data with so large size and complexity
that none of traditional data management
tools can store it or process it efficiently. Big
data is also a data but with huge size.
Few Examples
• The New York Stock Exchange is an example
of Big Data that generates about one
terabyte of new trade data per day.
• Social Media
• The statistic shows that 500+terabytes of new
data get ingested into the databases of social
media site Facebook, every day. This data is
mainly generated in terms of photo and video
uploads, message exchanges, putting
comments etc.
• A single Jet engine can
generate 10+terabytes of data in 30
minutes of flight time. With many thousand
flights per day, generation of data reaches up
to many Petabytes.
Types of Big Data
• Types Of Big Data
• Following are the types of Big Data:
• Structured
• Unstructured
• Semi-structured
Structured

• Any data that can be stored, accessed and


processed in the form of fixed format is termed
as a ‘structured’ data. Over the period of time,
talent in computer science has achieved greater
success in developing techniques for working
with such kind of data (where the format is well
known in advance) and also deriving value out of
it. However, nowadays, we are foreseeing issues
when a size of such data grows to a huge extent,
typical sizes are being in the rage of multiple
zettabytes.
Unstructured

• Any data with unknown form or the structure is


classified as unstructured data. In addition to the
size being huge, un-structured data poses
multiple challenges in terms of its processing for
deriving value out of it. A typical example of
unstructured data is a heterogeneous data source
containing a combination of simple text files,
images, videos etc. Now day organizations have
wealth of data available with them but
unfortunately, they don’t know how to derive
value out of it since this data is in its raw form or
unstructured format.
Semi-structured
• Semi-structured data can contain both the
forms of data. We can see semi-structured
data as a structured in form but it is actually
not defined with e.g. a table definition in
relational DBMS. Example of semi-structured
data is a data represented in an XML file.
Characteristics Of Big Data

• Big data can be described by the following


characteristics:
• Volume
• Variety
• Velocity
• Variability
Volume
• The name Big Data itself is related to a size
which is enormous. Size of data plays a very
crucial role in determining value out of data.
Also, whether a particular data can actually be
considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is
one characteristic which needs to be
considered while dealing with Big Data
solutions.
Variety
• Variety refers to heterogeneous sources and the
nature of data, both structured and
unstructured. During earlier days, spreadsheets
and databases were the only sources of data
considered by most of the applications.
Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications.
This variety of unstructured data poses certain
issues for storage, mining and analyzing data.
Velocity
• The term ‘velocity’ refers to the speed of
generation of data. How fast the data is
generated and processed to meet the demands,
determines real potential in the data.
• Big Data Velocity deals with the speed at which
data flows in from sources like business
processes, application logs, networks, and social
media sites, sensors, Mobile devices, etc. The
flow of data is massive and continuous.
Variability
• This refers to the inconsistency which can be
shown by the data at times, thus hampering
the process of being able to handle and
manage the data effectively.
Advantages Of Big Data Processing

• Businesses can utilize outside intelligence


while taking decisions.
• Improved customer service
• Early identification of risk to the
product/services, if any
• Better operational efficiency
Analysis
You have a huge data set containing data of various types.
Instead of tackling the entire data and running the risk of
becoming overwhelmed, you separated into easier to
digest chunks and study them individually and examine
how they relate to other parts.
And that's analysis in a nutshell.
One important thing to remember, however, is that you
perform analyses on things that have already happened in
the past, such as using an analysis to explain how his story
ended the way it did. Or how there was a decrease in sales
last summer. All this means that we do analyses to explain
how and or why something happened.
ANALYTICS
• Instead of explaining past events, it explores
potential future ones. Analytics is essentially
the application of logical and computational
reasoning to the component parts obtained
in an analysis. And in doing this, you are
looking for patterns and exploring what you
could do with them in the future.
Types of Analytics
• Qualitative
• Quantitative
Difference
• Qualitative = intuition + analysis

• Quantitative = Applying numbers and


formulas to the facts you have gathered from
the analysis.
What is the difference between and
Analysis and Analytics
Quantitative Analytics
• This is applying formulas and algorithms to
numbers you have gathered from your
analysis.
Types of Analytics
• Descriptive Analytics, which tells you what
happened in the past
• Diagnostic Analytics, which helps you understand
why something happened in the past
• Predictive Analytics, which predicts what’s most
likely to happen in the future
• Prescriptive Analytics, which recommends
actions you can take to affect those likely
outcomes
What Is Descriptive Analytics?

• Descriptive analytics is typically the starting


point in business intelligence. It uses data
aggregation and data mining to collect and
organize historical data, producing
visualizations such as line graphs, bar charts,
pie charts. Descriptive analytics presents a
clear picture of what has happened in the
past, such as statistical modeling does, and it
stops there — it doesn’t make interpretations
or advise on future actions.
What does descriptive analytics
show?
• Descriptive analytics is helpful to identify
answers to simple questions about what
occurred in the past. When you’re doing this
type of analytics, you’ll typically start by
identifying KPIs as benchmarks for
performance in a given business area (sales,
finance, operations, etc.). Next, you’ll
determine what data sets will inform the
analysis and where to source them from, then
collect and prepare them.
Examples of descriptive analytics

• Descriptive analytics can benefit decision-makers from


every department in a company, from finance to
operations. Here are a few examples:
• The sales team can learn which customer segments
generated the highest dollar amount in sales last year.
• The marketing team can uncover which social media
platforms delivered the best return on advertising
investment last quarter.
• The finance team can track month-over-month and
year-over-year revenue growth or decline.
• Operations can track demand for SKUs across
geographic locations throughout the past year.
What Is Diagnostic Analytics?

• Once you know what happened, you’ll want to


know why it happened. That’s where
diagnostic analytics comes in. Understanding
why a trend is developing or why a problem
occurred will make your business intelligence
actionable. It prevents your team from
making inaccurate guesses, particularly
related to confusing correlation and causality.
What does diagnostic analytics show?

• Typically, there is more than one contributing factor to


any given trend or event. Diagnostic analytics can
reveal the full spectrum of causes, ensuring you see
the complete picture. You can also see which factors
are most impactful and zero in on them. For diagnostic
analytics, you’ll use some of the same techniques as
descriptive analytics, but you’ll dive deeper with drill-
down and correlations. You may also need to bring in
outside datasets to more fully inform your analysis.
Sigma makes this easy, especially when connected
with Snowflake’s powerful capabilities.
Examples of diagnostic analytics

• The sales team can identify shared characteristics and


behaviors of profitable customer segments that may
explain why they’re spending more.
• The marketing team can look at unique characteristics of
high-performing social media ads compared to more
poorly-performing ones to identify the reasons for
performance differences.
• The finance team can compare the timing of key initiatives
to month-over-month and year-over-year revenue growth
or decline to help determine correlations.
• Operations can look at regional weather patterns to see if
they’re contributing to demand for particular SKUs across
geographic locations.
What Is Predictive Analytics?

• When you know what happened in the past


and understand why it happened, you can
then begin to predict what is likely to occur in
the future based on that information.
Predictive analytics takes the investigation a
step further, using statistics, computational
modeling, and machine learning to determine
the probability of various outcomes.
What does predictive analytics show?

• One of the most valuable forms of predictive


analytics is what-if analysis, which involves
changing various values to see how those
changes will affect the outcome. When
business teams are able to conduct rapid,
iterative analysis to evaluate options, they’re
empowered to make better decisions faster.
Sigma was designed with this capability.
Examples of predictive analytics

Predictive analytics is especially powerful for teams


because it allows decision-makers to be more confident
about the future. Here are a few examples:
• The sales team can learn the revenue potential of a
particular customer segment.
• The marketing team can predict how much revenue
they’re likely to generate with an upcoming campaign.
• The finance team can create more accurate projections
for the next fiscal year.
• The operations team can better predict demand for
various products in different regions at specific points
in the upcoming year.
What is Prescriptive Analytics?

• Prescriptive analytics is where the action is. This


type of analytics tells teams what they need to do
based on the predictions made. It’s the most
complex type, which is why less than 3% of
companies are using it in their business.
• While using AI in prescriptive analytics is
currently making headlines, the fact is that this
technology has a long way to go in its ability to
generate relevant, actionable insights.
What does prescriptive analytics
show?
• Prescriptive analytics anticipates what, when,
and why an event or trend might happen. It
tells you what actions have the highest
potential for the best outcome. It allows
teams to fix problems, improve performance,
and jump on valuable opportunities.
Examples of prescriptive analytics

• While the amount of data necessary for prescriptive


analytics means that it won’t make sense for daily use,
prescriptive analytics has a wide variety of applications.
For example:
• How the sales team can improve the sales process for
each target vertical.
• Helping the marketing team determine what product
to promote next quarter.
• Ways the finance team can optimize risk management.
• Help the operations team determine how to optimize
warehousing.
Cognitive Analysis
• It refers to the process of analyzing and
interpreting complex data such as text, images
or speech to extract meaning insights,
patterns and relationships. (NLP + Contextual
Understanding + Pattern Recognition +
Inference & Reasoning)
Introduction to the buzz words and
how they are related.
Difference between the Buzz Words
• Business Intelligence
• Data Mining
• Data Analytics
• Data Science
• Machine Learning
Introduction to Machine Learning
• Machine learning (ML) is the study of
computer algorithms that can improve
automatically through experience and by the
use of data.[1] It is seen as a part of artificial
intelligence. Machine learning algorithms
build a model based on sample data, known
as training data, in order to make predictions
or decisions without being explicitly
programmed to do so
• Machine learning programs can perform tasks
without being explicitly programmed to do so.
It involves computers learning from data
provided so that they carry out certain tasks.
• For simple tasks assigned to computers, it is
possible to program algorithms telling the
machine how to execute all steps required to
solve the problem at hand; on the computer's
part, no learning is needed.
• For more advanced tasks, it can be challenging
for a human to manually create the needed
algorithms. In practice, it can turn out to be
more effective to help the machine develop its
own algorithm, rather than having human
programmers specify every needed step.
History of Machine Learning
• The term machine learning was coined in 1959
by Arthur Samuel, an American IBMer and
pioneer in the field of computer
gaming and artificial intelligence.[11][12] Also the
synonym self-teaching computers was used in this
time period.[13][14] A representative book of the
machine learning research during the 1960s was
the Nilsson's book on Learning Machines, dealing
mostly with machine learning for pattern
classification.[15] Interest related to pattern
recognition continued into the 1970s, as
described by Duda and Hart in 1973.
Definition
• Tom M. Mitchell provided a widely quoted,
more formal definition of the algorithms
studied in the machine learning field: "A
computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P if its
performance at tasks in T, as measured by P,
improves with experience E.
AI and ML (Relationship)
• As a scientific endeavor, machine learning grew out of
the quest for artificial intelligence. In the early days of
AI as an having machacademic discipline, some
researchers were interested in ines learn from data.
They attempted to approach the problem with various
symbolic methods, as well as what was then termed
"neural networks"; these were
mostly perceptrons and other models that were later
found to be reinventions of the generalized linear
models of statistics.[23] Probabilistic reasoning was also
employed, especially in automated medical diagnosis.
• Machine learning (ML), reorganized as a
separate field, started to flourish in the 1990s.
The field changed its goal from achieving
artificial intelligence to tackling solvable
problems of a practical nature. It shifted focus
away from the symbolic approaches it had
inherited from AI, and toward methods and
models borrowed from statistics
and probability theory.[25]
• The difference between ML and AI is
frequently misunderstood. ML learns and
predicts based on passive observations,
whereas AI implies an agent interacting with
the environment to learn and take actions that
maximize its chance of successfully achieving
its goals.
Difference between ML & AI
ML as a subset of AI Part of machine learning as subfield of AI
Machine Learning and Data Mining
• Machine learning and data mining often
employ the same methods and overlap
significantly, but while machine learning
focuses on prediction, based
on known properties learned from the training
data, data mining focuses on the discovery of
(previously) unknown properties in the data
(this is the analysis step of knowledge
discovery in databases).
Machine Learning and Statistics
• Machine learning and statistics are closely related
fields in terms of methods, but distinct in their
principal goal: statistics draws
population inferences from a sample, while
machine learning finds generalizable predictive
patterns.[32] According to Michael I. Jordan, the
ideas of machine learning, from methodological
principles to theoretical tools, have had a long
pre-history in statistics.[33] He also suggested the
term data science as a placeholder to call the
overall field.
Introduction to Data Science
Programming Languages
• R was created by Ross Ihaka and Robert Gentleman — two
statisticians from the University of Auckland in New Zealand. It was
initially released in 1995 and they launched a stable beta version in
2000. It’s an interpreted language (you don’t need to run it through
a compiler before running the code) and has an extremely powerful
suite of tools for statistical modeling and graphing.
• For programming nerds, R is an implementation of S — a statistical
programming language developed in the 1970s at Bell Labs— and it
was inspired by Scheme — a variant of Lisp. It’s also extensible,
making it easy to call R objects from many other programming
languages.
• R is free and has become increasingly popular at the expense of
traditional commercial statistical packages like SAS and SPSS. Most
users write and edit their R code using RStudio, an Integrated
Development Environment (IDE) for coding in R.
Introduction to Python
• Python has also been around for a while. It was initially released in
1991 by Guido van Rossum as a general purpose programming
language. Like R, it’s also an interpreted language, and has a
comprehensive standard library which allows for easy programming
of many common tasks without having to install additional libraries.
Python has some of the most robust coding libraries there are.
They’re also available for free.
• For data science, there are a number of extremely powerful Python
libraries. There’s NumPy (efficient numerical computations), Pandas
(a wide range of tools for data cleaning and analysis), and
StatsModels (common statistical methods). You also have
TensorFlow, Keras and PyTorch (all libraries for building artificial
neural networks – deep learning systems).
• These days, many data scientists using Python
write and edit their code using Jupyter
Notebooks. Jupyter Notebooks allow for the
easy creation of documents that are a mix of
prose, code, data, and visualizations, making it
easy to document your process and for other
data scientists to review and replicate your
work.
Picking languages for data science
• Historically there has been a fairly even split in
the data science and data analysis community.
R vs Python for data science boils down to a
scientist’s background. Typically data scientists
with a stronger academic or mathematical
data science background preferred R, whereas
data scientists who had more of a
programming background tended to prefer
Python.
The strengths of Python

• Python is a general purpose programming


language. It’s great for statistical analysis, but
Python code will be the more flexible, capable
choice if you want to build a website for
sharing your results or a web service to
integrate easily with your production systems.
• In the September 2019 Tiobe index of the
most popular programming languages, Python
is the third most popular programming
language (and has grown by over 2% in the
last year) in all of computer science and
software development, whereas R has
dropped over the last year from 18th to 19th
place.
Categorical Attributes
• attributes refer to data that belongs to a
specific category of classes and belong to
classes and labels that a particular type of
label should predict.Attributes Categorical
attributes refer to data that belongs to a
specific category of classes and belong to
classes and labels that a particular type of
label should predict.
Numerical Attributes
• A numeric attribute is quantitative because, it
is a measurable quantity, represented in
integer or real values.
• For numerical attributes there is a slight
change in the formula.
• We take every element in the sample space
and compute the probability of every element
and then we add all those up to get the
expected value
What is a frequency distribution table?
• The frequency distribution table is a table
matching each distinct outcome in the sample
space to its associated frequency
Introduction to Statistics
• Population : Collection of items of a dataset.
• Represented by N
• The items are called as a parameters
• Sample : A subset of population represented
by n and the items or numbers are called as
statistics
Sample
• Two defined characteristics of sample
• A) Randomness – A random sample is collected
when each member of the sample is chosen from
population strictly by chance.
• B) Representativeness – Is a subset of population
that accurately reflects the member of entire
population.
• While choosing the sample it must be taken care
that the sample is both random and
representative.
Characteristics of Data
• Types of Data
• Categorical Data
• - Car Brands
• - Answers to yes and no question
• Numerical Data
• Discrete Data – Where we are sure about the values
like 0 1 or 2 (A-Z)
• Continuous Data - , if I define my random variable to be
the amount of sugar in orange. It can take any value
like 1.4g, 1.45g, 1.456g, 1.4568g as so on. For example
height, area, distance, time
Measurement Levels
• Levels of measurement, also called scales of
measurement, tell you how
precisely variables are recorded. In scientific
research, a variable is anything that can take
on different values across your data set (e.g.,
height or test scores).
• There are 4 levels of measurement:
• Nominal: the data can only be categorized
• Ordinal: the data can be categorized and
ranked
• Interval: the data can be categorized, ranked,
and evenly spaced
• Ratio: the data can be categorized, ranked,
evenly spaced, and has a natural zero.
• Depending on the level of measurement of
the variable, what you can do to analyze your
data may be limited. There is a hierarchy in
the complexity and precision of the level of
measurement, from low (nominal) to high
(ratio).
Nominal level
• Examples of nominal scalesYou
can categorize your data by labelling them in
mutually exclusive groups, but there is no order
between the categories.
• City of birth
• Gender
• Ethnicity
• Car brands
• Marital status
Ordinal level
• You can categorize and rank your data in an order,
but you cannot say anything about the intervals
between the rankings.
• Although you can rank the top 5 Olympic
medallists, this scale does not tell you how close
or far apart they are in number of wins.
• Top 5 Olympic medallists
• Language ability (e.g., beginner, intermediate,
fluent)
• Likert-type questions (e.g., very dissatisfied to
very satisfied)
Interval level
• You can categorize, rank, and infer equal
intervals between neighboring data points, but there is
no true zero point.
• The difference between any two adjacent
temperatures is the same: one degree. But zero
degrees is defined differently depending on the scale –
it doesn’t mean an absolute absence of temperature.
• The same is true for test scores and personality
inventories. A zero on a test is arbitrary; it does not
mean that the test-taker has an absolute lack of the
trait being measured.
Interval Level
• Arbitrary zero point – Interval scales have an
arbitrary zero point, which can make it difficult
to compare and analyze data.
• Limited Mathematical operations
• Only order information – They only mention
order of measurement but not their
magnitude
Example
• Test scores (e.g., IQ or exams)
• Personality inventories
• Temperature in Fahrenheit or Celsius
Ratio level
• You can categorize, rank, and infer equal
intervals between neighboring data points, and
there is a true zero point.
• A true zero means there is an absence of the
variable of interest. In ratio scales, zero does
mean an absolute lack of the variable.
• For example, in the Kelvin temperature scale,
there are no negative degrees of temperature –
zero means an absolute lack of thermal energy.
• True Zero Point
• Equal Intervals
• Order and magnitude
• Mathematical Operations
Example
• Height
• Age
• Weight
• Temperature in Kelvin
When to use Both Scales
• Ratio
– Physical Measurement
– Quantitative Data.
Interval
Social Science
Attitudinal Sciences.
Why are levels of measurement
important?
• The level at which you measure a variable
determines how you can analyze your data.
• The different levels limit which descriptive
statistics you can use to get an overall summary
of your data, and which type of inferential
statistics you can perform on your data to
support or refute your hypothesis.
• In many cases, your variables can be measured at
different levels, so you have to choose the level of
measurement you will use before data collection
begins.
• Example of a variable at 2 levels of
measurement You can measure the variable of
income at an ordinal or ratio level.
• Ordinal level: You create brackets of income
ranges: $0–$19,999, $20,000–$39,999, and
$40,000–$59,999. You ask participants to
select the bracket that represents their annual
income. The brackets are coded with numbers
from 1–3.
• Ratio level: You collect data on the exact
annual incomes of your participants.
• At a ratio level, you can see that the difference
between A and B’s incomes is far greater than
the difference between B and C’s incomes.
• At an ordinal level, however, you only know
the income bracket for each participant, not
their exact income. Since you cannot say
exactly how much each income differs from
the others in your data set, you can only order
the income levels and group the participants.
Descriptive Statistics
• Descriptive statistics help you get an idea of the
“middle” and “spread” of your data through
measures of central tendency and variability.
• When measuring the central tendency or
variability of your data set, your level of
measurement decides which methods you can
use based on the mathematical operations that
are appropriate for each level.
• The methods you can apply are cumulative; at
higher levels, you can apply all mathematical
operations and measures used at lower levels.
Variability
• Variability describes how far apart data points lie from
each other and from the center of a distribution. Along
with measures of central tendency, measures of variability
give you descriptive statistics that summarize your data.
• Variability is also referred to as spread, scatter or
dispersion. It is most commonly measured with the
following:
• Range: the difference between the highest and lowest
values
• Interquartile range: the range of the middle half of a
distribution
• Standard deviation: average distance from the mean
• Variance: average of squared distances from the mean
Measures of central tendency
• help you find the middle, or the average, of a data set.
The 3 most common measures of central tendency are
the mode, median, and mean.
• Mode: the most frequent value.
• Median: the middle number in an ordered data set.
• Mean: the sum of all values divided by the total
number of values.
• In addition to central tendency, the variability and
distribution of your data set is important to understand
when performing descriptive statistics.
Distributions and central tendency

• A data set is a distribution of n number of


scores or values.
• Normal distribution
• In a normal distribution, data is symmetrically
distributed with no skew. Most values cluster
around a central region, with values tapering
off as they go further away from the center.
The mean, mode and median are exactly the
same in a normal distribution.
Skewed distributions

• In skewed distributions, more values fall on one


side of the center than the other, and the mean,
median and mode all differ from each other. One
side has a more spread out and longer tail with
fewer scores at one end than the other. The
direction of this tail tells you the side of the skew
• In a positively skewed distribution, there’s a
cluster of lower scores and a spread out tail on
the right. In a negatively skewed distribution,
there’s a cluster of higher scores and a spread out
tail on the left.
Positively Skewed Distribution
Negatively Skewed
In a negatively skewed distribution, mean < median < mode.
Mode

• The mode is the most frequently occurring


value in the data set. It’s possible to have no
mode, one mode, or more than one mode.
• To find the mode, sort your data set
numerically or categorically and select the
response that occurs most frequently.
The mode is easily seen in a bar graph
because it is the value with the highest
bar.
When to use the mode

• The mode is most applicable to data from a


nominal level of measurement. Nominal data is
classified into mutually exclusive categories, so
the mode tells you the most popular category.
• For continuous variables or ratio levels of
measurement, the mode may not be a helpful
measure of central tendency. That’s because
there are many more possible values than there
are in a nominal or ordinal level of measurement.
It’s unlikely for a value to repeat in a ratio level of
measurement.
Median

• The median of a data set is the value that’s


exactly in the middle when it is ordered from
low to high.
Mean
• The arithmetic mean of a dataset (which is
different from the geometric mean) is the sum
of all values divided by the total number of
values. It’s the most commonly used measure
of central tendency because all values are
used in the calculation.
Outlier effect on the mean

• Outliers can significantly increase or decrease


the mean when they are included in the
calculation. Since all values are used to
calculate the mean, it can be affected by
extreme outliers. An outlier is a value that
differs significantly from the others in a data
set.
Population versus sample mean

• A data set contains values from a sample or a


population. A population is the entire group that you
are interested in researching, while a sample is only a
subset of that population.
• While data from a sample can help you make estimates
about a population, only full population data can give
you the complete picture.
• In statistics, the notation of a sample mean and a
population mean and their formulas are different. But
the procedures for calculating the population and
sample means are the same.
When should you use the mean,
median or mode?
• The 3 main measures of central tendency are best used in
combination with each other because they have
complementary strengths and limitations. But sometimes
only 1 or 2 of them are applicable to your data set,
depending on the level of measurement of the variable.
• The mode can be used for any level of measurement, but
it’s most meaningful for nominal and ordinal levels.
• The median can only be used on data that can be ordered –
that is, from ordinal, interval and ratio levels of
measurement.
• The mean can only be used on interval and ratio levels of
measurement because it requires equal spacing between
adjacent values or scores in the scale.
Examples
• For normally distributed data, all three
measures of central tendency will give you the
same answer so they can all be used.
• In skewed distributions, the median is the best
measure because it is unaffected by extreme
outliers or non-symmetric distributions of
scores. The mean and mode can vary in
skewed distributions.
Which descriptive statistics can I
apply on my data?
Skewed distributions

• In skewed distributions, more values fall on one


side of the center than the other, and the mean,
median and mode all differ from each other. One
side has a more spread out and longer tail with
fewer scores at one end than the other. The
direction of this tail tells you the side of the skew
• In a positively skewed distribution, there’s a
cluster of lower scores and a spread out tail on
the right. In a negatively skewed distribution,
there’s a cluster of higher scores and a spread out
tail on the left.
Variability
• Describes how far apart data points lie from
each other and from the center of a
distribution. Along with measures of central
tendency, measures of variability give
you descriptive statistics that summarize your
data.
• Variability is also referred to as spread, scatter or
dispersion. It is most commonly measured with
the following:
• Range: the difference between the highest and
lowest values
• Interquartile range: the range of the middle half
of a distribution
• Standard deviation: average distance from the
mean
• Variance: average of squared distances from the
mean
• While the central tendency, or average, tells you
where most of your points lie, variability
summarizes how far apart they are. This is
important because the amount of variability
determines how well you can generalize results
from the sample to your population.
• Low variability is ideal because it means that you
can better predict information about
the population based on sample data. High
variability means that the values are less
consistent, so it’s harder to make predictions.
Why does variability matter?

• Data sets can have the same central tendency but


different levels of variability or vice versa. If you know
only the central tendency or the variability, you can’t
say anything about the other aspect. Both of them
together give you a complete picture of your data.

• Data sets can have the same central tendency but


different levels of variability or vice versa. If you know
only the central tendency or the variability, you can’t
say anything about the other aspect. Both of them
together give you a complete picture of your data.
Range

• The range tells you the spread of your data


from the lowest to the highest value in the
distribution. It’s the easiest measure of
variability to calculate.
• To find the range, simply subtract the lowest
value from the highest value in the data set.
• How useful is the range?
• The range generally gives you a good indicator of
variability when you have a distribution without
extreme values. When paired with measures of
central tendency, the range can tell you about the
span of the distribution.
• But the range can be misleading when you have
outliers in your data set. One extreme value in
the data will give you a completely different
range.
Interquartile range

• The interquartile range gives you the spread of


the middle of your distribution.
• For any distribution that’s ordered from low to
high, the interquartile range contains half of
the values. While the first quartile (Q1)
contains the first 25% of values, the fourth
quartile (Q4) contains the last 25% of values.
• the interquartile range tells you the spread of
the middle half of your distribution.
• Quartiles segment any distribution that’s
ordered from low to high into four equal
parts. The interquartile range (IQR) contains
the second and third quartiles, or the middle
half of your data set.
Steps for the exclusive method

• To see how the exclusive method works by


hand, we’ll use two examples: one with an
even number of data points, and one with an
odd number.
• The interquartile range has a breakdown point of
25% due to which it is often preferred over the
total range.
• The IQR is used to build box plots, simple
graphical representations of a probability
distribution.
• The IQR can also be used to identify the outliers
in the given data set.
• The IQR gives the central tendency of the data.
Decision Making
• The data set has a higher value of interquartile
range (IQR) has more variability.
• The data set having a lower value of
interquartile range (IQR) is preferable.
• Suppose if we have two data sets and their
interquartile ranges are IR1 and IR2, and if
IR1 > IR2 then the data in IR1 is said to have
more variability than the data in IR2 and data
in IR2 is preferable.
• Just like the range, the interquartile range
uses only 2 values in its calculation. But the
IQR is less affected by outliers: the 2 values
come from the middle half of the data set, so
they are unlikely to be extreme scores.
• The IQR gives a consistent measure of
variability for skewed as well as normal
distributions.
• Five-number summary
Five-number summary

• Every distribution can be organized using


a five-number summary:
• Lowest value
• Q1: 25th percentile
• Q2: the median
• Q3: 75th percentile
• Highest value (Q4)
Visualize the interquartile range in
boxplots
• In a boxplot, the width of the box shows you the
interquartile range. A smaller width means you have
less dispersion, while a larger width means you have
more dispersion. Boxplots are especially useful for
showing the central tendency and dispersion of skewed
distributions.
• The placement of the box tells you the direction of the
skew. A box that’s much closer to the right side means
you have a negatively skewed distribution, and a box
closer to the left side tells you that you have a
positively skewed distribution.
Standard deviation

• The standard deviation is the average amount of variability in your


dataset.
• It tells you, on average, how far each score lies from the mean. The
larger the standard deviation, the more variable the data set is.
• There are six steps for finding the standard deviation by hand:
• List each score and find their mean.
• Subtract the mean from each score to get the deviation from the
mean.
• Square each of these deviations.
• Add up all of the squared deviations.
• Divide the sum of the squared deviations by n – 1 (for a sample)
or N (for a population).
• Find the square root of the number you found.
Variance
• The variance is the average of squared deviations from the
mean. A deviation from the mean is how far a score lies
from the mean.
• Variance is the square of the standard deviation. This
means that the units of variance are much larger than
those of a typical value of a data set.
• While it’s harder to interpret the variance number
intuitively, it’s important to calculate variance for
comparing different data sets in statistical tests
like ANOVAs.
• Variance reflects the degree of spread in the data set. The
more spread the data, the larger the variance is in relation
to the mean.
• Variance example
• To get variance, square the standard
deviation.
• s = 95.5
• s2 = 95.5 x 95.5 = 9129.14
• The variance of your data is 9129.14.
Descriptive Statistics (Summary)
• Descriptive statistics is concerned with describing
and summarizing the features of a dataset. It
involves methods such as calculating measures of
central tendency (mean, median, mode),
measures of dispersion (variance, standard
deviation), and visualizing data through graphs
and charts (histograms, box plots). Descriptive
statistics are used to understand the basic
characteristics of the data, such as its
distribution, variability, and central tendency.
Inferential Statistics
• Descriptive statistics allow you to describe a
data set, while inferential statistics allow you
to make inferences based on a data set.
• It relies on the probability theory and
Distribution to make inference to predict
future.
Inferential Statistics
• Inferential statistics, on the other hand, involves using
sample data to make inferences or draw conclusions about
a larger population.
• It allows researchers to generalize their findings from the
sample to the population and to make predictions or
hypotheses about the population based on the sample
data.
• Inferential statistics includes techniques such as hypothesis
testing, confidence intervals, and regression analysis.
• These techniques help researchers assess the reliability of
their findings and determine whether they are likely to
apply to the broader population.
Types of Inferential Statistics

• Hypothesis Testing
• Regression Analysis
• Confidence Intervals
Hypothesis Testing
• Hypothesis testing involves making decisions
about a population parameter based on
sample data. It typically involves formulating a
null hypothesis (H0) and an alternative
hypothesis (Ha), collecting sample data, and
using statistical tests to determine whether
there is enough evidence to reject the null
hypothesis in favor of the alternative
hypothesis.
Regression Analysis
• Regression analysis is used to examine the
relationship between one or more
independent variables and a dependent
variable. It helps in predicting the value of the
dependent variable based on the values of the
independent variables.
Confidence Intervals
• Confidence intervals provide a range of values
within which the true population parameter is
likely to fall with a certain level of confidence.
For example, a 95% confidence interval for the
population mean indicates that we are 95%
confident that the true population mean falls
within the interval.
Hypothesis Testing
• Hypothesis testing is an important part of
statistics. It's like a detective game where we
have two guesses about something in a group:
one saying there's no difference, and the
other saying there is. We collect data from a
smaller group and use statistics to see if we
can prove one guess is more likely. It helps us
decide if our ideas about the whole group are
true or not.
Hypothesis Testing
• The null hypothesis and the alternative
hypothesis are two competing statements in
statistical hypothesis testing. They are used to
make inferences about a population based on
sample data.
Null Hypothesis (H0)
• Definition: The null hypothesis states that there is no
effect, no difference, or no relationship between
variables. It represents the status quo or the
assumption that nothing has changed.
• Purpose: It is the hypothesis that researchers aim to
test or reject.
• Example: Suppose we are testing whether a new drug
is effective in reducing blood pressure. The null
hypothesis would be: ​ (The mean blood pressure after
using the drug is equal to the mean blood pressure
before using the drug, indicating no effect.)
Alternative Hypothesis (Ha)
• Definition: The alternative hypothesis states that
there is an effect, a difference, or a relationship
between variables. It is the claim that researchers
want to support.
• Purpose: It is accepted if there is sufficient
evidence to reject the null hypothesis.
• Example:Continuing the same example, the
alternative hypothesis could be:(The mean blood
pressure after using the drug is different from the
mean blood pressure before, indicating the drug
has an effect.)
Key Differences
What is a P-value?
• The p-value is a statistical measure that quantifies the
probability of observing a result at least as extreme as the
one obtained, assuming the null hypothesis (H0​) is true.
• Low p-value: Suggests that the observed data is unlikely
under H0 and therefore provides evidence to reject H0.
• High p-value: Indicates the observed data is consistent with
H0 , and there is insufficient evidence to reject it.
• Key Points:
• Range: 0 ≤ p ≤ 1
• Interpretation:
– p ≤ α: Reject the null hypothesis (statistically significant).
– P > α: Fail to reject the null hypothesis (not statistically
significant).
What is Alpha (α)?
import numpy as np
from scipy.stat import binom_test
n_flips = 100
observed_heads = 60
p_fair = 0.5
p_value = binom_test(observed_heads, n=n_flips, p=p_fair,
alternative='two-sided')
alpha = 0.05
print(f"P-value: {p_value}")
if p_value < alpha: print("Reject the null hypothesis: The coin is
not fair.")
else: print("Fail to reject the null hypothesis: The coin is fair.")
What is Distribution?
• Distribution = Probability Distribution
• A distribution is a function that shows the
possible values for a variable and how they
often occur.
• Fair Coin Problem –
Types of Distribution
Discrete Uniform Distribution: All
Outcomes are Equally Likely
• In statistics, uniform distribution refers to a statistical
distribution in which all outcomes are equally likely.
Consider rolling a six-sided die. You have an equal
probability of obtaining all six numbers on your next
roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6,
equaling a probability of 1/6, hence an example of a
discrete uniform distribution.
• As a result, the uniform distribution graph contains
bars of equal height representing each outcome. In our
example, the height is a probability of 1/6 (0.166667).
Drawback
• Uniform distribution is represented by the function
U(a, b), where a and b represent the starting and
ending values, respectively. Similar to a discrete
uniform distribution, there is a continuous uniform
distribution for continuous variables.
• The drawbacks of this distribution are that it often
provides us with no relevant information. Using our
example of a rolling die, we get the expected value of
3.5, which gives us no accurate intuition since there is
no such thing as half a number on a dice. Since all
values are equally likely, it gives us no real predictive
power.
Bernoulli Distribution: Single-trial
with Two Possible Outcomes
• The Bernoulli distribution is one of the easiest distributions to
understand. It can be used as a starting point to derive more
complex distributions. Any event with a single trial and only two
outcomes follows a Bernoulli distribution. Flipping a coin or
choosing between True and False in a quiz are examples of a
Bernoulli distribution.
• They have a single trial and only two outcomes. Let’s assume you
flip a coin once; this is a single trail. The only two outcomes are
either heads or tails. This is an example of a Bernoulli distribution.
• Usually, when following a Bernoulli distribution, we have the
probability of one of the outcomes (p). From (p), we can deduce the
probability of the other outcome by subtracting it from the total
probability (1), represented as (1-p).
Poisson Distribution: The Probability
that an Event May or May not Occur
• Poisson distribution deals with the frequency with which an
event occurs within a specific interval. Instead of the
probability of an event, Poisson distribution requires
knowing how often it happens in a particular period or
distance. For example, a cricket chirps two times in 7
seconds on average. We can use the Poisson distribution to
determine the likelihood of it chirping five times in 15
seconds.
• A Poisson process is represented with the notation Po(λ),
where λ represents the expected number of events that
can take place in a period. The expected value and variance
of a Poisson process is λ. X represents the discrete random
variable. A Poisson Distribution can be modeled using the
following formula.
• The main characteristics which describe the
Poisson Processes are:
• The events are independent of each other.
• An event can occur any number of times
(within the defined period).
• Two events can’t take place simultaneously.

Numpy Choice Function
• With the help of choice() method, we can get
the random samples of one dimensional array
and return the random samples of numpy
array.
• syntax : numpy.random.choice(a, size=None,
replace=True, p=None)
• Parameters:
• 1) a – 1-D array of numpy having random samples.
• 2) size – Output shape of random samples of numpy
array.
• 3) replace – Whether the sample is with or without
replacement.
• 4) p – The probability attach with every samples in a.
• Output : Return the numpy array of random samples.
• # import numpy library
• import numpy as np

• # create a list
• num_list = [10, 20, 30, 40, 50]

• # uniformly select any element


• # from the list
• number = np.random.choice(num_list)

• print(number)
# import numpy library
import numpy as np

# create a list
num_list = [10, 20, 30, 40, 50]

# choose index number-3rd element


# with 100% probability and other
# elements probability set to 0
# using p parameter of the
# choice() method so only
# 3rd index element selected
# every time in the list size of 3.
number_list = np.random.choice(num_list, 3, p = [0, 0, 0, 1, 0])

print(number_list)
# import numpy library
import numpy as np

# create a list
num_list = [10, 20, 30, 40, 50]

number_list = np.random.choice(num_list, 3, p = [0, 0, 0.5, 0.5, 0])

print(number_list)
import numpy as np
import matplotlib.pyplot as plt

# Using choice() method


gfg = np.random.choice(13, 5000)

count, bins, ignored = plt.hist(gfg, 25, density =


True)
plt.show()
• # import numpy
• import numpy as np
• import matplotlib.pyplot as plt

• gfg = np.arange(16).reshape((4, 4))


• # Using shuffle() method
• np.random.shuffle(gfg)

• print(gfg)
Sampling error in inferential statistics

• Since the size of a sample is always smaller than the


size of the population, some of the population isn’t
captured by sample data. This creates sampling error,
which is the difference between the true population
values (called parameters) and the measured sample
values (called statistics).
• Sampling error arises any time you use a sample, even
if your sample is random and unbiased. For this reason,
there is always some uncertainty in inferential
statistics. However, using probability sampling
methods reduces this uncertainty.
Estimating population parameters
from sample statistics
• The characteristics of samples and
populations are described by numbers
called statistics and parameters:
• A statistic is a measure that describes the
sample (e.g., sample mean).
• A parameter is a measure that describes the
whole population (e.g., population mean).
• Sampling error is the difference between a
parameter and a corresponding statistic. Since
in most cases you don’t know the real
population parameter, you can use inferential
statistics to estimate these parameters in a
way that takes sampling error into account.
• There are two important types of estimates you can
make about the population: point estimates and
interval estimates.
• A point estimate is a single value estimate of a
parameter. For instance, a sample mean is a point
estimate of a population mean.
• An interval estimate gives you a range of values where
the parameter is expected to lie. A confidence
interval is the most common type of interval estimate.
• Both types of estimates are important for gathering a
clear idea of where a parameter is likely to lie.
Confidence intervals

• A confidence interval uses the variability around


a statistic to come up with an interval estimate
for a parameter. Confidence intervals are useful
for estimating parameters because they take
sampling error into account.
• While a point estimate gives you a precise value
for the parameter you are interested in, a
confidence interval tells you the uncertainty of
the point estimate. They are best used in
combination with each other.
• Each confidence interval is associated with a
confidence level. A confidence level tells you the
probability (in percentage) of the interval
containing the parameter estimate if you repeat
the study again.
• A 95% confidence interval means that if you
repeat your study with a new sample in exactly
the same way 100 times, you can expect your
estimate to lie within the specified range of
values 95 times.
• Although you can say that your estimate will lie
within the interval a certain percentage of the
time, you cannot say for sure that the actual
population parameter will. That’s because you
can’t know the true value of the population
parameter without collecting data from the full
population.
• However, with random sampling and a suitable
sample size, you can reasonably expect your
confidence interval to contain the parameter a
certain percentage of the time.
Descriptive Statistics
• A large number of methods collectively
compute descriptive statistics and other
related operations on DataFrame. Most of
these are aggregations like sum(), mean(), but
some of them, like sumsum(), produce an
object of the same size. Generally speaking,
these methods take an axis argument, just
like ndarray.{sum, std, ...}, but the axis can be
specified by name or integer
Functions & Description
Statistical Functions
• Statistical methods help in the understanding
and analyzing the behavior of data. We will
now learn a few statistical functions, which we
can apply on Pandas objects.
• A) Percent Change
• B) Covariance
• C) Correlation
Reading a CSV File

• There are various ways to read a CSV file that


uses either the CSV module or the pandas library.
• csv Module: The CSV module is one of the
modules in Python which provides classes for
reading and writing tabular information in CSV
file format.
• pandas Library: The pandas library is one of the
open-source Python libraries that provide high-
performance, convenient data structures and
data analysis tools and techniques for Python
programming.
How to read a csv file?
Using pandas.read_csv() method:
• It is very easy and simple to read a CSV file
using pandas library functions. Here
read_csv() method of pandas library is used to
read data from CSV files.
• import pandas
• # reading the CSV file
• csvFile =
pandas.read_csv('california_housing_test.csv')
• # displaying the contents of the CSV file
• print(csvFile)
Specific Columns
• # reading the CSV file
• csvFile =
pandas.read_csv('california_housing_test.csv',
usecols=["population","households"])
• # displaying the contents of the CSV file
• print(csvFile)
How to Skip Rows
• # reading the CSV file
• csvFile =
pandas.read_csv('california_housing_test.csv',
usecols=["population","households"],
skiprows=[1,2])
• # displaying the contents of the CSV file
• print(csvFile)
How to store Data Frames to CSV
• # importing pandas as pd
• import pandas as pd

• # list of name, degree, score


• nme = ["aparna", "pankaj", "sudhir", "Geeku"]
• deg = ["MBA", "BCA", "M.Tech", "MBA"]
• scr = [90, 40, 80, 98]

• # dictionary of lists
• dict = {'name': nme, 'degree': deg, 'score': scr}

• df = pd.DataFrame(dict)

• # saving the dataframe


• df.to_csv('file1.csv')
Playing with CSV
• Df.head()
• Df.tail()
• Df.info()
Locate Missing Data

• Isnull()
• Isna()
• Fillna()
• Dropna()
Remove Rows & Return a new Data
Frame
• import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())
• import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())
• import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())
Replace Empty Values

• import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)


Replace only for specified columns
• import pandas as pd

df = pd.read_csv('data.csv')

df["Calories"].fillna(130, inplace = True)


Removing Duplicates
• print(df.duplicated())

• df.drop_duplicates(inplace = True)
Applications of Data Science in
Various Industries

Data Science is a multidisciplinary field


that extracts actionable insights from
structured and unstructured data. Its
applications span across diverse
industries, driving innovation, efficiency,
and decision-making.
1.1 Healthcare
• Predictive Analytics: Forecasting
disease outbreaks and treatment
outcomes.
• Personalized Medicine:
Recommending treatments tailored to an
individual’s genetic makeup.
• Medical Imaging: Analyzing X-rays,
MRIs, and CT scans using machine
learning (e.g., detecting tumors).
• Drug Discovery: Accelerating the
discovery of new drugs by analyzing
1.2 Finance
• Fraud Detection: Identifying suspicious
patterns in transactions using anomaly
detection algorithms.
• Risk Management: Assessing credit
risk, market risk, and operational risk.
• Algorithmic Trading: Developing
trading bots that execute trades based
on data-driven strategies.
• Customer Analytics: Segmentation and
retention strategies based on spending
habits.
1.3 Retail and E-commerce
• Customer Personalization:
Recommendation engines suggesting
products (e.g., Amazon).
• Inventory Management: Predicting
product demand and optimizing stock
levels.
• Sentiment Analysis: Understanding
customer feedback from reviews and
social media.
• Price Optimization: Adjusting prices
dynamically based on market trends and
1.4 Transportation and Logistics
• Route Optimization: Minimizing
delivery times and fuel costs (e.g., GPS-
enabled logistics).
• Predictive Maintenance: Forecasting
vehicle failures and scheduling
maintenance.
• Traffic Management: Analyzing traffic
data to reduce congestion.
• Autonomous Vehicles: Using data to
train self-driving cars.
1.5 Manufacturing
• Quality Control: Detecting defects in
products using computer vision.
• Supply Chain Optimization: Streamlining
procurement, production, and distribution.
• Predictive Maintenance: Reducing
downtime by predicting equipment failures.
1.6 Education
• Adaptive Learning: Personalized learning
paths based on student performance.
• Dropout Prediction: Identifying students
at risk of dropping out.
1.7 Entertainment and Media
• Content Recommendation: Suggesting
movies, music, or shows (e.g., Netflix,
Spotify).
• Audience Analysis: Understanding
viewing patterns and preferences.
• Marketing Campaigns: Data-driven
strategies to target specific audience
segments.
1.8 Energy Sector
• Energy Forecasting: Predicting energy
consumption and optimizing energy
1.9 Government and Public Services
• Crime Prediction: Using data to predict
and prevent criminal activities.
• Smart Cities: Traffic control, waste
management, and energy conservation
using IoT data.
• Policy Making: Data-driven decision-
making for economic and social policies.
Data Science Lifecycle
• The Data Science lifecycle is a
structured process for converting raw
data into meaningful insights and
decisions. It involves several key stages,
each of which plays a critical role in
achieving the desired outcomes.
• The Data Science lifecycle is a
systematic approach that transforms raw
data into actionable insights. By
mastering each stage—data collection,
preparation, analysis, visualization, and
decision-making—data scientists can
Data Science Lifecycle
– Data Collection
– Data Preparation
– Data Analysis
– Data Visualization
– Decision Making
1. Data Collection
• The first step involves gathering relevant
data for analysis. Data can come from
various sources and formats.
Definition
• Collecting raw data from internal and
external sources to address a specific
problem or question.
Sources of Data
• Primary Data: Data gathered directly
through experiments, surveys, interviews,
or sensors.
• Secondary Data: Data obtained from
external sources like government
1. Data Collection
Tools for Data Collection
• Web Scraping: Using Python libraries
like BeautifulSoup, Scrapy.
• IoT Devices: Sensors that generate
real-time data.
Challenges in Data Collection
• Data Privacy and Security: Adhering to
regulations like GDPR, HIPAA.
• Incomplete or Biased Data: Missing
values, unrepresentative samples.
2. Data Preparation
Once data is collected, it must be cleaned and
organized to ensure accuracy and consistency.
Definition
• Transforming raw data into a structured and
clean format suitable for analysis.
Steps in Data Preparation
1. Data Cleaning:
1.Removing duplicates.
2.Filling in missing values (mean, median, or using
predictive models).
3.Addressing outliers using statistical methods.
2. Data Transformation:
1.Scaling: Normalization (0-1 scaling) or
standardization (z-scores).
2.Encoding: Converting categorical data to
numerical values (e.g., one-hot encoding).
2. Data Preparation
Tools for Data Preparation
• Python libraries: Pandas, NumPy.
• SQL for data querying.
• Data cleaning tools: OpenRefine.
Importance of Data Preparation
• Data preparation ensures data quality,
which directly impacts the accuracy of
subsequent analyses and models.
3. Data Analysis
This stage involves uncovering patterns,
trends, and relationships in the prepared
data.
Definition
• Using statistical and computational
techniques to interpret data and derive
insights.
Types of Analysis
1.Descriptive Analysis: Summarizes the
data (e.g., mean, median, mode, standard
deviation).
2.Exploratory Data Analysis (EDA):
Visualizing data to detect patterns and
relationships.
3. Data Analysis
Techniques and Tools
• Statistical Tools: R, MATLAB, SPSS.
• Machine Learning: Scikit-learn,
TensorFlow, PyTorch.
• EDA Tools: Python libraries like
Matplotlib, Seaborn; R packages like
ggplot2.
Key Considerations
• Ensuring the analysis is aligned with the
problem statement.
4. Data Visualization
Presenting data in a visual format to
communicate findings effectively to
stakeholders.
Definition
• The graphical representation of data to
make trends and insights more
interpretable.
Common Visualization Types
1.Bar Charts and Column Charts:
Compare categorical data.
2.Line Charts: Track trends over time.
4. Data Visualization
Visualization Best Practices
• Keep visuals simple and uncluttered.
• Use appropriate chart types for the data.
• Label axes, legends, and provide
context.
Tools for Visualization
• Python Libraries: Matplotlib, Seaborn,
Plotly.
• Business Intelligence Tools: Tableau,
Power BI.
5. Decision Making
The final stage of the Data Science lifecycle
is using the insights derived from data
analysis and visualization to make informed
decisions.
Definition
• Implementing actionable strategies based
on data-driven insights.
Steps in Decision Making
1.Define the Objective: Clearly outline the
decision to be made.
2.Evaluate Options: Analyze alternative
strategies using data insights.
3.Make Predictions: Use predictive models
5. Decision Making
Applications in Real Life
• Business Strategy: Optimizing
marketing campaigns, pricing strategies.
• Healthcare: Recommending treatments
based on patient data.
• Public Policy: Formulating policies for
urban planning, education, and
healthcare.
Key Considerations
• Ensure decisions are ethical and fair.
Data Analysis and Data
Analytics
• While "Data Analytics" and "Data
Analysis" are often used
interchangeably, they represent distinct
concepts in practice.
Data Analysis
Definition:
• Data Analysis is the process of inspecting,
cleansing, transforming, and modeling data
to uncover useful information, suggest
conclusions, and support decision-making.
Objectives:
• Understand patterns and trends in the data.
• Extract meaningful insights.
• Support evidence-based decision-making.
Characteristics:
• Focused on interpreting past or current data.
Data Analysis
Key Techniques:
1. Descriptive Analysis: Summarizes historical data (e.g.,
averages, totals).
2. Inferential Analysis: Uses samples to infer conclusions about
the larger population.
3. Statistical Analysis: Includes hypothesis testing, regression
analysis, etc.
4. Exploratory Data Analysis (EDA): Uses visualizations to
identify trends and anomalies.

Tools:
•Excel
•Python (Pandas, NumPy)
•R
•SQL

Applications:
•Examining sales data to identify top-performing products.
Data Analytics
Definition:
• Data Analytics is a broader field that includes
Data Analysis and focuses on applying
techniques and algorithms to predict future
trends, make recommendations, and
automate decision-making.
Objectives:
• Gain actionable insights for strategic
decisions.
• Predict future events or behaviors.
• Optimize processes and systems.
Characteristics:
Data Analytics
Key Techniques:
1.Predictive Analytics: Uses machine learning and statistical
models to forecast future outcomes.
2. Prescriptive Analytics: Recommends actions to optimize
results.
3. Diagnostic Analytics: Identifies the causes of past outcomes.
4. Real-Time Analytics: Analyzes data as it is generated.

Tools:
• Tableau
• Power BI
• Python (Scikit-learn, TensorFlow)
• Big Data Platforms (Hadoop, Spark)

Applications:
• Building recommendation systems (e.g., Netflix, Amazon).
• Fraud detection in financial transactions.
• Demand forecasting in supply chain management.
Differences
Aspect Data Analysis Data Analytics

Narrow: Focuses on past and Broad: Encompasses analysis and


Scope
present data. predictions.

Actionable strategies and


Focus Insight generation.
automation.

Algorithmic and technology-


Approach Manual and statistical.
driven.

Answers "What will happen?"


Outcome Answers "What happened?"
and "What should we do?"

Often requires advanced tools


Complexity Comparatively simpler.
and techniques.
Key Points
• Data Analysis is a subset of Data Analytics.
• While Data Analysis focuses on
understanding historical data, Data Analytics
extends its application to prediction,
decision-making, and process optimization.
• Both are critical components in the data-
driven decision-making process, each
complementing the other in generating
insights and driving strategies.
Data Science Programming
Languages
The programming languages commonly used in
Data Science are essential tools for data
manipulation, statistical analysis, data
visualization, and machine learning.
➢Python
➢R
➢SQL
➢Julia
➢SAS
➢MATLAB
Python
• Python is arguably the most popular language
in the Data Science ecosystem due to its
simplicity, readability, and extensive libraries.
• It offers a wide range of libraries and
frameworks that make tasks like data analysis,
machine learning, and deep learning much
easier.
• Popular Libraries:
▪ Pandas: Data manipulation and analysis.
▪ NumPy: Numerical computing, arrays, and
matrices.
R
• R is a language specifically designed for
statistical analysis and visualization.
• It's particularly popular among statisticians and
researchers due to its extensive collection of
statistical packages.
• It also offers various visualization libraries that
are highly regarded in the Data Science
community.
• Popular Libraries:
▪ ggplot2: Data visualization and plotting.
▪ dplyr: Data manipulation.
▪ tidyr: Data tidying and reshaping.
SQL
• SQL is not a general-purpose programming
language like Python or R, but it is
indispensable for data science. It is the
standard language for managing and
manipulating relational databases.
• SQL allows data scientists to query, retrieve,
and filter data stored in databases, which is a
core part of the data cleaning and exploration
phase.
• Key Features:
– Querying data (SELECT statements).
Julia
• Julia is a high-performance, high-level
programming language designed for numerical
and scientific computing. It’s gaining popularity
in the Data Science field due to its ability to
handle complex mathematical computations at
speeds comparable to languages like C.
• Key Features:
– High performance with easy-to-use syntax.
– Designed for parallel computing and distributed
computing.
– Libraries for machine learning, statistics, and data
analysis.
SAS
• SAS is a software suite used for advanced
analytics, business intelligence, and data
management. It is often used in business and
enterprise environments, particularly for large-
scale data analysis.
• Key Features:
▪ Rich statistical analysis capabilities.
▪ Data manipulation and transformation.
▪ Extensive support for business analytics.
• While not as popular for open-source projects
as Python or R, SAS is still heavily used in
MATLAB
• MATLAB is widely used in academic research
and industries that require heavy numerical
computation. It provides a vast array of built-in
functions and toolboxes for mathematical
modeling, simulations, and data analysis.
• Key Features:
▪ Advanced mathematical modeling and simulations.
▪ Visualization tools for data analysis.
▪ Toolboxes for specific domains like signal
processing, statistics, and machine learning.
• MATLAB is favored for numerical computations,
SCALA
• Scala is a programming language that runs on
the Java Virtual Machine (JVM) and is used in
big data processing, particularly in the Hadoop
ecosystem. It’s often combined with Apache
Spark for distributed data processing.
• Key Features:
▪ Functional and object-oriented programming.
▪ Integration with Apache Spark for big data
processing.
▪ High-performance data processing.
• Scala is popular in big data analytics, especially
Java
• Java is a general-purpose programming
language that is also used in Data Science,
particularly for building scalable data systems
and for big data applications. It is commonly
used in the backend of data science systems
and in environments where performance is
critical.
• Key Features:
▪ High-performance and scalability.
▪ Strong ecosystem for big data tools (Hadoop,
Apache Kafka, Apache Spark).
▪ Support for parallel and distributed computing.
• Each programming language in Data Science has its strengths and
is suited for different tasks.
• Python and R are the most widely used for data analysis and
machine learning.
• SQL is essential for managing and querying data stored in
databases.
• Julia and MATLAB are emerging tools for numerical computing.
• SAS is a powerful tool for specialized statistical analysis.
• For handling large-scale data, Scala and Java are favored due to
their ability to integrate with big data platforms like Apache Spark.
• Ultimately, the choice of programming language depends on the

You might also like