Data Science Notes
Data Science Notes
Data Science Notes
TCS 733
Compiled by Dr. Vijay Singh
Associate Professor,
Department of Computer Science and Engineering
Graphic Era Deemed to be University, Dehradun
+91-9760322316
Vijaysingh.cse@geu.ac.in
Syllabus
UNIT-I
A data scientist is a person who knows how to extract insights from the
data by using various processes, methods, systems, and algorithms.
Data scientist requires a range of skills to analyze, interpret, and
visualize data to make informed decisions.
• R-intro.pdf (r-project.org)
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Miscellaneous
operators
Conditional expression in R
A conditional expression or a conditional statement is a programming
construct where a decision is made to execute some code based on a
Boolean (true or false) condition. A more commonly used term for
conditional expression in programming is an 'if-else' condition. In plain
English, this is stated as 'if this test is true then do this operation;
otherwise do this different operation’.
Conditional expression in R
Loops in R
R Script file is a file with extension “.R” that
contains a program (a set of commands). Rscript
R script is an R Interpreter which helps in the execution of
R commands present in the script file.
Functions in R
A random variable, usually written X, is a variable
UNIT-II (Data whose possible values are numerical outcomes of a
Preprocessing) random phenomenon. There are two types of random
variables, discrete and continuous.
Discrete Random Variables
A discrete random variable is one which may take on only a countable number of
distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not
necessarily) counts. If a random variable can take only a finite number of distinct
values, then it must be discrete. Examples of discrete random variables include the
number of children in a family, the Friday night attendance at a cinema, the number
of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X
= 2) + P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X
= 1) = 1 - 0.1 = 0.9, by the complement rule. This distribution may also be described by
the probability histogram shown in the figure:
Continuous Random Variables
A continuous random variable is one which takes an infinite number of
possible values. Continuous random variables are usually
measurements. Examples include height, weight, the amount of sugar
in an orange, the time required to run a mile.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics
Glossary v1.1)
A continuous random variable is not defined at specific values. Instead,
it is defined over an interval of values, and is represented by the area
under a curve (in advanced mathematics, this is known as an integral).
The probability of observing any single value is equal to 0, since the
number of values which may be assumed by the random variable is
infinite.
Continuous Random Variables
Suppose a random variable X may take all values over an interval of real
numbers. Then the probability that X is in the set of outcomes A, P(A),
is defined to be the area above A and under a curve. The curve, which
represents a function p(x), must satisfy the following:
Missing Values
In statistics, missing data, or missing values, occur when no data value
is stored for the variable in an observation. Missing data are a common
occurrence and can have a significant effect on the conclusions that can
be drawn from the data.
• The problem of missing value is quite common in many real-life
datasets. Missing value can bias the results of the machine learning
models and/or reduce the accuracy of the model.
Why Is Data Missing From The Dataset
• There can be multiple reasons why certain values are missing from the data.
• Reasons for the missing data from the dataset affect the approach of handling
missing data.
• So it’s necessary to understand why the data could be missing.
In the case of MCAR, the data could be missing due to human error,
some system/equipment failure, loss of sample, or some unsatisfactory
technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some
values of overdue books in the computer system are missing. The
reason might be a human error like the librarian forgot to type in the
values. So, the missing values of overdue books are not related to any
other variable/data in the system.
It should not be assumed as it’s a rare case. The advantage of such data
is that the statistical analysis remains unbiased.
Missing At Random (MAR)
• Missing at random (MAR) means that the reason for missing values can be
explained by variables on which you have complete information as there is
some relationship between the missing data and other values/data.
• In this case, the data is not missing for all the observations. It is missing
only within sub-samples of the data and there is some pattern in the
missing values.
• For example, if you check the survey data, you may find that all the people
have answered their ‘Gender’ but ‘Age’ values are mostly missing for
people who have answered their ‘Gender’ as ‘female’. (The reason being
most of the females don’t want to reveal their age.)
• So, the probability of data being missing depends only on the observed
data.
Missing At Random (MAR)
• In this case, the variables ‘Gender’ and ‘Age’ are related and the
reason for missing values of the ‘Age’ variable can be explained by the
‘Gender’ variable but you can not predict the missing value itself.
• Suppose a poll is taken for overdue books of a library. Gender and the
number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why
the data is missing can be explained by another factor that is gender.
• In this case, the statistical analysis might result in bias.
Missing Not At Random (MNAR)
• Missing values depend on the unobserved data.
• If there is some structure/pattern in missing data and other observed
data can not explain it, then it is Missing Not At Random (MNAR).
• If the missing data does not fall under the MCAR or MAR then it can
be categorized as MNAR.
• It can happen due to the reluctance of people in providing the
required information. A specific group of people may not answer
some questions in a survey.
Missing Not At Random (MNAR)
• For example, suppose the name and the number of overdue books
are asked in the poll for a library. So most of the people having no
overdue books are likely to answer the poll. People having more
overdue books are less likely to answer the poll.
• So in this case, the missing value of the number of overdue books
depends on the people who have more books overdue.
How to deal with missing values using R
R has various packages to deal with the missing data.
List of R Packages
• MICE
• Amelia
• missForest
• Hmisc
• mi
Decoding the job description
The data analyst role is one of many job titles that contain the word “analyst.” To name a few
others that sound similar but may not be the same role:
• Business analyst — analyzes data to help businesses improve processes, products, or services
• Data analytics consultant — analyzes the systems and models for using data
• Data engineer — prepares and integrates data from different sources for analytical use
• Data scientist — uses expert skills in technology and social science to find trends through data
analysis
• Data specialist — organizes or converts data for use in databases or software systems
• Operations analyst — analyzes data to assess the performance of business operations and
workflows
The six data analysis phases
There are six data analysis phases that will help you make seamless decisions: ask,
prepare, process, analyze, share, and act. Keep in mind, these are different from
the data life cycle, which describes the changes data goes through over its lifetime.
Let’s walk through the steps to see how they can help you solve problems you
might face on the job.
Step 1: Ask
It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:
• Define the problem you’re trying to solve
• Make sure you fully understand the stakeholder’s expectations
• Focus on the actual problem and avoid any distractions
• Collaborate with stakeholders and keep an open line of communication
• Take a step back and see the whole situation in context
Clean data is the best data and you will need to clean up your data to get rid of any possible errors,
inaccuracies, or inconsistencies. This might mean:
• Using spreadsheet functions to find incorrectly entered data
• Using SQL functions to check for extra spaces
• Removing repeated entries
• Checking as much as possible for bias in the data
You will want to think analytically about your data. At this stage, you might sort and format your
data to make it easier to:
• Perform calculations
• Combine data from multiple sources
• Create tables with your results
• Observer bias
• Interpretation bias
• Confirmation bias
References
• Data science – Wikipedia
• What Does A Data Scientist Do: Skills, Roles And Responsibilities, Salary, And More | SPEC INDIA (spec-india.com)
• Why Data Science Matters and How It Powers Business in 2022 (simplilearn.com)
• Importance of Data Science for Businesses (naukri.com)
• Data Science Use Cases Guide | DataCamp
• Introduction and Importance of R – Programming Language | XTIVIA
• R Data Types (programiz.com)
• R Operators (w3schools.com)