FDS - fINAL QB
FDS - fINAL QB
FDS - fINAL QB
oARcPSD|21987438
UNIT I - INTRODUCTION
1. What is DataScience?
Data science is the area of study which involves methods to analyze massive amounts of
data and extracting knowledge from all the data that are gathered.
Statistics help Data scientist to get a better idea of customer’s expectation. Using the
statistics method Data Scientists can get knowledge regarding consumer interest, behavior,
engagement, retention, etc. It also helps you to build powerful data models to validate certain
inferences and predictions.
4. List out the various field, where data science are used?
Data Science are used almost everywhere in both commercial and non-commercial settings.
Some of the fields where data science used are
Healthcare industry
Retailers
Financial sectors
Transportation
Government sectors
Universities
lOMoARcPSD|21987438
Data cleaning
Data Integration.
Data Transformation.
Data that is generated continuously by thousands of data sources, which typically send in
the data records simultaneously and in small sizes. Examples are the “What’s trending” on
Twitter, live sporting or music events, and the stock market.
• Pareto charts are useful to find the defects to prioritize in order to observe the greatest
overall improvement.
Recommender systems utilizes algorithms that optimize the analysis of the data to build
the recommendations.
Structured data
Unstructured data
Natural language
Machine-generated
2
lOMoARcPSD|21987438
Graph-based
Streaming
These system generates recommendations based on what they know about the user’s
tastes from their activities on the platforms.
2. Retrieving data
3. Data Preparation
4. Data Exploration
5. Data Mining
13. Mention the input that covers inside the project charter.
3
lOMoARcPSD|21987438
• An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the
other observations.
• The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
4
lOMoARcPSD|21987438
• The first operation is joining: enriching an observation from one table with information
from another table.
• The second operation is appending or stacking: adding the observation of one table to
those of another table.
Bar chart and Histograms can be used to compare the sizes of the different groups. A bar
chart is made up of bars plotted on a chart. A histogram is a graph that represents a frequency
distribution, the height of the bars represent observed frequencies.
21. How does data cleaning plays a vital role in the analysis?
Presenting your results to the stakeholders and industrializing your analysis process for
repetitive reuse and integration with other tools.
5
lOMoARcPSD|21987438
• Data preparation is the process of cleaning and transforming raw data prior to processing
and analysis.
• It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.
• Box plots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile(Q1), median, third quartile (Q3) and
“maximum”).
• First quartile: the middle number between the smallest number and the median.
• Third quartile: the middle number between the median and highest value of the dataset.
• With brushing and linking you combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.
6
lOMoARcPSD|21987438
UNIT -2
DESCRIBING DATA
1. Discuss the differences between the frequency table and the frequency distribution table?
The frequency table is said to be a tabular method where each part of the data is assigned to
its corresponding frequency. Whereas, a frequency distribution is generally the graphical
representation of the frequency table.
distributions?
distribution?
7
lOMoARcPSD|21987438
The real limits are located at the midpoint of the gap between adjacent tabled boundaries;
that is, one-half of one unit of measurement below the lower tabled boundary and one-half of one
unit of measurement above the upper tabled boundary.
Relative frequency distributions show the frequency of each class as a part or fraction of
the total frequency for the entire distribution.
Cumulative frequency distributions show the total number of observations in each class
and in all lower-ranked classes.
The percentile rank of a score indicates the percentage of scores in the entire distribution
with similar or smaller values than that score. Thus a weight has a percentile rank of 80 if equal
or lighter weights constitute 80 percent of the entire distribution.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
The intersection of the two axes defines the origin at which both numerical scales equal
0.
Numerical scales always increase from left to right along the horizontal axis and from
bottom to top along the vertical axis.
The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.
8
lOMoARcPSD|21987438
A device for sorting quantitative data on the basis of leading and trailing digits.
A distribution that includes a few extreme observations in the positive direction (to the right of
the majority of observations).
A distribution that includes a few extreme observations in the negative direction (to the left of the
majority of observations).
The mode reflects the value of the most frequently occurring score.
Distributions can have more than one mode (or no mode at all). Distributions with two
obvious peaks, even though they are not exactly the same height, are referred to as bimodal.
Distributions with more than two peaks are referred to as multimodal.
16. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65,
63.
Answer: 63
The median reflects the middle value when observations are ordered from least to most.
2. Find the middle position by adding one to the total number of scores and dividing by 2.
3.If the middle position is a whole number, as in the left-hand panel below, use this
number to count into the set of ordered scores.
4. The value of the median equals the value of the score located at the middle position.
5. If the middle position is not a whole number, as in the right-hand panel below, use the
two nearest whole numbers to count into the set of ordered scores.
9
lOMoARcPSD|21987438
6. The value of the median equals the value midway between those of the two
middlemost scores; to find the midway value, add the two given values and divide by 2.
19. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.
median = 63
The median can be used whenever it is possible to order qualitative data from least to
most because the level of measurement is ordinal.
The range is the difference between the largest and smallest scores.
Degrees of freedom (df) refers to the number of values that are free to vary, given one or
more mathematical restrictions, in a sample being used to estimate a population characteristic.
2 To determine how far to penetrate the set of ordered scores, begin at either end,
then add 1 to the total number of scores and divide by 4. If necessary, round the
3 Beginning with the largest score, count the requisite number of steps
(calculated in step 2) into the ordered scores to find the location of the third
quartile.
4 The third quartile equals the value of the score at this location.
5 Beginning with the smallest score, again count the requisite number of steps
into the ordered scores to find the location of the first quartile.
location. 7 The IQR equals the third quartile minus the first
quartile.
lOMoARcPSD|21987438
10
lOMoARcPSD|21987438
UNIT – III
DESCRIBING RELATIONSHIPS
Obtained from a mathematical equation, the normal curve is a theoretical curve defined
for a continuous variable
Because the normal curve is symmetrical, its lower half is the mirror image of its upper
half.
Being bell shaped, the normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak (without actually
touching the horizontal axis, since, in theory, the tails of a normal curve extend infinitely
far).
The values of the mean, median (or 50th percentile), and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.
3. Define z scores
Any unit-free scores expressed relative to a known mean and a known standard deviation
they are referred to as standard scores. Although z scores qualify as standard scores because they
are unit-free and expressed relative to a known mean of 0 and a known standard deviation of 1,
other scores also qualify as standard scores.
11
lOMoARcPSD|21987438
particularly when reporting test results to a wide audience, z scores can be changed to
transformed standard scores, other types of unit-free standard scores that lack negative signs and
decimal points. These transformations change neither the shape of the original distribution nor
the relative standing of any test score within the distribution.
7. Define Scatterplots.
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
With a little training, you can use any dot cluster as a preview of a fully measured relationship.
1. The sign of r indicates the type of linear relationship, whether positive or negative.
2. The numerical value of r, without regard to sign, indicates the strength of the linear
relationship.
The equation that minimizes the total of all squared prediction errors for known Y scores
in the original correlation analysis.
11. Assume that an r of .30 describes the relationship between educational level (highest grade
completed) and estimated number of hours spent reading each week. More specifically:
X = 13 Y=8
SSx = 25 SSy = 50
r = .30
(a) Determine the least squares equation for predicting weekly reading time from educational
lOMoARcPSD|21987438
12
lOMoARcPSD|21987438
level.
Answer:
The square of the correlation coefficient, r2, always indicates the proportion of total
variability in one variable that is predictable from its relationship with the other variable.
A least squares equation that contains more than one predictor or X variable.
1. Define NumPy.
NumPy stands for Numerical Python and is one of the most useful scientific libraries in
Python programming.
It provides support for large multidimensional array objects and various tools to work
with them. NumPy arrays are called ndarray or N-dimensional arrays and they store
elements of the same type and size.
are:
Attributes of arrays
Indexing of arrays
Slicing of arrays
Reshaping of arrays
Joining and splitting of arrays
NumPy array has attributes ndim (the number of dimensions), shape (the size of each
dimension), and size (the total size of the array)
lOMoARcPSD|21987438
13
lOMoARcPSD|21987438
To access subarrays with the slice notation, marked by the colon (:) character. The
NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use
this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.
The useful type of operation is reshaping of arrays. Reshaping means interchange of row
and columns, also define as process of changing row and column width. The most flexible way
of doing this is with the reshape() method. For example, if you want to put the numbers 1
through 9 in a 3×3 grid, you can do the following:
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
7. What is Broadcasting.
Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction,
multiplication, etc.) on arrays of different sizes.
Broadcasting in NumPy follows a strict set of rules to determine the interaction between
the two arrays:
14
lOMoARcPSD|21987438
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer
dimensions is padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape
equal to 1 in that dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
NumPy also provides the np.recarray class, which is almost identical to the structured
arrays just described, but with one additional feature: fields can be accessed as attributes rather
than as dictionary keys.
data_rec = data.view(np.recarray)
data_rec.age
The two object of Pandas are Pandas series object and Pandas Data frame object.
The essential difference between NumPy and Pandas is the presence of the index: while
the NumPy array has an implicitly defined integer index used to access the values, the Pandas
Series has an explicitly defined index associated with the values.
'area': area})
states
15
lOMoARcPSD|21987438
The Series object provides a mapping from a collection of keys to a collection of values:
import pandas as pd
data
Output: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options
mpl
16
lOMoARcPSD|21987438
Scatter plots is similar to line plot. Instead of points being joined by line segments, here
the points are represented individually with a dot, circle, or other shape.
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
cmap='viridis')
A basic errorbar can be created with a single Matplotlib function call (Figure 4-27):
%matplotlib inline
17
lOMoARcPSD|21987438
plt.style.use('seaborn-whitegrid')
import numpy as np
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
8. What is seaborn?
Seaborn is a data visualization library built on top of matplotlib and closely integrated
with pandas data structures in Python. Visualization is the central part of Seaborn which helps in
exploration and understanding of data.
functionalities:
Python StatsModels allows users to explore data, perform statistical tests and estimate
statistical models. It is supposed to complement to SciPy's stats module. It is part of the Python
scientific stack that deals with data science, statistics and data analysis.
Bokeh is a Python library for creating interactive visualizations for modern web
browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards
with streaming datasets. With Bokeh, you can create JavaScript-powered visualizations without
writing any JavaScript yourself.
18
lOMoARcPSD|21987438
Bokeh can be both used as a high-level or low-level interface; thus, it can create many
sophisticated plots that Matplotlib creates but with fewer lines of code and higher resolution.
Bokeh also makes it really easy to link between plots.
Density mapping is simply a way to show where points or lines may be concentrated in a given
area.
Plotly can also be used to style interactive graphs with Jupyter notebook. Figure
Converters which convert matplotlib, ggplot2, and IGOR Pro graphs into interactive, online
graphs.
Data binning is a type of data preprocessing, a mechanism which includes also dealing
with missing values, formatting, normalization and standardization. Binning can be applied to
convert numeric values to categorical or to sample numeric values.
Matplotlib can be used in multiple ways in Python, including Python scripts, the Python
and iPython shells, Jupyter Notebooks and what not! This is why it’s often used to create
visualizations not just by Data Scientists but also by researchers to create graphs that are of
publication quality.
Matplotlib supports all the popular charts (lots, histograms, power spectra, bar charts, error
charts, scatterplots, etc.) right out of the box. There are also extensions that you can use to create
advanced visualizations like 3-Dimensional plots, etc.
19
lOMoARcPSD|21987438
To get started with seaborn, you're going to need to install it in the terminal with either
pip install seaborn or conda install seaborn . Then simply include import seaborn as sns at the top
of your python file.
Plotly allows users to import, copy and paste, or stream data to be analyzed and
visualized. For analysis and styling graphs, Plotly offers a Python sandbox (NumPy supported),
datagrid, and GUI. Python scripts can be saved, shared, and collaboratively edited in Plotly.
20