FDS - fINAL QB

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

lOMoARcPSD|21987438

oARcPSD|21987438

2 Marks – Question and Answers

UNIT I - INTRODUCTION
1. What is DataScience?

Data science is the area of study which involves methods to analyze massive amounts of
data and extracting knowledge from all the data that are gathered.

2. Explain the benefits of using statistics in data science.

Statistics help Data scientist to get a better idea of customer’s expectation. Using the
statistics method Data Scientists can get knowledge regarding consumer interest, behavior,
engagement, retention, etc. It also helps you to build powerful data models to validate certain
inferences and predictions.

3. What are the need ofdatascience?

The basic needs of data science are:

 Better Decision Making


 Predictive Analysis
 Pattern Discovery

4. List out the various field, where data science are used?

Data Science are used almost everywhere in both commercial and non-commercial settings.
Some of the fields where data science used are

 Healthcare industry

 Retailers

 Financial sectors

 Transportation

 Government sectors

 Universities
lOMoARcPSD|21987438

5.What are the three sub-phases of data preparations?

The three sub-Phase of data preparations:

 Data cleaning
 Data Integration.
 Data Transformation.

6. What is data cleaning?

Removing missing values, false and inconsistencies across data source.

7. Define Streaming data.

Data that is generated continuously by thousands of data sources, which typically send in
the data records simultaneously and in small sizes. Examples are the “What’s trending” on
Twitter, live sporting or music events, and the stock market.

8. What is Pareto charts?

• It is a graph that indicate the frequency of defects, as well as cumulative impact.

• Pareto charts are useful to find the defects to prioritize in order to observe the greatest
overall improvement.

• It is a combination of a bar graph and line graph.

9. What are recommender system?

Recommender systems are a subclass of information filtering systems, used to predict


how ser would rate or score particular objects (movies, music, merchandise, etc.). Recommender
systems filter large volume of information based on the data provided by a user and other factors.

Recommender systems utilizes algorithms that optimize the analysis of the data to build
the recommendations.

10. What are various forms of data used in data science?

• The main categories of data are:

 Structured data

 Unstructured data

 Natural language

 Machine-generated

2
lOMoARcPSD|21987438

 Graph-based

 Audio, video and images

 Streaming

11. Explain how a recommender system works.

A recommender system is a system that many consumer-facing, content-driven, online


platforms employs to generate recommendations for users from a library of available content.

These system generates recommendations based on what they know about the user’s
tastes from their activities on the platforms.

12. List out the steps involved in data science process.

1. Setting the research goal

2. Retrieving data

3. Data Preparation

4. Data Exploration

5. Data Mining

6. Presentation and automation

13. Mention the input that covers inside the project charter.

 A clear research goal


 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline

14. List out the various open data sites providers.

Open data site Description


Data.gov The home of the US Government’s open data
https://open-data.europa.eu/ The home of the European Commission’s open
data
Freebase.org An open database that retrieves its information
from sites like
Wikipedia, MusicBrains, and the SEC archive
Data.worldbank.org Open data initiative from the World Bank

3
lOMoARcPSD|21987438

Aiddata.org Open data for international development


Open.fda.gov Open data from the US Food and Drug
Administration

15. Define the techniques to handle missing data.

 Omit the values


 Set value to null
 Impute a static value such as 0 or the mean
 Impute a value from an estimated or theoretical distribution.
 Modeling the value (nondependent)

16. What is outliers?

• An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the
other observations.

• The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.

17.What are the different types of Recommender

systems? The two types of recommender systems are

Collaborative filtering – Collaborative filtering is a method of making automatic predictions by


using the recommendations of other eople.

Content-Based filtering – It is based on the description of an item and a user’s choice. As he


name suggests, it uses content (keywords) to describe the items, and the ser profile is built to
state the type of item this user likes.

18. What are the steps involved in model building?

The model building consists of the following steps such as

a. Selection of a modeling technique and variables to enter in the model


b. Execution of the model
c. Diagnosis and model comparison.

19. What are the various operation involved in combining data.

4
lOMoARcPSD|21987438

Two operations to combine information from different data sets.

• The first operation is joining: enriching an observation from one table with information
from another table.

• The second operation is appending or stacking: adding the observation of one table to
those of another table.

20. What is the difference between a bar graph and a histogram.

Bar chart and Histograms can be used to compare the sizes of the different groups. A bar
chart is made up of bars plotted on a chart. A histogram is a graph that represents a frequency
distribution, the height of the bars represent observed frequencies.

21. How does data cleaning plays a vital role in the analysis?

 Data cleaning can help in analysis because:


 Cleaning data from multiple sources helps to transforms it into a format that data analyst
or data scientists can work with.
 Data cleaning helps to increase the accuracy of the model in machine learning.
 It is a cumbersome process because as the number of data sources increases, the time
taken to clean the data increases exponentially due to the number of sources and the
volume of data generated by these sources.

22. What is Presentation and automation?

 Presenting your results to the stakeholders and industrializing your analysis process for
repetitive reuse and integration with other tools.

23. Difference between Data Science and Machine Learning.

Data Science Machine Learning


Data Science is an interdisciplinary field that Machine Learning is the Scientific study of
uses scientific methods, algorithm and system algorithm and statistical methods.
to extract knowledge from structural and
unstructured data.
It helps you to create insight from data dealing It helps you to predict and the outcomes for
with real-world complexities. new database from historical data with the help
of mathematical models.
It is a Complete process It is a single steps in the entire data science
process.
It is not a subset of Artificial Intelligence (AI). ML technology is a subset of AI.
In Data science, high RAM and SSD used, In ML, GPUs are used for intensive vector
which helps to overcome I/O bottleneck operations.
problems.

5
lOMoARcPSD|21987438

24. Define Data Preparation.

• Data preparation is the process of cleaning and transforming raw data prior to processing
and analysis.

• It is an important step prior to processing and often involves reformatting data, making
corrections to data and the combining of data sets to enrich data.

25. Define Box plots.

• Box plots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile(Q1), median, third quartile (Q3) and
“maximum”).

• Median : Middle value of a data set

• First quartile: the middle number between the smallest number and the median.

• Third quartile: the middle number between the median and highest value of the dataset.

26. What is Brushing and linking.

• With brushing and linking you combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.

6
lOMoARcPSD|21987438

UNIT -2

DESCRIBING DATA

1. Discuss the differences between the frequency table and the frequency distribution table?

The frequency table is said to be a tabular method where each part of the data is assigned to
its corresponding frequency. Whereas, a frequency distribution is generally the graphical
representation of the frequency table.

2. What are the numerous types of frequency

distributions?

Different types of frequency distributions are as follows:

1. Grouped frequency distribution.

2. Ungrouped frequency distribution.

3. Cumulative frequency distribution.

4. Relative frequency distribution.

5. Relative cumulative frequency distribution, etc.

3. What are some characteristics of the frequency distribution?

Some major characteristics of the frequency distribution are given as follows:


1. Measures of central tendency and location i.e. mean, median, and mode.
2. Measures of dispersion i.e. range, variance, and the standard deviation.

3. The extent of the symmetry or asymmetry i.e. skewness.

4. The flatness or the peakedness i.e.

kurtosis. 4.What is the importance of frequency

distribution?

The value of the frequency distributions in statistics is excessive. A well-formed frequency


distribution creates the possibility of a detailed analysis of the structure of the population. So, the
groups where the population breaks down are determinable.

5.What is frequency distribution?

A frequency distribution is a collection of observations produced by sortingobservations

into classes and showing their frequency (f ) of occurrence in eacclasss


lOMoARcPSD|21987438

7
lOMoARcPSD|21987438

6.Essential guidelines for frequency distribution.

 Each observation should be included in one, and only one, class.


 List all classes, even those with zero frequencies.
 All classes should have equals intervals.

6.What is Real Limits.

The real limits are located at the midpoint of the gap between adjacent tabled boundaries;
that is, one-half of one unit of measurement below the lower tabled boundary and one-half of one
unit of measurement above the upper tabled boundary.

7.Define Relative frequency distributions.

Relative frequency distributions show the frequency of each class as a part or fraction of
the total frequency for the entire distribution.

8. How to convert frequency distribution into relative frequency distribution.

To convert a frequency distribution into a relative frequency distribution, divide the


frequency for each class by the total frequency for the entire distribution.

9. Define Cumulative frequency distribution.

Cumulative frequency distributions show the total number of observations in each class
and in all lower-ranked classes.

10. What is Percentile Ranks.

The percentile rank of a score indicates the percentage of scores in the entire distribution
with similar or smaller values than that score. Thus a weight has a percentile rank of 80 if equal
or lighter weights constitute 80 percent of the entire distribution.

11.List some of the features of histogram.

 Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution.
 Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
 The intersection of the two axes defines the origin at which both numerical scales equal
0.
 Numerical scales always increase from left to right along the horizontal axis and from
bottom to top along the vertical axis.
 The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.

8
lOMoARcPSD|21987438

12. Define Stem and leaf display..

A device for sorting quantitative data on the basis of leading and trailing digits.

13. Difference between positively skewed and negatively skewed distribution.

Positively Skewed Distribution

A distribution that includes a few extreme observations in the positive direction (to the right of
the majority of observations).

Negatively Skewed Distribution

A distribution that includes a few extreme observations in the negative direction (to the left of the
majority of observations).

14. Define mode.

The mode reflects the value of the most frequently occurring score.

15. Define multi mode.

Distributions can have more than one mode (or no mode at all). Distributions with two
obvious peaks, even though they are not exactly the same height, are referred to as bimodal.
Distributions with more than two peaks are referred to as multimodal.

16. Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65,
63.

Answer: 63

17. Define median.

The median reflects the middle value when observations are ordered from least to most.

18. List out the steps to find the median values.

1.Order scores from least to most.

2. Find the middle position by adding one to the total number of scores and dividing by 2.

3.If the middle position is a whole number, as in the left-hand panel below, use this
number to count into the set of ordered scores.

4. The value of the median equals the value of the score located at the middle position.

5. If the middle position is not a whole number, as in the right-hand panel below, use the
two nearest whole numbers to count into the set of ordered scores.

9
lOMoARcPSD|21987438

6. The value of the median equals the value midway between those of the two
middlemost scores; to find the midway value, add the two given values and divide by 2.

19. Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63.

median = 63

20. When do we use the median value.

The median can be used whenever it is possible to order qualitative data from least to
most because the level of measurement is ordinal.

21. What is Range.

The range is the difference between the largest and smallest scores.

22.What is Degrees of freedom (df).

Degrees of freedom (df) refers to the number of values that are free to vary, given one or
more mathematical restrictions, in a sample being used to estimate a population characteristic.

23.How doo calculate IQR.

1 Order scores from least to most.

2 To determine how far to penetrate the set of ordered scores, begin at either end,

then add 1 to the total number of scores and divide by 4. If necessary, round the

result to the nearest whole number.

3 Beginning with the largest score, count the requisite number of steps

(calculated in step 2) into the ordered scores to find the location of the third

quartile.

4 The third quartile equals the value of the score at this location.

5 Beginning with the smallest score, again count the requisite number of steps

into the ordered scores to find the location of the first quartile.

6 The first quartile equals the value of the score at this

location. 7 The IQR equals the third quartile minus the first

quartile.
lOMoARcPSD|21987438

10
lOMoARcPSD|21987438

UNIT – III

DESCRIBING RELATIONSHIPS

1. Define Normal curve.

A theoretical curve noted for its symmetrical bell-shaped form.

2. List some of the properties of the normal curve.

 Obtained from a mathematical equation, the normal curve is a theoretical curve defined
for a continuous variable
 Because the normal curve is symmetrical, its lower half is the mirror image of its upper
half.
 Being bell shaped, the normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak (without actually
touching the horizontal axis, since, in theory, the tails of a normal curve extend infinitely
far).
 The values of the mean, median (or 50th percentile), and mode, located at a point midway
along the horizontal spread, are the same for the normal curve.

3. Define z scores

A z score is a unit-free, standardized score that, regardless of the original units of


measurement, indicates how many standard deviations a score is above or below the mean of its
distribution.

4. How to find the proportion for one score.

 Sketch a normal curve and shade in the target area.


 Plan your solution according to the normal table.
 Convert X to z.
 Find the target area.

5. What is Standard score.

Any unit-free scores expressed relative to a known mean and a known standard deviation
they are referred to as standard scores. Although z scores qualify as standard scores because they
are unit-free and expressed relative to a known mean of 0 and a known standard deviation of 1,
other scores also qualify as standard scores.

11
lOMoARcPSD|21987438

6. What is Transformed Standard Score.

particularly when reporting test results to a wide audience, z scores can be changed to
transformed standard scores, other types of unit-free standard scores that lack negative signs and
decimal points. These transformations change neither the shape of the original distribution nor
the relative standing of any test score within the distribution.

7. Define Scatterplots.

A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.
With a little training, you can use any dot cluster as a preview of a fully measured relationship.

8.Define correlation coefficient.

A correlation coefficient is a number between –1 and 1 that describes the relationship


between pairs of variables.

9. Specify the properties of correlation

coefficient. Two properties are:

1. The sign of r indicates the type of linear relationship, whether positive or negative.

2. The numerical value of r, without regard to sign, indicates the strength of the linear
relationship.

10. Define least square regression equation.

The equation that minimizes the total of all squared prediction errors for known Y scores
in the original correlation analysis.

11. Assume that an r of .30 describes the relationship between educational level (highest grade
completed) and estimated number of hours spent reading each week. More specifically:

educational level (x) weekly reading time (y)

X = 13 Y=8

SSx = 25 SSy = 50

r = .30

(a) Determine the least squares equation for predicting weekly reading time from educational
lOMoARcPSD|21987438

12
lOMoARcPSD|21987438

level.

Answer:

b = √(50/25)(.30) = .42; a = 8 – (.42)(13) = 2.54

12. What is Standard Error of Estimate.

A rough measure of the average amount of predictive error.

13.What is Squared Correlation Coefficient (r 2).

The square of the correlation coefficient, r2, always indicates the proportion of total
variability in one variable that is predictable from its relationship with the other variable.

14. What is Multiple Regression Equation

A least squares equation that contains more than one predictor or X variable.

UNIT – IV - PYTHON LIBRARIES FOR DATA WRANGLING

1. Define NumPy.

 NumPy stands for Numerical Python and is one of the most useful scientific libraries in
Python programming.
 It provides support for large multidimensional array objects and various tools to work
with them. NumPy arrays are called ndarray or N-dimensional arrays and they store
elements of the same type and size.

2. Define few categories of Array Manipulations.

A few categories of basic array manipulations

are:

 Attributes of arrays
 Indexing of arrays
 Slicing of arrays
 Reshaping of arrays
 Joining and splitting of arrays

3.Mention some NumPy Array


attributes.

NumPy array has attributes ndim (the number of dimensions), shape (the size of each
dimension), and size (the total size of the array)
lOMoARcPSD|21987438

13
lOMoARcPSD|21987438

4.Write a comment to create two-dimensional

array. import numpy as np

np.random.seed(0) # seed for reproducibility

x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional

array 5.Write a syntax to access subarrays.

To access subarrays with the slice notation, marked by the colon (:) character. The
NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use
this:

x[start:stop:step]

If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

6. What is Reshaping of Array.

The useful type of operation is reshaping of arrays. Reshaping means interchange of row
and columns, also define as process of changing row and column width. The most flexible way
of doing this is with the reshape() method. For example, if you want to put the numbers 1
through 9 in a 3×3 grid, you can do the following:

grid = np.arange(1, 10).reshape((3, 3))

print(grid)

[[1 2 3]

[4 5 6]

[7 8 9]]

7. What is Broadcasting.

Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction,
multiplication, etc.) on arrays of different sizes.

8. What are the rules involved in Broadcasting?

Broadcasting in NumPy follows a strict set of rules to determine the interaction between
the two arrays:

14
lOMoARcPSD|21987438

• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer
dimensions is padded with ones on its leading (left) side.

• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape
equal to 1 in that dimension is stretched to match the other shape.

• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

9. Define Records arrays.

NumPy also provides the np.recarray class, which is almost identical to the structured
arrays just described, but with one additional feature: fields can be accessed as attributes rather
than as dictionary keys.

data_rec = data.view(np.recarray)

data_rec.age

10. What are the two object in Pandas.

The two object of Pandas are Pandas series object and Pandas Data frame object.

11. Difference between Numpy and Pandas.

The essential difference between NumPy and Pandas is the presence of the index: while
the NumPy array has an implicitly defined integer index used to access the values, the Pandas
Series has an explicitly defined index associated with the values.

12. Write a comment to construct the data frame object in pandas?

states = pd.DataFrame({'population': population,

'area': area})

states

Output: area population

California 423967 38332521

Florida 170312 19552860

Illinois 149995 12882135

New York 141297 19651127

Texas 695662 26448193

15
lOMoARcPSD|21987438

13. What is data selection in series.

The Series object provides a mapping from a collection of keys to a collection of values:

import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0],

index=['a', 'b', 'c', 'd'])

data

Output: a 0.25

b 0.50

c 0.75

d 1.00

dtype: float64

14. What are the two method of combining dataset in Pandas?

The two methods of combining dataset in Pandas is Concat and Append.

15. Write a syntax for Concatenation operation in Pandas.

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

keys=None, levels=None, names=None, verify_integrity=False, copy=True)

UNIT –V - DATA VISUALIZATION

1. What is data visualization.

Graphical way of representing the dataset is known as Data visualization.

2. Write a comment to import

matplotlib. import matplotlib as

mpl

import matplotlib.pyplot as plt

16
lOMoARcPSD|21987438

3. What are the two interface of Matplotlib.

Feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based


interface, and a more powerful object-oriented interface.

4. Define Scatter plots.

Scatter plots is similar to line plot. Instead of points being joined by line segments, here
the points are represented individually with a dot, circle, or other shape.

5. Write a program to create scatter plots with

plt.scatter. rng = np.random.RandomState(0)

x = rng.randn(100)

y = rng.randn(100)

colors = rng.rand(100)

sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,

cmap='viridis')

plt.colorbar(); # show color scale

6. Why plt.plot is more efficient than plt.scatter?

 plt.plot can be noticeably more efficient than plt.scatter.


 The reason is that plt.scatter has the capability to render a different size and/or color for
each point, so the renderer must do the extra work of constructing each point
individually. In plt.plot, on the other hand, the points are always essentially clones of
each other, so the work of determining the appearance of the points is done only once for
the entire set of data.
 For large datasets, the difference between these two can lead to vastly different
performance, and for this reason, plt.plot should be preferred over plt.scatter for large
datasets.

7. How do you create a basic errorbars?

A basic errorbar can be created with a single Matplotlib function call (Figure 4-27):

%matplotlib inline

17
lOMoARcPSD|21987438

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

import numpy as np

x = np.linspace(0, 10, 50)

dy = 0.8

y = np.sin(x) + dy * np.random.randn(50)

plt.errorbar(x, y, yerr=dy, fmt='.k');

8. What is seaborn?

Seaborn is a data visualization library built on top of matplotlib and closely integrated
with pandas data structures in Python. Visualization is the central part of Seaborn which helps in
exploration and understanding of data.

9. Mension the functionality of Seaborn?

Seaborn offers the following

functionalities:

1. Dataset oriented API to determine the relationship between variables.

2. Automatic estimation and plotting of linear regression plots.

3. It supports high-level abstractions for multi-plot grids.

4. Visualizing univariate and bivariate distribution.

10. What is the use of Statsmodels in Python?

Python StatsModels allows users to explore data, perform statistical tests and estimate
statistical models. It is supposed to complement to SciPy's stats module. It is part of the Python
scientific stack that deals with data science, statistics and data analysis.

11. What is Python Bokeh used for?

Bokeh is a Python library for creating interactive visualizations for modern web
browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards
with streaming datasets. With Bokeh, you can create JavaScript-powered visualizations without
writing any JavaScript yourself.

18
lOMoARcPSD|21987438

12. What s Bokeh chart?

Bokeh is a data visualization library in Python that provides high-performance interactive


charts and plots. Bokeh output can be obtained in various mediums like notebook, html and
server.

13. Is Bokeh is better than Matplotlib?

Bokeh can be both used as a high-level or low-level interface; thus, it can create many
sophisticated plots that Matplotlib creates but with fewer lines of code and higher resolution.
Bokeh also makes it really easy to link between plots.

14. Mention the uses of Density Maps

Density mapping is simply a way to show where points or lines may be concentrated in a given
area.

Using Density Maps with Python

Here, we will be using a worldwide dataset of earthquakes and their magnitudes.

15. Where is ploty used?

Plotly can also be used to style interactive graphs with Jupyter notebook. Figure
Converters which convert matplotlib, ggplot2, and IGOR Pro graphs into interactive, online
graphs.

16. What is binning?

Data binning is a type of data preprocessing, a mechanism which includes also dealing
with missing values, formatting, normalization and standardization. Binning can be applied to
convert numeric values to categorical or to sample numeric values.

17. What are the features of Matplotlib?

Matplotlib can be used in multiple ways in Python, including Python scripts, the Python
and iPython shells, Jupyter Notebooks and what not! This is why it’s often used to create
visualizations not just by Data Scientists but also by researchers to create graphs that are of
publication quality.

Matplotlib supports all the popular charts (lots, histograms, power spectra, bar charts, error
charts, scatterplots, etc.) right out of the box. There are also extensions that you can use to create
advanced visualizations like 3-Dimensional plots, etc.

19
lOMoARcPSD|21987438

18. How does Python define SNS?

To get started with seaborn, you're going to need to install it in the terminal with either
pip install seaborn or conda install seaborn . Then simply include import seaborn as sns at the top
of your python file.

19. What is a Contour plot?

A contour plot is a graphical technique which portrays a 3-dimensional surface in two


dimensions. Such a plot contains contour lines, which are constant z slices. To draw the contour
line for a certain z value, we connect all the (x, y) pairs, which produce the value z.

20. What is the use of ploty in Python?

Plotly allows users to import, copy and paste, or stream data to be analyzed and
visualized. For analysis and styling graphs, Plotly offers a Python sandbox (NumPy supported),
datagrid, and GUI. Python scripts can be saved, shared, and collaboratively edited in Plotly.

21. Define Density plot in Python.

A 2D histogram contour plot, also known as a density contour plot, is a 2-dimensional


generalization of a histogram which resembles a contour plot but is computed by grouping a set
of points specified by their x and y coordinates into bins, and applying an aggregation function
such as count or sum (if z is provided) to compute the value to be used to compute contours.

20

You might also like