MLM FDS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

CS3352- Foundations of Data Science

2021 Regulations

UNIT I–INTRODUCTION

PART-A

1.What is data science ?

Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze
large amounts of data.

2. What is the role of data science in business, medical research, healthcare,


education, social media, technology and financial institutions?

Academic institutions use data science to monitor student performance and improve
their marketing to prospective students.

3. Write the main types/categories of data?

Nominal, Ordinal, Discrete, Continuous

4. What is NLP ? Is natural language structured data?

Natural language processing (NLP) combines computational linguistics, machine


learning, and deep learning models to process human language.

5. What is machine generated data with an example?

Machine generated data can be found across all sectors of computing and a business.
This type of data makes use of computers in any of their daily operations, and this type
of data can be generated by users unknowingly. Examples of machine generated data
include: APIs.

6. What is graph-based or network data?

Graph or network data is, in short, data that focuses on the relationship or adjacency
of objects.

7.List the steps involved in data science processing?


1– Problem Identification and Business Understanding.
2 – Data Collection and Exploration.
3 – Data Preparation and Cleaning.
4 – Data Modeling and Analysis.
5 – Model Evaluation and Interpretation of Results.
6 – Deployment and Communication of Findings.

8. What are outliers?


An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
9. What are the different ways of combining data?
Append Rows.
Append Columns.
Conditional Merge.
10. What is big data?
Big data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that continues to grow exponentially over time.
PART-B
1. Explain about roles and stages in data science project
In the majority of cases, a Data Science project will have to go through five key
stages: defining a problem, data processing, modelling, evaluation and deployment.
Here I will cover what each of these stages often contains and why they are
important to have as part of a successful Data Science project
Step 1: Problem Identification and Planning
The first step in the data science project life cycle is to identify the problem that needs
to be solved. This involves understanding the business requirements and the goals
of the project. Once the problem has been identified, the data science team will plan
the project by determining the data sources, the data collection process, and the
analytical methods that will be used.
Step 2: Data Collection
The second step in the data science project life cycle is data collection. This involves
collecting the data that will be used in the analysis. The data science team must
ensure that the data is accurate, complete, and relevant to the problem being solved.
Step 3: Data Preparation
The third step in the data science project life cycle is data preparation. This involves
cleaning and transforming the data to make it suitable for analysis. The data science
team will remove any duplicates, missing values, or irrelevant data from the
dataset. They will also transform the data into a format that is suitable for analysis.
Step 4: Data Analysis
The fourth step in the data science project life cycle is data analysis. This involves
applying analytical methods to the data to extract insights and patterns. The data
science team may use techniques such as regression analysis, clustering, or
machine learning algorithms to analyze the data.
Step 5: Model Building
The fifth step in the data science project life cycle is model building. This involves
building a predictive model that can be used to make predictions based on the data
analysis. The data science team will use the insights and patterns from the data
analysis to build a model that can predict future outcomes.
Step 6: Model Evaluation
The sixth step in the data science project life cycle is model evaluation. This involves
evaluating the performance of the predictive model to ensure that it is accurate and
reliable. The data science team will test the model using a validation dataset to
determine its accuracy and performance.
Step 7: Model Deployment
The final step in the data science project life cycle is model deployment. This involves
deploying the predictive model into production so that it can be used to make
predictions in real-world scenarios. The deployment process involves integrating
the model into the existing business processes and systems to ensure that it can be
used effectively.

2. Illustrate Data cleaning process


1. Step 1: Remove duplicate or irrelevant observations. Remove unwanted
observations from your dataset, including duplicate observations or irrelevant
observations. ...
2. Step 2: Fix structural errors. ...
3. Step 3: Filter unwanted outliers. ...
4. Step 4: Handle missing data. ...
5. Step 5: Validate and QA.

3. Demonstrate data Exploration process


In data science, there are two primary methods for extracting data from disparate sources:
data exploration and data mining.
Data exploration is a broad process that is performed by business users and an increasing
numbers of citizen data scientists with no formal training in data science or analytics, but
whose jobs depend on understanding data trends and patterns. Visualization tools help this
wide-ranging group to better export and examine a variety of metrics and data sets.
Data mining is a specific process, usually undertaken by data professionals. Data analysts
create association rules and parameters to sort through extremely large data sets and
identify patterns and future trends.
Typically, data exploration is performed first to assess the relationships between variables.
Then the data mining begins. Through this process, data models are created to gather
additional insight from the data.
4. Describe the facets of data.
Data science is focused on making sense of complex datasets and in building predictive
models from those data. As such, it encompasses a wide array of different
activities, from the upstream processes of data acquisition, cleaning and integration
to downstream processes of data analysis, modeling and prediction. There are many
facets of data science, including:

Identifying the structure of data


Accessing and importing data
Cleaning, filtering, reorganizing, augmenting, and aggregating data
Visualizing data
Data analysis, statistics, and modeling
Machine Learning
Assembling data processing pipelines to link these steps
Leveraging high-end computational resources for large-scale problems
5. Describe datamining
Data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets. Given the
evolution of data warehousing technology and the growth of big data, adoption of data
mining techniques has rapidly accelerated over the last couple of decades, assisting
companies by transforming their raw data into useful knowledge. However, despite the
fact that that technology continuously evolves to handle data at a large scale, leaders
still face challenges with scalability and automation.
Data mining has improved organizational decision-making through insightful data
analyses. The data mining techniques that underpin these analyses can be divided into
two main purposes; they can either describe the target dataset or they can predict
outcomes through the use of machine learning algorithms. These methods are used to
organize and filter data, surfacing the most interesting information, from fraud
detection to user behaviors, bottlenecks and even security breaches.

UNIT II-DESCRIBING DATA


PART-A
1. What is frequency distribution?
A frequency distribution is a representation, either in a graphical or tabular format, that
displays the number of observations within a given interval.
2. what are the types and uses of frequency distribution?
In statistics, there are four types of frequency distributions, which are discussed
below:

Ungrouped frequency distributions: Instead of sets of data values, it presents the


frequency of an item in each particular data value.
Grouped frequency distributions: The data is organised and grouped into groups
called class intervals in this type. In a frequency distribution table, the frequency of
data
belonging to each class interval is represented. The grouped frequency table shows
the frequency distribution in class intervals.
Relative frequency distributions: It indicates what percentage of the total number of
observations each category belongs to.
Cumulative frequency distributions: A frequency distribution is the sum of the first
frequency and all frequencies below it. We must add each value to the next, then add
the sum to the next, and so on until the last value is reached. The whole sum of all
frequencies will be the last cumulative frequency.
3. what is grouped frequency distribution?
A grouped frequency table (grouped frequency distribution) is a way of organising a
large set of data into more manageable groups. The groups that we organise the
numerical data into are called class intervals. They can have the same or different class
widths and must not overlap.
4. What is the ungrouped frequency distribution?
The ungrouped frequency distribution is a type of frequency distribution that displays
the frequency of each individual data value instead of groups of data values. In this
type of frequency distribution, we can directly see how often different values occurred
in the table.
5. What is cumulative frequency distribution?
Cumulative frequency distribution is a form of frequency distribution that represents
the sum of a class and all classes below it.
6. What is relative frequency distribution?
A relative frequency distribution shows the proportion of the total number of
observations associated with each value or class of values and is related to a
probability distribution, which is extensively used in statistics. From: Statistical
Methods
7. Define percentile ranks?
Percentile ranks are commonly used to clarify the interpretation of scores on
standardized tests
8. What is a histogram?
A histogram is a graphical representation of discrete or continuous data. The area of a bar in a
histogram is equal to the frequency
9.What is frequency polygon?
A frequency polygon is a line graph of class frequency plotted against class
midpoint. It is almost identical to a histogram, which is used to compare sets of
data or to display a cumulative frequency distribution.
10. What is interquartile range (IQR)?
Interquartile range is defined as the difference between the upper and lower quartile
values in a set of data. It is commonly referred to as IQR and is used as a measure of
spread and variability in a data set.

PART-B

1. Descriptive statistics Inferential statistics


Descriptive Statistics describes the characteristics of a data set. It is a simple technique to
describe, show and summarize data in a meaningful way. You simply choose a group you’re
interested in, record data about the group, and then use summary statistics and graphs to
describe the group properties. There is no uncertainty involved because you’re just describing
the people or items that you actually measure.
Descriptive statistics involves taking a potentially sizeable number of data points in the
sample data and reducing them to certain meaningful summary values and graphs. The process
allows you to obtain insights and visualize the data rather than simply pouring through sets of
raw numbers. With descriptive statistics, you can describe both an entire population and an
individual sample.
What is Inferential Statistics?
Inferential statistics involves drawing conclusions about populations by examining samples. It
allows us to make inferences about the entire set, including specific examples within it, based
on information obtained from a subset of examples. These inferences rely on the principles of
evidence and utilize sample statistics as a basis for drawing broader conclusions.
The accuracy of inferential statistics depends largely on the accuracy of sample data and
how it represents the larger population. This can be effectively done by obtaining a random
sample. Results that are based on non-random samples are usually discarded. Random
sampling - though not very straightforward always – is extremely important for carrying out
inferential techniques.

2. Data Qualitative data


Qualitative data is defined as data that approximates and characterizes. Qualitative data
can be observed and recorded.

This data type is non-numerical. This type of data is collected through methods of
observations, one-to-one interviews, conducting focus groups, and similar methods.
Qualitative data in statistics is also known as categorical data – data that can be
arranged categorically based on the attributes and properties of a thing or a
phenomenon.

Importance of qualitative data


Qualitative data is important in determining the particular frequency of traits or
characteristics. It allows the statistician or the researchers to form parameters
through which larger data sets can be observed.

It provides how observers can quantify the world around them. Qualitative data is
about the emotions or perceptions of people and what they feel. Qualitative analysis
is key to getting useful insights from textual data, figuring out its rich context, and
finding subtle patterns and themes.

In qualitative data, these perceptions and emotions are documented. It helps market
researchers understand their consumers’ language and solve the research problem
effectively and efficiently.

Advantages of qualitative data


Some advantages of qualitative data are given below:

It helps in-depth analysis


The data collected provide the qualitative researchers with a detailed analysis, like a
thematic analysis of subject matters. While collecting it, the researchers tend to
probe the participants and can gather ample information by asking the right kind of
questions. The data collected is used to conclude a series of questions and answers.

Understand what customers think


The data helps market researchers understand their customers’ mindsets. Using
qualitative data gives businesses an insight into why a customer purchased a
product. Understanding customer language helps market research infer the data
collected more systematically.

Rich data
Collected data can also be used to conduct future research. Since the questions
asked to collect qualitative data are open-ended questions, respondents are free to
express their opinions, leading to more information.
3. Ranked data Quantitative data
A ranked variable is one that has an ordinal value (i.e. 1st, 2nd, 3rd, etc.). While the
exact value of the variable may not be known, its place relative to the other
variables is. Ranked data is data that has been compared to the other pieces of data
and given a "place" relative to these other pieces of data. For example, to rank the
numbers 7.6, 2.4, 1.5, and 5.9 from least to greatest, 1.5 is first, 2.4 is second, 5.9 is
third, and 7.6 is fourth. The numbers within this data set (7.6, 2.4, 1.5, 5.9) are
ranked data, and the ordinal numbers used to rank them (1st, 2nd, 3rd, 4th) are
ranked variables.
Ranked data has many uses, including in:
Sports: Most sports clubs (such as the NFL, FIFA, and bicycle racing) rank their teams
or athletes to determine who goes to the final match.
Politics: There are many world-wide rankings, including education, environment, and
technology.
Search Engines: Search engines give results based on what they think is important and
relevant to a search.
Ranked data is important when to know how each piece of data compares to the others
in a set. It is also important for certain statistical data, such as Spearman's Rank
Correlation Coefficient.
4. Experiment Dependent variable
A dependent variable is what changes as a result of the independent variable
manipulation in experiments. It's what you're interested in measuring, and it
“depends” on your independent variable. In statistics, dependent variables are also
called: Response variables (they respond to a change in another variable)
5. Confounding variable Observational study
In an observational study, confounding occurs when a risk factor for the outcome
also affects the exposure of interest, either directly or indirectly. The resultant bias
can strengthen, weaken, or completely reverse the true exposure-outcome
association.

UNIT III-DESCRIBING RELATIONSHIPS


PART-A

1. What is correlation and its types?

Correlation is a key statistical measure that describes the degree of association


between two variables. There are three basic types of correlation: positive
correlation: the two variables change in the same direction. negative correlation: the
two variables change in opposite directions.

2. Define Scatterplots?

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables.

3. What is a correlation coefficient?

The correlation coefficient measures the relationship between two variables. The
correlation coefficient can never be less than -1 or higher than 1. 1 = there is a
perfect linear relationship between the variables
4. Define Regression.
A regression is a statistical technique that relates a dependent variable to one or
more independent (explanatory) variables.
5. Write the types of regression analysis.
Linear regression and logistic regression are two types of regression analysis
techniques that are used to solve the regression problem using machine learning.

6. List the types of nonlinear relationship


 Quadratic Relationships.
 Cubic Relationships.
 Exponential Relationships.
 Logarithmic Relationships.

7. Compare correlation and regression

Correlation is a statistical measure that determines the association or co-relationship


between two variables. Regression describes how to numerically relate an
independent variable to the dependent variable.

8. What is interpretation of r2 ?

R-Squared and Adjusted R-Squared describes how well the linear regression model
fits the data points: The value of R-Squared is always between 0 to 1 (0% to 100%).
A high R-Squared value means that many data points are close to the linear
regression function line.

9. What is the need for correlation?

Correlation measures the relationship between two variables. We mentioned that a


function has a purpose to predict a value, by converting input (x) to output (f(x)).

10. What is decision tree?

A decision tree is a non-parametric supervised learning algorithm, which is utilized


for both classification and regression tasks.

PART-B
1.What is python? Explain about python libraies?
Python is one of the most popular programming languages used across various tech
disciplines, especially in data science and machine learning. Python offers an easy-to-
code, object-oriented, high-level language with a broad collection of libraries for a
multitude of use cases.

1. TensorFlow
The first in the list of python libraries for data science is TensorFlow. TensorFlow is a
library for high-performance numerical computations with around 35,000 comments
and a vibrant community of around 1,500 contributors. It’s used across various
scientific fields. TensorFlow is basically a framework for defining and running
computations that involve tensors, which are partially defined computational objects
that eventually produce a value.

Features:
Better computational graph visualizations
Reduces error by 50 to 60 percent in neural machine learning
Parallel computing to execute complex models
Seamless library management backed by Google
Quicker updates and frequent new releases to provide you with the latest features
TensorFlow is particularly useful for the following applications:

Speech and image recognition


Text-based applications
Time-series analysis
Video detection
2. SciPy
SciPy (Scientific Python) is another free and open-source Python library for data
science that is extensively used for high-level computations. SciPy has around 19,000
comments on GitHub and an active community of about 600 contributors. It’s
extensively used for scientific and technical computations, because it extends NumPy
and provides many user-friendly and efficient routines for scientific calculations.

Features:
Collection of algorithms and functions built on the NumPy extension of Python
High-level commands for data manipulation and visualization
Multidimensional image processing with the SciPy ndimage submodule
Includes built-in functions for solving differential equations
Applications:
Multidimensional image operations
Solving differential equations and the Fourier transform
Optimization algorithms
Linear algebra
3. NumPy
NumPy (Numerical Python) is the fundamental package for numerical computation in
Python; it contains a powerful N-dimensional array object. It has around 18,000
comments on GitHub and an active community of 700 contributors. It’s a general-
purpose array-processing package that provides high-performance multidimensional
objects called arrays and tools for working with them. NumPy also addresses the
slowness problem partly by providing these multidimensional arrays as well as
providing functions and operators that operate efficiently on these arrays.

Features:
Provides fast, precompiled functions for numerical routines
Array-oriented computing for better efficiency
Supports an object-oriented approach
Compact and faster computations with vectorization
Applications:
Extensively used in data analysis
Creates powerful N-dimensional array
Forms the base of other libraries, such as SciPy and scikit-learn
Replacement of MATLAB when used with SciPy and matplotlib
4. Pandas
Next in the list of python librabries is Pandads. Pandas (Python data analysis) is a must
in the data science life cycle. It is the most popular and widely used Python library for
data science, along with NumPy in matplotlib. With around 17,00 comments on
GitHub and an active community of 1,200 contributors, it is heavily used for data
analysis and cleaning. Pandas provides fast, flexible data structures, such as data frame
CDs, which are designed to work with structured data very easily and intuitively.

Also Read: What is Data Analysis: Methods, Process and Types Explained

Features:
Eloquent syntax and rich functionalities that gives you the freedom to deal with
missing data
Enables you to create your own function and run it across a series of data
High-level abstraction
Contains high-level data structures and manipulation tools
Applications:
General data wrangling and data cleaning
ETL (extract, transform, load) jobs for data transformation and data storage, as it has
excellent support for loading CSV files into its data frame format
Used in a variety of academic and commercial areas, including statistics, finance and
neuroscience
Time-series-specific functionality, such as date range generation, moving window,
linear regression and date shifting.
2.What is data Analysis? Why python is used for data analysis?

Data analysts are responsible for interpreting data and analyzing the results utilizing
statistical techniques and providing ongoing reports. They develop and implement data
analyses, data collection systems, and other strategies that optimize statistical efficiency and
quality. They are also responsible for acquiring data from primary or secondary data sources
and maintaining databases.

Besides, they identify, analyze, and interpret trends or patterns in complex data sets. Data
analysts review computer reports, printouts, and performance indicators to locate and
correct code problems. By doing this, they can filter and clean data.

Data analysts conduct full lifecycle analyses to include requirements, activities, and design,
as well as developing analysis and reporting capabilities. They also monitor performance
and quality control plans to identify improvements.

Finally, they use the results of the above responsibilities and duties to better work with
management to prioritize business and information needs.

One needs only to briefly glance over this list of data-heavy tasks to see that having a tool that can
handle mass quantities of data easily and quickly is an absolute must. Considering the proliferation
of Big Data (and it’s still on the increase), it is important to be able to handle massive amounts of
information, clean it up, and process it for use. Python fits the bill since its simplicity and ease of
performing repetitive tasks means less time needs to be devoted to trying to figure out how the tool
works.
Data Analysis Vs. Data Science
Before wading in too deep on why Python is so essential to data analysis, it’s important first
to establish the relationship between data analysis and data science, since the latter also
tends to benefit greatly from the programming language. In other words, many of the
reasons Python is useful for data science also end up being reasons why it’s suitable for data
analysis.
The two fields have significant overlap, and yet are also quite distinctive, each on their
right. The main difference between a data analyst and a data scientist is that the former
curate's meaningful insights from known data, while the latter deals more with the
hypotheticals, the what-ifs. Data analysts handle the day-to-day, using data to answer
questions presented to them, while data scientists try to predict the future and frame those
predictions in new questions. Or to put it another way, data analysts focus on the here and
now, while data scientists extrapolate what might be.

There are often situations where the lines get blurred between the two specialties, and that’s
why the advantages that Python bestows on data science can potentially be the same ones
enjoyed by data analysis. For instance, both professions require knowledge of software
engineering, competent communication skills, basic math knowledge, and an understanding
of algorithms. Furthermore, both professions require knowledge of programming languages
such as R, SQL, and, of course, Python.
3.Discuss in detail about Pandas in python with suitable example.
Pandas is a powerful and versatile library that simplifies tasks of data manipulation in
Python . Pandas is built on top of the NumPy library and is particularly well-suited for
working with tabular data, such as spreadsheets or SQL tables. Its versatility and ease
of use make it an essential tool for data analysts, scientists, and engineers working with
structured data in Python.

What can you do using Pandas?


Pandas are generally used for data science but have you wondered why? This is because
pandas are used in conjunction with other libraries that are used for data science. It is
built on the top of the NumPy library which means that a lot of structures of NumPy
are used or replicated in Pandas. The data produced by Pandas are often used as input
for plotting functions of Matplotlib, statistical analysis in SciPy, and machine learning
algorithms in Scikit-learn. Here is a list of things that we can do using Pandas.

Data set cleaning, merging, and joining.


Easy handling of missing data (represented as NaN) in floating point as well as non-
floating point data.
Columns can be inserted and deleted from DataFrame and higher dimensional objects.
Powerful group by functionality for performing split-apply-combine operations on data
sets.
Data Visulaization
4.Discuss about combing and merging data sets in python?
The Series and DataFrame objects in pandas are powerful tools for exploring and
analyzing data. Part of their power comes from a multifaceted approach to combining
separate datasets. With pandas, you can merge, join, and concatenate your datasets,
allowing you to unify and better understand your data as you analyze it.
merge() for combining data on common columns or indices
.join() for combining data on a key column or an index
concat() for combining DataFrames across rows or columns
pandas merge(): Combining Data on Common Columns or Indices
The first technique that you’ll learn is merge(). You can use merge() anytime you want
functionality similar to a database’s join operations. It’s the most flexible of the three
operations that you’ll learn.

When you want to combine data objects based on one or more keys, similar to what you’d
do in a relational database, merge() is the tool you need. More specifically, merge() is most
useful when you want to combine rows that share data.

You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-
one join, one of your datasets will have many rows in the merge column that repeat the
same values. For example, the values could be 1, 1, 3, 5, and 5. At the same time, the
merge column in the other dataset won’t have repeated values. Take 1, 3, and 5 as an
example.

As you might have guessed, in a many-to-many join, both of your merge columns will
have repeated values. These merges are more complex and result in the Cartesian product
of the joined rows.
This means that, after the merge, you’ll have every combination of rows that share the
same value in the key column. You’ll see this in action in the examples below.

What makes merge() so flexible is the sheer number of options for defining the behavior of
your merge. While the list can seem daunting, with practice you’ll be able to expertly
merge datasets of all kinds.

When you use merge(), you’ll provide two required arguments:

The left DataFrame


The right DataFrame
5.Describe about plotting and visulization concepts in python?
Data visualization is a field in data analysis that deals with visual representation
of data. It graphically plots data and is an effective way to communicate
inferences from data.
Using data visualization, we can get a visual summary of our data. With
pictures, maps and graphs, the human mind has an easier time processing and
understanding any given data. Data visualization plays a significant role in the
representation of both small and large data sets, but it is especially useful when
we have large data sets, in which it is impossible to see all of our data, let alone
process and understand it manually.
Data Visualization in Python
Python offers several plotting libraries, namely Matplotlib, Seaborn and many
other such data visualization packages with different features for creating
informative, customized, and appealing plots to present data in the most simple
and effective way.
Matplotlib and Seaborn
Matplotlib and Seaborn are python libraries that are used for data visualization.
They have inbuilt modules for plotting different graphs. While Matplotlib is
used to embed graphs into applications, Seaborn is primarily used for statistical
graphs.
But when should we use either of the two? Let’s understand this with the help of
a comparative analysis. The table below provides comparison between Python’s
two well-known visualization packages Matplotlib and Seaborn.

UNIT IV – PYTHON LIBRARIES FOR DATA WRANGLING


PART – A
1. NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra,
fourier transform, and matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open
source project and you can use it freely.
NumPy stands for Numerical Python.

Why Use NumPy?


In Python we have lists that serve the purpose of arrays, but they
are slow to process.
NumPy aims to provide an array object that is up to 50x faster
than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of
supporting functions that make working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very
important.

2. Write a python library create an array?


array(data_type, value_list) is used to create an array with data type and value list specified
in its arguments.

3. What is Data frame?


A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and
columns, much like a spreadsheet.

4. How a pandas data frame can be constructed?


There are several ways to create a pandas DataFrame. In most cases, you'll use the
DataFrame constructor and provide the data, labels, and other information.

5. What are indexes?


Indexes are used to quickly locate data without having to search every row in a database
table every time said table is accessed.

6.How missing data can be handled in python?


1. Deleting the column with missing data. ...
2. Deleting the row with missing data. ...
3. Filling the Missing Values – Imputation. ...
4. Other imputation methods. ...
5. Filling with a Regression Model.

7. How the operations can be performed on null values in pandas data science?
the notnull() method to return a dataframe of boolean values that are False for NaN values
when checking for null values in a Pandas Dataframe.

8.Define Hierarchical indexing.


Hierarchical indexing is a method of creating structured group relationships in data.

9. What is pivot table?


A pivot table is a statistics tool that summarizes and reorganizes selected columns
and rows of data in a spreadsheet or database table to obtain a desired report.

10. Write python code to create 1D,2D and 3D numpy arrays.


1D array creation :
import numpy as np
one_dimensional_list = [1,2,4]
one_dimensional_arr = np.array(one_dimensional_list)
print("1D array is : ",one_dimensional_arr)

2D array creation :
import numpy as np
two_dimensional_list=[[1,2,3],[4,5,6]]
two_dimensional_arr = np.array(two_dimensional_list)
print("2D array is : ",two_dimensional_arr)

3D array creation :
import numpy as np
three_dimensional_list=[[[1,2,3],[4,5,6],[7,8,9]]]
three_dimensional_arr = np.array(three_dimensional_list)
print("3D array is : ",three_dimensional_arr)

PART-B
11. Explain about Big Data-Characteristics and applications?
 It refers to a massive amount of data that keeps on growing exponentially with time.
 It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and data visualization.
 The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
 Types of Big Data

 Now that we are on track with what is big data, let’s have a look at the
types of big data:
 Structured

 Structured is one of the types of big data and By structured data, we


mean data that can be processed, stored, and retrieved in a fixed format.
It refers to highly organized information that can be readily and
seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will
be structured as the employee details, their job positions, their salaries,
etc., will be present in an organized manner.
 Unstructured

 Unstructured data refers to the data that lacks any specific form or
structure whatsoever. This makes it very difficult and time-consuming to
process and analyze unstructured data. Email is an example of
unstructured data. Structured and unstructured are two important types of
big data.

12. Explain about Spark program flow, Spark Eco System?


Apache Spark consists of Spark Core Engine, Spark SQL, Spark Streaming, MLlib,
GraphX and Spark R. You can use Spark Core Engine along with any of the other five
components mentioned above. It is not necessary to use all the Spark components
together. Depending on the use case and application, any one or more of these can be
used along with Spark Core.
Spark Core
Spark SQL
Spark Streaming
MLlib(Machine learning library)
GraphX
Spark R
13. Explain about Using Spark SQL Commands ?
One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read
data from an existing Hive installation. For more on how to configure this feature,
please refer to the Hive Tables section. When running SQL from within another
programming language the results will be returned as a Dataset/DataFrame.
14. Definition of Machine Learning? Explain about Machine Learning with Spark?
Machine learning (ML) is a field of study in artificial intelligence concerned with the
development and study of statistical algorithms that can learn from data and generalize
to unseen data, and thus perform tasks without explicit instructions.
15. Explain about structured streaming with example?
Structured Streaming is a scalable and fault-tolerant stream processing engine
built on the Spark SQL engine. You can express your streaming computation the
same way you would express a batch computation on static data.

Unit V- DATA VISUALIZATION

PART-A
1. What is purpose of matplotlib?
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python
2. Write the dual interface of matplotlib?
An explicit "Axes" interface that uses methods on a Figure or Axes
object to create other Artists, and build a visualization step by step. This
has also been called an "object-oriented" interface.
3. How to draw a simple line plot using matplotlib?
pip install matplotlib.
import matplotlib. ...
# Load the dataset into a Pandas DataFrame df = pd. ...
# Extract the date and close price columns dates = df['Date'] closing_price = df['Close'] #
Create a line plot plt. ...
# Plot in Red colour plt. ...
# Increasing the linewidth plt.
4. Define contour plots?
A contour plot is a graphical technique for representing a 3-dimensional
surface by plotting constant z slices, called contours, on a 2-dimensional
format.
5. What functions can be used to draw contour plots?
The contour plot is used to depict the change in Z values as compared to
X and Y values. If the data (or function) do not form a regular grid, you
typically need to perform a 2-D interpolation to form a regular grid.
6. What is the purpose of histogram?
The histogram is a popular graphing tool. It is used to summarize
discrete or continuous data that are measured on an interval scale.

6. What is density plot?


A density plot is a representation of the distribution of a numeric
variable.
7. Mention the significance of subplots?
Subplots can be used for many different purposes. They can reveal
information about characters and settings, add complexity to particular
themes, or make the story more realistic.
8. How can you set different colors for line plot.
Loop: for column, color in zip(df.columns, colors): ax.plot(df[column], color=color)
Adapt the color cycle: with plt.rc_context({'axes.prop_cycle': plt.cycler(color=['red', 'blue'])}):
plt.cycler(color=['red', 'blue']): ax.plot(df)
9. List the applications of lineplot.
They can be used to represent continuous data, which is data that can take
on any value within a range.
10. Write the syntax of scatter() method.
x_axis_value - An array containing x-axis data for scatter in the plot.
y_axis_value - an array with y-axis data.
s - it is the size of the marker (can be scalar or array of size equal to the size
of the x-axis or y-axis)

PART-B

1.How to connecting to you data? What are generated values?


The first step to getting started with Tableau Desktop is to connect to the data you want to
explore. There are several types of data you can connect to and several ways to connect to
your data.

When you open Tableau, you are taken to the home page where you can easily select from
previous workbooks, sample workbooks, and saved data sources. You can also connect to
new data sources by selecting Connect to Data. Figure 2.1 displays the screen.

The option In a File is for connecting to locally stored data or file based data. TABLEAU
Personal edition can only access Excel, Access, and text files (txt, csv). You can also import
from data sources stored in other workbooks.
Connecting to Desktop Sources

If you click on one of the desktop source options under the In a File list, you will get a
directory window to select the desired file. Once you have chosen your file, you will be
taken to the Connection Options window. There are small differences in the connection
dialog depending on the data source you are connecting to but the menus are self-
explanatory. Figure 2.2 shows the connection window with the Superstore sample
spreadsheet being the file that is being accessed.
2.Explain Joining tables with tableau?
A way to extract data from multiple tables in the database is by Tableau Joins. They enable
us to get data from different tables provided that the tables have certain fields common. The
common field shall be the primary key in one table that acts as a foreign key in another.
Various types of Joins include Inner Join, Left Join, Right Join, and Full Outer Join.
Tableau allows us to perform joins in a very easy manner. It offers a guided approach to
join the two tables providing a couple of important options. Using the functionality we can
get data from different tables for analysis.
3.Explain sets groups & hierarchies?
A Tableau Group is a set of multiple members combined in a single dimension to create a
higher category of the dimension. Tableau allows the grouping of single-dimensional
members and automatically creates a new dimension adding the group at the end of the
name. Tableau does not do anything with the original dimension of the members.
Sort data
Data present in the visualization and worksheet can be sort based on the requirement. It can
sort the data based on data source order, ascending, descending or depend on any measured
value.

The procedure for sorting is given as follows.

Step 1) Go to a Worksheet and drag a dimension and measure as given in the image.
Step 2)
Right click on Category.
Select ‘Sort’ option.
Sort Order:

Ascending: It sorts the order of selected dimension in ascending order.


Descending: It sorts the order of selected dimension in descending order.
4.How to using calculation dialog box?

Use the Formula Builder to create calculated fields and measures and configure their
summary functions to generate the data that you want.

About this task


Procedure

1. Click Create > Ad Hoc View


2. Under Select Data, select a domain, and click, Choose Data.
3. Select a domain or domains, and double-click, drag, or use the direction buttons to move it
under Selected Fields.
4. Click OK after you select the specific domain or domains you want
5. Hover over the drop-down icon, and click Create Calculated Field or Create Calculated
Measure next to either the Fields pane or the Measures pane of the Data Source
Selection window.
6. Work with the Formula Builder and Summary Calculation tabs.
o Creating calculated field and measure formulas in the Formula Builder tab.

Use the following fields in the Formula Builder tab to create the formula for
your calculated field and measures:
Formula field box

a. Edit the formula for calculating your fields and measures by typing directly into
the Formula field box.
b. Consider the following when you write your formula:
 Formulas must use the following syntax:
 Delineate labels for fields and measures must in double quotation
marks ("). For example, "Customer ID", "Date ordered".
 Surround text with single quotation marks ('). For example, '--'.
 Use single quotation marks for levels ('). For
example, 'ColumnGroup' and 'Total'. For more information about
levels, see Aggregate functions.
 The following words are reserved and cannot be used as field names unless
they are contained as part of a phrase such as Not Available:
 And
 In
 Not
 Or
 Add fields, measures, and functions to your formula by double-clicking
them.
 Click the buttons below the Formula field to add operators.

5.What are the table calculation functions?

Number Functions.
String Functions.
Date Functions.
Type Conversion.
Logical Functions.
Aggregate Functions.
Pass-Through Functions (RAWSQL)
User Functions.
7. How to using maps to improve insights?
8. How to proving self evidenceadhoc analysis with parameters?
9. How to editing views in server?

You might also like