Week 3 Q&A (1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Week 3

Question 1:
What are the different types of plots that can be created using seaborn library and
mention the significance of each plot type in data visualisation?What are the different
features it offers?
Answer:
The several types of plots that can be created using the seaborn library are scatter
plot,histogram,bar plot,box and whiskers plot.
Each plot type serves different purposes.
-scatter plot shows the relationship between two variables.
-histograms display the distribution of a single variable.
-bar plots compare quantities across categories
-box and whisker plots summarize the distribution of a dataset and highlight outliers.
It provides a high-level interface for drawing attractive and informative statistical graphics.

Question 2:
What is a .csv file, and how is it different from a .txt file?
Answer:
.csv file:
o Stands for Comma-Separated Values.
o Stores tabular data, where each row is a record, and columns are separated by
commas.
o Commonly used for data exchange between applications and can be opened in
tools like Microsoft Excel or Notepad.
.txt file:
o A plain text format storing unstructured or structured data.
o May use delimiters (e.g., tab, space, comma) to separate values but is not
standardized like .csv.
o Requires additional parsing for structured data analysis.

Question 3:
(A) Define A Bar Plot And Explain When It Is Used.
(B) Using The Following Data, Write Python Code To Generate A Bar Plot For This Data.
Class No of
Students
Class A 30
Class B 25
Class C 35
Class D 20

Answer:
A Bar Plot Is A Graphical Representation That Uses Rectangular Bars To Display Data. The
Lengths Of The Bars Are Proportional To The Values Or Frequencies They Represent. It Is
Commonly Used For Categorical Data.
When to Use a Bar Plot-

Python Code-
classes = ['Class A', 'Class B', 'Class C', 'Class D']
students = [30, 25, 35, 20]
index =np.arrange(len(no of students in each class)
plt.bar(classes, students, color='skyblue', edgecolor='black')
plt.title('Number of Students in Each Class', fontsize=14)
plt.xlabel('Classes', fontsize=12)
plt.ylabel('Number of Students', fontsize=12)
plt.show()

Question 4.
How can missing values be identified and handled in a Pandas DataFrame? Discuss various
approaches for dealing with numerical and categorical missing values.
Answer.
Missing values in a dataset can significantly impact analysis and results. In Pandas, missing
values are represented as NaN (Not a Number).
To identify these values, functions like isnull() or isna() can be used, which return a DataFrame
of Boolean values indicating True for missing entries.
The total count of missing values in each column can be found using DataFrame.isnull().sum().
Once identified, there are multiple ways to handle missing values. One straightforward
approach is dropping rows or columns containing missing values using the dropna() method.
This is particularly useful when the proportion of missing data is small.
For numerical variables, missing values can be imputed with the mean for symmetric
distributions or the median for skewed distributions.
For categorical variables, missing values can be imputed with the mode, which represents the
most frequently occurring value.

Question 5.
Describe how scatter plots are used in data visualization.
Answer:
iables using
points on a graph.
-axis represents one variable, and the y-axis represents the other.

age increases.
Question 6:
Describe the role of loops in python programming and differentiate between for and while
loops.
Answer:
Loops in Python, such as "for" and "while" loops, are essential for executing commands
repeatedly based on certain conditions, making the code more concise, efficient, and
easier to maintain.
Loops are crucial for implementing many algorithms, such as searching through data,
sorting items, or filtering values. They allow you to process each piece of data step by
step.
The two most common types of loops in Python are for loops and while loops.
A "for" loop iterates over a sequence, executing the specified commands for each item.
A while loop in Python repeatedly executes a block of code as long as the given
condition is True and is used when the number of iterations is not known in advance.

Question 7:
Explain the concepts of joint probability, marginal probability, and conditional probability as.
How does pandas compute these probabilities using the crosstab() function?
Answer:
Joint Probability:
It refers to the likelihood of two independent events occurring simultaneously. In
pandas, it is calculated using the pandas.crosstab() function, which creates a cross-
tabulation showing the frequency of combined occurrences of two or more factors.
Eg
A car being both Diesel-fueled and having a Manual gearbox.
pdcrosstab(index=carsdata2[‘Automatic’],columns=[‘FuleType’],normalize=True,dropna
=True)
Marginal Probability:
This is the probability of a single event occurring without considering other events. It
can be computed by summing the rows or columns in the cross-tabulation generated by
pandas.crosstab
Probability of a car having a Manual gearbox when fuletype is CNG or petrol or diesel
pdcrosstab(index=carsdata2[‘Automatic’],columns=[‘FuleType’],margins=True
,normalize=True,dropna=True)
Conditional Probability:
This represents the probability of an event (A) occurring given that another event (B)
has already occurred. In pandas, conditional probabilities are calculated by normalizing
the rows or columns in the cross-tabulation to sum to 1, allowing for analysis of
dependencies between events.
Given the type of gear box, probability of different fuel type
pdcrosstab(index=carsdata2[‘Automatic’],columns=[‘FuleType’],margins=True,normalize
=index,dropna=True)

Question 8:
Explain the difference between a Series and a DataFrame in Pandas, along with their use cases.
Answer:
In Pandas, Series and DataFrame are the two primary data structures that provide efficient
ways to manage and manipulate data. Each serves different purposes and has distinct
characteristics.
A Series is a one-dimensional labeled array capable of holding any data type, such as integers,
floats, or strings. It is essentially a single column of data, with an associated index that allows
for efficient data access and manipulation. A Series can be created using the pd.Series()
function and is particularly useful for storing time-series data or any indexed list of values.
For example:
import pandas as pd s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
In contrast, a DataFrame is a two-dimensional, tabular data structure with labeled rows and
columns. It can be thought of as a collection of Series objects that share the same index.
DataFrames are more versatile than Series because they allow for multiple columns, each
potentially containing different data types. This makes them ideal for structured datasets, such
as those commonly used in spreadsheets or databases.
For example:
df = pd.DataFrame({'Column1': [1, 2], 'Column2': [3, 4]})
The key difference between the two is dimensionality: a Series is one-dimensional, while a
DataFrame is two-dimensional. While a Series is suited for simple, linear data, a DataFrame is
preferred for more complex datasets requiring multiple attributes and relationships. Both
structures provide a robust foundation for data analysis and manipulation tasks in Python.

Question 9:
Explain the importance of using scatter plots in data visualization, particularly in
understanding the relationship between numerical variables.
Answer:
Scatter plots are one of the most commonly used tools in data visualization for analyzing
relationships between two numerical variables. They help provide a visual representation of
how one variable influences or correlates with another.
1. Role of Scatter Plots in Identifying Trends, Patterns, and Outliers:
Scatter plots allow us to:
Identify Trends: They show whether a relationship exists between two variables and the
direction of the relationship
Understand Patterns: Patterns in data can indicate linear or nonlinear relationships.
Spot Outliers: Scatter plots help in identifying outliers—data points that deviate
significantly from the overall trend.
Importance of Axis Selection:
When creating a scatter plot, the choice of what to place on the x-axis and y-axis
matters:- The x-axis(horizontal) typically represents the independent variable or the
factor believed to influence the other
The y-axis (vertical) represents the dependent variable, which is being analyzed for
variation based on the independent variable
Question 10:
Explain the differences between shallow copy and deep copy in Python. Provide an example
for each.
Answer:
Shallow Copy: A shallow copy creates a new variable that shares the reference of the original
object. Any changes made to the copied object will reflect in the original object since they
point to the same data. Shallow copying is faster because it does not recursively copy all
objects, just the top-level container.
Example:
import copy
original_list = [1, [2, 3], 4]
shallow_copy = copy.copy(original_list)
shallow_copy[1][0] = 99
print(original_list)
print(shallow_copy)
Deep Copy: A deep copy creates a completely new object, independent of the original.
Changes made to the copied object do not affect the original. Deep copying is slower because
it traverses and copies every nested object and requires more memory since it duplicates the
entire data structure.
Example:
deep_copy = copy.deepcopy(original_list)
deep_copy[1][0] = 42
print(original_list)
print(deep_copy)

Question 11:-
What is data visualization, and what are its key types? Explain its importance in data analysis.
Answer :-
Data visualization is the process of representing data graphically to make it easier to
understand and interpret. It plays a crucial role in analyzing data by simplifying complex
information, uncovering patterns, and supporting decision-making.
Key Types of Data Visualizations:-
Charts:- Used for comparisons, trends, and distributions. Examples: Bar chart, Line chart,
Pie chart.
Graphs:- Useful for identifying correlations, distributions, and patterns. Examples:
Scatter plot, Histogram.
Maps:- Ideal for spatial data and location-based insights. Examples: Heatmaps,
Geographical maps.
Importance in Data Analysis:-
▪ Simplifies large datasets for easier interpretation.
▪ Identifies trends, correlations, and outliers.
▪ Makes presentations more engaging and impactful.
▪ Facilitates data-driven decision-making.

Question 12:-
What are the steps to define a function in Python? Illustrate with an example.
Answer :-
Steps to define a function in Python:

return statement to output a value if needed.

Example:
def add_numbers(a, b):
return a + b
result = add_numbers(5, 7)
print(result)
Output:
12

Question 13:
What are the advantages of using the Seaborn library for data visualization?
Answer:
1. Provides a high-level interface for creating visually appealing statistical graphics.
2. Includes built-in themes and color palettes to enhance plot aesthetics.
3. Simplifies complex visualizations like box plots, violin plots, and pairwise plots.
4. Supports integration with Pandas, making it easier to plot data stored in DataFrames

Question 14:
Explain the difference between a Series and a DataFrame in Pandas, along with their use cases.
Answer:
In Pandas, Series and DataFrame are the two primary data structures that provide efficient
ways to manage and manipulate data. Each serves different purposes and has distinct
characteristics
. A Series is a one-dimensional labeled array capable of holding any data type, such as integers,
f loats, or strings. It is essentially a single column of data, with an associated index that allows
for efficient data access and manipulation. A Series can be created using the pd.Series()
function and is particularly useful for storing time-series data or any indexed list of values.
For example:
import pandas as pd s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
In contrast, a DataFrame is a two-dimensional, tabular data structure with labeled rows and
columns. It can be thought of as a collection of Series objects that share the same index.
DataFrames are more versatile than Series because they allow for multiple columns, each
potentially containing different data types. This makes them ideal for structured datasets, such
as those commonly used in spreadsheets or databases.
For example:
df = pd.DataFrame({'Column1': [1, 2], 'Column2': [3, 4]}) The key difference between the two is
dimensionality: a Series is one-dimensional, while a DataFrame is two-dimensional. While a
Series is suited for simple, linear data, a DataFrame is preferred for more complex datasets
requiring multiple attributes and relationships. Both structures provide a robust foundation for
data analysis and manipulation tasks in Python.

You might also like