0% found this document useful (0 votes)
3 views14 pages

Data Visualization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views14 pages

Data Visualization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1. Explain the principles of effective data visualization.

Answer:

Clarity: Eliminate chartjunk; every element should serve a purpose.

Accuracy: Scale axes appropriately; don’t distort data.

Efficiency: Convey the maximum information with minimal ink (Tufte’s “data-ink ratio”).

Consistency: Use consistent color palettes, fonts, and markers.

Perceptual Considerations: Leverage preattentive attributes (color hue, position, length) to


highlight key patterns.

Accessibility: Ensure charts are legible (font size, contrast) and colorblind–friendly.

2. Compare and contrast bar charts, histograms, and boxplots. When would you use each?

Answer:

Bar Chart: Categorical data → compare discrete categories.

Histogram: Continuous data → show distribution via bins.

Boxplot: Continuous data → summarize distribution (median, quartiles, outliers).


Use bar charts for counts by group, histograms when you care about distribution shape, and
boxplots to compare distributions across categories.

3. Given a time series of daily temperatures for one year, propose a visualization dashboard
layout.

Answer:

1. Line Chart of daily mean temperature (primary view).

2. Heatmap Calendar: Dates on axes, color intensity = temperature, to spot seasonal patterns.

3. Boxplots by Month to compare monthly spreads.


4. Interactive Slider (if web-based) to zoom into specific weeks.

5. Summary Statistics Panel showing annual mean, variance, and extreme days.

4. What is a “Small Multiple”? Give an example.

Answer:
A Small Multiple is an array of similar charts using the same scale and axes, each representing
a different subset.
Example: A grid of line charts showing monthly sales trends for each of 12 products, all with
identical axes so you can compare seasonal patterns at a glance.

5. Describe how you would visualize a high-dimensional dataset (e.g., 50 features).

Answer:

Dimensionality Reduction: Apply PCA or t-SNE to project data into 2D/3D, then scatter-plot.

Parallel Coordinates Plot: Lines represent observations across axes for each feature.

Heatmap of Feature Correlations: Identify colinearities.

Glyphs/Star Plots for small subsets: Each observation drawn as a star of 50 spokes (requires
careful design).

6. Write a matplotlib snippet to plot a dual-axis chart: sales vs. profit over months.

import matplotlib.pyplot as plt

months = ['Jan','Feb','Mar','Apr','May']
sales = [120, 150, 170, 160, 180]
profit = [30, 45, 55, 50, 60]

fig, ax1 = plt.subplots()


ax1.plot(months, sales, 'b-o', label='Sales')
ax1.set_xlabel('Month')
ax1.set_ylabel('Sales (k$)', color='b')
ax1.tick_params(axis='y', labelcolor='b')

ax2 = ax1.twinx()
ax2.bar(months, profit, alpha=0.3, color='g', label='Profit')
ax2.set_ylabel('Profit (k$)', color='g')
ax2.tick_params(axis='y', labelcolor='g')

fig.tight_layout()
plt.title('Monthly Sales and Profit')
plt.show()

7. What are the pros and cons of using color to encode quantitative vs. categorical data?

Answer:

Quantitative → Sequential or diverging palettes:

Pros: Show gradations.

Cons: Hard to perceive exact values, risk of color-blind issues.

Categorical → Qualitative palettes:

Pros: Distinct hues for categories.

Cons: Limited distinct colors before confusion; ordering meaningless.

8. Critique the following visualization choice: pie chart for showing year-over-year revenue
change across 12 months.

Answer:

Problems:

Pie charts encode part-to-whole; months aren’t parts of a whole.

Hard to compare small slices; human perception poor at angle judgments.

Better Alternative: Line chart for trend, or bar chart to compare monthly revenue.

9. How would you visualize geospatial numeric data (e.g., temperature readings at weather
stations)?

Answer:

Choropleth Map: Regions colored by value if data aggregated by region.


Scatter Map: Colored/scaled points at station locations.

Heatmap Overlay: Kernel density smoothing across grid.

Interactive Sliders to animate changes over time.

10. Explain the concept of “brushing and linking” in interactive visualization.

Answer:

Brushing: Selecting a subset of data points in one view (e.g., lasso-select on a scatter plot).

Linking: Automatically highlighting those same data points in other coordinated views (e.g., bar
chart, map), enabling multi-view exploration.

11. Given a dataset with outliers, which visualization helps you spot them easily?

Answer:

Boxplot: Explicitly marks points beyond the whiskers as outliers.

Scatter Plot: With axes zoomed properly, outliers stand out visually.

Violin Plot: Shows distribution; outliers appear as isolated points.

12. Describe how you’d encode three variables (e.g., x, y, and category) in one 2D scatter plot.

Answer:

X-axis & Y-axis for two variables.

Color to encode category (categorical variable).

Marker size (optional) for a fourth variable (e.g., magnitude).

Shape for an additional categorical distinction, if needed.

13. What is a heatmap, and when is it most appropriate?

Answer:
A heatmap displays matrix-like data as a grid of colored tiles, where color intensity encodes
magnitude. Best for visualizing correlation matrices, confusion matrices, or any 2D binned
frequency distribution.
14. How can you assess whether your chosen visualization is well understood by users?

Answer:

User Testing: Ask domain experts or peers to interpret the chart.

Task-based Evaluation: Give users questions (e.g., “Which month had highest revenue?”) and
measure accuracy/time.

Eye-Tracking Studies: See where attention focuses.

Surveys & Feedback: Collect subjective usability and clarity ratings.

15. Propose a visualization to compare distributions of a numeric variable across multiple


groups.

Answer:

Side-by-Side Boxplots: One box per group.

Violin Plots: Show kernel density per group.

Ridgeline Plot: Overlapping density curves for each group.

Jittered Strip Plots + Violin/Box Overlay: Show all data points plus summary distribution.

1. What are the main differences between exploratory and explanatory data visualizations?

Answer:

Exploratory: Used during analysis; helps discover patterns or anomalies. Often interactive and
complex.

Explanatory: Final-stage visualization for communicating specific insights to others. Clean,


focused, and annotated.
Example: Exploratory = scatter matrix; Explanatory = line chart with annotated peak.

2. How can you visualize uncertainty in data?

Answer:

Error Bars: Represent standard deviation/confidence intervals.


Shaded Confidence Bands: Around lines in time series.

Violin Plots: Show distribution shape.

Bootstrapped Plots: Multiple samples for visual variability.

Gradient Transparency or Dots: To reflect density or likelihood.

3. Explain how to use the “Faceting” technique in plotting libraries.

Answer:
Faceting creates a grid of plots for subgroups of data.
In Seaborn:

import seaborn as sns


sns.relplot(data=df, x='time', y='value', col='region', kind='line')

Useful for comparing trends across multiple categories.

4. Compare logarithmic vs. linear scale visualizations. When is log scale appropriate?

Answer:

Linear: Equal spacing = equal data increments.

Logarithmic: Equal spacing = equal ratios (e.g., 10x).

Use log scale when data spans several orders of magnitude (e.g., income, population,
frequencies).

5. What are the disadvantages of 3D plots in data visualization?

Answer:

Perception Issues: Depth and overlap make it hard to interpret.

Occlusion: Objects may hide others.

Navigation Complexity: Rotating/zooming adds cognitive load.

Best practice: Use 2D + interactivity or faceting over static 3D charts.

6. Create a violin plot in Python using Seaborn. What does it convey?


import seaborn as sns
sns.violinplot(x='group', y='score', data=df)

Answer:
A violin plot combines boxplot summary with a kernel density estimate, showing data distribution
shape (skewness, multimodality) along with median and quartiles.

7. Explain how Sankey diagrams are used and give one application.

Answer:
Sankey diagrams show flows (quantity movement) between stages/categories.

Width of flows = quantity

Use case: Energy flow, website user journey, financial transfers

8. How can interactivity improve data visualization in dashboards?

Answer:

Enables filtering, zooming, tooltips, and brushing.

Helps focus on relevant data (e.g., time range).

Makes exploratory analysis intuitive.

Enhances user engagement and control.

9. List techniques for reducing clutter in dense plots.

Answer:

Jittering: Slight random displacement (esp. for scatter plots).

Transparency (alpha): Helps view overlaps.

Hexbin/Binning: Aggregate dense data regions.

Sampling: Plot representative subset.

Subplots (faceting) or dimension reduction (PCA, t-SNE).

10. What is a correlation heatmap and how do you interpret it?


Answer:

Displays pairwise correlation coefficients (Pearson/Spearman) between variables using a


color-coded matrix.

Color intensity/direction shows strength and sign of correlation.

Useful for feature selection in modeling.

11. Differentiate between lollipop chart and bar chart.

Answer:

Bar Chart: Solid filled bars → standard.

Lollipop: Thin line + dot → cleaner, emphasizes endpoint.

Lollipop reduces ink usage and makes rankings more readable when values are close.

12. How does data granularity affect visualization choice?

Answer:

High granularity (e.g., second-level timestamps): Leads to clutter → Use aggregation.

Low granularity (monthly or yearly): Can use standard bar/line plots.


Choose visualizations that aggregate or highlight trends instead of noise.

13. Discuss ethical pitfalls in data visualization.

Answer:

Misleading Axes: Truncating y-axis to exaggerate effects.

Cherry-picking Data: Excluding inconvenient values.

Inappropriate Aggregation: Hides outliers or misleads trends.

Color Bias: Using red/green without considering colorblind users.

Ethical visualizations maintain integrity, transparency, and fairness.

14. Propose a method to visualize real-time sensor data streams.


Answer:

Live Line Chart (rolling window): Plot latest N values.

Use streaming libraries (Plotly Dash, Bokeh) with WebSockets.

Add threshold lines, alerts, and pause/resume controls.

Consider aggregate over intervals (e.g., 5-second average) for stability.

15. Visualize and interpret a confusion matrix for a classification model.

Answer:

Use heatmap: Rows = Actual, Columns = Predicted

Diagonal = correct predictions

Off-diagonal = misclassifications

import seaborn as sns


from sklearn.metrics import confusion_matrix
sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, fmt='d', cmap='Blues')

Interpret patterns (e.g., class imbalance, systematic errors).

1. Explain the concept of “Data-ink ratio” in data visualization.

Answer:
The data-ink ratio, introduced by Edward Tufte, refers to the proportion of a graphic that directly
represents the data as opposed to non-data elements (like gridlines, labels, etc.).

High Data-Ink Ratio: Most of the ink is used for displaying data.

Low Data-Ink Ratio: Non-data elements clutter the visualization.


The goal is to maximize data representation and minimize unnecessary embellishments.

2. What are the key differences between a line chart and a scatter plot in visualizing trends?

Answer:

Line Chart: Displays data points connected by lines, ideal for showing continuous trends over
time. Best for showing patterns, trends, and comparisons over periods.
Scatter Plot: Plots individual data points, ideal for showing relationships between two variables.
It can reveal correlations, outliers, and distributions.
Difference: Line charts are used for continuous data, while scatter plots are used to explore
relationships between two variables.

3. How can you visualize the relationship between categorical and continuous variables?

Answer:

Boxplot: Displays the distribution of continuous data for each category.

Violin Plot: Combines boxplot and kernel density estimation to visualize distributions per
category.

Bar Chart with Error Bars: Displays means and uncertainty intervals for each category.

Swarm Plot/Strip Plot: Shows individual data points along with categories.

4. Describe the concept of "glyph-based visualization" and give an example.

Answer:
Glyph-based visualization uses graphical elements (glyphs) to represent multivariate data in a
compact and visually rich manner. Each glyph encodes multiple attributes in a single visual
element.
Example: In a star plot, each variable is represented as a spoke radiating from the center, with
the length of each spoke corresponding to the value of the variable. This allows a quick view of
the multivariate profile of each data point.

5. What is a “Gantt Chart” and when would you use it in data visualization?

Answer:
A Gantt Chart is a type of bar chart that visualizes a project schedule, showing tasks or events
along a time axis.

Use case: It’s used for project management to show start and end dates, task dependencies,
and the progress of multiple tasks simultaneously.

Example: A project with tasks like "Design," "Development," and "Testing" scheduled over
months.

6. Create a 3D scatter plot in Python using matplotlib and explain its usage.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D

x = [1, 2, 3, 4, 5]
y = [5, 6, 7, 8, 9]
z = [1, 4, 9, 16, 25]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.show()

Answer:
A 3D scatter plot is used for visualizing data in three dimensions. It is particularly useful for
identifying patterns or relationships between three variables simultaneously. In this plot, each
point is placed at coordinates defined by x, y, and z.

7. What is the purpose of “facet grids” in data visualization? Provide an example.

Answer:
Facet Grids (or facets) split data into subsets and plot each subset in a separate chart, sharing
the same axis scales.

Purpose: To compare different subsets of data (e.g., categories, groups) without cluttering a
single chart.

Example: A facet grid of histograms showing the distribution of salary by department.

import seaborn as sns


sns.FacetGrid(data=df, col="department", hue="gender").map(plt.hist, "salary")

8. When would you use a stacked bar chart, and what are its limitations?

Answer:

Use case: A stacked bar chart is used when you want to visualize part-to-whole relationships,
with each bar showing the contribution of categories to a total across different groups.

Limitations:
Difficult to compare the actual values of different segments.

Can get cluttered when too many categories are involved.

Misleading if the total is not clearly understood.

9. How would you visualize hierarchical data, and what are the common visualization
techniques?

Answer:

Treemap: Represents hierarchical data as a set of nested rectangles, where the size and color
represent value.

Dendrogram/Tree Diagram: Shows the hierarchical relationships between entities.

Sunburst Chart: Circular variant of treemaps showing hierarchical data with radial layout.

Icicle Plot: A variation of the treemap that uses rectangular sections to show hierarchy.

10. What is the purpose of a heatmap in visualizing correlation matrices?

Answer:
A heatmap is used to visualize the correlation matrix of a dataset, where each cell’s color
intensity represents the degree of correlation between two variables.

Purpose: Helps identify strong correlations, weak correlations, and potential multicollinearity
among variables.

Interpretation: Red or blue cells indicate strong positive or negative correlations, while yellow
suggests weak or no correlation.

11. Explain the role of annotations in a data visualization. Provide an example.

Answer:
Annotations are textual elements added to visualizations to provide context, highlight key points,
or explain patterns.

Example: In a line chart, you may annotate significant peaks, such as marking a sales spike
with a note "Product launch leads to peak in sales."

plt.plot(x, y)
plt.annotate('Launch event', xy=(5, 10), xytext=(6, 12), arrowprops=dict(facecolor='blue',
arrowstyle='->'))

12. How does dimensionality reduction impact the choice of visualization?

Answer:
Dimensionality reduction methods like PCA or t-SNE reduce high-dimensional data to lower
dimensions (typically 2D or 3D) to facilitate visualization.

Impact: Makes complex data interpretable by projecting high-dimensional points into a 2D/3D
space.

Best use case: For visualizing clusters, relationships, or patterns in datasets with many features
(e.g., genes, images).

13. Describe how to visualize categorical data distributions using a count plot.

Answer:
A count plot is a bar plot used to show the frequency of categories in a dataset.

Usage: Ideal for displaying the distribution of categorical data, e.g., the number of occurrences
of different classes in a target variable.

import seaborn as sns


sns.countplot(x='category', data=df)

14. What are the advantages and disadvantages of using line charts over bar charts?

Answer:

Advantages of Line Charts:

Best for showing trends over time (time series).

Ideal for showing continuous data.

Can effectively handle multiple lines (series).

Disadvantages of Line Charts:

Less effective for categorical data compared to bar charts.


Harder to compare exact values between different categories.

15. Explain the concept of interaction in modern data visualizations.

Answer:
Interaction in data visualization allows users to actively engage with the data and customize
their view. Examples include:

Zooming and Panning: Allows users to explore different sections of the data.

Hover/Tooltips: Display additional information on mouseover.

Filtering: Users can narrow down data based on specific criteria. Purpose: Enhances
exploratory analysis, making visualizations dynamic and adaptable.

You might also like