C49 DWM Expt4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

LAB Manual

PART A
(PART A: TO BE REFERRED BY STUDENTS)

Experiment No.04

A.1 Aim:
Implementation of Data Discretization (any one) & Visualization (any one)

A.2 Prerequisite:
Familiarity with the programming languages

A.3 Outcome:
After successful completion of this experiment
students will be able to ⮚ Use discretize and visualize
data

A.4 Theory:
1. Discretization

Discretization by Binning
Binning is a top-down splitting technique based on a specified number of
bins. These methods are also used as discretization methods for data
reduction and concept hierarchy generation. For example, attribute values
can be discretized by applying equal-width or equal-frequency binning,
and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively. These
techniques can be applied recursively to the resulting partitions to
generate concept hierarchies. Binning does not use class information and
is therefore an unsupervised discretization technique. It is sensitive to the
user
specified number of bins, as well as the presence
of outliers.
Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing. Figure
3.2 illustrates some binning techniques. In this example, the data for price
are first sorted and then partitioned into equal-frequency bins of size 3
(i.e., each bin contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. For example, the
mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary
value. In general, the larger the width, the greater the effect of the
smoothing. Alternatively, bins may be equal width, where the interval
range of values in each bin is constant.

Discretization by Histogram Analysis


Like binning, histogram analysis is an unsupervised discretization
technique because it does not use class information. Histograms were
introduced in Section 2.2.3. A histogram partitions the values of an
attribute, A, into disjoint ranges called buckets or bins. Various partitioning
rules can be used to define histograms. In an equal-with histogram, for
example, the values are partitioned into equal-size partitions or ranges
(e.g., earlier in Figure 3.8 for price, where each bucket has a width of
$10). With an equal-frequency histogram, the values are partitioned so
that, ideally, each partition contains the same number of data tuples. The
histogram analysis algorithm can be applied recursively to each partition
in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels has
been reached. A minimum interval size can also be used per level to
control the recursive procedure. This specifies the minimum width of a
partition, or the minimum number of values for each partition at each
level.
A histogram for an attribute, A, partitions the data distribution of A into
disjoint subsets, referred to as buckets or bins. If each bucket represents
only a single attribute–value/frequency pair, the buckets are called
singleton buckets. Often, buckets instead represent continuous ranges for
the given attribute.
There are several partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket
range is uniform Equal-frequency (or equal-depth): In an equal-
frequency histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (i.e., each bucket contains roughly
the same number of contiguous data samples).
Example Histograms. The following data are a list of AllElectronics
prices for commonly sold items (rounded to the nearest dollar). The
numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14,
14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20,
20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30. Figure (a)
shows a histogram for the data using singleton buckets. To further reduce
the data, it is common to have each bucket denote a continuous value
range for the given attribute. In Figure (b), each bucket represents a
different $10 range for price.

(a) (b)
Histograms are highly effective at approximating both sparse and dense
data, as well as highly skewed and uniform data. The histograms
described before for single attributes can be extended for multiple
attributes. Multidimensional histograms can capture dependencies
between attributes. These histograms have been found effective in
approximating data with up to five attributes. More studies are needed
regarding the effectiveness of multidimensional histograms for high
dimensionalities. Singleton buckets are useful for storing high-frequency
outliers.

2) Visualization:
■ Why data visualization?
■ Gain insight into an information space by
mapping data onto graphical primitives
■ Provide qualitative overview of large data sets
■ Search for patterns, trends, structure, irregularities,
relationships among data ■ Help find interesting regions and
suitable parameters for further quantitative analysis
■ Provide a visual proof of computer representations derived
■ Categorization of visualization methods:
■ Pixel-oriented visualization techniques
■ Geometric projection visualization techniques
■ Icon-based visualization techniques
■ Hierarchical visualization techniques
■ Visualizing complex data and relations

Geometric projection visualization techniques


■ Visualization of geometric transformations and
projections of the data ■ Methods
■ Direct visualization
■ Scatterplot and scatterplot matrices
■ Landscapes
■ Projection pursuit technique: Help users find
meaningful projections of multidimensional data
■ Prosection views
■ Hyperslice
■ Parallel coordinates
Scatterplot and scatterplot matrices
■ Provides a first look at bivariate data to see clusters of points,
outliers, etc ■ Each pair of values is treated as a pair of coordinates
and plotted as points in the plane
Matrix of

scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k)


scatterplots]

Histogram:
■ Histogram: Graph display of tabulated frequencies, shown as bars
■ It shows what proportion of cases fall into each of several categories
■ Differs from a bar chart in that it is the area of the bar that
denotes the value, not the height as in bar charts, a crucial
distinction when the categories are not of uniform width
■ The categories are usually specified as non-overlapping
intervals of some variable. The categories (bars) must be
adjacent

40
30

20

10

0
10000 30000 50000 70000 90000
PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments


within two hours of the practical. The soft copy must be uploaded
on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black
board access available)

B.1 Software Code written by student:


Example dataset:
Bowler Total Wickets
Anil Kumble 956
Harbhajan Singh 711
Kapil Dev 687
Zaheer Khan 610
Ravichandran Ashwin 715
Javagal Srinath 551
Venkatesh Prasad 292
Ishant Sharma 456
Mohammed Shami 434
Bhuvneshwar Kumar 286
Ravindra Jadeja 610
Ajit Agarkar 349
Ashish Nehra 235
Kuldeep Yadav 283
Yuzvendra Chahal 266
S. Sreesanth 169
R. P. Singh 124
Umesh Yadav 288
Maninder Singh 88
Vinoo Mankad 162
• Histogram:
Input code:
import matplotlib.pyplot as plt

# Sample data
bowlers = [
'Anil Kumble', 'Harbhajan Singh', 'Kapil Dev', 'Zaheer Khan',
'Ravichandran Ashwin', 'Javagal Srinath', 'Venkatesh Prasad',
'Ishant Sharma', 'Mohammed Shami', 'Bhuvneshwar Kumar',
'Ravindra Jadeja', 'Ajit Agarkar', 'Ashish Nehra', 'Kuldeep Yadav',
'Yuzvendra Chahal', 'S. Sreesanth', 'R. P. Singh', 'Umesh Yadav',
'Maninder Singh', 'Vinoo Mankad'
]
total_wickets = [
956, 711, 687, 610, 715,
551, 292, 456, 434, 286,
610, 349, 235, 283, 266,
169, 124, 288, 88, 162
]
# Create histogram
plt.figure(figsize=(12, 8))
plt.hist(total_wickets, bins=10, edgecolor='black', color='skyblue')

# Adding labels and title


plt.xlabel('Total Wickets')
plt.ylabel('Number of Bowlers')
plt.title('Histogram of Total Wickets by Bowlers')
# Display the histogram
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

• Scatter-plot:
Input code:
import matplotlib.pyplot as plt

# Sample data
bowlers = [
'Anil Kumble', 'Harbhajan Singh', 'Kapil Dev', 'Zaheer
Khan', 'Ravichandran Ashwin',
'Javagal Srinath', 'Venkatesh Prasad', 'Ishant Sharma',
'Mohammed Shami', 'Bhuvneshwar Kumar',
'Ravindra Jadeja', 'Ajit Agarkar', 'Ashish Nehra', 'Kuldeep
Yadav', 'Yuzvendra Chahal',
'S. Sreesanth', 'R. P. Singh', 'Umesh Yadav', 'Maninder Singh', 'Vinoo
Mankad'
]
total_wickets = [
956, 711, 687, 610, 715,
551, 292, 456, 434, 286,
610, 349, 235, 283, 266,
169, 124, 288, 88, 162
]

# Create scatter plot


plt.figure(figsize=(12, 8))
plt.scatter(bowlers, total_wickets, color='skyblue',
edgecolor='black')

# Adding labels and title


plt.xlabel('Bowlers')
plt.ylabel('Total Wickets')
plt.title('Scatter Plot of Total Wickets by Bowlers')

# Rotate x-axis labels for better readability


plt.xticks(rotation=90)

# Display the scatter plot


plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

B.2 Input and Output:


1. Histogram

Output:
2. Scatter-plot

Output:
B.3 Observations and learning:
Histogram:
• Distribution of Age Groups: Shows how age data is
distributed across different ranges. Certain age groups may
have more individuals than others.
• Frequency Insight: Helps in understanding the
concentration of individuals within each age range.
Scatter Plot:
• Total Wickets by Bowlers: Visualizes the performance of
bowlers, indicating which bowlers have high or low total wickets.
• Performance Variation: Highlights the variation in total
wickets among different bowlers, showing clear high performers
and those with fewer wickets.

B.4 Conclusion:
• Data Discretization: Converting continuous data into
discrete categories simplifies analysis and interpretation.
• Histogram Visualization: Provides a clear view of the
distribution and frequency of categorized data.
• Scatter Plot Visualization: Demonstrates the performance
variation among bowlers, making it easier to compare and assess
individual contributions.
Using both histogram and scatter plot visualizations provides a
comprehensive understanding of different types of data, helping in
effective analysis and decision-making

B.5 Question of Curiosity


(To be answered by student based on the practical performed and learning/observations)

Q1: Explain data dispersion characteristics.


Data dispersion, also known as data variability or spread, refers to the
extent to which data points in a dataset are spread out or dispersed from
the central value, such as the mean or median. It provides insights into
how much individual data points deviate from the central tendency and
helps us understand the diversity or variability within the dataset.
Several measures are used to quantify data dispersion, each providing a
different perspective on the spread of data points.
Some of the common measures of data dispersion include:

1. Range: The range is the simplest measure of dispersion and is


calculated as the difference between the maximum and minimum values
in the dataset. While easy to calculate,the range can be greatly affected
by outliers and may not provide a comprehensive view of dispersion.

2. Variance: Variance measures the average squared deviation of each


data point from the mean of the dataset. It provides an overall measure
of how much the data points deviate from the mean. A higher variance
indicates greater dispersion.
3. Standard Deviation: The standard deviation is the square root of the
variance. It represents the average deviation of each data point from the
mean. It is widely used due to its property of being in the same units as
the data, making it easier to interpret.

4. Mean Absolute Deviation (MAD): MAD measures the average


absolute deviation of each data point from the mean. It provides a
measure of dispersion that is less sensitive to outliers compared to the
variance and standard deviation.

Q2: Explain different graphic displays of basic statistical


descriptions in brief. Graphical displays of basic statistical
descriptions are visual representations that help summarize and
communicate key characteristics of a dataset. Here are some
common graphic displays:

• Histogram: A histogram is a bar chart that represents the


frequency distribution of continuous or discrete data. It displays the
frequency of data within predefined intervals (bins), providing
insights into the data's distribution.
• Box Plot (Box-and-Whisker Plot): A box plot displays the
distribution of data using quartiles. It shows the median, quartiles,
and potential outliers, giving a quick overview of central tendency,
spread, and symmetry of the data.

• Scatter Plot: A scatter plot displays individual data points as dots


on a two-dimensional plane. It is used to visualize the relationship
between two variables and identify patterns, trends, or correlations.

• Bar Chart: A bar chart represents categorical data using


rectangular bars. It shows the frequency, count, or proportion of each
category, making it suitable for comparing different categories.

• Line Chart: A line chart displays data points connected by lines. It


is commonly used to visualize trends and changes in data over time
or across ordered categories.

• Pie Chart: A pie chart divides a dataset into slices, representing


the proportion of each category relative to the whole. It's suitable for
displaying parts-to-whole relationships.

• Scatterplot Matrix: A scatterplot matrix (or pair plot) displays


scatter plots between all pairs of variables in a dataset. It helps
visualize relationships and correlations between multiple variables
simultaneously.

Q3: What are the icon-based visualization techniques.


Explain in short.
Icon-based visualization techniques use icons or symbolic
representations to convey information visually. These techniques can
help simplify complex data and make patterns more apparent. Here
are some common icon-based visualization methods: 1. Pictograms
• Description: Represent data using images or icons where
each icon corresponds to a unit of measurement.
• Example: A pictogram showing 10 icons of people to represent
10 individuals out of a total population.
2. Heat Maps
• Description: Use color-coded icons or symbols to represent the
intensity or frequency of data values.
• Example: A heat map with color-coded squares to show
temperature variations across different regions.
3. Icon Arrays
• Description: Display data using a grid of icons where each
icon represents a specific quantity or percentage.
• Example: An array of 100 icons where 40 icons are shaded to
represent 40% of the total.
4. Bubble Charts
• Description: Use circles (bubbles) of varying sizes to represent
data values, where the size of the bubble corresponds to the
magnitude of the data.
• Example: A bubble chart with different-sized circles
representing sales figures across various regions.
5. Symbol Maps
• Description: Use symbols or icons placed on geographic
maps to represent data values related to specific locations.
• Example: A map with icons representing the number of hospitals in
different areas. 6. Tree Maps
• Description: Display hierarchical data using nested
rectangles or icons, where the size of each rectangle represents
a portion of the whole.
• Example: A tree map showing the distribution of budget across
different departments using nested icons.

You might also like