C49 DWM Expt4
C49 DWM Expt4
C49 DWM Expt4
PART A
(PART A: TO BE REFERRED BY STUDENTS)
Experiment No.04
A.1 Aim:
Implementation of Data Discretization (any one) & Visualization (any one)
A.2 Prerequisite:
Familiarity with the programming languages
A.3 Outcome:
After successful completion of this experiment
students will be able to ⮚ Use discretize and visualize
data
A.4 Theory:
1. Discretization
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of
bins. These methods are also used as discretization methods for data
reduction and concept hierarchy generation. For example, attribute values
can be discretized by applying equal-width or equal-frequency binning,
and then replacing each bin value by the bin mean or median, as in
smoothing by bin means or smoothing by bin medians, respectively. These
techniques can be applied recursively to the resulting partitions to
generate concept hierarchies. Binning does not use class information and
is therefore an unsupervised discretization technique. It is sensitive to the
user
specified number of bins, as well as the presence
of outliers.
Binning: Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing. Figure
3.2 illustrates some binning techniques. In this example, the data for price
are first sorted and then partitioned into equal-frequency bins of size 3
(i.e., each bin contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin. For example, the
mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary
value. In general, the larger the width, the greater the effect of the
smoothing. Alternatively, bins may be equal width, where the interval
range of values in each bin is constant.
(a) (b)
Histograms are highly effective at approximating both sparse and dense
data, as well as highly skewed and uniform data. The histograms
described before for single attributes can be extended for multiple
attributes. Multidimensional histograms can capture dependencies
between attributes. These histograms have been found effective in
approximating data with up to five attributes. More studies are needed
regarding the effectiveness of multidimensional histograms for high
dimensionalities. Singleton buckets are useful for storing high-frequency
outliers.
2) Visualization:
■ Why data visualization?
■ Gain insight into an information space by
mapping data onto graphical primitives
■ Provide qualitative overview of large data sets
■ Search for patterns, trends, structure, irregularities,
relationships among data ■ Help find interesting regions and
suitable parameters for further quantitative analysis
■ Provide a visual proof of computer representations derived
■ Categorization of visualization methods:
■ Pixel-oriented visualization techniques
■ Geometric projection visualization techniques
■ Icon-based visualization techniques
■ Hierarchical visualization techniques
■ Visualizing complex data and relations
Histogram:
■ Histogram: Graph display of tabulated frequencies, shown as bars
■ It shows what proportion of cases fall into each of several categories
■ Differs from a bar chart in that it is the area of the bar that
denotes the value, not the height as in bar charts, a crucial
distinction when the categories are not of uniform width
■ The categories are usually specified as non-overlapping
intervals of some variable. The categories (bars) must be
adjacent
40
30
20
10
0
10000 30000 50000 70000 90000
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
# Sample data
bowlers = [
'Anil Kumble', 'Harbhajan Singh', 'Kapil Dev', 'Zaheer Khan',
'Ravichandran Ashwin', 'Javagal Srinath', 'Venkatesh Prasad',
'Ishant Sharma', 'Mohammed Shami', 'Bhuvneshwar Kumar',
'Ravindra Jadeja', 'Ajit Agarkar', 'Ashish Nehra', 'Kuldeep Yadav',
'Yuzvendra Chahal', 'S. Sreesanth', 'R. P. Singh', 'Umesh Yadav',
'Maninder Singh', 'Vinoo Mankad'
]
total_wickets = [
956, 711, 687, 610, 715,
551, 292, 456, 434, 286,
610, 349, 235, 283, 266,
169, 124, 288, 88, 162
]
# Create histogram
plt.figure(figsize=(12, 8))
plt.hist(total_wickets, bins=10, edgecolor='black', color='skyblue')
• Scatter-plot:
Input code:
import matplotlib.pyplot as plt
# Sample data
bowlers = [
'Anil Kumble', 'Harbhajan Singh', 'Kapil Dev', 'Zaheer
Khan', 'Ravichandran Ashwin',
'Javagal Srinath', 'Venkatesh Prasad', 'Ishant Sharma',
'Mohammed Shami', 'Bhuvneshwar Kumar',
'Ravindra Jadeja', 'Ajit Agarkar', 'Ashish Nehra', 'Kuldeep
Yadav', 'Yuzvendra Chahal',
'S. Sreesanth', 'R. P. Singh', 'Umesh Yadav', 'Maninder Singh', 'Vinoo
Mankad'
]
total_wickets = [
956, 711, 687, 610, 715,
551, 292, 456, 434, 286,
610, 349, 235, 283, 266,
169, 124, 288, 88, 162
]
Output:
2. Scatter-plot
Output:
B.3 Observations and learning:
Histogram:
• Distribution of Age Groups: Shows how age data is
distributed across different ranges. Certain age groups may
have more individuals than others.
• Frequency Insight: Helps in understanding the
concentration of individuals within each age range.
Scatter Plot:
• Total Wickets by Bowlers: Visualizes the performance of
bowlers, indicating which bowlers have high or low total wickets.
• Performance Variation: Highlights the variation in total
wickets among different bowlers, showing clear high performers
and those with fewer wickets.
B.4 Conclusion:
• Data Discretization: Converting continuous data into
discrete categories simplifies analysis and interpretation.
• Histogram Visualization: Provides a clear view of the
distribution and frequency of categorized data.
• Scatter Plot Visualization: Demonstrates the performance
variation among bowlers, making it easier to compare and assess
individual contributions.
Using both histogram and scatter plot visualizations provides a
comprehensive understanding of different types of data, helping in
effective analysis and decision-making