Data
Data
Data
Introduction
Data visualization is part of data exploration, which is a critical step in the AI cycle. You will
use this technique to gain understanding and insights to the data you have gathered, and
determine if the data is ready for further processing or if you need to collect more data or
clean the data.
You will also use this technique to present your results.
In this notebook, we will explore python packages crucial for Data Sciences. Packages like
Pandas, NumPy and Matplotlib are used in the whole process.
Context
We will be working with Jaipur weather data obtained from Kaggle, a platform for data
enthusiasts to gather, share knowledge and compete for many prizes!
The data has been cleaned and simplified, so that we can focus on data visualization instead
of data cleaning. Our data is stored in the file named JaipurFinalCleanData.csv. This file
contains weather information of Jaipur and is saved at the same location as the notebook.
What do you do next?
Now that we have imported pandas, let's start by reading the csv file.
#saving the csv file into a variable which we will call data frame
dataframe = pd.read_csv("JaipurFinalCleanData.csv")
Task 1: Display the first 10 rows of data by modifying the function above
print (dataframe.head(10))
Now that you have listed the first few rows of the data, what do you notice?
• What headers are there? If you are not sure, look them up online!
• Does the values recorded make sense to you?
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object
What do you notice? What are the parameters involved? What are the outputs generated?
What do you notice? What are the parameters involved? What are the outputs generated?
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
max_dew_pt_2 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object
Looks like there are 16 columns in this dataset and we don't need all of them for the
purposes of this activity. One way to go about doing this, is to drop the columns that we
don't need. Pandas provide an easy way for us to drop columns using the ".drop" function.
dataframe = dataframe.drop(["max_dew_pt_2"], axis=1)
Let's print to ensure that the columns are dropped, try printing them with head() or dtypes.
dataframe.dtypes
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
min_dew_pt_2 int64
max_pressure_1 int64
max_pressure_2 int64
min_pressure_1 int64
min_pressure_2 int64
rainfall float64
dtype: object
date object
mean_temperature int64
max_temperature int64
min_temperature int64
Mean_dew_pt int64
mean_pressure float64
max_humidity int64
min_humidity int64
max_dew_pt_1 int64
min_dew_pt_1 int64
max_pressure_1 int64
min_pressure_1 int64
rainfall float64
dtype: object
What do you notice from the number? Look at the date. Can you see how the function help
us sort data based on the date?
Task 6: Sort the values in ascending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= True)
print(jaipur_weather.head(5))
date mean_temperature max_temperature min_temperature \
252 2017-01-11 10 18 3
253 2017-01-12 12 19 4
254 2017-01-13 12 20 4
255 2017-01-14 12 20 5
258 2017-01-17 12 20 5
Look at the max and min temperature! See the range of temperature that one can
experience within a day.
Task 7: Sort the values in descending order of mean temperature and print the first 5 rows
jaipur_weather = dataframe.sort_values(by='mean_temperature',ascending
= False)
print(jaipur_weather.head(5))
Importing matplotlib
Matplotlib is a Python 2D plotting library that we can use to produce high quality data
visualization. It is highly usable (as you will soon find out), you can create simple and
complex graphs with just a few lines of codes!
Now let's load matplotlib to start plotting some graphs
import matplotlib.pyplot as plt
import numpy as np
Scatter plot
Scatter plots use a collection of points on a graph to display values from two variables. This
allow us to see if there is any relationship or correlation between the two variables.
Let's see how mean temperature changes over the years!
x = dataframe.date
y = dataframe.mean_temperature
plt.scatter(x,y)
plt.show()
Do you see that the x axis is filled with a thick line, and that there's no tick label available?
This makes us unable to analyze the data.
Let's try to modify this scatter plot so that we can see the ticks!
Task 8: Change x ticks interval so that you can see the dates clearly
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 180))
plt.show()
What is the interval you use so that you can see all the dates? Do you notice that now we
are only having very few ticks?
Let's try to rotate our ticks. See the example on Stackoverflow!
Note: Stackoverflow is a site where technical personnel gather and share their knowledge.
You can search for any queries over the site and see if there are already others who solve it!
Task 9: Rotate our x ticks label so that we can see more ticks more clearly
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=90)
plt.show()
plt.show()
Looks good!
plt.show()
Task 11: Change the title size to be bigger than the x and y labels!
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
plt.show()
Good! Now, we can also change the size of the plot
# Change the default figure size
plt.figure(figsize=(10,10))
plt.scatter(x,y)
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
Looking good! Now, let's customize our graphs with the shapes and colours that we like.
See here for examples
plt.scatter(x,y, marker='*')
plt.xticks(np.arange(0, 731, 60))
plt.xticks (rotation=30)
# Add x and y labels, title and set a font size
plt.xlabel ("Date", fontsize = 24)
plt.ylabel ("Mean Temperature", fontsize = 24)
plt.title('Mean Temperature at Jaipur', fontsize = 30)
plt.show()
plt.show()
Look at your working directory and check if the new image file has been created!
Task 13: Change the data points to your favourite colour.
Change the font colour and size too!
If you are wondering how to get the nice bubble looking scatter plot in the slides. Here a
sample code, try running it!
import numpy as np
Task 14: Try changing the alpha value and see what happens?
Saving plot
You can use plt.savefig("figurename.png") to save the figure. Try it!
plt.savefig("jaipur_scatter_plot.png")
Line Plots
Besides showing relationship using scatter plot, time data as above can also be represented
with a line plot. Let's see how this is done!
y = dataframe.mean_temperature
plt.plot(y)
plt.ylabel("Mean Temperature")
plt.xlabel("Time")
Y_tick =
['May16','Jul16','Sept16','Nov16','Jan17','Mar17','May17','Jul17','Sep
t17','Nov17','Jan18','Mar18' ]
plt.show()
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/3656475819.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.plot(y)
4 plt.ylabel("Mean Temperature")
5 plt.xlabel("Time")
Task 15: Change the labels and add title so that it is clearer and easier for you to show this
graph to others
plt.legend()
plt.show()
Task 16: Draw at least 3 line graphs in one plot!
# Change the default figure size
plt.figure(figsize=(20,10))
x = dataframe.date
y_1 = dataframe.max_temperature
y_2 = dataframe.min_temperature
y_3 = dataframe.mean_temperature
z = y_1-y_2
plt.legend()
plt.show()
Histograms
The histogram is useful to look at desity of values. For example, you might want to know
how many days are hotter than 35C so that you can see what types of plants would survive
better in your climate zone. The histogram allows us to see the probability distribution of
our variables
Let's look at how histograms are plotted.
y = dataframe.mean_temperature
plt.hist(y,bins=15)
plt.show()
----------------------------------------------------------------------
-----
NameError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_8140/1106375958.py in <module>
1 y = dataframe.mean_temperature
2
----> 3 plt.hist(y,bins=15)
4
5 plt.show()
plt.hist(y,bins=10)
plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')
plt.show()
What is the mode of this dataset? what temperature range is represented the most/ the
least?
Task 17: What do you think are bins? Try changing the number of bins to 20. What do you
notice?
y = dataframe.mean_temperature
plt.hist(y,bins=20)
plt.ylabel("No.of days")
plt.xlabel("Temperature")
plt.title('Probability distribution of temperature over 2 years (2016
- 2018) in Jaipur')
plt.show()
What does the histogram tell you about the temperature in Jaipur over the last two years?
Bar Charts
Bar chart looks like histogram, but they are not the same! See the difference between bar
charts and histogram here
Now, head over to the matplotlib library and look at the example for bar charts. Here's the
link!
import matplotlib.pyplot as plt
import numpy as np
usage = [10,8,6,4,2,1]
plt.show()
Because we are not dealing with categories with Jaipur weather data, we will not use it to
make a barchart. However, do remember how to create your bar chart!
Boxplots
Boxplots is used to determine the distribution of our dataset.
We will explore boxplot using a sample tutorial obtained from matplotlib website
plt.boxplot(y)
plt.show()
What does the boxplot tell you about the temperature of Jaipur over the past two years?
Subplots
Many times, you want to plot more than one graphs side by side. You can use subplots to do
that!
Here's how you can make them!
x = dataframe.date
y = dataframe.mean_temperature
plt.show()
plt.show()
Great! You have now gained the ability to visualize data using matplotlib.