WorkingWithData - Ipynb - Colaboratory
WorkingWithData - Ipynb - Colaboratory
ipynb - Colaboratory
Objectives
Read and process .csv files
Use loops and conditionals to extract portions of the data
Plot the data using matplotlib
Data
In this worksheet, we will be exploring energy consumption data over three years for two buildings
on the University of Calgary campus. Data includes the following columns:
IMPORTANT: the data is contained in two files, bulding1.csv and building2.csv . Be sure that this
python notebook file and the two .csv files are in the same folder.
Questions
We would like to answer to following questions:
1. Are there weekly, season and yearly trends in the data? Do they make sense?
2. How did the campus shutdown in March 2020 affect energy use?
3. Compare the energy use of the two buildings. How are they different?
Task 1: Hypothesis
Before diving into the data, it is important to first hypothesize what you think the answers will be. In
the space below, write your hypothesis for the first two questions above.
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 1/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
1. There will probably be weekly, yearly and seasonal trends in the data considering every year is
sort of the the same pattern. For example, the seasons always follow the same order, so that
could show some patterns in the data.
2. The campus shutdwon in March most likely resulted in enerygy consumption decreasing due
to the lockdown that was initiated in this time period.
Using csv
Step 1: Load the necessary libraries: Python has a built-in csv library that is able to read and
interpret csv files. As a reminder, csv stands for comma separated values.
# load the csv library
import csv
# 'r' is read only
# 'w' is overwrite
# 'a' is append
with open('building1.csv', 'r') as file:
reader = csv.reader(file)
# let's read the first 5 lines
counter = 0
for row in reader:
print(row)
counter = counter = counter + 1
if counter > 5:
break
['Date', 'Heating (kWh)', 'Cooling (kWh)', 'Electricity (kWh)', 'Nat Gas (kWh)', 'Domest
['10/1/2018', '16244.46', '4920.66', '13240.1', '1319.45', '6.39']
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 2/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
Step 2: We want to import each column as a separate variable. Let's do this for the first 5 rows
again.
Note: csv reads all values as strings, not as floats (numbers). We need to use float() to convert
to a number, but as this may fail, it is useful to use a try: except: block.
time = [] # empty array called time
heating = [] # empty array called heating
with open('building1.csv', 'r') as file:
reader = csv.reader(file)
# let's read the first 5 lines
counter = 0
for row in reader:
try:
time.append(row[0])
heating.append(float(row[1]))
except:
print('Row value conversion error')
counter = counter = counter + 1
if counter > 5:
break
print(time)
print(heating)
This seems clunky, but there are other ways of importing that csv file.
Using numpy
Alternative 1: Use numpy . numpy is a numerical library with a lot of built in functionality. One
function is to generate numpy arrays from text:
import numpy
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 3/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
data = numpy.genfromtxt('building1.csv', delimiter=",", encoding=None)
data
nan, nan],
1.319450e+03, 6.390000e+00],
1.413890e+03, 4.996000e+01],
...,
5.587430e+03, 1.278000e+01],
5.607750e+03, 1.466000e+01],
5.584340e+03, 3.450000e+00]])
As we can see, all the numbers are imported, but column names and dates are not by default.
To get the data value in the 3rd row and 4th column, we use:
data[2,3]
13108.56
data[2,:]
The problem here is that non-numerical numbers are imported as nan , including the time column.
Using pandas
Alternative 2: Use pandas . pandas is a data importing and management library, among other
things. It makes importing csv files very easy:
import pandas
data = pandas.read_csv('building1.csv', sep=',', header=0)
data.head()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 4/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
To print the first 5 rows of data, we can use the head() command. tail() prints the last 5 rows of
data.
data.head()
2018-
0 16244.46 4920.66 13240.10 1319.45 6.39
10-01
2018-
1 17772.24 4882.56 13108.56 1413.89 49.96
10-02
2018-
2 16852.79 5163.91 13307.63 1288.89 101.98
10 03
# import the pyplot function from matplotlib and then give it an alias, 'plt'
import matplotlib.pyplot as plt
plt.plot(data['Date'], data['Heating (kWh)'])
plt.xlabel('Date')
plt.ylabel('Heating (kWh)')
plt.show()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 5/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
To plot multiple things at once, we can create subplots. Let's plot all the data and overlay the
different years on top of each other. We can do that by selecting the data from a specific year, and
then looping through the years.
# helper variables to specify the years of interest and the columns.
years = [2018, 2019, 2020, 2021]
cols = ['Heating (kWh)', 'Cooling (kWh)', 'Electricity (kWh)', 'Nat Gas (kWh)', 'Domestic Wat
# Create a new figure with the number of subplots equal to the number of cols
# figsize specifies the figure size in [width, height] format
fig, ax = plt.subplots(nrows=len(cols), ncols=1, figsize=[20,14])
# Loop through the years to plot
for year in years:
# Select the data from that year
dat = data[data['Date'].dt.year == year]
# Loop through the columns and plot the data in the appropriate subplot
for idx in range(len(cols)):
ax[idx].plot(dat['Date'].dt.dayofyear, dat[cols[idx]])
# Set the same xlabel for all subplots
plt.setp(ax, xlabel = 'Day of year')
# Loop through the subplots and set the appropriate ylabel
for idx in range(len(cols)):
plt.setp(ax[idx], ylabel = cols[idx])
# Add a legend to each subplot
ax[idx].legend(years)
# Show the plot
plt.show()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 6/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
Looking at the electricity data, there appears to be a weekly pattern. And for heating use, there
appears to be a seasonal pattern. Let's use the groupby() function to look at this.
# Create a new figure
fig = plt.figure(figsize=[14,4])
# Create a new subplot. '121' means 1 row, 2 columns, 1st subplot
axes1 = plt.subplot('121')
# Bin the data by day of the week
binned_data_by_day_of_week = data.groupby(data['Date'].dt.dayofweek)
# Plot the min, max and mean of the Electricity data
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 7/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
binned_data_by_day_of_week ['Electricity (kWh)'].max().plot(ax=axes1)
binned_data_by_day_of_week ['Electricity (kWh)'].mean().plot(ax=axes1)
binned_data_by_day_of_week ['Electricity (kWh)'].min().plot(ax=axes1)
# Create labels
plt.xlabel('Day of Week')
# dayofweek by default starts with Monday as index 0. xticks() allows
# for custom tick labels
plt.xticks(range(7),['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
plt.ylabel('Electricity (kWh)')
# Add a legend
plt.legend(['Max', 'Mean', 'Min'])
# Create a new subplot. '121' means 1 row, 2 columns, 2nd subplot
axes2 = plt.subplot('122')
# Bin the data by month
binned_data_by_month = data.groupby(data['Date'].dt.month)
# Plot the average monthly heating and cooling use
binned_data_by_month ['Heating (kWh)'].mean().plot(ax=axes2)
binned_data_by_month ['Cooling (kWh)'].mean().plot(ax=axes2)
# Add labels and a legend
plt.xlabel('Month')
plt.legend(['Mean Heating (kWh)', 'Mean Cooling (kWh)'])
plt.ylabel('Daily kWh')
# Show the plot
plt.show()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 8/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
-------------------------------------------------------------------------
--
<ipython-input-17-c8bdb06af0ab> in <module>()
----> 8 binned_data_by_day_of_week =
data.groupby(data['Date'].dt.dayofweek)
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/accessors.py
in __new__(cls, data)
479
Assignment
Modify the code above, or add additional code below to answer the questions posed above. Once
you have completed your analysis for Building 1, compare your analysis with Building 2.
Show your instructor or TA your progress before the end of the seminar, and submit a pdf of this
workbook to the D2L dropbox before the start of the seminar next week.
# Code Here
import pandas
data_1 = pandas.read_csv('building1.csv', sep=',', header=0)
print(data_1)
data_1['Date'] = pandas.to_datetime(data_1['Date'])
Date Heating (kWh) ... Nat Gas (kWh) Domestic Water (m3)
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 9/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
# helper variables to specify the years of interest and the columns.
years = [2018, 2019, 2020, 2021]
cols = ['Heating (kWh)', 'Cooling (kWh)', 'Electricity (kWh)', 'Nat Gas (kWh)', 'Domestic Wat
# Create a new figure with the number of subplots equal to the number of cols
# figsize specifies the figure size in [width, height] format
fig, ax = plt.subplots(nrows=len(cols), ncols=1, figsize=[20,14])
# Loop through the years to plot
for year in years:
# Select the data from that year
dat1 = data_1[data_1['Date'].dt.year == year]
# Loop through the columns and plot the data in the appropriate subplot
for idx in range(len(cols)):
ax[idx].plot(dat1['Date'].dt.dayofyear, dat1[cols[idx]])
# Set the same xlabel for all subplots
plt.setp(ax, xlabel = 'Day of year')
# Loop through the subplots and set the appropriate ylabel
for idx in range(len(cols)):
plt.setp(ax[idx], ylabel = cols[idx])
# Add a legend to each subplot
ax[idx].legend(years)
# Show the plot
plt.show()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 10/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
import pandas
data2 = pandas.read_csv('building2.csv', sep=',', header=0)
print(data2)
#data2['Data'] = pandas.to_datetime(data2['Date'])
# import the pyplot function from matplotlib and then give it an alias, 'plt'
import matplotlib.pyplot as plt
plt.plot(data2['Date'], data2['Heating (kWh)'])
plt.xlabel('Date')
plt.ylabel('Heating (kWh)')
plt.show()
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 11/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
Date Heating (kWh) ... Nat Gas (kWh) Domestic Water (m3)
1. Are there weekly, season and yearly trends in the data? Do they make sense?
2. How did the campus shutdown in March 2020 affect energy use?
3. Compare the energy use of the two buildings. How are they different?
Building 1:
1. Yes, there are a couple trends for building 1. For example, it seems that the consumpation
sates are higher during the weekdays than on the weekends. This makes sense becuase no
one has classes on the weekneds, so this is to be expected. If you look at the trends, the
heating is higher than the cooling in the winter, which also makes sense. The yearly trends are
less obvious, but we can sort of see that the yearly trends show more consumption during
Sept - April than in May - August. This makes sense because most kids have classes during
the fall and winter term.
2. The energy usage declined immensly due to the lockdown, which makes sense since
lockdown means no more kids at school which translates to no need for high heat or other
electric components.
Building 2:
1. The consumption rates are higher during the winter for building 2 which makes sense since
more heat is required to keep warm during the cold winters. Building 2 had the same weekly
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 12/13
1/29/22, 2:58 PM WorkingWithData.ipynb - Colaboratory
trend with the weekdays having higher levels of consumption and the weekneds having less
since most kids have their classes on the weekdays. The yearly trends are not super noticable
in this building, but there are some there.
2. Comparing the two buildings, there are not much differences when it comes to their trends
which makes sense because they are located in the same place so they both experience the
same weather and situations. But I did notice building 2 had higher consumption rate than
building 1. This could be because builiding 1 is smaller than building 2.
https://colab.research.google.com/drive/1jSwMg3NjVWB07JEIjT1pzpw8N7eeXCsa#scrollTo=lhm4cE1BtUTD&printMode=true 13/13