Date string manipulations with Python
As a data professional, you can expect
to work with date time objects and date strings. In this video,
we'll continue coding in Python and practice converting,
manipulating, and grouping data. By the end of this video, we'll create
a widely used data visualization, a bar graph that tells
a story with your data. Working with date strings will often
require breaking them down into smaller pieces. Breaking date strings into days, months,
and years allows you to group and order the other data in different ways so
that you can analyze it. Manipulating date and time strings
is a foundational skill in EDA. In this video,
you will learn to convert date strings in the NOAA lightning strike
dataset into datetime objects. We will discuss how to combine
these data objects into different groups by segments of
time such as quarters and weeks. Let's open a Python notebook,
and I'll show you what I mean. Let's begin by importing
Python libraries and packages. To start, import Matplotlib and
Pandas which you've used before. To review,
Pyplot is a very helpful package for creating visualizations like bar,
line, and pie charts. Pandas is one of the more popular packages
in data science because it's specific focus is a series of functions and
commands that help you work with datasets. The last package, Seaborn,
may be new to you. Seaborn is a visualization
library that is easier to use and produces nicer looking charts. For this video, we'll use the NOAA
lightning strike data for the years 2016, 2017, and 2018 to group lightning
strikes by different timeframes. This will help us understand total
lightning strikes by week and quarter. As I mentioned at the beginning of this
video, when manipulating date strings, the best thing to do is to break
down the date information like day, month, and year into parts. This allows us to group the data into
any time series grouping we want. Luckily, there's an easy way to do that
which is to create a datetime object. Now, as you'll recall
from a previous video, this NOAA dataset has three
columns giving us the date, number of lightning strikes, and
latitude and longitude of the strike. For us to manipulate the date column and
its data, we'll first need to convert
it into a datetime data type. We do that by simply
coding df ('date') and making that equal pd.to_datetime with df ('date') input in parentheses. Doing this
conversion
gives us the quickest and most direct path to manipulating
the date string in the date column which currently is in the format of
a four digit year followed by a dash, then the two digit month, a dash,
and lastly the two digit day. Okay, this is the exciting part. Because our dates are converted
into Panda's datetime objects, we can create any sort of
date grouping we want. Let's say for example, we want to group
the lightning strike data by both week and quarter. All we would need to do is
create some new columns. You'll see here we're creating four new
columns, week, month, quarter, and year. With the first line of code,
we're creating a column week by taking
the data in the date column and using the function strftime. This function from
the daytime package formats are datetime data into new strings. In this case, we want the year followed
by
a dash, and the number of weeks out of 52. If we want that string, we need to code it as %Y-W%V. The
percent sign is the command which
tells the datetime format to use the year data in the string. The W implies this is a week,
and the V stands for value, as in a sequential
number running from 1 to 52. The final string output for
the column data will be in this format, 2016-W27. The next line of code gives
us the new month column. The argument is then written as a %Y-%m. This will output the four
digit year followed by a dash, then the two digit month. Essentially, we're removing the last two
digit date from the original date string. Next, we will create a column for
quarters. In this case, a quarter is three months. Many corporations divide their
financial year into quarters. So knowing how to divide data into
quarter years is a very useful skill. In this case,
it only takes one line of code. We'll call the new column quarter,
and we'll use our date column with two underscore
period to create the quarter column. The datetime package has a pre-made code
for dividing datetime into quarters. In the two underscore period argument
field, we only need to place the letter q. After that, we can use the function
strftime to complete the string. For the argument, we put %Y-Q%q. The first Q is placed into the string
to
indicate we are talking about quarters. The percent sign followed by
the lower case q indicates the Pandas that we want the date
formatted into quarters. Our final column will be
the easiest to code of them all. The year column is created by taking
our original date column data and creating a string that includes
only the argument percent sign Y. This creates a column of data
with it with only the year in it. Now that we have formatted some strings,
let's quickly review our work by using the head function we learned
in the previous video. When we run this code,
our four new columns are there, week, month, quarter, and year. They are all formatted
just as we discussed. We can use these new strings
to learn more about the data. For example, let's say we want to group
the number of lightning strikes by weeks. An organization whose employees primarily
work outdoors might be interested in knowing the week to week likelihood
of dealing with lightning strikes. In order to do that,
we'll want to plot a chart. We've reviewed a couple of
charts coded in Python by now. Next, let's code a chart with
a lightning strike data. For plotting the number of lightning
strikes per week, let's use a bar chart. Our graph would be a bit confusing
using all three years of data. So, let's just use the 2018 data and limit our chart to 52 weeks
rather than 156 weeks. We can do this by creating a column
that groups the data by year and then orders it by week. We will then learn more about
the structuring function in another video. For now,
let's focus on plotting this bar chart. We'll use the plt.bar function to plot. Within our argument field, we
select
the x-axis which is our week column, then the y-axis or height,
which we input as a number of strikes. Next, we'll fill in some of
the details of our chart. Using plt.plot,
we will place arguments in the x-label, y-label, and title functions. The arguments are week number,
number of lightning strikes, and number of lightning strikes per week
(2018), respectively. This renders a graph, but
the x-axis labels are all joined together. So, we have a chart, but
the x-axis is difficult to read. So let's fix that. We can do that with
the plt.xticks function. For the rotation, we can put 45, and for
the fontsize, let's scale it down to 8. After we use plt.show,
the x-axis labels are much cleaner. Given our bar chart illustrating
lightning strikes per month in 2018, you could conclude that a group
planning outdoor activities for weeks 32 to 34 might want
a backup plan to move indoors. Of course, this is a broad
generalization to make on behalf of every North American
location in the dataset. But for our purposes and in general, it is a good understanding
of our dataset to have. For our last visualization,
let's plot lightning strikes by quarter. For our visualization,
it will be far easier to work with numbers in millions
such as 25.2 million rather than 25, 154, 365, for example. Let's create a column that divides
the total number of strikes by one million. We do this by typing df_by_quarter, and entering the
relevant column
in the arguments field. In this case, we want number of strikes. Next, we add on .div to
get our division function. Lastly for the argument field,
we enter 1000000. When we run this cell, we have a column that provides the number
of lightning strikes in millions. Next, we'll group the number
of strikes by quarter using the groupby and
reset_index functions. This code divides the number of strikes
into quarters for all three years. Each number is rounded
to the first decimal. The letter m represents one million. As you'll soon discover, this calculation
will help with the visualization. You'll learn more about these
functions in another video. We will plot our chart using
the same format as before. We use the plt.bar with
our x being from our df_by_quarter dataframe,
with quarter in the argument field. For the height, we put the number_of_strikes
column in the argument field. It would be helpful if each
quarter had the total lightning strike count at the top of each bar. To do that, we need to define our own
function, which we will call, addlabels. Let's type addlabels,
then input our two column axes, quarter, and number of strikes
separated by columns and brackets. At the end,
we use the format we created earlier, number_of_strikes formatted to label
the number_of_strikes_by_quarter. To finish the bar chart, we label
the x and y-axis and add the title. Before we show the data visualization, there are a few small things we
want to
add just to make it more friendly to read. Let's set our length and
height to 15 by 5. Next, let's make the bar
labels cleaner by defining those numbers and centering the text. Our bar chart now gives us the number
of
strikes by quarter from 2016 to 2018. To make the information easier to digest,
let's do one more visualization. Here is the code for
plotting a bar chart that groups the total number of strikes
year over year by quarter. Review the code carefully and
consider what each function an argument does in order to create this
final polished bar chart. Each year has assigned its own color to
highlight the differences in quarters. And now we have our chart. Coming up, you'll learn more about
the
different methods for structuring data. I'll see you there.