0% found this document useful (0 votes)
2 views39 pages

180A Python Examples

Uploaded by

Douga Rico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views39 pages

180A Python Examples

Uploaded by

Douga Rico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

180A Python Examples


Dylan Neff

January 12, 2021

Part I
Introduction
The goal of this document is not to teach Python fundamentals but, rather, to teach by examples
relevant to analysis in this nuclear physics experimental course. The examples will roughly follow
the flow of the analyses for the course, gradually adding tools and building upon those introduced
in previous examples. The code will be written in functions, heavily commented, and attempt to
use only the core Python packages (numpy, matplotlib, scipy, etc.). Screenshots of output will be
taken from the Spyder IDE, but the code presented is general and should run equally well on any
user environment. If any of this material is out of date/incorrect, please email the issue to Dylan
Neff. As a disclaimer, the author is by no means a Python expert so suggestions and coding style
should be considered with this in mind.


Physics Department, UCLA, Los Angeles, CA 90024, USA

1
Contents

I Introduction 1

II Python Basics 4

1 Hello World 4
1.1 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Use of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Python Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Reading Files 5
2.1 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 File Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discussion of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Plotting 8
3.1 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Discussion of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Fitting 14
4.1 Finding Optimal Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Discussion of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

III Nuclear Physics Specifics 18

1 Histograms 18
1.1 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Simple Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Fitting and Plotting with Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Earthquake Example 23
2.1 Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Analysis Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Python Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Python lambda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Discrete Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 HEP Histogram Formatting 29

IV Full Scripts 30

2
1 Hello World 30

2 Read File 31

3 Plotting 32

4 Fitting 34

5 Histograms 36

6 Earthquakes 38

3
Part II
Python Basics
One major strength of Python is its popularity. This, along with its high level syntax, makes
learning Python substantially easier than other languages. There is a plethora of material available
online for learning Python and, more importantly, any question about the language has likely
already been asked and answered somewhere online. For this reason, this document will provide
only cursory introductory examples and will instead focus on unique tools needed for nuclear
physics. It is left to the reader to fill in any gaps via Google and documentation pages for Python
packages. Some good general python tutorials can be found at codeacademy and learnpython.

1 Hello World
As is tradition in programming introductions, the first example will print hello world to the termi-
nal:

1.1 Example Code


def main():
print('hello world') # Print 'hello world' string to the terminal

if __name__ == '__main__':
main() # If this file is run as a script directly, this statement will execute and main will be run

Upon running this script, the terminal should display ”hello world” as shown in Figure 1.

Figure 1: Terminal output from ‘hello world’ print statement.

1.2 Use of Functions


Instead of a one line solution, the print statement is made within a function. Functions, defined
with the def keyword and trailing parentheses with optional input parameters, are vital in good
code. They allow the analyzer to partition their analysis into small, manageable pieces. Instead of
writing a daunting wall of code in a linear fashion from start to end, the analysis is split up into
logically independent pieces which are then strung together. This allows the writer to separate
their thoughts and tackle one problem at a time. Using well thought out function names, this also
allows someone reading the code to much more quickly understand the big picture and dive into
the functions of particular interest. These ideas are related to a design style is sometimes called
top-down programming.

4
While the use of functions is by no means mandatory, it is heavily encouraged to develop good
coding practices. Object oriented programming is a somewhat related paradigm which will be
dodged in this document but can be applied in this course.

1.3 Python Control Flow


Python is an interpreted language which means that, when the code is run, each line is read and
executed one at a time. This has the advantage of easy debugging since you know exactly where in
your code a failure occurred at the expense of slower speed when compared to compiled languages.
This flow is important when combined with the use of functions since, if a function is defined on
line 20 and called on line 10, Python will show an error on line 10 because the function has not yet
been defined.
One solution to this problem is given in the hello world code above. The top level logic can be
written into a main function, usually at the top of the script. At the very bottom of the script,
the main function is executed if the name-space is main (which is separate from the arbitrarily
but customarily named main function). Without going into details, this will be true whenever this
script is executed (while it will be false if this file is imported into another Python script). While
this is not the only way to write code, it is the style that will be used repeatedly in this document.

2 Reading Files
Reading data from files stored on the local machine will be crucial for this course. In this example,
a set of US corona virus data is stored in a comma separated values (csv) file with three columns:
the date, number of cases on that date, and the number of deaths on that date. This file can be
opened and viewed in any text editor or spreadsheet software. A screenshot of the first lines are
shown in Figure 2.

Figure 2: First lines of US corona virus case data in US.

This example shows how to open, read, and print the data in the file to the terminal. This
is split into two functions, read_cases and print_cases which are called in turn from the main
function. The defined path is passed to read_cases which returns the date and cases data. These
are then passed to print_cases which prints the data to the terminal.

5
2.1 Example Code
def main():
"""
Read US corona virus case data from file and print to terminal.
:return:
"""

path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
dates, cases = read_cases(path)
print_cases(dates, cases)
print('donzo')

def read_cases(path):
"""
Read corona virus case data from file at the input path and return lists of dates and case numbers
:param path: Path to corona virus case data
:return: List of dates and list of cases
"""

dates = [] # List of dates, left as strings


cases = [] # List of cases, converted to integers

with open(path, 'r') as file: # Open the contents at given path and name this opened object file
lines = file.readlines() # Read the opened file into a list with each entry a string of the line
for line in lines[1:]: # Iterate over each line, skipping the first line since it's a header
line_data = line.split(',') # Split each line string into a list of items separated by commas
dates.append(line_data[0]) # Append zeroth element of line_data to list of dates
cases.append(int(line_data[1])) # Append first element of line_data to list of cases as an integer

return dates, cases # Return the two lists

def print_cases(dates, cases):


"""
Print the corona virus date and case data passed as lists
:param dates: List of dates
:param cases: List of cases
:return:
"""

print(dates) # Print the list of dates


print(cases) # Print the list of cases

for date, case in zip(dates, cases): # Iterate through dates and cases list together
print(f'date: {date}, cases: {case}') # Print each date/case pair on a new terminal line

6
2.2 File Paths
A key aspect of reading and writing files on your computer is knowing where those files are located.
Computer data is stored in a filesystem structure, in which folders or directories are used to organize
and retrieve data. A file path is the location of a file within this file system and uniquely specifies
the file from all others on the computer. As demonstrated in the example code, the path must
be specified in order for Python to open the correct file. The path of a file can be found by right
clicking on it and selecting ‘Properties’ in Windows/Linux or ‘Get Info’ on a Mac.

2.3 Discussion of Code


The read_cases function first opens the file at path in read (‘r’) mode. It then reads the contents
into a list of lines, each of which is a string. The lines are then iterated over and split by the comma
into a list of three strings. The first of these strings corresponds to the date value on the line and is
appended to the list of dates. The second of the strings corresponds to the number of cases which
is converted from a string to an integer and appended to the list of cases. Once all lines are read,
these two lists are returned.
The print_cases function takes as arguments the dates and cases lists obtained from the
read_cases function and immediately prints these two lists out directly. Next the two lists are
zipped and iterated over together, printing matching pairs of date/case on each line. The output
of this script is shown in Figure 3 with the end of the dates list at the top, the cases list in the
middle, and the beginning of the zipped dates/cases output at the bottom.

7
Figure 3: Portion of terminal output from reading files example.

3 Plotting
This example builds on the previous, using the read_cases function verbatim. The goal of this
section is to demonstrate Python plotting with the matplotlib module. Once the dates and cases
are read, they are passed to three independent functions: plot_cases_simple, plot_cases and
plot_case_rate. The plot_cases_simple function simply plots the data with the cumulative
number of cases on the y-axis and the case index on the x-axis. The plot_cases function does
the same but instead uses the date values for the x-axis and adds a title. The plot_case_rate
function plots the number of daily new cases vs the date along with uncertainty bands to showcase
the feature.

8
3.1 Example Code
def main():
"""
Read US corona virus case data from file and plot.
:return:
"""

path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
dates, cases = read_cases(path)
plot_cases_simple(cases)
plot_cases(dates, cases)
plot_case_rate(dates, cases)
print('donzo')

def plot_cases_simple(cases):
"""
Plot cumulative US corona virus cases vs the case index using matplotlib.
:param cases: List of coronavirus case numbers for each day
:return:
"""
case_indices = range(len(cases)) # Produce a list of integers from {0, 1, ..., number of cases}
plt.plot(case_indices, cases) # Plot case indices on x-axis and number of cases on y-axis.
plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called

def plot_cases(dates, cases):


"""
Plot cumulative US corona virus cases vs the date using matplotlib.
:param dates: List of dates for coronoavirus cases
:param cases: List of coronavirus case numbers for each corresponding date in dates
:return:
"""
fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
ax.set_title('Total Covid Cases in US') # Set title of plot

dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects

formatter = DateFormatter('%m/%d') # Create date formatting for plot


ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees

plt.plot(dates, cases) # Plot dates on x-axis and cases on y-axis. This can be called alone if no formatting needed
plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called

def plot_case_rate(dates, cases):


"""
Plot number of daily new coronavirus cases in US vs the date. Also represent estimated errors with filled bands.
:param dates: List of dates for coronoavirus cases
:param cases: List of coronavirus case numbers for each corresponding date in dates
:return:
"""
rates = np.asarray([cases[i+1] - cases[i] for i in range(len(cases) - 1)]) # Calculate rate, daily change in cases
rates_err = np.sqrt(rates) # Estimate uncertainties on rates based on counting statistics, not robust

fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
ax.set_title('Daily New Covid Cases in US') # Set plot title

dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects

formatter = DateFormatter('%m/%d') # Create date formatting for plot


ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees

plt.plot(dates[1:], rates, color='green') # Plot dates (skipping first) on x-axis rates on y, set color to green
plt.fill_between(dates[1:], rates - rates_err, rates + rates_err, facecolor='green', alpha=0.3,
interpolate=True) # Plot uncertainty bands about data, alpha is transparency

plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called.

9
3.2 Discussion of Code
The plot_cases_simple function simply takes the case data and plots them on the y-axis with
the case list index on the x-axis. To generate this list of case indices, the number of cases in the list
is first calculated with the built-in len function. The (also built-in) range function then generates
a list of integers starting from zero up to the number of cases minus 1. The case_indices and
cases are then passed to matplotlib for plotting.
The two most important and ubiquitous lines used in matplotlib plotting are plt.plot and
plt.show. The plt.plot function is versatile but here takes the two arguments: a list of x-axis
values first followed by a list of y-axis values. Both lists must have the same number of elements.
This produces, by default, a plot in which each pair or x and y points are connected by a line. The
plot function can take many additional options and perusing the documentation page is highly
encouraged. The final line calls the plt.show function which is required to actually display the
figures that have been created and will, depending upon the Python environment or IDE, create a
pop-up matplotlib window with the plot and possibly hold execution of the code until this window
is closed. This pop-up figure window, shown in Figure 4 is interactive and displays the line-plot of
cases vs date just created. This window can be resized and offers some simple options for formatting
the plot.

Spyder IDE Plotting

If using Spyder IDE, matplotlib figures will likely be produced in the ‘Plots’ pane or inline on
the terminal by default. To change these settings to instead have an interactive matplotlib
window pop up (suggested) from the top tool bar click ‘Tools’→‘Preferences’. From the list
on the right hand side select ‘IPython console’ and then select the ‘Graphics’ tab. Here the
default behavior of various plotting aspects can be changed. In order to produce plots in
their own interactive windows, select ‘Automatic’ from the ‘Backend’ drop-down menu under
the ‘Graphics backend’ section. Once selected, click ‘Apply’ and the ‘OK’ to close. You will
likely need to manually restart Spyder for these changes to properly take effect.

10
Figure 4: Simple plot of cumulative US corona cases vs case index in an interactive matplotlib
window.

The plot_cases function again uses plt.plot and plt.show as in the plot_cases_simple
function but now plots the dates on the x-axis and adds a title. The plt.subplots method is
first called to set the figure (fig) and axis (ax) objects. These objects are set automatically when
plt.plot is called if none have previously been created. Calling subplots is useful when the figure
or axis objects are needed for direct manipulation or formatting. The plot title is set by calling the
axis set_title function.

Datetime Package

The built in Python datetime package will not be necessary for this course but is extremely
useful when working with dates and will be explained in the next paragraph for completeness.

To plot dates properly a bit of work must be done. The date string is first converted to a Python
datetime object which is a module in the Python standard library (datetime object in datetime
module imported and succinctly renamed “dt”). This is accomplished by a list comprehension
which iterates over the original strings, converts each to a datetime object, then stores the results
in a list (this list is stored as “dates”, overwriting the original list of strings which are no longer
needed in this function). A matplotlib formatter is created with the DateFormatter class, telling
the plot to display only the month and day separated by a ‘/’. This formatter object is passed
to the axis object’s xaxis with set_major_formatter and the dates are rotated by 30° with the
set_tick_params function. Once this formatting is done, plt.plot and plt.show are called as
before and the figure is displayed as in Figure 5.

11
Figure 5: Plot of cumulative US corona cases vs date with title.

The plot_case_rate re-uses many of the tools presented in plot_cases and expands on the
capabilities of matplotlib plotting. Instead of analyzing the cumulative number of covid cases, a
somewhat more interesting data set is the number of new cases reported per day. This is obtained
from the cumulative cases by subtracting the total cases on day i from the total number of cases on
day i+1. These N −1 rate values are calculated from the N case values by iterating through all but
the last case value, subtracting the ith case value from the (i + 1)th and storing these differences
collected via list comprehension as a numpy array. In addition, the uncertainty on these rates
can be very roughly estimated under the assumption of counting statistics as the square root of
the rate values. The benefit of numpy arrays is that they are easy to manipulate algebraically as
demonstrated by np.sqrt(rates) which takes the square root of each value of the array individually
and stores the new array as rates_err.

Top Down Design

It is prudent to note here that the conversion from cases to rates is a step logically indepen-
dent from the plotting procedure. The conversion can be seen as a manipulation or analysis
of the original case data and would likely be better partitioned into its own function. This
would especially hold true if the manipulation took any more than a few lines of code to
accomplish.

12
After obtaining a numpy array of the rates and one for the associated error estimates, the figure
and axis are created and formatted in a manner similar to the plot_cases function. Next, a line-
plot is again created though this time the first date is skipped since the change in case numbers
is being plotted and the color of the line is set to green. To showcase a feature of matplotlib,
the fill_between method is used with the rate error estimates to fill a transparent (dictated by
the ”alpha” parameter) band about the rate data. The resulting plot is shown in Figure 6 and a
zoomed version in Figure 7 to show the error bands.

Figure 6: Plot of daily change of US corona Figure 7: Zoomed plot of daily change of
virus cases vs date. cases to show filled uncertainty bands.

13
4 Fitting
Finding the best fit of a given function to data is an extremely important tool in analysis. Given
a set of data, a function with an arbitrary set of parameters, and a metric with which to compare
the function output to the data, a “best fit” of the function to the data is obtained by finding
the parameter values that optimize the goodness of fit metric. This goodness of fit metric is more
generally called a loss function and in this document a simple quadratic loss function will be utilized
via least squares regression.

4.1 Finding Optimal Parameters


As an example, consider the case of a simple linear regression on one dimensional data. Given a
data set yi , xi where i ∈ 1, ..., N a linear best fit attempts to find the slope and intercept parameters
such that a loss function is minimized. Commonly used is the quadratic loss function, as used in
least squares or chi-square regression. This quadratic loss is calculated for each data point by taking
the difference between it and the corresponding estimate from the linear function and squaring this
difference (possibly dividing by some weight). This is repeated for each data point and the square
differences are summed over the entire data set. The goal is then to minimize this sum of squares
by choosing the optimal slope and intercept parameters for the line.
In the case of simple linear regression these optimal parameters can be solved for algebraically
(true even for weighted linear regression). For more arbitrary fitting functions and/or higher
dimensional data, however, a closed form solution for the optimal parameters may no longer exist.
In these scenarios, a numerical solution is implemented. The goal remains to minimize the loss
function by finding optimal parameter values. Given a set of initial parameters, the loss function
is calculated on the data set. From there, the parameter values are varied and the loss is re-
calculated. Loosely speaking, if the loss decreases (suggesting that the data and fit function are in
closer agreement than before) this indicates the parameters are being pushed in the right direction
and can take another step in the same direction for the next iteration.
This process of varying the function parameters and calculating the new loss is repeated until the
loss reaches a minimum. Uncertainties on these optimal parameters are estimated by varying about
this solution, checking the loss at slightly sub-optimal values, and constructing a covariance matrix
from the loss values in this region of parameter space. The exact details of this process (in which
direction and how much to vary each parameter in each iteration, when to declare victory/give
up, how to avoid local minima, etc) define the minimization algorithm, of which there are many
available. In scipy’s curve fit function, the ’method’ argument allows the user to choose between a
selection of different minimization algorithms.
While the simple linear regression has a unique solution that can be solved for analytically,
minimizing the loss for more complicated functions or higher dimensional data can be challenging.
Manually choosing initial parameters close to the expected solution is a powerful way to avoid
undesirable results or outright failures of the minimizer. Setting bounds on the parameters can
also be very useful but should be done sparingly as uncertainty estimation near a boundary is
difficult and can result in poor estimations. Finally, all data in this course should have associated
estimated errors. As such, all fitting should be weighted by these errors as demonstrated in Section
1.

14
4.2 Example Code
def main():
"""
Read US corona virus case data from file and fit with exponential.
:return:
"""

path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
dates, cases = read_cases(path)
fit_cases(dates, cases)
print('donzo')

def fit_cases(dates, cases):


"""
Fit the US cumulative coronavirus cases vs date data to an exponential function and plot the result
:param dates: List of dates for coronoavirus cases
:param cases: List of coronavirus case numbers for each corresponding date in dates
:return:
"""
fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
ax.set_title('Total Covid Cases in US with Exponential Fit') # Set title of plot
ax.set_xlabel('Days since Jan 21 2020') # Set x-axis title

formatter = DateFormatter('%m/%d') # Create date formatting for plot


ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees

dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects

# Convert to days since first day for fitting, int64 for large integers, default int32 max is 5236926
days = np.asarray([(date - dates[0]).days for date in dates], dtype=np.int64)
cases = np.asarray(cases, dtype=np.int64) # Convert cases from Python list to numpy array

pinit = [1, 0.04] # Set initial exponential parameters for fitting


popt, pcov = curve_fit(exp, days, cases, p0=pinit) # Fit data to exponential
print(f'Optimal parameters: {popt}') # Print list of best fit parameters
print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix

plt.plot(dates, cases, color='blue', label='data') # Plot dates on x-axis and cases on y-axis
plt.plot(dates, exp(days, *popt), 'r--', label='fit') # Plot best fit curve in red dashes

ax.legend() # Display the legend on plot

plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called

def exp(x, a, lam):


"""
Exponential function
:param x: Data x values
:param a: Amplitude
:param lam: Decay/growth parameter
:return: Value of exponential
"""
return a * np.exp(lam * x)

15
4.3 Discussion of Code
The main function of this script again reuses the read_cases function to pull the dates and cases
lists from the text file at the given path. The data are then passed to the fit_cases function
where they are fit to a simple exponential function.
The fit_cases function repeats much of the matplotlib setup and formatting described in
Section 3. The fitting is accomplished with the curve_fit function from the scipy package. This
function takes the fitting function as its first parameter followed by the x data (independent variable)
and then the y data (dependent variable). The fitting function must take the x data as its first
parameter followed by an arbitrary number of parameters and return a value to compare to the data.
In the example code, an exp function is defined following this format. It takes parameters a which
is the amplitude of the exponential and lam which determines the curvature of the exponential.
It then calculates and returns the value of the exponential given these two parameters and the x
value. The x and y data passed to curve_fit are the same lists passed to the plot function. The
date values are stored as datetime objects which can be used by matplotlib but are not directly
compatible with scipy curve_fit. To work with curve_fit, the dates are converted to the integer
number of days since the first date. In addition, these are stored explicitly as 64 bit integers using
the dtype argument rather than the default 32 bit integers whose max is 5236926, smaller than the
max values encountered in the cases data. The cases are also converted to a numpy array of 64 bit
integers.
The curve_fit function varies the input parameters and compares the goodness of fit for each
variation, searching for an optimal set of parameter values. As such, it is often important, especially
with many parameters (many dimensional space to explore), to set specific initial parameters from
which curve_fit will start its search. In addition, bounds may be set on these parameters though
this should only be done if necessary as it may affect error estimation. In this example, the initial
parameters are very roughly set with an amplitude of 1 (expected number of cases at day 0) and
the exponential constant of 0.04 (positive curvature but slower growth). In general it is often useful
to plot the data and the fit function with guesses for best parameters before curve_fitting in
order to find good initial parameter values that give decent agreement between the data and the
function.
The return values from curve_fit are the optimal parameters found and the covariance matrix.
The estimated errors on the optimal parameters are the square root of the diagonal elements of
the covariance matrix. The example code prints the optimal values from the fit along with the
estimated parameter errors and the full covariance matrix, shown in Figure 8. Without going into
detail, the covariance matrix is symmetric and the off diagonal elements give the covariance between
the parameters.

16
Figure 8: Optimal parameters as determined by fit on first line followed by estimated error on this
parameters. Full covariance matrix printed at the end.

The cases are then plotted as usual with an additional label argument which sets the plot’s
name in the figure legend. Next, the best fit curve is plotted on top of the raw cases data by calling
the exp function with days as the x-data and the optimal parameters which are unpacked and
passed with *popt. Finally ax.legend() is called to construct the figure legend and plt.show()
is called to show the figure as in Figure 9.

Figure 9: Plot of cumulative US coronavirus cases vs date with best fit exponential function
superimposed and a legend drawn. Notice the x-axis title is cut off if the plot is saved directly.
Tweaking the plot in the interactive matplotlib window can quickly remedy this issue.

17
Part III
Nuclear Physics Specifics
While the previous part outlined a rather general type of analysis and showcased some of the
capabilities of Python, nuclear physics relies on some tools more heavily than the general Python
user community. Many nuclear physics experiments and analyses, at their core, tend to involve
counting the number of particles meeting some criteria. Counting data such as this are most often
visualized and analyzed as histograms which will be the focus of the first section. The next section
will give a full example analysis similar to what would be expected for an experiment in this course.
The general Python aspects of these examples will be less heavily discussed, favoring instead to
focus on the specifics important to nuclear physics.

1 Histograms
Histograms are tools used more heavily in nuclear physics than in general. For this reason, it is
prudent to show how to use and modify Python’s more popular tools in order to accurately set,
plot, and fit histograms.
The following example takes height and weight data found online and histograms the height
data. Python’s matplotlib package has a built in histogram method which works well for most use
cases. The major feature missing from plt.hist is the ability to draw error bars on the histogram.
As of the time of writing, there’s no workaround for this absence. Instead, the matplotlib bar charts
will be used. The histogrammed bins and counts will be produced by numpy’s histogram function
and the plotting will be done with matplotlib’s bar method.
The heights and weights are first read from the file with the read_heights_weights function
and returned as numpy arrays. The heights are then passed to the get_statistics function to
calculate the mean and standard deviation of the data set. The height data is then plotted naively
with the auto_plot_hist function which uses automatic binning and the bin_plot_hist which
takes an optional binning argument. The heights are then passed to the fit_gaus function which
more carefully plots the data and fits a Gaussian distribution.

1.1 Example Code


def main():
path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex5_Plot_Hist/height_weight.txt'
heights, weights = read_heights_weights(path)
get_statistics(heights)
auto_plot_hist(heights)
bin_plot_hist(heights, 15)
fit_gaus(heights, 15)
print('donzo')

def read_heights_weights(path):
"""
Read file at file_path and return height and weight data to as numpy arrays.
:param path: Path to data file.
:return heights: height data as a numpy float array
:return weights: weight data as a numpy float array
"""
heights = [] # Make a list to store height data
weights = [] # Make a list to store weight data
with open(path, 'r') as file: # Open file_path in read ('r') mode, reference it as file inside with scope
lines = file.readlines() # Read each line from file, lines is then a list of strings, one for each line in file
for line in lines[1:]: # Iterate over each line in lines, [1:] -> Skip first line to get rid of file headers

18
line = line.strip() # Remove any white space or newline at begging or end of line (optional)
line = line.split() # Split line on argument. With no argument, default is space or tab
heights.append(float(line[1])) # Convert element in second column (line[1]) to float, add to heights list
weights.append(float(line[2])) # Same as above but with element in third column of the line

return np.asarray(heights), np.asarray(weights) # Convert lists to numpy arrays to allow for easier math/plotting

def get_statistics(data):
"""
Get mean and standard deviation of data and print to terminal.
:param data: Sequence of data to calculate mean/sd of.
:return:
"""
mean = np.mean(data) # Calculate mean of data
sd = np.std(data) # Calculate standard deviation of data
print(f'Data mean: {mean}, standard deviation {sd}') # Print mean and standard deviation with f-string

def auto_plot_hist(data):
"""
Plot histogram of data with matplotlib hist method with the default 10 bins
:param data: Sequence of data to be histogrammed and plotted.
:return:
"""
fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
ax.hist(data, align='mid') # Plot histogrammed data with bins centered in the middle of the bin range
plt.show()

def bin_plot_hist(data, bins=10):


"""
Plot histogram of data with matplotlib hist method with input optional bins parameter passed for binning
:param data: Sequence of data to be histogrammed and plotted
:param bins: Optional parameter with default 10. Passed to matplotlib hist as histogram binning
:return:
"""
fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
ax.hist(data, bins, align='mid') # Plot histogrammed data with input binning with bins centered middle of range
plt.show()

def fit_gaus(data, bins=10):


"""
Histogram data and plot manually as bar chart with uncertainties. Fit gaussian to histogram and superimpose fit.
:param data: Data to be histogrammed and fit
:param bins: Optional parameter for binning
:return:
"""
fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation

counts, bin_edges = np.histogram(data, bins) # Histogram via numpy with binning passed, return counts and bin edges
bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1]) # Calculate N bin centers from N+1 bin edges
bin_widths = bin_edges[:-1] - bin_edges[1:] # Calculate N bin widths from N+1 bin edges
counting_err = [np.sqrt(count) if count > 0 else 1 for count in counts] # Calculate counting uncertainties
ax.bar(bin_centers, counts, bin_widths, yerr=counting_err, zorder=1, label='height data') # Plot histogram via bar

pinit = [20, 68, 3] # Set initial conditions for gaussian fit


popt, pcov = curve_fit(gaus, bin_centers, counts, p0=pinit, sigma=counting_err, absolute_sigma=True) # Fit to gaus
perr = np.sqrt(np.diag(pcov)) # Calculate uncertainties on optimal parameters via sqrt of diagonals of cov matrix
print(f'Optimal parameters: {popt}') # Print list of best fit parameters
print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix
gaus_range = np.linspace(bin_edges[0], bin_edges[-1], 300) # Set range of points to plot for fit gaussian
ax.plot(gaus_range, gaus(gaus_range, *popt), 'r--', zorder=3, label='gaus fit') # Plot best fit gaus over hist
ax.fill_between(gaus_range, gaus(gaus_range, *popt), 0, color='red', alpha=0.2, zorder=2) # Fill area under gaus
ax.legend() # Draw legend on plot.
ax.annotate(f'a: {popt[0]:.1f}±{perr[0]:.1f}\nx0: {popt[1]:.1f}±{perr[1]:.1f}\nsigma: {popt[2]:.1f}±{perr[2]:.1f}',
xy=(0.05, 0.95), xycoords='axes fraction', bbox=dict(boxstyle='round', facecolor='tan', alpha=0.3),
verticalalignment='top') # Display box on plot with optimal fit parameters
plt.show()

19
def gaus(x, a, x0, sigma):
"""
Calculate and return gaussian value at point x
:param x: Point at which to compute gaussian
:param a: Amplitude of gaussian
:param x0: Center of gaussian (equal to mean)
:param sigma: Width of gaussian (equal to standard deviation)
:return: Value of gaussian at point x
"""
return a * np.exp(-(x - x0)**2 / (2 * sigma**2))

1.2 Statistics
In the get_statistics function the mean and the standard deviation are calculated for the raw
(unhistogrammed) heights data. This is accomplished using numpy’s mean and std functions which
calculate these statistics for a sequence of data. The resulting statistics are then printed to the
terminal via an f-string.
It is important to note that when a data set is histogrammed, all information within each bin
is lost. For instance, if you have a bin from 1 to 4, you will be unable to distinguish a 2 from a 3
after histogramming and information is thus lost. For this reason, it is important to think carefully
before constructing a histogram and to pay attention to binning once one is constructed.

1.3 Simple Plotting


The histogram can be plotted without error bars using matplotlib’s hist method as shown in
Figures 10 and 10. The auto_plot_hist function sends the data to the hist method directly
which, by default, produces a histogram with 10 bins (Figure 10). The bin_plot_hist function
does the same but accepts an optional argument of the number of bins. As shown in the function
definition, this optional parameter is set to 10 by default if only one argument is passed and this
bins variable is then sent to matplotlib to determine the number of bins. In the example shown, a
15 bin histogram is created (Figure 10). It can be very useful to change the binning of histograms
that are constructed. If, for instance, changing the binning of a histogram slightly (15 instead of
10 bins, say) significantly influences the results of the analysis, the results may not be very robust
and it may be appropriate to assign systematic errors to cover this dependence on binning.

20
Figure 10: Histogram of height data made with the matplotlib hist method.

Figure 11: Histogram of height data made with the matplotlib hist method with a binning value
of 15 passed.

21
1.4 Fitting and Plotting with Errors
Fitting histograms is where analysis in Python requires a bit of extra work though it ensures the
user knows what they are doing. The scipy curve_fit function used in Section 4 will again be used
for fitting histograms. As before, the function will take the fitting function along with x and y data
lists of the same size. For histograms, the y-data corresponds to the number of counts in each bin
and the x-data usually corresponds to the center value of each bin (what point in the bin is used
is somewhat arbitrary but the center is a somewhat logical choice). The bin centers therefore need
to be calculated from the bin edges which fully define the binning. This can be done in Python in
one line utilizing the convenient list slicing, adding adjacent edges and dividing by two to get an
average. Once the histogram counts and bin centers have been obtained, fitting works just like for
a scatter plot as in Section 4.
In typical counting experiments (where histograms are most useful) uncertainties on each bin
can be estimated as the square root of the counts in that bin (with the exception of zero counts) as
will be discussed at some point in lecture. In order to plot these uncertainties on the histogram with
matplotlib, a bar chart will be used and slightly modified. In the example code the bin_centers
and counts are passed to the matplotlib bar method followed by the bin_widths. These bin widths
must also be calculated “manually” by subtracting each bin edge from the subsequent edge. Next
the calculated list of uncertainties, counting_err, is passed as the optional yerr parameter to
show the bin error bars. In addition, the zorder optional parameter determines how plot objects
are displayed in relation to each other. The lower zorder objects will be displayed on top of
(overlapping) higher zorder objects.
When data has associated uncertainties it is usually necessary to weigh fits on these uncertain-
ties. A list of weights are passed to curve_fit via the optional sigma parameter. These weights
can be arbitrary but, in the case that they are proper uncertainties for the y-data points, the
optional parameter absolute_sigma can passed as True which will allow the covariance matrix to
be properly calculated taking these errors into account.
In this example the height data is fit with a Gaussian function. This best fit curve is superim-
posed on the histogram along with a highly transparent (low alpha) red fill under the fit curve for
aesthetics (plotted on top of histogram with high zorder). One difference from previous examples
is that the x-points used to plot the Gaussian are manually set via numpys’s linspace function
which, in this example, generates a list of 300 equally spaced number from the first bin edge to the
last edge. This allows for a higher resolution (less jagged) function to be drawn as compared to
what would result from using the bin edges or bin centers.
The final new feature introduced is the matplotlib annotate method which is a slightly more
feature filled variant of the base text method. This allows the user to draw arbitrary text/an-
notations on the plot area. In this case, the optimal fit parameters along with their associated
uncertainties are written in a round, tan, transparent text box. The final plot is shown in Figure
12.

22
Figure 12: Histogram of height data with best fit Gaussian superimposed.

2 Earthquake Example
In this final, more comprehensive example analysis, Southern California earthquake data from
Caltech is used to demonstrate many of the features already introduced along with a few extras.
This example will have fewer comments and the description will focus on the overall analysis
along with specifics of the features not introduced in previous examples. While the text will be
pedagogical, the analysis itself will be similar to what is expected for experiments in this course.
The data is a rather large (2.7MB) file that lists individual earthquakes on each line with
earthquake characteristics separated by spaces (Figure 13). These earthquakes took place in 2020
between latitudes 32.0 and 37.0 and longitudes -122.0 and -114.0, roughly visualized in Figure 14.
Many questions can and undoubtedly are asked of this data by geologists and others. This example
will ignore much of the detail (which will help introduce filters) and focus on a simple question
within the purview of the author’s very limited knowledge of earthquakes.

23
Figure 13: First few lines of the Caltech earthquake data.

Figure 14: Map of the approximate area (shaded) of recorded earthquakes in the Caltech data set.

2.1 Analysis Overview


The goal of this analysis will be to histogram and model the number of earthquakes that occur
per day in Southern California with magnitude between 3 and 9. The algorithm is outlined in
the main function, first pulling the data via read_quakes, then histogramming the data manu-
ally in hist_daily_quakes, and finally fitting the data to a Poisson distribution and plotting in
plot_quake_days.
In order to build a general function that can easily be reused in any subsequent analyses of
the same dataset, the read_quakes function reads out all characteristics of each earthquake (each
column of every line) into a list of Python dictionaries which is the object that will store all of
the earthquake data. From this list, earthquakes with magnitude less than 3 and greater than or
equal to 9 are filtered out and the number of earthquakes that meet these criteria are counted for
each day independently. The result is a list of the number of earthquakes meeting these criteria
that occurred each day which is produced and returned by the hist_daily_quakes function. This

24
list is then sent to the plot_quake_days function where it is histogrammed with numpy, fit to a
Poisson distribution, and plotted with matplotlib.
The resulting plot of the filtered daily earthquake counts along with superimposed best fit
Poisson is shown in Figure 15. From this plot it is apparent that there are outlying data points,
specifically, the day with 48 recorded earthquakes as shown with a y-axis zoom in Figure 16. In
a more thorough analysis it would likely be beneficial to investigate this outlier, which would be
easy to do with the quakes list of dictionaries (find the date in which there were 48 earthquakes
meeting the criteria and then look into the quakes characteristics or cross reference this date with
events online). This analysis is more interested in modeling the data as Poisson, so the plot is
simply zoomed on the x-axis to produce Figure 17. From these plots it is apparent that while
the number of earthquakes each day aren’t a perfect Poisson process, a Poisson fit does a decent
job of describing the data. From here, this model could be used to estimate the probability of N
earthquakes occurring in a single day in Southern California. Other analyses in this same vein can
rather easily be carried out using modifications of this script.

Figure 15: Plot of number of days (y-axis) with number of earthquakes (x-axis) of magnitude
greater than or equal to 3 and less than 9.

25
Figure 16: Zooming in on the y-axis shows the days with a large number of earthquakes per day.
While a Poisson distribution is expected to have sizable probability for rare tail events, the number
of days with more than 5 earthquakes seems to deviate from the Poisson expectation, especially the
outlier with 48 earthquakes. This likely indicates a failure of the Poisson model and may indicate
some temporal correlations between earthquakes.

Figure 17: Zooming in on the x-axis shows that, while a Poisson model of the data certainly isn’t
perfect, it does capture much of the behavior and may, therefore, still be useful.

26
2.2 Analysis Code
def main():
"""
Plot histogram of daily number of SoCal earthquakes of magnitude between 3 and 9. Fit to Poisson distribution.
:return:
"""
path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex6_Discrete_Hist/' \
'caltech_earthquakes_2020.catalog.txt'
quakes = read_quakes(path)
quake_days = hist_daily_quakes(quakes, mag_3_filter)
plot_quake_days(quake_days)
print('donzo')

def read_quakes(path):
"""
Read caltech earthquake data at path. Return list of dictionaries with info about each quake.
:param path: Path to caltech 2020 earthquake data file
:return:
"""
quakes = []
with open(path, 'r') as file:
lines = file.readlines()
for line in lines:
new_quake_dict = {}
line = line.strip().split()
if len(line) != 13:
continue
try:
new_quake_dict.update({'date_time': dt.strptime(line[0] + ' ' + line[1], '%Y/%m/%d %H:%M:%S.%f')})
except ValueError:
continue
new_quake_dict.update({'et': line[2]})
new_quake_dict.update({'gt': line[3]})
new_quake_dict.update({'mag': float(line[4])})
new_quake_dict.update({'m': line[5]})
new_quake_dict.update({'lat': float(line[6])})
new_quake_dict.update({'lon': float(line[7])})
new_quake_dict.update({'depth': float(line[8])})
new_quake_dict.update({'q': line[9]})
new_quake_dict.update({'evid': int(line[10])})
new_quake_dict.update({'nph': int(line[11])})
new_quake_dict.update({'ngrm': int(line[12])})
quakes.append(new_quake_dict)

return quakes

def hist_daily_quakes(quakes, criteria=lambda quake: True):


"""
Given a list of earthquake dictionaries and a filter function, generate a list of the number earthquakes that
occurred each day passing the filter criteria.
:param quakes: Quake dictionary. Must include 'mag' keyword with corresponding magnitude value of the earthquake.
:param criteria: Function to determine which earthquakes to count per day. Defaults to counting all quakes each day.
:return: List of number of quakes passing the given criteria each day.
"""
quake_days = {}
for quake in quakes:
if quake['date_time'].date() not in quake_days:
quake_days.update({quake['date_time'].date(): 0})

quakes = filter(criteria, quakes)


for quake in quakes:
quake_days[quake['date_time'].date()] += 1

return list(quake_days.values())

def plot_quake_days(quake_days):
counts, bin_edges = np.histogram(quake_days, np.arange(-0.5, 50.5, 1))

27
bin_centers = (bin_edges[1:] + bin_edges[:-1]) / 2
counting_err = [np.sqrt(count) if count > 0 else 1 for count in counts]
plt.bar(bin_centers, counts, 1.0, yerr=counting_err, label='quake data')

pinit = [100, 1] # Very rough initial guess


popt, pcov = curve_fit(poisson, bin_centers, counts, p0=pinit)
perr = np.sqrt(np.diag(pcov))
print(f'Optimal parameters: {popt}') # Print list of best fit parameters
print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix
poisson_range = np.linspace(0, bin_edges[-1], 300)
plt.plot(poisson_range, poisson(poisson_range, *popt), 'r--', label='poisson fit')
plt.title('Number of Daily Earthquakes with 3 <= mag < 9')
plt.ylabel('Number of Days')
plt.xlabel('Number of Earthquakes in Day')
plt.annotate(f'a: {popt[0]:.0f}±{perr[0]:.0f}\nlam: {popt[1]:.3f}±{perr[1]:.3f}',
xy=(0.5, 0.95), xycoords='axes fraction', bbox=dict(boxstyle='round', facecolor='tan', alpha=0.3),
verticalalignment='top', horizontalalignment='center')
plt.legend()
plt.show()

def mag_3_filter(quake):
"""
Given a quake dictionary, return True if the magnitude is between 3 and 9
:param quake: Quake dictionary. Must include 'mag' keyword with corresponding magnitude value of the earthquake.
:return: True if quake magnitude greater than or equal to three and less than 9
"""
if 3 <= quake['mag'] < 9:
return True
else:
return False

def poisson(k, a, lam):


return a * lam ** k / factorial(k) * np.exp(-lam)

2.3 Python Dictionaries


Python dictionaries allow for pairing of arbitrary data types. Items are entered into dictionaries in
key: value pairs. When retrieving data from a dictionary, the key is used as an index to retrieve
the corresponding value. Dictionaries can be very useful to keep track of data in a reader friendly
way, as demonstrated here. Dictionaries can be slower than lists so their use should be avoided in
lower level loops if speed is required.

2.4 Python lambda


The lambda expression in Python is a slightly higher level concept which is in no way necessary for
this course though very briefly introduced here in case of interest. More can be found online but the
lambda expression as used in this example simply serves as a quick, nameless function definition.
Instead of using the def keyword and defining a function as usual, the lambda expression can
generate the function on the spot. An expression starts with the lambda keyword followed by a
space, the input parameters separated by commas followed by a colon, then finally the returned
expression.

2.5 Filtering Data


A very important aspect of data analysis is the ability to filter and cleanse data. This can (and often
is) done “manually” by iterating over a data sequence, checking each item with an if statement
and, if it passes some criteria, saving it to a new, filtered sequence. This process is almost essential
in data analysis and Python has a built in filter function to generalize and condense this process.

28
The filter function takes as it’s first argument a criteria function and a data sequence as its
second. This function must take as an input parameter an element of the data sequence and
return True if this element should be kept and False if the element should be filtered out. The
filter function then returns a filtered sequence of data which can then be manipulated. Lambda
functions are often used as the filter criteria if the expression is simple enough.

2.6 Discrete Values


This earthquake data set was specifically selected in order to demonstrate an analysis on discrete
counting data. In this case, the data was the number of earthquakes per day that meet a certain
criteria. The values that this data can take are discrete: 0, 1, or 2 earthquakes are valid data
points while 1.5 earthquakes does not make sense in this context. This discrete data is typical for
counting observations which are at the core of many nuclear experiments. As such, it is prudent to
point out a subtlety.
It is important to pay special attention to the binning of this type of data since it is discrete
instead of continuous. The bins should be set with care such that the distribution of the data
within each bin is centered. This is easiest to explain with bins of width 1. In this case, the
bins should span from {(−0.5, 0.5), (0.5, 1.5), (1.5, 2.5), ...}. This ensures that each bin has a whole
number at its center which will ensure that the discrete data is centered in the bin. This is
especially important when fitting using the bin centers as the x data since getting this binning
wrong would result in a shift of the fitting results which can be very important for, say, the mean
value of a Gaussian distribution. If the data needs to be re-binned (a bin width of 2, for instance
{(−0.5, 1.5), (1.5, 3.5), ...}) the same advice holds true and combining bins from the bin width of 1
case ensures the data remains centered. While re-binning may be necessary in certain cases, the
author strongly suggests using a bin width of 1 for discrete data unless it is necessary to re-bin.
On re-binning, information is lost with the ability to distinguish between different values within
the same bin.

3 HEP Histogram Formatting


Traditional (read: old) high energy and nuclear physics analysis has been done in C++ within the
ROOT framework. ROOT has many benefits, among them is processing speed and a a full language
tailored to physics. It is, however, maintained by only a small community which sometimes can
not keep facets of the language up to date and competitive. As computing has exploded over the
past few decades, the open source community has made strides in state-of-the-art environments and
through pure manpower has begun to define industry standards. For this reason, this lab will be
suggesting analysis in Python as the skills learned with this language will be much more universally
useful than somewhat esoteric ROOT knowledge.
There is, however, much work being done to bring the useful features of ROOT into Python.
ROOT itself can be run in Python though this still requires a rather extensive investment into
understanding the ROOT framework. Other efforts are being made to on-board specific ROOT
features in a Pythonic style. A more comprehensive project trying to accomplish this goal is coffea.
Coffea utilizes the more specific and standalone mplhep in order to modify matplotlib plotting for
formatting more reminiscent of typical HEP plots (histograms, specifically). These tools are still
in the process of being built and tested and, in the author’s opinion, they are not yet suited to be
handed to beginners. Those interested are more than welcome to explore these projects!

29
Part IV
Full Scripts
The full scripts used in this document’s examples can be found below.

1 Hello World
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on December 14 2:06 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/hello_world
7
8 @author: Dylan Neff, Dylan
9 """
10
11
12 def main():
13 print('hello world') # Print 'hello world' string to the terminal
14
15
16 if __name__ == '__main__':
17 main() # If this file is run as a script directly, this statement will execute and main will be run

30
2 Read File
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on December 14 3:21 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/read_file
7
8 @author: Dylan Neff, Dylan
9 """
10
11
12 def main():
13 """
14 Read US corona virus case data from file and print to terminal.
15 :return:
16 """
17
18 path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
19 dates, cases = read_cases(path)
20 print_cases(dates, cases)
21 print('donzo')
22
23
24 def read_cases(path):
25 """
26 Read corona virus case data from file at the input path and return lists of dates and case numbers
27 :param path: Path to corona virus case data
28 :return: List of dates and list of cases
29 """
30
31 dates = [] # List of dates, left as strings
32 cases = [] # List of cases, converted to integers
33
34 with open(path, 'r') as file: # Open the contents at given path and name this opened object file
35 lines = file.readlines() # Read the opened file into a list with each entry a string of the line
36 for line in lines[1:]: # Iterate over each line, skipping the first line since it's a header
37 line_data = line.split(',') # Split each line string into a list of items separated by commas
38 dates.append(line_data[0]) # Append zeroth element of line_data to list of dates
39 cases.append(int(line_data[1])) # Append first element of line_data to list of cases as an integer
40
41 return dates, cases # Return the two lists
42
43
44 def print_cases(dates, cases):
45 """
46 Print the corona virus date and case data passed as lists
47 :param dates: List of dates
48 :param cases: List of cases
49 :return:
50 """
51
52 print(dates) # Print the list of dates
53 print(cases) # Print the list of cases
54
55 for date, case in zip(dates, cases): # Iterate through dates and cases list together
56 print(f'date: {date}, cases: {case}') # Print each date/case pair on a new terminal line
57
58
59 if __name__ == '__main__':
60 main()

31
3 Plotting
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on December 14 4:37 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/plotting
7
8 @author: Dylan Neff, Dylan
9 """
10
11
12 import numpy as np
13 import matplotlib.pyplot as plt
14 from matplotlib.dates import DateFormatter
15 from datetime import datetime as dt
16
17
18 def main():
19 """
20 Read US corona virus case data from file and plot.
21 :return:
22 """
23
24 path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
25 dates, cases = read_cases(path)
26 plot_cases_simple(cases)
27 plot_cases(dates, cases)
28 plot_case_rate(dates, cases)
29 print('donzo')
30
31
32 def read_cases(path):
33 """
34 Read corona virus case data from file at the input path and return lists of dates and case numbers
35 :param path: Path to corona virus case data
36 :return: List of dates and list of cases
37 """
38
39 dates = [] # List of dates, left as strings
40 cases = [] # List of cases, converted to integers
41
42 with open(path, 'r') as file: # Open the contents at given path and name this opened object file
43 lines = file.readlines() # Read the opened file into a list with each entry a string of the line
44 for line in lines[1:]: # Iterate over each line, skipping the first line since it's a header
45 line_data = line.split(',') # Split each line string into a list of items separated by commas
46 dates.append(line_data[0]) # Append zeroth element of line_data to list of dates
47 cases.append(int(line_data[1])) # Append first element of line_data to list of cases as an integer
48
49 return dates, cases # Return the two lists
50
51
52 def plot_cases_simple(cases):
53 """
54 Plot cumulative US corona virus cases vs the case index using matplotlib.
55 :param cases: List of coronavirus case numbers for each day
56 :return:
57 """
58 case_indices = range(len(cases)) # Produce a list of integers from {0, 1, ..., number of cases}
59 plt.plot(case_indices, cases) # Plot case indices on x-axis and number of cases on y-axis.
60 plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called
61
62
63 def plot_cases(dates, cases):
64 """
65 Plot cumulative US corona virus cases vs the date using matplotlib.

32
66 :param dates: List of dates for coronoavirus cases
67 :param cases: List of coronavirus case numbers for each corresponding date in dates
68 :return:
69 """
70 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
71 ax.set_title('Total Covid Cases in US') # Set title of plot
72
73 dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects
74
75 formatter = DateFormatter('%m/%d') # Create date formatting for plot
76 ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
77 ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees
78
79 plt.plot(dates, cases) # Plot dates on x-axis and cases on y-axis. This can be called alone if no formatting needed
80 plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called
81
82
83 def plot_case_rate(dates, cases):
84 """
85 Plot number of daily new coronavirus cases in US vs the date. Also represent estimated errors with filled bands.
86 :param dates: List of dates for coronoavirus cases
87 :param cases: List of coronavirus case numbers for each corresponding date in dates
88 :return:
89 """
90 rates = np.asarray([cases[i+1] - cases[i] for i in range(len(cases) - 1)]) # Calculate rate, daily change in cases
91 rates_err = np.sqrt(rates) # Estimate uncertainties on rates based on counting statistics, not robust
92
93 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
94 ax.set_title('Daily New Covid Cases in US') # Set plot title
95
96 dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects
97
98 formatter = DateFormatter('%m/%d') # Create date formatting for plot
99 ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
100 ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees
101
102 plt.plot(dates[1:], rates, color='green') # Plot dates (skipping first) on x-axis rates on y, set color to green
103 plt.fill_between(dates[1:], rates - rates_err, rates + rates_err, facecolor='green', alpha=0.3,
104 interpolate=True) # Plot uncertainty bands about data, alpha is transparency
105
106 plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called.
107
108
109 if __name__ == '__main__':
110 main()

33
4 Fitting
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on December 16 2:27 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/fitting
7
8 @author: Dylan Neff, Dyn04
9 """
10
11
12 import numpy as np
13 import matplotlib.pyplot as plt
14 from matplotlib.dates import DateFormatter
15 from datetime import datetime as dt
16 from scipy.optimize import curve_fit
17
18
19 def main():
20 """
21 Read US corona virus case data from file and fit with exponential.
22 :return:
23 """
24
25 path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex2_Read_File/us_12_14_20.csv'
26 dates, cases = read_cases(path)
27 fit_cases(dates, cases)
28 print('donzo')
29
30
31 def read_cases(path):
32 """
33 Read corona virus case data from file at the input path and return lists of dates and case numbers
34 :param path: Path to corona virus case data
35 :return: List of dates and list of cases
36 """
37
38 dates = [] # List of dates, left as strings
39 cases = [] # List of cases, converted to integers
40
41 with open(path, 'r') as file: # Open the contents at given path and name this opened object file
42 lines = file.readlines() # Read the opened file into a list with each entry a string of the line
43 for line in lines[1:]: # Iterate over each line, skipping the first line since it's a header
44 line_data = line.split(',') # Split each line string into a list of items separated by commas
45 dates.append(line_data[0]) # Append zeroth element of line_data to list of dates
46 cases.append(int(line_data[1])) # Append first element of line_data to list of cases as an integer
47
48 return dates, cases # Return the two lists
49
50
51 def fit_cases(dates, cases):
52 """
53 Fit the US cumulative coronavirus cases vs date data to an exponential function and plot the result
54 :param dates: List of dates for coronoavirus cases
55 :param cases: List of coronavirus case numbers for each corresponding date in dates
56 :return:
57 """
58 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
59 ax.set_title('Total Covid Cases in US with Exponential Fit') # Set title of plot
60 ax.set_xlabel('Days since Jan 21 2020') # Set x-axis title
61
62 formatter = DateFormatter('%m/%d') # Create date formatting for plot
63 ax.xaxis.set_major_formatter(formatter) # Set date formatting on plot
64 ax.xaxis.set_tick_params(rotation=30) # Set rotation of date labels to 30 degrees
65

34
66 dates = [dt.strptime(date, '%Y-%m-%d').date() for date in dates] # Convert date strings to datetime objects
67
68 # Convert to days since first day for fitting, int64 for large integers, default int32 max is 5236926
69 days = np.asarray([(date - dates[0]).days for date in dates], dtype=np.int64)
70 cases = np.asarray(cases, dtype=np.int64) # Convert cases from Python list to numpy array
71
72 pinit = [1, 0.04] # Set initial exponential parameters for fitting
73 popt, pcov = curve_fit(exp, days, cases, p0=pinit) # Fit data to exponential
74 print(f'Optimal parameters: {popt}') # Print list of best fit parameters
75 print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
76 print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix
77
78 plt.plot(dates, cases, color='blue', label='data') # Plot dates on x-axis and cases on y-axis
79 plt.plot(dates, exp(days, *popt), 'r--', label='fit') # Plot best fit curve in red dashes
80
81 ax.legend() # Display the legend on plot
82
83 plt.show() # Once plot is set up and ready, call this to display the plot. Not displayed if this is not called
84
85
86 def exp(x, a, lam):
87 """
88 Exponential function
89 :param x: Data x values
90 :param a: Amplitude
91 :param lam: Decay/growth parameter
92 :return: Value of exponential
93 """
94 return a * np.exp(lam * x)
95
96
97 if __name__ == '__main__':
98 main()

35
5 Histograms
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on January 11 1:44 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/read_txt_example
7
8 @author: Dylan Neff, Dylan
9 """
10
11 import numpy as np
12 import matplotlib.pyplot as plt
13 from scipy.optimize import curve_fit
14
15
16 def main():
17 path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex5_Plot_Hist/height_weight.txt'
18 heights, weights = read_heights_weights(path)
19 get_statistics(heights)
20 auto_plot_hist(heights)
21 bin_plot_hist(heights, 15)
22 fit_gaus(heights, 15)
23 print('donzo')
24
25
26 def read_heights_weights(path):
27 """
28 Read file at file_path and return height and weight data to as numpy arrays.
29 :param path: Path to data file.
30 :return heights: height data as a numpy float array
31 :return weights: weight data as a numpy float array
32 """
33 heights = [] # Make a list to store height data
34 weights = [] # Make a list to store weight data
35 with open(path, 'r') as file: # Open file_path in read ('r') mode, reference it as file inside with scope
36 lines = file.readlines() # Read each line from file, lines is then a list of strings, one for each line in file
37 for line in lines[1:]: # Iterate over each line in lines, [1:] -> Skip first line to get rid of file headers
38 line = line.strip() # Remove any white space or newline at begging or end of line (optional)
39 line = line.split() # Split line on argument. With no argument, default is space or tab
40 heights.append(float(line[1])) # Convert element in second column (line[1]) to float, add to heights list
41 weights.append(float(line[2])) # Same as above but with element in third column of the line
42
43 return np.asarray(heights), np.asarray(weights) # Convert lists to numpy arrays to allow for easier math/plotting
44
45
46 def get_statistics(data):
47 """
48 Get mean and standard deviation of data and print to terminal.
49 :param data: Sequence of data to calculate mean/sd of.
50 :return:
51 """
52 mean = np.mean(data) # Calculate mean of data
53 sd = np.std(data) # Calculate standard deviation of data
54 print(f'Data mean: {mean}, standard deviation {sd}') # Print mean and standard deviation with f-string
55
56
57 def auto_plot_hist(data):
58 """
59 Plot histogram of data with matplotlib hist method with the default 10 bins
60 :param data: Sequence of data to be histogrammed and plotted.
61 :return:
62 """
63 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
64 ax.hist(data, align='mid') # Plot histogrammed data with bins centered in the middle of the bin range
65 plt.show()

36
66
67
68 def bin_plot_hist(data, bins=10):
69 """
70 Plot histogram of data with matplotlib hist method with input optional bins parameter passed for binning
71 :param data: Sequence of data to be histogrammed and plotted
72 :param bins: Optional parameter with default 10. Passed to matplotlib hist as histogram binning
73 :return:
74 """
75 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
76 ax.hist(data, bins, align='mid') # Plot histogrammed data with input binning with bins centered middle of range
77 plt.show()
78
79
80 def fit_gaus(data, bins=10):
81 """
82 Histogram data and plot manually as bar chart with uncertainties. Fit gaussian to histogram and superimpose fit.
83 :param data: Data to be histogrammed and fit
84 :param bins: Optional parameter for binning
85 :return:
86 """
87 fig, ax = plt.subplots() # Set matplotlib subplots, only needed when axis (ax) object needed for manipulation
88
89 counts, bin_edges = np.histogram(data, bins) # Histogram via numpy with binning passed, return counts and bin edges
90 bin_centers = 0.5 * (bin_edges[1:] + bin_edges[:-1]) # Calculate N bin centers from N+1 bin edges
91 bin_widths = bin_edges[:-1] - bin_edges[1:] # Calculate N bin widths from N+1 bin edges
92 counting_err = [np.sqrt(count) if count > 0 else 1 for count in counts] # Calculate counting uncertainties
93 ax.bar(bin_centers, counts, bin_widths, yerr=counting_err, zorder=1, label='height data') # Plot histogram via bar
94
95 pinit = [20, 68, 3] # Set initial conditions for gaussian fit
96 popt, pcov = curve_fit(gaus, bin_centers, counts, p0=pinit, sigma=counting_err, absolute_sigma=True) # Fit to gaus
97 perr = np.sqrt(np.diag(pcov)) # Calculate uncertainties on optimal parameters via sqrt of diagonals of cov matrix
98 print(f'Optimal parameters: {popt}') # Print list of best fit parameters
99 print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
100 print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix
101 gaus_range = np.linspace(bin_edges[0], bin_edges[-1], 300) # Set range of points to plot for fit gaussian
102 ax.plot(gaus_range, gaus(gaus_range, *popt), 'r--', zorder=3, label='gaus fit') # Plot best fit gaus over hist
103 ax.fill_between(gaus_range, gaus(gaus_range, *popt), 0, color='red', alpha=0.2, zorder=2) # Fill area under gaus
104 ax.legend() # Draw legend on plot.
105 ax.annotate(f'a: {popt[0]:.1f}±{perr[0]:.1f}\nx0: {popt[1]:.1f}±{perr[1]:.1f}\nsigma: {popt[2]:.1f}±{perr[2]:.1f}',
106 xy=(0.05, 0.95), xycoords='axes fraction', bbox=dict(boxstyle='round', facecolor='tan', alpha=0.3),
107 verticalalignment='top') # Display box on plot with optimal fit parameters
108 plt.show()
109
110
111 def gaus(x, a, x0, sigma):
112 """
113 Calculate and return gaussian value at point x
114 :param x: Point at which to compute gaussian
115 :param a: Amplitude of gaussian
116 :param x0: Center of gaussian (equal to mean)
117 :param sigma: Width of gaussian (equal to standard deviation)
118 :return: Value of gaussian at point x
119 """
120 return a * np.exp(-(x - x0)**2 / (2 * sigma**2))
121
122
123 if __name__ == '__main__':
124 main()

37
6 Earthquakes
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 """
4 Created on December 22 9:37 PM 2020
5 Created in PyCharm
6 Created as UCLA_Nuclear_Lab/hist_earthquakes
7
8 @author: Dylan Neff, Dyn04
9 """
10
11
12 import numpy as np
13 import matplotlib.pyplot as plt
14 from datetime import datetime as dt
15 from scipy.optimize import curve_fit
16 from scipy.special import factorial
17
18
19 def main():
20 """
21 Plot histogram of daily number of SoCal earthquakes of magnitude between 3 and 9. Fit to Poisson distribution.
22 :return:
23 """
24 path = 'C:/Users/Dylan/PycharmProjects/UCLA_Nuclear_Lab/Example_Code/Ex6_Discrete_Hist/' \
25 'caltech_earthquakes_2020.catalog.txt'
26 quakes = read_quakes(path)
27 quake_days = hist_daily_quakes(quakes, mag_3_filter)
28 plot_quake_days(quake_days)
29 print('donzo')
30
31
32 def read_quakes(path):
33 """
34 Read caltech earthquake data at path. Return list of dictionaries with info about each quake.
35 :param path: Path to caltech 2020 earthquake data file
36 :return:
37 """
38 quakes = []
39 with open(path, 'r') as file:
40 lines = file.readlines()
41 for line in lines:
42 new_quake_dict = {}
43 line = line.strip().split()
44 if len(line) != 13:
45 continue
46 try:
47 new_quake_dict.update({'date_time': dt.strptime(line[0] + ' ' + line[1], '%Y/%m/%d %H:%M:%S.%f')})
48 except ValueError:
49 continue
50 new_quake_dict.update({'et': line[2]})
51 new_quake_dict.update({'gt': line[3]})
52 new_quake_dict.update({'mag': float(line[4])})
53 new_quake_dict.update({'m': line[5]})
54 new_quake_dict.update({'lat': float(line[6])})
55 new_quake_dict.update({'lon': float(line[7])})
56 new_quake_dict.update({'depth': float(line[8])})
57 new_quake_dict.update({'q': line[9]})
58 new_quake_dict.update({'evid': int(line[10])})
59 new_quake_dict.update({'nph': int(line[11])})
60 new_quake_dict.update({'ngrm': int(line[12])})
61 quakes.append(new_quake_dict)
62
63 return quakes
64
65

38
66 def hist_daily_quakes(quakes, criteria=lambda quake: True):
67 """
68 Given a list of earthquake dictionaries and a filter function, generate a list of the number earthquakes that
69 occurred each day passing the filter criteria.
70 :param quakes: Quake dictionary. Must include 'mag' keyword with corresponding magnitude value of the earthquake.
71 :param criteria: Function to determine which earthquakes to count per day. Defaults to counting all quakes each day.
72 :return: List of number of quakes passing the given criteria each day.
73 """
74 quake_days = {}
75 for quake in quakes:
76 if quake['date_time'].date() not in quake_days:
77 quake_days.update({quake['date_time'].date(): 0})
78
79 quakes = filter(criteria, quakes)
80 for quake in quakes:
81 quake_days[quake['date_time'].date()] += 1
82
83 return list(quake_days.values())
84
85
86 def plot_quake_days(quake_days):
87 counts, bin_edges = np.histogram(quake_days, np.arange(-0.5, 50.5, 1))
88 bin_centers = (bin_edges[1:] + bin_edges[:-1]) / 2
89 counting_err = [np.sqrt(count) if count > 0 else 1 for count in counts]
90 plt.bar(bin_centers, counts, 1.0, yerr=counting_err, label='quake data')
91
92 pinit = [100, 1] # Very rough initial guess
93 popt, pcov = curve_fit(poisson, bin_centers, counts, p0=pinit)
94 perr = np.sqrt(np.diag(pcov))
95 print(f'Optimal parameters: {popt}') # Print list of best fit parameters
96 print(f'Error on parameters: {np.sqrt(np.diag(pcov))}') # Print estimated parameter errors on fit
97 print(f'Covariance matrix:\n{pcov}') # Print fit covariance matrix
98 poisson_range = np.linspace(0, bin_edges[-1], 300)
99 plt.plot(poisson_range, poisson(poisson_range, *popt), 'r--', label='poisson fit')
100 plt.title('Number of Daily Earthquakes with 3 <= mag < 9')
101 plt.ylabel('Number of Days')
102 plt.xlabel('Number of Earthquakes in Day')
103 plt.annotate(f'a: {popt[0]:.0f}±{perr[0]:.0f}\nlam: {popt[1]:.3f}±{perr[1]:.3f}',
104 xy=(0.5, 0.95), xycoords='axes fraction', bbox=dict(boxstyle='round', facecolor='tan', alpha=0.3),
105 verticalalignment='top', horizontalalignment='center')
106 plt.legend()
107 plt.show()
108
109
110 def mag_3_filter(quake):
111 """
112 Given a quake dictionary, return True if the magnitude is between 3 and 9
113 :param quake: Quake dictionary. Must include 'mag' keyword with corresponding magnitude value of the earthquake.
114 :return: True if quake magnitude greater than or equal to three and less than 9
115 """
116 if 3 <= quake['mag'] < 9:
117 return True
118 else:
119 return False
120
121
122 def poisson(k, a, lam):
123 return a * lam ** k / factorial(k) * np.exp(-lam)
124
125
126 if __name__ == '__main__':
127 main()

39

You might also like