Python – working with data –
text formats
ASCII or text file formats
Advantages of working with text formats:
• They are usually human-readable.
• They tend to be simple structures.
• It is relatively easy to write code to interpret
them.
Disadvantages include:
• Inefficient storage for big data volumes.
• Most people invent their own format so
there is a lack of standardisation.
Using python to read text formats
As we have seen Python has a great toolkit for
reading files and working with strings.
In this example we use a file that we found on
the web, and then adapt some code to read it
into a useful, re-usable form.
Our example file
We found a suitable data set on the web:
http://www.metoffice.gov.uk/climate/uk/summaries/datasets#Yearorder
Met Office monthly weather statistics for
the UK since 1910.
Header
Lines numbers
(for reference
only)
Data (first 9 columns)
Data (last 8 columns)
Look! A missing value!
Let's write some code to read it
We'll need:
• To read the header and data separately
• To think about the data structure (so it is easy to
retrieve the data in a useful manner).
Let's put into practice what we have learnt:
• Use NumPy to store the arrays
• But we'll need to test for missing values and use
Masked Array (numpy.ma)
Example code (and data)
Please refer to the example code:
example_code/test_read_rainfall.py
And data file:
example_data/uk_rainfall.txt
Reading the header
UK Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
order of year descending.
Data are provisional from December 2014 & Winter
2015. Last updated 07/04/2015
Reading the header
UK Rainfall (mm) Line 1 is important
Areal series, starting from information.
1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist. Other lines are useful
information.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Let's capture the metadata in:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
- location: UK
order of year descending.
- variable:
Data are provisional from December Rainfall
2014 & Winter
- units: mm
2015. Last updated 07/04/2015
Reading the header
def readHeader(fname):
# Open the file and read the relevant lines
f = open(fname)
head = f.readlines()[:6]
f.close()
# Get important stuff
location, variable, units = head[0].split()
units = units.replace("(", "").replace(")", "")
# Put others lines in comments
comments = head[1:6]
return (location, variable, units, comments)
Test the reader
>>> (location, variable, units, comments) = \
readHeader("example_data/uk_rainfall.txt")
>>> print location, variable, units
UK Rainfall mm
>>> print comments[1]
Allowances have been made for topographic, coastal
and urban effects where relationships are found to
exist.
Write a function to handle missing
data properly
import numpy.ma as MA
def checkValue(value):
# Check if value should be a float
# or flagged as missing
if value == "---":
value = MA.masked
else:
value = float(value)
return value
Reading the data (part 1)
import numpy.ma as MA
def readData(fname):
# Open file and read column names and data block
f = open(fname)
# Ignore header
for i in range(7):
f.readline()
col_names = f.readline().split()
data_block = f.readlines()
f.close()
# Create a data dictionary, containing
# a list of values for each variable
data = {}
Data (first 9 columns)
Reading the data (part 2)
# Add an entry to the dictionary for each column
for col_name in col_names:
data[col_name] = MA.zeros(len(data_block), 'f',
fill_value = -999.999)
Reading the data (part 3)
# Loop through each value: append to each column
for (line_count, line) in enumerate(data_block):
items = line.split()
for (col_count, col_name) in enumerate(col_names):
value = items[col_count]
data[col_name][line_count] = checkValue(value)
return data
Testing the code
>>> data = readData("example_data/uk_rainfall.txt")
>>> print data["Year"]
[ 1910. 1911. 1912. ...
>>> print data["JAN"]
[ 111.40000153 59.20000076 111.69999695 ...
>>> winter = data["WIN"]
>>> print MA.is_masked(winter[0])
True
>>> print MA.is_masked(winter[1])
False
Look! A missing value!
What about CSV or tab-delimited?
The above example will work exactly the same with
a tab-delimited file (because the string split
method splits on white space) .
If the file used commas (CSV) to separate columns
then you could use:
line.split(",")
Or try the Python "csv" module
There is a python "csv" module that is able to read text files
with various delimiters. E.g.:
>>> import csv
>>> r = csv.reader(open("example_data/weather.csv"))
>>> for row in r:
... print row
['Date', 'Time', 'Temp', 'Rainfall']
['2014-01-01', '00:00', '2.34', '4.45']
['2014-01-01', '12:00', '6.70', '8.34']
['2014-01-02', '00:00', '-1.34', '10.25']
See: https://docs.python.org/2/library/csv.html