Day_10 Python External Files
Day_10 Python External Files
Day-10 Python File Handling When you open a file in a programme such as Word or Excel you have to select the file you want from a 'File ->
Open' dialogue of some kind. In other words you have to point the programme at the specific file you want which
is stored at some specific location on your computer. This is also true when you open a file in Python. The
difference is that when you open a file in python you have to specify the file location in words (as a filepath)
Text Files rather then selecting from a dialogue. So the first thing we have to do is assign the file location to a variable. This
Data is often stored in formats that are not easy to read unless you have some specialised software. Excel textual representation of the file location is called the filepath.
spreadsheets are one example; how do you read an Excel spreadsheet if you don't have Excel? This reliance on
Once we have the filepath we can use the python function open() to 'get a handle' on the file. This file handle
particular software is not very useful for programmatic analysis of data. Instead if we're going to use a
should also be assigned to a variable. We can then operate on that variable. Let's see an example.
programming language to analyse or process our data we would prefer some "easy to deal with" format. Text
files or flatfiles (http://en.wikipedia.org/wiki/Text_file) provide just the sort of format we need (there are other
In [1]: file_loc = 'data/elderlyHeightWeight.csv'
choices). You can think of these as a single sheet from a spreadsheet with the data arranged in rows (separated
f_hand = open(file_loc, 'r')
by our old friend the \n character) and columns (separated by some other character). print(f_hand)
data_field1\tdata_field2\tdata_field3\n data_field1\tdata_field2\tdata_field3\n The results of the print statement might surprise you. Rather than printing the contents of the file what we get
is a representation of the location in the computer memory of where the file is i.e. at memory location 0x7... etc in
Each field is separated by a tab stop \t and the end of a line is indicated by the \n combination. the example above.
In a .csv file you would see: Whilst we can also open Excel or Word files in python this requires the use of special software libraries. We'll see
some of those later in the course. Mostly when we are analysing data in files we open simple text files. Both
data_field1,data_field2,data_field3\n data_field1,data_field2,data_field3\n Excel and Word can save files out as simple text files.
The first few lines of the file we will work with are shown below.
Note:
In [2]: !head -n 4 data/elderlyHeightWeight.csv
It's important to note that there is NO STANDARD for either the layout OR naming of text files. In the exercises
that follow the text file we will be working with has the extension .csv which should mean comma separated Gender Age body wt ht
but in fact the data is separated by tabs - because that's what I'm in the habit of doing. Please ignore my bad F 77 63.8 155.5
F 80 56.4 160.5
habits!
F 76 55.2 159.5
In [4]: file_loc = 'data/elderlyHeightWeight.csv'
While the file handle does not contain the data from the file (it only points at it) it is easy to construct a for loop f_hand = open(file_loc, 'r')
to cycle through the lines of the file and carry out some computation. For example we can easily count the f_data = f_hand.read()
number of lines in the file. print(f_data)
If you know that your file is likely to be small you can read the whole file into memory with the read() method In [6]: file_loc = 'data/elderlyHeightWeight.csv'
(remember the dot notation!). f_hand = open(file_loc, 'r')
f_data = f_hand.read()
This reads the entire contents of the file, newlines and all, into one large string. f_data[:22]
print(len(f_data))
print( f_data[:10])
r'{}'.format(f_data[:22]) # note the tab stop in the output
286
Gender Age
After reading in the current line readline() then moves on to the next line. So calling readline() again
uses the next line in the file. One other thing to bear in mind is that readline() leaves whitespace and in
particular the \n character at the end of the line. You can see that above (there's a blank line between the Alternative Implementations (just for fun)
printed lines) in the following example.
In [8]: line = f_hand.readline() # note next line has been read no. 1
line # compare to print above
In [10]: f_hand = open(file_loc, 'r') # read in in file again to get header line
line = f_hand.readline().strip() # read the line then strip the whitespace at
the end of the line
line # no \n!
One final thing to note is that whenever we finish with a file we should close it. Leaving files 'open' after data has Gender Age body wt ht
M 79 75.5 171
been read from them can lead to increasing amounts of memory being used and also corruption of the file.
M 75 83.9 178.5
Closing files is accomplished by using the close() method on the file handle. Also illustrated is a simple filter M 79 75.7 167
to print out only the male data using the string method startswith() - which returns a boolean value M 84 72.5 171.5
depending on whether the line begins with the given argument (M in this case) or not. M 76 56.2 167
M 80 73.4 168.5
M 75 67.7 174.5
M 75 93 168
M 78 95.6 168
M 80 75.6 183.5
It's that simple! Now python makes available to us all the useful code in the csv library. The csv library,
In [26]: max_male = max(info['M'], key=lambda e: float(e[2]) + float(e[3]))
unsurprisingly, contains python functions and methods to make dealing with csv (and other) text files easier. Let's
print(max_male)
first see how to open a text file using the csv library and printing out the first few lines.
('M', '78', '95.6', '168')
To read data from a csv file, we use the reader() function. The reader() function takes each line of the file
and makes a reader object containing lists made up of each row in the input data. Objects in programming are
containers for both data and methods that act on that data (a bit esoteric so don't worry if you don't quite get
Sol. #2 : Using a list comprehension that). One method the reader object supports is the .next() method. We can use this to access each row at
a time. Notably once we have processed the line it's gone from the reader object.
In [27]: ## Creating Partial Lists using **List Comprehension**
males = [l.strip().split('\t') for l in lines[1:]
if l.startswith('M')]
females = [l.strip().split('\t') for l in lines[1:]
if l.startswith('F')] Note:
From here on, we are going to keep using the with/as statement to handle I/O operations, namely Context
In [28]: males
Manager objects.
Out[28]: [['M', '79', '75.5', '171'],
['M', '75', '83.9', '178.5'], For more information, see this notebook (09 Exceptions.ipynb#ctx).
['M', '79', '75.7', '167'],
['M', '84', '72.5', '171.5'],
['M', '76', '56.2', '167'],
['M', '80', '73.4', '168.5'],
['M', '75', '67.7', '174.5'],
['M', '75', '93', '168'],
['M', '78', '95.6', '168'],
['M', '80', '75.6', '183.5']]
Note that if the file exists then opening it with the 'w' argument removes any data that was in the file and
We can see that the reader() function has processed each line into a single list element based on the field overwrites it with what you put in. This may not be what you wanted to do. We'll cover how you append data to a
delimiter we supplied. Importantly also note that all the values are now of type str in each list (everything is in file without overwriting the contents shortly.
quotes). This is important if you want to do calculations on these values. Once we have an open file we can write data to it with the write() method applied to the file handle.
Using the csv module makes it easy to select whole columns by selecting the data we want from the reader .
Let's open a file and write some data to it.
We'll use the .next() method to find the column order and then iterate over the rows with a for loop to pull
out height and weight.
In [40]: with open('data/test.txt', 'w') as f_out:
for i in range(10):
In [37]: with open('data/elderlyHeightWeight.csv', 'r') as csvfile: line = 'Line ' + str(i) + '\n'
reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter f_out.write(line)
weight = ['Weight'] # list to hold data, put in header In the above code we first opened (created) the file test.txt and then ran through a range of numbers (from 0
height = ['Height'] to 9) using a for loop. At each iteration of the loop we concatenated (joined) the word 'Line' to the string
representation of the number (note the use of str ) and a newline character. Finally we wrote each of the
for row in reader:
resulting strings to our new file. In the last line we closed the file.
weight.append(row[2])
height.append(row[3])
print (weight)
print (height) Putting it together!
['Gender', 'Age', 'body wt', 'ht'] Write a script that uses the csv module to open a file after getting a filepath from the user. Use the script to
['Weight', '63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75. open the elderlyHeightWeight.csv file. Write out a new file containing only male data. Remember to close
5', '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
all the files once your done. In addition include a try\except clause to handle the situation where the
['Height', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '1
71', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5'] requested file doesn't exist.
Hint: csv.reader objects are lists. Recall how you .join() lists elements into a string.
In [15]: # import csv - done above
We'll use a slightly different approach here from that demonstrated above (previous height & weight example). Now we have the data separated out it's a trivial effort to calculate the height in meters (from the given height in
Instead of iterating over the rows we'll use iterator variables in our for loop. cm). In the code below we use the range() function to get the positions of the actual heights (i.e. we skip the
column header), we convert those heights from str to float and we calculate the height in meters and
append this to a new list.
In [44]: height_m = []
height_m.append('ht_m') # a new header The first element in each of our data lists is the column header. The zip() function captures these first
elements into a tuple - ('Gender', 'Age', 'body wt', 'ht', 'ht_m') - and this, in turn, becomes the first
# use range(1,len(height)) so we don't get the header again element of a new list, data_out . The zip() function then captures all the second elements from each data
for ht in height[1:]:
list and these become part of a tuple which is the second element of data_out . In this way each data list is
height_m.append(float(ht)/100) # note the conversion to a float here
'zipped up' with the other lists.
print (height_m)
To output the rows we simply iterate over the data_out list and send each element to our output file as a row
using the .writerow() method.
['ht_m', 1.555, 1.605, 1.595, 1.51, 1.655, 1.67, 1.54, 1.53, 1.71, 1.785, 1.6
7, 1.715, 1.67, 1.685, 1.745, 1.68, 1.68, 1.835]
Putting it together 1
Now we have all the data we need to write the new file. First we'll capture each line of our new file to a list (the
zip() function) and then write each line to the new file. The csv library extends the .write() method with Open the elderlyHeightWeight.csv using the functions in the csv module and extract each column to a
a writer object. One method of writer objects is .writerow() the use of which is demonstrated below. separate list. Use the height and weight data to calculate the BMI for each subject. Use zip() to create a list of
data to write out and write all the phenotype data including BMI back to a new file.
In [45]: with open('data/new_data.csv', 'w') as newdata_file:
Hint - if you use the csv.reader() remember the issues with the str type in lists.
writer = csv.writer(newdata_file, delimiter='\t') # define a writer object
<class 'zip'>
Write a script that will read this data and count the number of NA values in height and /or weight and count the
number of males and females.
Calculate the BMI for each individual, add this to the original file and write out a new file indluding BMI data.
Finally calculate the mean BMI for males and females and write these out as well (to 2 decimal places).
Hint: In this exercise you should use the techniques you have learned to loop over the lines of a file and extract
each variable into its' own list. You can then calculate the BMI values easily. However you won't be able to
calculate a BMI for individuals with 'NA' in either weight or height columns. How can you use the continue
keyword when you loop over your data to avoid collecting values for these individuals?