0% found this document useful (0 votes)
2 views10 pages

Day_10 Python External Files

Uploaded by

hatim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Day_10 Python External Files

Uploaded by

hatim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Opening (and Reading) Files

Day-10 Python File Handling When you open a file in a programme such as Word or Excel you have to select the file you want from a 'File ->
Open' dialogue of some kind. In other words you have to point the programme at the specific file you want which
is stored at some specific location on your computer. This is also true when you open a file in Python. The
difference is that when you open a file in python you have to specify the file location in words (as a filepath)
Text Files rather then selecting from a dialogue. So the first thing we have to do is assign the file location to a variable. This
Data is often stored in formats that are not easy to read unless you have some specialised software. Excel textual representation of the file location is called the filepath.
spreadsheets are one example; how do you read an Excel spreadsheet if you don't have Excel? This reliance on
Once we have the filepath we can use the python function open() to 'get a handle' on the file. This file handle
particular software is not very useful for programmatic analysis of data. Instead if we're going to use a
should also be assigned to a variable. We can then operate on that variable. Let's see an example.
programming language to analyse or process our data we would prefer some "easy to deal with" format. Text
files or flatfiles (http://en.wikipedia.org/wiki/Text_file) provide just the sort of format we need (there are other
In [1]: file_loc = 'data/elderlyHeightWeight.csv'
choices). You can think of these as a single sheet from a spreadsheet with the data arranged in rows (separated
f_hand = open(file_loc, 'r')
by our old friend the \n character) and columns (separated by some other character). print(f_hand)

<_io.TextIOWrapper name='data/elderlyHeightWeight.csv' mode='r' encoding='UTF


There are two common characters used to separate the data in text files into columns. The \t , or tab stop -8'>
character and the humble comma (,). In both cases fields in our input file are separated by a single tab stop or a
single comma.
In the example above we have opened a small file containing height and weight information on a group of elderly
This gives rise to two commonly used file extensions (the bit after the dot in file names e.g. in myfile.docx the men and women from a body composition study.
'docx' is the file extension.). These file extensions are .tsv for 'tab separated file' and .csv for 'comma
separated file'. Other separators in text files of data may be little used characters such as '|', ':' or spaces. As a
In the first line we assign the file path to the variable file_loc . We then use that variable to open a file handle
quick aside separating fields in data files by spaces is tricky because an individual value might also contain
( f_hand ) using the open() function. The file location is the first argument to the open() function and the
spaces and therefore be inappropriately divided over more than one column.
second argument 'r' indicates that we want to read from the file i.e. we want access to the data in the file but
As an illustration if you were to print a couple of lines from a .tsv file out it would look something like this: we don't want to change the file itself. Finally we use a print statement to print the file handle.

data_field1\tdata_field2\tdata_field3\n data_field1\tdata_field2\tdata_field3\n The results of the print statement might surprise you. Rather than printing the contents of the file what we get
is a representation of the location in the computer memory of where the file is i.e. at memory location 0x7... etc in
Each field is separated by a tab stop \t and the end of a line is indicated by the \n combination. the example above.

In a .csv file you would see: Whilst we can also open Excel or Word files in python this requires the use of special software libraries. We'll see
some of those later in the course. Mostly when we are analysing data in files we open simple text files. Both
data_field1,data_field2,data_field3\n data_field1,data_field2,data_field3\n Excel and Word can save files out as simple text files.

The first few lines of the file we will work with are shown below.
Note:
In [2]: !head -n 4 data/elderlyHeightWeight.csv
It's important to note that there is NO STANDARD for either the layout OR naming of text files. In the exercises
that follow the text file we will be working with has the extension .csv which should mean comma separated Gender Age body wt ht
but in fact the data is separated by tabs - because that's what I'm in the habit of doing. Please ignore my bad F 77 63.8 155.5
F 80 56.4 160.5
habits!
F 76 55.2 159.5
In [4]: file_loc = 'data/elderlyHeightWeight.csv'
While the file handle does not contain the data from the file (it only points at it) it is easy to construct a for loop f_hand = open(file_loc, 'r')
to cycle through the lines of the file and carry out some computation. For example we can easily count the f_data = f_hand.read()
number of lines in the file. print(f_data)

Gender Age body wt ht


F 77 63.8 155.5
In [3]: count = 0 # initialise F 80 56.4 160.5
F 76 55.2 159.5
for line in f_hand: # iterate F 77 58.5 151
count = count + 1 F 82 64 165.5
F 78 51.6 167
print('There are %d lines in the file.' % count ) F 85 54.6 154
F 83 71 153
There are 19 lines in the file. M 79 75.5 171
M 75 83.9 178.5
M 79 75.7 167
M 84 72.5 171.5
Why doesn't open() open the file directly? M 76 56.2 167
M 80 73.4 168.5
It might seem stupid that after using open() a file handle is created but the file contents aren't directly M 75 67.7 174.5
available. The reason the open() function does not read the whole file immediately but just points to it is to do M 75 93 168
with file size. If you do not know in advance the size of the files you are dealing with (often the case - how often M 78 95.6 168
M 80 75.6 183.5
do you check the size of files you open on your computer?) automatically opening very large files could:

Take a long time


Crash the whole computer system - essentially you'd run out of memory In [5]: f_hand.close()
f_hand = open(file_loc)
print(f_hand.readline(), end = '')
print(f_hand.readline(), end = '')
For some biological applications the data files (which may well be text files e.g. in RNA Seq data - see comment f_hand.close()
here (http://seqanswers.com/forums/showthread.php?t=10787)) can be very large. So it's safer to point to the file f_hand = open(file_loc)
lines = f_hand.readlines()
rather than automatically open it. This is also true of other data or informatics applications.
print(type(lines))
In the for loop above python splits the file into lines based on the newline character ( \n - the split is implicit), Gender Age body wt ht
increments the count variable by 1 for every line and then discards that line. So there is only ever one line from F 77 63.8 155.5
the file in the computer memory at any given time. <class 'list'>

If you know that your file is likely to be small you can read the whole file into memory with the read() method In [6]: file_loc = 'data/elderlyHeightWeight.csv'
(remember the dot notation!). f_hand = open(file_loc, 'r')
f_data = f_hand.read()
This reads the entire contents of the file, newlines and all, into one large string. f_data[:22]
print(len(f_data))
print( f_data[:10])
r'{}'.format(f_data[:22]) # note the tab stop in the output

286
Gender Age

Out[6]: 'Gender\tAge\tbody wt\tht\n'


In the above example we first create the file handle and then read the entire contents of the file into one string. Use of strip() has removed the \n from our selected line. There are also lstrip() and rstrip()
We check the length of that string (286 characters including whitespace characters like \n ) and we print the methods that strip whitespace from only the left or right sides of a string respectively.
first 10 characters (refer back to the material on slicing if you're unsure how the [:10] slice works). The
Just to confuse you further there's also a readlines() method that reads all the lines in the file into a list .
print statement interprets the tab stop properly but if we just ask for the first 22 characters to be returned (i.e.
Again the lines are separated on the invisible \n character. This can be handy because you can assign the
we do not use print ) we can see the tab stop and the \n . Compare this to the illustration of a .tsv file shown
list to a variable and then loop through the list to print file lines or simply extract the lines you want using slice
above.
notation.
Using the print statement and subsetting is fine but not convenient. You might want to print the whole of the
first line. The readline() method will read one line at a time and you can use this to e.g. just display the In [11]: f_hand = open(file_loc, 'r')
header line (if you know or suspect that your file has a header line). lines = f_hand.readlines()
print(lines[0:2]) # check we have a list
print(len(lines))

In [7]: f_hand = open(file_loc, 'r') ['Gender\tAge\tbody wt\tht\n', 'F\t77\t63.8\t155.5\n']


line = f_hand.readline() # reads first line 19
print(line)
# next line
In [12]: for i in range(4):
line=f_hand.readline()
print (lines[i].strip())
print(line)
Gender Age body wt ht
Gender Age body wt ht
F 77 63.8 155.5
F 80 56.4 160.5
F 77 63.8 155.5
F 76 55.2 159.5

After reading in the current line readline() then moves on to the next line. So calling readline() again
uses the next line in the file. One other thing to bear in mind is that readline() leaves whitespace and in
particular the \n character at the end of the line. You can see that above (there's a blank line between the Alternative Implementations (just for fun)
printed lines) in the following example.

In [8]: line = f_hand.readline() # note next line has been read no. 1
line # compare to print above

Out[8]: 'F\t80\t56.4\t160.5\n' In [13]: for line in lines[:4]:


print(line.strip())

In [ ]: Gender Age body wt ht


F 77 63.8 155.5
F 80 56.4 160.5
If you're using python to join lines together in some new format that might not be what you want. There is a F 76 55.2 159.5
method, strip() (see here (https://docs.python.org/3/library/string.html)) that removes whitespace at the end
of lines and can be used to remove this potentially extraneous \n character. Note also that methods can be
chained together so you can use readline() and strip() sequentially using the following syntax. no. 2

In [10]: f_hand = open(file_loc, 'r') # read in in file again to get header line
line = f_hand.readline().strip() # read the line then strip the whitespace at
the end of the line
line # no \n!

Out[10]: 'Gender\tAge\tbody wt\tht'


In [14]: for i, line in enumerate(lines): In [17]: file_loc = 'data/elderlyHeightWeight.csv' # relative path
if i == 4: f_hand = open(file_loc, 'r')
break lines = f_hand.read().splitlines() # lines to a list
print(line.strip()) print (lines[0]) # header

Gender Age body wt ht for line in lines: # loop to filter


F 77 63.8 155.5 if line.startswith('M'):
F 80 56.4 160.5 print (line)
F 76 55.2 159.5
f_hand.close()

Gender Age body wt ht


M 79 75.5 171
M 75 83.9 178.5
M 79 75.7 167
M 84 72.5 171.5
Just like readline() the readlines() method leaves the trailing \n at the end of the line but you can use M 76 56.2 167
strip() to remove it if you have to as we did above. We had to use the strip() method on the individual M 80 73.4 168.5
line rather than on the list of lines as lists strip() does not operate on lists. Try moving the strip() to the M 75 67.7 174.5
M 75 93 168
end of lines = f_hand.readlines() and see what kind if error you get.
M 78 95.6 168
M 80 75.6 183.5
Finally (and perhaps most usefully) there is the splitlines() method that does the same as readlines()
but drops the trailing \n automatically.

In [16]: f_hand = open(file_loc, 'r') filter alternative


lines = f_hand.read().splitlines() # read file, then split lines to lists, dro
ps trailing \n
In [18]: def filter_function(line):
for i in range(4):
return line.startswith('M')
print (lines[i])

Gender Age body wt ht In [19]: f_hand = open(file_loc)


F 77 63.8 155.5 lines = f_hand.readlines()
F 80 56.4 160.5 male_gender = filter(filter_function, lines)
F 76 55.2 159.5 print(lines[0].strip())
for ml in male_gender:
print(ml.strip())
Notice we didn't have to use strip() . f_hand.close()

One final thing to note is that whenever we finish with a file we should close it. Leaving files 'open' after data has Gender Age body wt ht
M 79 75.5 171
been read from them can lead to increasing amounts of memory being used and also corruption of the file.
M 75 83.9 178.5
Closing files is accomplished by using the close() method on the file handle. Also illustrated is a simple filter M 79 75.7 167
to print out only the male data using the string method startswith() - which returns a boolean value M 84 72.5 171.5
depending on whether the line begins with the given argument (M in this case) or not. M 76 56.2 167
M 80 73.4 168.5
M 75 67.7 174.5
M 75 93 168
M 78 95.6 168
M 80 75.6 183.5

Using lambda expressions


In [29]: f_hand = open(file_loc)
male_gender = filter(lambda l: l.startswith('M'), f_hand)
for ml in male_gender: Ex. no 2
print(ml.strip())
Print all the lines in the file where the Age value is in the range [70, 80)
f_hand.close()

M 79 75.5 171 In [23]: f_hand = open(file_loc)


M 75 83.9 178.5 for i, line in enumerate(f_hand):
M 79 75.7 167 if i == 0:
M 84 72.5 171.5 continue
M 76 56.2 167 line = line.strip()
M 80 73.4 168.5 _, age, *_ = line.split('\t')
M 75 67.7 174.5 if 70 <= int(age) < 80:
M 75 93 168 print(line)
M 78 95.6 168
F 77 63.8 155.5
M 80 75.6 183.5
F 76 55.2 159.5
F 77 58.5 151
F 78 51.6 167
M 79 75.5 171
Exercises M 75 83.9 178.5
M 79 75.7 167
M 76 56.2 167
M 75 67.7 174.5
Show the content of the file using a Shell command M 75 93 168
M 78 95.6 168
Tip 1: The shell command to be used could be cat

Tip 2: Remember the ! (esclamation mark) Ex. no 3


Print the two lines in the files for each gender corresponding to the two entries with the (relative) maximum
In [22]: !cat data/elderlyHeightWeight.csv
value of body weight ( body wt ) plus height ( ht ).
Gender Age body wt ht
F 77 63.8 155.5
F 80 56.4 160.5
F 76 55.2 159.5 Sol #1 : Using a Dictionary
F 77 58.5 151
F 82 64 165.5
F 78 51.6 167 In [24]: info = {} # Dictonary holding per-sex lines info
F 85 54.6 154 f_hand = open(file_loc)
F 83 71 153 lines = f_hand.read().splitlines()
M 79 75.5 171 for l in lines[1:]:
M 75 83.9 178.5 l = l.strip()
M 79 75.7 167 key = l[0]
M 84 72.5 171.5 info.setdefault(key, [])
M 76 56.2 167 info[key].append(tuple(l.split('\t')))
M 80 73.4 168.5
M 75 67.7 174.5
M 75 93 168
M 78 95.6 168
M 80 75.6 183.5
In [25]: from pprint import pprint # pprint is for **pretty printing** structures
pprint(info)

{'F': [('F', '77', '63.8', '155.5'),


('F', '80', '56.4', '160.5'),
('F', '76', '55.2', '159.5'),
('F', '77', '58.5', '151'),
('F', '82', '64', '165.5'), The csv module
('F', '78', '51.6', '167'),
('F', '85', '54.6', '154'), Getting the data from a file and doing something with it is all well and good. However once we've done our
('F', '83', '71', '153')], analysis we usually want to save the results to another file. We can do this using base python but it's easier if we
'M': [('M', '79', '75.5', '171'), use a python library, in this case the csv (http://www.pythonforbeginners.com/systems-programming/using-
('M', '75', '83.9', '178.5'), the-csv-module-in-python/) library. We'll learn more about libraries in the next unit but for now just consider
('M', '79', '75.7', '167'),
libraries as extra python code that you can get access to if you need it. In fact that's exactly what many libraries
('M', '84', '72.5', '171.5'),
('M', '76', '56.2', '167'), are. So the quesion arises 'how do we get access to a library?'. We have to tell python we want to use the library
('M', '80', '73.4', '168.5'), up front. To do this we use the import statement.
('M', '75', '67.7', '174.5'),
('M', '75', '93', '168'),
In [31]: import csv
('M', '78', '95.6', '168'),
('M', '80', '75.6', '183.5')]}

It's that simple! Now python makes available to us all the useful code in the csv library. The csv library,
In [26]: max_male = max(info['M'], key=lambda e: float(e[2]) + float(e[3]))
unsurprisingly, contains python functions and methods to make dealing with csv (and other) text files easier. Let's
print(max_male)
first see how to open a text file using the csv library and printing out the first few lines.
('M', '78', '95.6', '168')
To read data from a csv file, we use the reader() function. The reader() function takes each line of the file
and makes a reader object containing lists made up of each row in the input data. Objects in programming are
containers for both data and methods that act on that data (a bit esoteric so don't worry if you don't quite get
Sol. #2 : Using a list comprehension that). One method the reader object supports is the .next() method. We can use this to access each row at
a time. Notably once we have processed the line it's gone from the reader object.
In [27]: ## Creating Partial Lists using **List Comprehension**
males = [l.strip().split('\t') for l in lines[1:]
if l.startswith('M')]
females = [l.strip().split('\t') for l in lines[1:]
if l.startswith('F')] Note:
From here on, we are going to keep using the with/as statement to handle I/O operations, namely Context
In [28]: males
Manager objects.
Out[28]: [['M', '79', '75.5', '171'],
['M', '75', '83.9', '178.5'], For more information, see this notebook (09 Exceptions.ipynb#ctx).
['M', '79', '75.7', '167'],
['M', '84', '72.5', '171.5'],
['M', '76', '56.2', '167'],
['M', '80', '73.4', '168.5'],
['M', '75', '67.7', '174.5'],
['M', '75', '93', '168'],
['M', '78', '95.6', '168'],
['M', '80', '75.6', '183.5']]

In [30]: max_male = max(males, key=lambda e: float(e[2]) + float(e[3]))


print(max_male)

['M', '78', '95.6', '168']


In [36]: # import csv - already done
with open('data/elderlyHeightWeight.csv', 'r') as csvfile: The iterable in the
reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter
header = next(reader) for...
print (header)
print () # blank line loop above is each row of the input file. From each row we simply capture the two values we want and add these
to lists. We could then further process the data in these two lists.
for i in range(4):
print (next(reader)) # print the first 4 lines after the header

['Gender', 'Age', 'body wt', 'ht'] Writing files


['F', '77', '63.8', '155.5']
['F', '80', '56.4', '160.5']
In order to open a file for writing we use the 'w' parameter in our open() statement. Rather obviously 'w'
['F', '76', '55.2', '159.5']
['F', '77', '58.5', '151'] stands for write. If the file doesn't exist a new file is created with the given name and extension.

Note that if the file exists then opening it with the 'w' argument removes any data that was in the file and
We can see that the reader() function has processed each line into a single list element based on the field overwrites it with what you put in. This may not be what you wanted to do. We'll cover how you append data to a
delimiter we supplied. Importantly also note that all the values are now of type str in each list (everything is in file without overwriting the contents shortly.
quotes). This is important if you want to do calculations on these values. Once we have an open file we can write data to it with the write() method applied to the file handle.
Using the csv module makes it easy to select whole columns by selecting the data we want from the reader .
Let's open a file and write some data to it.
We'll use the .next() method to find the column order and then iterate over the rows with a for loop to pull
out height and weight.
In [40]: with open('data/test.txt', 'w') as f_out:
for i in range(10):
In [37]: with open('data/elderlyHeightWeight.csv', 'r') as csvfile: line = 'Line ' + str(i) + '\n'
reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter f_out.write(line)

# use next() method on reader object to id the headers


headers = next(reader) If you run the above code a new file should appear in your data directory (notice we opened the writeable file in
print(headers)
the /data directory) called test.txt . That file should have 10 lines in it with the word 'Line' and a number
# we now know weight index is 2, height index is 3 from 0-9.

weight = ['Weight'] # list to hold data, put in header In the above code we first opened (created) the file test.txt and then ran through a range of numbers (from 0
height = ['Height'] to 9) using a for loop. At each iteration of the loop we concatenated (joined) the word 'Line' to the string
representation of the number (note the use of str ) and a newline character. Finally we wrote each of the
for row in reader:
resulting strings to our new file. In the last line we closed the file.
weight.append(row[2])
height.append(row[3])

print (weight)
print (height) Putting it together!
['Gender', 'Age', 'body wt', 'ht'] Write a script that uses the csv module to open a file after getting a filepath from the user. Use the script to
['Weight', '63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75. open the elderlyHeightWeight.csv file. Write out a new file containing only male data. Remember to close
5', '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
all the files once your done. In addition include a try\except clause to handle the situation where the
['Height', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '1
71', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5'] requested file doesn't exist.

Hint: csv.reader objects are lists. Recall how you .join() lists elements into a string.
In [15]: # import csv - done above

Adding data to an existing file from collections import defaultdict


with open('data/elderlyHeightWeight.csv', 'r') as f_hand:
As noted above if you open an existing file and write data to it all that pre-existing data gets over written. That's csv_info = dict()
not usually what you want to do. In fact in general you probably never want to write to any file that has raw data reader = csv.DictReader(f_hand, delimiter='\t') # define the field delimit
you are going to analyse in it - because you might lose or screw-up your original data. Sometimes however you er
might want to add new measurements (perhaps taken over time) to an existing file. For these cases there's the for entry in reader:
for key, value in entry.items():
'a' argument to the open() function. The a stands for append. Let's take the file containing only the male
if key not in csv_info:
data we wrote in the last exercise, open it in append mode and write the female data to that file. csv_info[key] = [] # initialise as an Empty list
csv_info[key].append(value)

In [7]: import csv for key, value in csv_info.items():


print('{}: \n\t {}'.format(key, value))
# assumes your file was called male_data.tsv
try: Gender:
with open('data/male_data.tsv', 'a') as new_file, open('data/elderlyHeight ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M',
Weight.csv', 'r') as f_hand: 'M', 'M', 'M', 'M', 'M']
reader = csv.reader(f_hand, delimiter='\t') # define the field delimit Age:
er ['77', '80', '76', '77', '82', '78', '85', '83', '79', '75', '79',
male_data = [line for line in reader if line[0] == 'M'] '84', '76', '80', '75', '75', '78', '80']
for line in male_data: body wt:
new_file.write('\t'.join(line)+'\n') ['63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75.5',
except FileNotFoundError: '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
print('The file does not exist.') ht:
['155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '17
1', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']

Processing and writing file data


In this code snippet we first initialised four lists - one to hold each column of our data. We then iterated over the
Let's do something a bit more useful than just copying data around from one file to another. Often when we have columns of the data and assigned each value to its relevant list variable.
demographic data like this one of the things we want to do is create new variables from that data. The
elderlyHeightWeight.csv file contains... eh, well... height and weight data from a sample of elderly study If you examine these lists you'll see that the first entry is the column header (which is handy for tracking data)
participants. One obvious new variable we could create from this is BMI. However we'll save that for the and the other entries are the actual data for that column in the original file.
exercise!
In [43]: print (height)
Instead we'll demonstrate the process by converting the height from cm to m - a simple division by 100. We can
write this data to a new column. The strategy we'll use is to read each field of the data into a separate list. We ['ht', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '171',
will process the appropriate list and then use the the writer() (https://docs.python.org/2/library/csv.html) '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']
method of the csv module to write our new file including processed height data.

We'll use a slightly different approach here from that demonstrated above (previous height & weight example). Now we have the data separated out it's a trivial effort to calculate the height in meters (from the given height in
Instead of iterating over the rows we'll use iterator variables in our for loop. cm). In the code below we use the range() function to get the positions of the actual heights (i.e. we skip the
column header), we convert those heights from str to float and we calculate the height in meters and
append this to a new list.
In [44]: height_m = []
height_m.append('ht_m') # a new header The first element in each of our data lists is the column header. The zip() function captures these first
elements into a tuple - ('Gender', 'Age', 'body wt', 'ht', 'ht_m') - and this, in turn, becomes the first
# use range(1,len(height)) so we don't get the header again element of a new list, data_out . The zip() function then captures all the second elements from each data
for ht in height[1:]:
list and these become part of a tuple which is the second element of data_out . In this way each data list is
height_m.append(float(ht)/100) # note the conversion to a float here
'zipped up' with the other lists.
print (height_m)
To output the rows we simply iterate over the data_out list and send each element to our output file as a row
using the .writerow() method.
['ht_m', 1.555, 1.605, 1.595, 1.51, 1.655, 1.67, 1.54, 1.53, 1.71, 1.785, 1.6
7, 1.715, 1.67, 1.685, 1.745, 1.68, 1.68, 1.835]

Putting it together 1
Now we have all the data we need to write the new file. First we'll capture each line of our new file to a list (the
zip() function) and then write each line to the new file. The csv library extends the .write() method with Open the elderlyHeightWeight.csv using the functions in the csv module and extract each column to a
a writer object. One method of writer objects is .writerow() the use of which is demonstrated below. separate list. Use the height and weight data to calculate the BMI for each subject. Use zip() to create a list of
data to write out and write all the phenotype data including BMI back to a new file.
In [45]: with open('data/new_data.csv', 'w') as newdata_file:
Hint - if you use the csv.reader() remember the issues with the str type in lists.
writer = csv.writer(newdata_file, delimiter='\t') # define a writer object

# iterate over data and write to file


# use zip to create list of tuples for writing
for row in zip(gender, age, weight, height, height_m): Putting it together 2
writer.writerow(row)
Read the file you just created back in and select only those trial participants who are obese. Print the sex, age
and BMI of these people. Obese means a BMI of 30 or more.
Remember that the zip() (https://docs.python.org/3/library/functions.html#zip) function will create an iterator
(i.e. zip object ) made up of tuples . In the example above the use of zip() creates a sequence the first
element of which is all the first elements of our data lists, the second list element is all the second elements etc.
It's easier to see this than explain it.

In [47]: zip_sequence = zip(gender, age, weight, height, height_m)


print(type(zip_sequence))

<class 'zip'>

In [48]: print (gender[:4])


print (age[:4])
print (weight[:4])
print (height[:4])
print (height_m[:4])
print # just a blank line
print (list(zip_sequence)[:4])

['Gender', 'F', 'F', 'F']


['Age', '77', '80', '76']
['body wt', '63.8', '56.4', '55.2']
['ht', '155.5', '160.5', '159.5']
['ht_m', 1.555, 1.605, 1.595]
[('Gender', 'Age', 'body wt', 'ht', 'ht_m'), ('F', '77', '63.8', '155.5', 1.5
55), ('F', '80', '56.4', '160.5', 1.605), ('F', '76', '55.2', '159.5', 1.59
5)]
Homework
The nhanes.tsv file in the data directory contains data on 4581 Americans aged from 20 to 70 from the
2011-2012 NHANES (http://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?
Component=Demographics&CycleBeginYear=2011) survey. The data included are

individual number (unique ID for each individual in NHANES)


age (years)
sex (1 = M, 2 = F)
weight (kg)
height (cm).

Write a script that will read this data and count the number of NA values in height and /or weight and count the
number of males and females.

Calculate the BMI for each individual, add this to the original file and write out a new file indluding BMI data.

Finally calculate the mean BMI for males and females and write these out as well (to 2 decimal places).

Hint: In this exercise you should use the techniques you have learned to loop over the lines of a file and extract
each variable into its' own list. You can then calculate the BMI values easily. However you won't be able to
calculate a BMI for individuals with 'NA' in either weight or height columns. How can you use the continue
keyword when you loop over your data to avoid collecting values for these individuals?

You might also like