0% found this document useful (0 votes)

2 views10 pages

Day_10 Python External Files

Uploaded by

hatim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views10 pages

Day_10 Python External Files

Uploaded by

hatim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Opening (and Reading) Files

Day-10 Python File Handling When you open a file in a programme such as Word or Excel you have to select the file you want from a 'File ->
Open' dialogue of some kind. In other words you have to point the programme at the specific file you want which
is stored at some specific location on your computer. This is also true when you open a file in Python. The
difference is that when you open a file in python you have to specify the file location in words (as a filepath)
Text Files rather then selecting from a dialogue. So the first thing we have to do is assign the file location to a variable. This
Data is often stored in formats that are not easy to read unless you have some specialised software. Excel textual representation of the file location is called the filepath.
spreadsheets are one example; how do you read an Excel spreadsheet if you don't have Excel? This reliance on
Once we have the filepath we can use the python function open() to 'get a handle' on the file. This file handle
particular software is not very useful for programmatic analysis of data. Instead if we're going to use a
should also be assigned to a variable. We can then operate on that variable. Let's see an example.
programming language to analyse or process our data we would prefer some "easy to deal with" format. Text
files or flatfiles (http://en.wikipedia.org/wiki/Text_file) provide just the sort of format we need (there are other
In [1]: file_loc = 'data/elderlyHeightWeight.csv'
choices). You can think of these as a single sheet from a spreadsheet with the data arranged in rows (separated
f_hand = open(file_loc, 'r')
by our old friend the \n character) and columns (separated by some other character). print(f_hand)

<_io.TextIOWrapper name='data/elderlyHeightWeight.csv' mode='r' encoding='UTF

There are two common characters used to separate the data in text files into columns. The \t , or tab stop -8'>
character and the humble comma (,). In both cases fields in our input file are separated by a single tab stop or a
single comma.
In the example above we have opened a small file containing height and weight information on a group of elderly
This gives rise to two commonly used file extensions (the bit after the dot in file names e.g. in myfile.docx the men and women from a body composition study.
'docx' is the file extension.). These file extensions are .tsv for 'tab separated file' and .csv for 'comma
separated file'. Other separators in text files of data may be little used characters such as '|', ':' or spaces. As a
In the first line we assign the file path to the variable file_loc . We then use that variable to open a file handle
quick aside separating fields in data files by spaces is tricky because an individual value might also contain
( f_hand ) using the open() function. The file location is the first argument to the open() function and the
spaces and therefore be inappropriately divided over more than one column.
second argument 'r' indicates that we want to read from the file i.e. we want access to the data in the file but
As an illustration if you were to print a couple of lines from a .tsv file out it would look something like this: we don't want to change the file itself. Finally we use a print statement to print the file handle.

data_field1\tdata_field2\tdata_field3\n data_field1\tdata_field2\tdata_field3\n The results of the print statement might surprise you. Rather than printing the contents of the file what we get
is a representation of the location in the computer memory of where the file is i.e. at memory location 0x7... etc in
Each field is separated by a tab stop \t and the end of a line is indicated by the \n combination. the example above.

In a .csv file you would see: Whilst we can also open Excel or Word files in python this requires the use of special software libraries. We'll see
some of those later in the course. Mostly when we are analysing data in files we open simple text files. Both
data_field1,data_field2,data_field3\n data_field1,data_field2,data_field3\n Excel and Word can save files out as simple text files.

The first few lines of the file we will work with are shown below.
Note:
In [2]: !head -n 4 data/elderlyHeightWeight.csv
It's important to note that there is NO STANDARD for either the layout OR naming of text files. In the exercises
that follow the text file we will be working with has the extension .csv which should mean comma separated Gender Age body wt ht
but in fact the data is separated by tabs - because that's what I'm in the habit of doing. Please ignore my bad F 77 63.8 155.5
F 80 56.4 160.5
habits!
F 76 55.2 159.5
In [4]: file_loc = 'data/elderlyHeightWeight.csv'
While the file handle does not contain the data from the file (it only points at it) it is easy to construct a for loop f_hand = open(file_loc, 'r')
to cycle through the lines of the file and carry out some computation. For example we can easily count the f_data = f_hand.read()
number of lines in the file. print(f_data)

Gender Age body wt ht

F 77 63.8 155.5
In [3]: count = 0 # initialise F 80 56.4 160.5
F 76 55.2 159.5
for line in f_hand: # iterate F 77 58.5 151
count = count + 1 F 82 64 165.5
F 78 51.6 167
print('There are %d lines in the file.' % count ) F 85 54.6 154
F 83 71 153
There are 19 lines in the file. M 79 75.5 171
M 75 83.9 178.5
M 79 75.7 167
M 84 72.5 171.5
Why doesn't open() open the file directly? M 76 56.2 167
M 80 73.4 168.5
It might seem stupid that after using open() a file handle is created but the file contents aren't directly M 75 67.7 174.5
available. The reason the open() function does not read the whole file immediately but just points to it is to do M 75 93 168
with file size. If you do not know in advance the size of the files you are dealing with (often the case - how often M 78 95.6 168
M 80 75.6 183.5
do you check the size of files you open on your computer?) automatically opening very large files could:

Take a long time

Crash the whole computer system - essentially you'd run out of memory In [5]: f_hand.close()
f_hand = open(file_loc)
print(f_hand.readline(), end = '')
print(f_hand.readline(), end = '')
For some biological applications the data files (which may well be text files e.g. in RNA Seq data - see comment f_hand.close()
here (http://seqanswers.com/forums/showthread.php?t=10787)) can be very large. So it's safer to point to the file f_hand = open(file_loc)
lines = f_hand.readlines()
rather than automatically open it. This is also true of other data or informatics applications.
print(type(lines))
In the for loop above python splits the file into lines based on the newline character ( \n - the split is implicit), Gender Age body wt ht
increments the count variable by 1 for every line and then discards that line. So there is only ever one line from F 77 63.8 155.5
the file in the computer memory at any given time. <class 'list'>

If you know that your file is likely to be small you can read the whole file into memory with the read() method In [6]: file_loc = 'data/elderlyHeightWeight.csv'
(remember the dot notation!). f_hand = open(file_loc, 'r')
f_data = f_hand.read()
This reads the entire contents of the file, newlines and all, into one large string. f_data[:22]
print(len(f_data))
print( f_data[:10])
r'{}'.format(f_data[:22]) # note the tab stop in the output

286
Gender Age

Out[6]: 'Gender\tAge\tbody wt\tht\n'

In the above example we first create the file handle and then read the entire contents of the file into one string. Use of strip() has removed the \n from our selected line. There are also lstrip() and rstrip()
We check the length of that string (286 characters including whitespace characters like \n ) and we print the methods that strip whitespace from only the left or right sides of a string respectively.
first 10 characters (refer back to the material on slicing if you're unsure how the [:10] slice works). The
Just to confuse you further there's also a readlines() method that reads all the lines in the file into a list .
print statement interprets the tab stop properly but if we just ask for the first 22 characters to be returned (i.e.
Again the lines are separated on the invisible \n character. This can be handy because you can assign the
we do not use print ) we can see the tab stop and the \n . Compare this to the illustration of a .tsv file shown
list to a variable and then loop through the list to print file lines or simply extract the lines you want using slice
above.
notation.
Using the print statement and subsetting is fine but not convenient. You might want to print the whole of the
first line. The readline() method will read one line at a time and you can use this to e.g. just display the In [11]: f_hand = open(file_loc, 'r')
header line (if you know or suspect that your file has a header line). lines = f_hand.readlines()
print(lines[0:2]) # check we have a list
print(len(lines))

In [7]: f_hand = open(file_loc, 'r') ['Gender\tAge\tbody wt\tht\n', 'F\t77\t63.8\t155.5\n']

line = f_hand.readline() # reads first line 19
print(line)
# next line
In [12]: for i in range(4):
line=f_hand.readline()
print (lines[i].strip())
print(line)
Gender Age body wt ht
Gender Age body wt ht
F 77 63.8 155.5
F 80 56.4 160.5
F 77 63.8 155.5
F 76 55.2 159.5

After reading in the current line readline() then moves on to the next line. So calling readline() again
uses the next line in the file. One other thing to bear in mind is that readline() leaves whitespace and in
particular the \n character at the end of the line. You can see that above (there's a blank line between the Alternative Implementations (just for fun)
printed lines) in the following example.

In [8]: line = f_hand.readline() # note next line has been read no. 1
line # compare to print above

Out[8]: 'F\t80\t56.4\t160.5\n' In [13]: for line in lines[:4]:

print(line.strip())

In [ ]: Gender Age body wt ht

F 77 63.8 155.5
F 80 56.4 160.5
If you're using python to join lines together in some new format that might not be what you want. There is a F 76 55.2 159.5
method, strip() (see here (https://docs.python.org/3/library/string.html)) that removes whitespace at the end
of lines and can be used to remove this potentially extraneous \n character. Note also that methods can be
chained together so you can use readline() and strip() sequentially using the following syntax. no. 2

In [10]: f_hand = open(file_loc, 'r') # read in in file again to get header line
line = f_hand.readline().strip() # read the line then strip the whitespace at
the end of the line
line # no \n!

Out[10]: 'Gender\tAge\tbody wt\tht'

In [14]: for i, line in enumerate(lines): In [17]: file_loc = 'data/elderlyHeightWeight.csv' # relative path
if i == 4: f_hand = open(file_loc, 'r')
break lines = f_hand.read().splitlines() # lines to a list
print(line.strip()) print (lines[0]) # header

Gender Age body wt ht for line in lines: # loop to filter

F 77 63.8 155.5 if line.startswith('M'):
F 80 56.4 160.5 print (line)
F 76 55.2 159.5
f_hand.close()

Gender Age body wt ht

M 79 75.5 171
M 75 83.9 178.5
M 79 75.7 167
M 84 72.5 171.5
Just like readline() the readlines() method leaves the trailing \n at the end of the line but you can use M 76 56.2 167
strip() to remove it if you have to as we did above. We had to use the strip() method on the individual M 80 73.4 168.5
line rather than on the list of lines as lists strip() does not operate on lists. Try moving the strip() to the M 75 67.7 174.5
M 75 93 168
end of lines = f_hand.readlines() and see what kind if error you get.
M 78 95.6 168
M 80 75.6 183.5
Finally (and perhaps most usefully) there is the splitlines() method that does the same as readlines()
but drops the trailing \n automatically.

In [16]: f_hand = open(file_loc, 'r') filter alternative

lines = f_hand.read().splitlines() # read file, then split lines to lists, dro
ps trailing \n
In [18]: def filter_function(line):
for i in range(4):
return line.startswith('M')
print (lines[i])

Gender Age body wt ht In [19]: f_hand = open(file_loc)

F 77 63.8 155.5 lines = f_hand.readlines()
F 80 56.4 160.5 male_gender = filter(filter_function, lines)
F 76 55.2 159.5 print(lines[0].strip())
for ml in male_gender:
print(ml.strip())
Notice we didn't have to use strip() . f_hand.close()

One final thing to note is that whenever we finish with a file we should close it. Leaving files 'open' after data has Gender Age body wt ht
M 79 75.5 171
been read from them can lead to increasing amounts of memory being used and also corruption of the file.
M 75 83.9 178.5
Closing files is accomplished by using the close() method on the file handle. Also illustrated is a simple filter M 79 75.7 167
to print out only the male data using the string method startswith() - which returns a boolean value M 84 72.5 171.5
depending on whether the line begins with the given argument (M in this case) or not. M 76 56.2 167
M 80 73.4 168.5
M 75 67.7 174.5
M 75 93 168
M 78 95.6 168
M 80 75.6 183.5

Using lambda expressions

In [29]: f_hand = open(file_loc)
male_gender = filter(lambda l: l.startswith('M'), f_hand)
for ml in male_gender: Ex. no 2
print(ml.strip())
Print all the lines in the file where the Age value is in the range [70, 80)
f_hand.close()

M 79 75.5 171 In [23]: f_hand = open(file_loc)

M 75 83.9 178.5 for i, line in enumerate(f_hand):
M 79 75.7 167 if i == 0:
M 84 72.5 171.5 continue
M 76 56.2 167 line = line.strip()
M 80 73.4 168.5 _, age, *_ = line.split('\t')
M 75 67.7 174.5 if 70 <= int(age) < 80:
M 75 93 168 print(line)
M 78 95.6 168
F 77 63.8 155.5
M 80 75.6 183.5
F 76 55.2 159.5
F 77 58.5 151
F 78 51.6 167
M 79 75.5 171
Exercises M 75 83.9 178.5
M 79 75.7 167
M 76 56.2 167
M 75 67.7 174.5
Show the content of the file using a Shell command M 75 93 168
M 78 95.6 168
Tip 1: The shell command to be used could be cat

Tip 2: Remember the ! (esclamation mark) Ex. no 3

Print the two lines in the files for each gender corresponding to the two entries with the (relative) maximum
In [22]: !cat data/elderlyHeightWeight.csv
value of body weight ( body wt ) plus height ( ht ).
Gender Age body wt ht
F 77 63.8 155.5
F 80 56.4 160.5
F 76 55.2 159.5 Sol #1 : Using a Dictionary
F 77 58.5 151
F 82 64 165.5
F 78 51.6 167 In [24]: info = {} # Dictonary holding per-sex lines info
F 85 54.6 154 f_hand = open(file_loc)
F 83 71 153 lines = f_hand.read().splitlines()
M 79 75.5 171 for l in lines[1:]:
M 75 83.9 178.5 l = l.strip()
M 79 75.7 167 key = l[0]
M 84 72.5 171.5 info.setdefault(key, [])
M 76 56.2 167 info[key].append(tuple(l.split('\t')))
M 80 73.4 168.5
M 75 67.7 174.5
M 75 93 168
M 78 95.6 168
M 80 75.6 183.5
In [25]: from pprint import pprint # pprint is for **pretty printing** structures
pprint(info)

{'F': [('F', '77', '63.8', '155.5'),

('F', '80', '56.4', '160.5'),
('F', '76', '55.2', '159.5'),
('F', '77', '58.5', '151'),
('F', '82', '64', '165.5'), The csv module
('F', '78', '51.6', '167'),
('F', '85', '54.6', '154'), Getting the data from a file and doing something with it is all well and good. However once we've done our
('F', '83', '71', '153')], analysis we usually want to save the results to another file. We can do this using base python but it's easier if we
'M': [('M', '79', '75.5', '171'), use a python library, in this case the csv (http://www.pythonforbeginners.com/systems-programming/using-
('M', '75', '83.9', '178.5'), the-csv-module-in-python/) library. We'll learn more about libraries in the next unit but for now just consider
('M', '79', '75.7', '167'),
libraries as extra python code that you can get access to if you need it. In fact that's exactly what many libraries
('M', '84', '72.5', '171.5'),
('M', '76', '56.2', '167'), are. So the quesion arises 'how do we get access to a library?'. We have to tell python we want to use the library
('M', '80', '73.4', '168.5'), up front. To do this we use the import statement.
('M', '75', '67.7', '174.5'),
('M', '75', '93', '168'),
In [31]: import csv
('M', '78', '95.6', '168'),
('M', '80', '75.6', '183.5')]}

It's that simple! Now python makes available to us all the useful code in the csv library. The csv library,
In [26]: max_male = max(info['M'], key=lambda e: float(e[2]) + float(e[3]))
unsurprisingly, contains python functions and methods to make dealing with csv (and other) text files easier. Let's
print(max_male)
first see how to open a text file using the csv library and printing out the first few lines.
('M', '78', '95.6', '168')
To read data from a csv file, we use the reader() function. The reader() function takes each line of the file
and makes a reader object containing lists made up of each row in the input data. Objects in programming are
containers for both data and methods that act on that data (a bit esoteric so don't worry if you don't quite get
Sol. #2 : Using a list comprehension that). One method the reader object supports is the .next() method. We can use this to access each row at
a time. Notably once we have processed the line it's gone from the reader object.
In [27]: ## Creating Partial Lists using **List Comprehension**
males = [l.strip().split('\t') for l in lines[1:]
if l.startswith('M')]
females = [l.strip().split('\t') for l in lines[1:]
if l.startswith('F')] Note:
From here on, we are going to keep using the with/as statement to handle I/O operations, namely Context
In [28]: males
Manager objects.
Out[28]: [['M', '79', '75.5', '171'],
['M', '75', '83.9', '178.5'], For more information, see this notebook (09 Exceptions.ipynb#ctx).
['M', '79', '75.7', '167'],
['M', '84', '72.5', '171.5'],
['M', '76', '56.2', '167'],
['M', '80', '73.4', '168.5'],
['M', '75', '67.7', '174.5'],
['M', '75', '93', '168'],
['M', '78', '95.6', '168'],
['M', '80', '75.6', '183.5']]

In [30]: max_male = max(males, key=lambda e: float(e[2]) + float(e[3]))

print(max_male)

['M', '78', '95.6', '168']

In [36]: # import csv - already done
with open('data/elderlyHeightWeight.csv', 'r') as csvfile: The iterable in the
reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter
header = next(reader) for...
print (header)
print () # blank line loop above is each row of the input file. From each row we simply capture the two values we want and add these
to lists. We could then further process the data in these two lists.
for i in range(4):
print (next(reader)) # print the first 4 lines after the header

['Gender', 'Age', 'body wt', 'ht'] Writing files

['F', '77', '63.8', '155.5']
['F', '80', '56.4', '160.5']
In order to open a file for writing we use the 'w' parameter in our open() statement. Rather obviously 'w'
['F', '76', '55.2', '159.5']
['F', '77', '58.5', '151'] stands for write. If the file doesn't exist a new file is created with the given name and extension.

Note that if the file exists then opening it with the 'w' argument removes any data that was in the file and
We can see that the reader() function has processed each line into a single list element based on the field overwrites it with what you put in. This may not be what you wanted to do. We'll cover how you append data to a
delimiter we supplied. Importantly also note that all the values are now of type str in each list (everything is in file without overwriting the contents shortly.
quotes). This is important if you want to do calculations on these values. Once we have an open file we can write data to it with the write() method applied to the file handle.
Using the csv module makes it easy to select whole columns by selecting the data we want from the reader .
Let's open a file and write some data to it.
We'll use the .next() method to find the column order and then iterate over the rows with a for loop to pull
out height and weight.
In [40]: with open('data/test.txt', 'w') as f_out:
for i in range(10):
In [37]: with open('data/elderlyHeightWeight.csv', 'r') as csvfile: line = 'Line ' + str(i) + '\n'
reader = csv.reader(csvfile, delimiter='\t') # define the field delimiter f_out.write(line)

# use next() method on reader object to id the headers

headers = next(reader) If you run the above code a new file should appear in your data directory (notice we opened the writeable file in
print(headers)
the /data directory) called test.txt . That file should have 10 lines in it with the word 'Line' and a number
# we now know weight index is 2, height index is 3 from 0-9.

weight = ['Weight'] # list to hold data, put in header In the above code we first opened (created) the file test.txt and then ran through a range of numbers (from 0
height = ['Height'] to 9) using a for loop. At each iteration of the loop we concatenated (joined) the word 'Line' to the string
representation of the number (note the use of str ) and a newline character. Finally we wrote each of the
for row in reader:
resulting strings to our new file. In the last line we closed the file.
weight.append(row[2])
height.append(row[3])

print (weight)
print (height) Putting it together!
['Gender', 'Age', 'body wt', 'ht'] Write a script that uses the csv module to open a file after getting a filepath from the user. Use the script to
['Weight', '63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75. open the elderlyHeightWeight.csv file. Write out a new file containing only male data. Remember to close
5', '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
all the files once your done. In addition include a try\except clause to handle the situation where the
['Height', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '1
71', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5'] requested file doesn't exist.

Hint: csv.reader objects are lists. Recall how you .join() lists elements into a string.
In [15]: # import csv - done above

Adding data to an existing file from collections import defaultdict

with open('data/elderlyHeightWeight.csv', 'r') as f_hand:
As noted above if you open an existing file and write data to it all that pre-existing data gets over written. That's csv_info = dict()
not usually what you want to do. In fact in general you probably never want to write to any file that has raw data reader = csv.DictReader(f_hand, delimiter='\t') # define the field delimit
you are going to analyse in it - because you might lose or screw-up your original data. Sometimes however you er
might want to add new measurements (perhaps taken over time) to an existing file. For these cases there's the for entry in reader:
for key, value in entry.items():
'a' argument to the open() function. The a stands for append. Let's take the file containing only the male
if key not in csv_info:
data we wrote in the last exercise, open it in append mode and write the female data to that file. csv_info[key] = [] # initialise as an Empty list
csv_info[key].append(value)

In [7]: import csv for key, value in csv_info.items():

print('{}: \n\t {}'.format(key, value))
# assumes your file was called male_data.tsv
try: Gender:
with open('data/male_data.tsv', 'a') as new_file, open('data/elderlyHeight ['F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M',
Weight.csv', 'r') as f_hand: 'M', 'M', 'M', 'M', 'M']
reader = csv.reader(f_hand, delimiter='\t') # define the field delimit Age:
er ['77', '80', '76', '77', '82', '78', '85', '83', '79', '75', '79',
male_data = [line for line in reader if line[0] == 'M'] '84', '76', '80', '75', '75', '78', '80']
for line in male_data: body wt:
new_file.write('\t'.join(line)+'\n') ['63.8', '56.4', '55.2', '58.5', '64', '51.6', '54.6', '71', '75.5',
except FileNotFoundError: '83.9', '75.7', '72.5', '56.2', '73.4', '67.7', '93', '95.6', '75.6']
print('The file does not exist.') ht:
['155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '17
1', '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']

Processing and writing file data

In this code snippet we first initialised four lists - one to hold each column of our data. We then iterated over the
Let's do something a bit more useful than just copying data around from one file to another. Often when we have columns of the data and assigned each value to its relevant list variable.
demographic data like this one of the things we want to do is create new variables from that data. The
elderlyHeightWeight.csv file contains... eh, well... height and weight data from a sample of elderly study If you examine these lists you'll see that the first entry is the column header (which is handy for tracking data)
participants. One obvious new variable we could create from this is BMI. However we'll save that for the and the other entries are the actual data for that column in the original file.
exercise!
In [43]: print (height)
Instead we'll demonstrate the process by converting the height from cm to m - a simple division by 100. We can
write this data to a new column. The strategy we'll use is to read each field of the data into a separate list. We ['ht', '155.5', '160.5', '159.5', '151', '165.5', '167', '154', '153', '171',
will process the appropriate list and then use the the writer() (https://docs.python.org/2/library/csv.html) '178.5', '167', '171.5', '167', '168.5', '174.5', '168', '168', '183.5']
method of the csv module to write our new file including processed height data.

We'll use a slightly different approach here from that demonstrated above (previous height & weight example). Now we have the data separated out it's a trivial effort to calculate the height in meters (from the given height in
Instead of iterating over the rows we'll use iterator variables in our for loop. cm). In the code below we use the range() function to get the positions of the actual heights (i.e. we skip the
column header), we convert those heights from str to float and we calculate the height in meters and
append this to a new list.
In [44]: height_m = []
height_m.append('ht_m') # a new header The first element in each of our data lists is the column header. The zip() function captures these first
elements into a tuple - ('Gender', 'Age', 'body wt', 'ht', 'ht_m') - and this, in turn, becomes the first
# use range(1,len(height)) so we don't get the header again element of a new list, data_out . The zip() function then captures all the second elements from each data
for ht in height[1:]:
list and these become part of a tuple which is the second element of data_out . In this way each data list is
height_m.append(float(ht)/100) # note the conversion to a float here
'zipped up' with the other lists.
print (height_m)
To output the rows we simply iterate over the data_out list and send each element to our output file as a row
using the .writerow() method.
['ht_m', 1.555, 1.605, 1.595, 1.51, 1.655, 1.67, 1.54, 1.53, 1.71, 1.785, 1.6
7, 1.715, 1.67, 1.685, 1.745, 1.68, 1.68, 1.835]

Putting it together 1
Now we have all the data we need to write the new file. First we'll capture each line of our new file to a list (the
zip() function) and then write each line to the new file. The csv library extends the .write() method with Open the elderlyHeightWeight.csv using the functions in the csv module and extract each column to a
a writer object. One method of writer objects is .writerow() the use of which is demonstrated below. separate list. Use the height and weight data to calculate the BMI for each subject. Use zip() to create a list of
data to write out and write all the phenotype data including BMI back to a new file.
In [45]: with open('data/new_data.csv', 'w') as newdata_file:
Hint - if you use the csv.reader() remember the issues with the str type in lists.
writer = csv.writer(newdata_file, delimiter='\t') # define a writer object

# iterate over data and write to file

# use zip to create list of tuples for writing
for row in zip(gender, age, weight, height, height_m): Putting it together 2
writer.writerow(row)
Read the file you just created back in and select only those trial participants who are obese. Print the sex, age
and BMI of these people. Obese means a BMI of 30 or more.
Remember that the zip() (https://docs.python.org/3/library/functions.html#zip) function will create an iterator
(i.e. zip object ) made up of tuples . In the example above the use of zip() creates a sequence the first
element of which is all the first elements of our data lists, the second list element is all the second elements etc.
It's easier to see this than explain it.

In [47]: zip_sequence = zip(gender, age, weight, height, height_m)

print(type(zip_sequence))

In [48]: print (gender[:4])

print (age[:4])
print (weight[:4])
print (height[:4])
print (height_m[:4])
print # just a blank line
print (list(zip_sequence)[:4])

['Gender', 'F', 'F', 'F']

['Age', '77', '80', '76']
['body wt', '63.8', '56.4', '55.2']
['ht', '155.5', '160.5', '159.5']
['ht_m', 1.555, 1.605, 1.595]
[('Gender', 'Age', 'body wt', 'ht', 'ht_m'), ('F', '77', '63.8', '155.5', 1.5
55), ('F', '80', '56.4', '160.5', 1.605), ('F', '76', '55.2', '159.5', 1.59
5)]
Homework
The nhanes.tsv file in the data directory contains data on 4581 Americans aged from 20 to 70 from the
2011-2012 NHANES (http://wwwn.cdc.gov/Nchs/Nhanes/Search/DataPage.aspx?
Component=Demographics&CycleBeginYear=2011) survey. The data included are

individual number (unique ID for each individual in NHANES)

age (years)
sex (1 = M, 2 = F)
weight (kg)
height (cm).

Write a script that will read this data and count the number of NA values in height and /or weight and count the
number of males and females.

Calculate the BMI for each individual, add this to the original file and write out a new file indluding BMI data.

Finally calculate the mean BMI for males and females and write these out as well (to 2 decimal places).

Hint: In this exercise you should use the techniques you have learned to loop over the lines of a file and extract
each variable into its' own list. You can then calculate the BMI values easily. However you won't be able to
calculate a BMI for individuals with 'NA' in either weight or height columns. How can you use the continue
keyword when you loop over your data to avoid collecting values for these individuals?

Chapter 5 - File Handling
No ratings yet
Chapter 5 - File Handling
68 pages
HKU - 7001 - 3.1 Managing Data I
No ratings yet
HKU - 7001 - 3.1 Managing Data I
73 pages
Introduction To Files
No ratings yet
Introduction To Files
17 pages
Data File Handling
No ratings yet
Data File Handling
29 pages
2024 25 COL100 Lab 13 File Handling
No ratings yet
2024 25 COL100 Lab 13 File Handling
6 pages
CSV File
No ratings yet
CSV File
30 pages
5 - File I O and CSV Module
No ratings yet
5 - File I O and CSV Module
12 pages
Class Xii File Handling
No ratings yet
Class Xii File Handling
14 pages
File Hadling
No ratings yet
File Hadling
59 pages
Lesson 5 File Handling Text Files
No ratings yet
Lesson 5 File Handling Text Files
35 pages
Python Module - 4herrewHRW
No ratings yet
Python Module - 4herrewHRW
79 pages
Chapter 5 FileHandlingFinal
No ratings yet
Chapter 5 FileHandlingFinal
56 pages
CS - File Handling
No ratings yet
CS - File Handling
15 pages
Chapter 5.3 CSV File Handling
No ratings yet
Chapter 5.3 CSV File Handling
9 pages
Handling CSV Files in Python
No ratings yet
Handling CSV Files in Python
11 pages
Python Data File Handling XII CS 2022-23 As On 28-10-2022
No ratings yet
Python Data File Handling XII CS 2022-23 As On 28-10-2022
62 pages
Chap.5.File Handling
No ratings yet
Chap.5.File Handling
7 pages
File Handling
No ratings yet
File Handling
12 pages
Mbict 305 - 162 - 2122 - 11 - 10042022 - 123
No ratings yet
Mbict 305 - 162 - 2122 - 11 - 10042022 - 123
31 pages
FILES
No ratings yet
FILES
59 pages
Xii Mll 083 Xi Fila Handling Qp
No ratings yet
Xii Mll 083 Xi Fila Handling Qp
8 pages
File Handling
No ratings yet
File Handling
23 pages
Python File Handling
No ratings yet
Python File Handling
18 pages
Python 07 File
No ratings yet
Python 07 File
22 pages
File Handling Notes
No ratings yet
File Handling Notes
28 pages
Data File Handling
No ratings yet
Data File Handling
8 pages
File Handling 2022 - Complete Notes
No ratings yet
File Handling 2022 - Complete Notes
60 pages
Text Files - File Handling
No ratings yet
Text Files - File Handling
84 pages
Problem Solving and Python Programming
No ratings yet
Problem Solving and Python Programming
28 pages
GE3151 Unit V
No ratings yet
GE3151 Unit V
20 pages
F Open ("Test - TXT") F Open ("C:/Python33/README - TXT") # Specifying Full Path
No ratings yet
F Open ("Test - TXT") F Open ("C:/Python33/README - TXT") # Specifying Full Path
11 pages
Chapter 5 Text and Binary File Handling
100% (1)
Chapter 5 Text and Binary File Handling
36 pages
Csv-Files Final
No ratings yet
Csv-Files Final
21 pages
Chapter+6+Sections+1 3
No ratings yet
Chapter+6+Sections+1 3
10 pages
Unit IV File Handling - CSV Files
No ratings yet
Unit IV File Handling - CSV Files
28 pages
Study Material File Handling 2024-25
No ratings yet
Study Material File Handling 2024-25
82 pages
File Handling Notes
No ratings yet
File Handling Notes
21 pages
File Handling CSV Files Notes 3
No ratings yet
File Handling CSV Files Notes 3
17 pages
Files
No ratings yet
Files
17 pages
File Handling
No ratings yet
File Handling
8 pages
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
No ratings yet
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
9 pages
CSV New
No ratings yet
CSV New
4 pages
Chapter 2 Notes
No ratings yet
Chapter 2 Notes
29 pages
Python-CSV Files
No ratings yet
Python-CSV Files
15 pages
Strings and Text Files
No ratings yet
Strings and Text Files
34 pages
File Handling (1)
No ratings yet
File Handling (1)
81 pages
Chapter 4 File Handlinf Final (New)
100% (1)
Chapter 4 File Handlinf Final (New)
78 pages
Unit-4 Python
No ratings yet
Unit-4 Python
21 pages
Unit V
No ratings yet
Unit V
12 pages
File Handling
No ratings yet
File Handling
36 pages
Class 12 File - Handling 1
No ratings yet
Class 12 File - Handling 1
4 pages
XIIComp SC 26
No ratings yet
XIIComp SC 26
22 pages
Python SecD
No ratings yet
Python SecD
8 pages
Reading Filw With Open
No ratings yet
Reading Filw With Open
4 pages
Lecture5 LIFE733 202425
No ratings yet
Lecture5 LIFE733 202425
45 pages
CSV File Notes
No ratings yet
CSV File Notes
12 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
11 pages
File Handling Notes
No ratings yet
File Handling Notes
8 pages
Lesson-5-File Handling-CSV Files
No ratings yet
Lesson-5-File Handling-CSV Files
37 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
day-06 More Functions
No ratings yet
day-06 More Functions
11 pages
Amazon Create Account and Online Shopping DECENT
No ratings yet
Amazon Create Account and Online Shopping DECENT
11 pages
python_notes
No ratings yet
python_notes
33 pages
DECENT CODING COURSE USING PYTHON MAY 2024 file 2
No ratings yet
DECENT CODING COURSE USING PYTHON MAY 2024 file 2
20 pages
Power BI Material DECENT SECTION 2 - 9. HOW TO CREARE TREE MAP
No ratings yet
Power BI Material DECENT SECTION 2 - 9. HOW TO CREARE TREE MAP
17 pages
Power BI Material DECENT SECTION 2 - 2. REPORT LEVEL FILTERS
No ratings yet
Power BI Material DECENT SECTION 2 - 2. REPORT LEVEL FILTERS
9 pages
Python Material 2024 TOPIC 3
No ratings yet
Python Material 2024 TOPIC 3
59 pages
Power BI Material DECENT SECTION 2 - 10. CREARE A TABLE
No ratings yet
Power BI Material DECENT SECTION 2 - 10. CREARE A TABLE
9 pages
Power BI Material DECENT SECTION 2 - 6. FORMAT MULTI RAW CARD
No ratings yet
Power BI Material DECENT SECTION 2 - 6. FORMAT MULTI RAW CARD
19 pages
Python Material 2024 SECTION 1,2
No ratings yet
Python Material 2024 SECTION 1,2
19 pages
BANKING COURSE MATERIAL
No ratings yet
BANKING COURSE MATERIAL
21 pages
1704 TDS ON PROPERTY
No ratings yet
1704 TDS ON PROPERTY
6 pages
Question 2020-01-17 12.01.14_1 (1)
No ratings yet
Question 2020-01-17 12.01.14_1 (1)
1 page
Digital Marketing Presentation
No ratings yet
Digital Marketing Presentation
22 pages
Power BI Material DECENT SECTION 2 - 4. SLICER
No ratings yet
Power BI Material DECENT SECTION 2 - 4. SLICER
27 pages
2301 Tds Registration Number
No ratings yet
2301 Tds Registration Number
1 page
1904 194 O EXPLANATION E COMMERCE OPERATORE COMMERCE
No ratings yet
1904 194 O EXPLANATION E COMMERCE OPERATORE COMMERCE
6 pages
0503 Long Term Capital Gain
No ratings yet
0503 Long Term Capital Gain
1 page
1801 How To Create The Challan
No ratings yet
1801 How To Create The Challan
3 pages
1801 TDS FULL FORM
No ratings yet
1801 TDS FULL FORM
2 pages
1901 TDS Introduction
No ratings yet
1901 TDS Introduction
3 pages
Purchase Order, Sales Order in Tally Decent
No ratings yet
Purchase Order, Sales Order in Tally Decent
2 pages
Google Sheet Steps DECENT
No ratings yet
Google Sheet Steps DECENT
12 pages
G Mail Account Create DECENT
No ratings yet
G Mail Account Create DECENT
9 pages
Python Material 2024 TOPIC 9
No ratings yet
Python Material 2024 TOPIC 9
7 pages
Google Drive Stepts Decent
No ratings yet
Google Drive Stepts Decent
4 pages
Test
No ratings yet
Test
1 page
Outlook Material 2023 DECENT
No ratings yet
Outlook Material 2023 DECENT
1 page
Track Change Practice
No ratings yet
Track Change Practice
7 pages
Short Cut Keyts Word Excel
No ratings yet
Short Cut Keyts Word Excel
4 pages
KB0 Unit 1-Phonics M
No ratings yet
KB0 Unit 1-Phonics M
3 pages
LESSON PLAN Relative Pronouns
No ratings yet
LESSON PLAN Relative Pronouns
2 pages
Straightforward Upper Intermediate Unit 2 Test: Name - Score - /50
No ratings yet
Straightforward Upper Intermediate Unit 2 Test: Name - Score - /50
3 pages
Bayley ObsChklst SCR RF Final
100% (1)
Bayley ObsChklst SCR RF Final
2 pages
Parenting
No ratings yet
Parenting
48 pages
The United Amateur
No ratings yet
The United Amateur
175 pages
Oración en Vietnamita
No ratings yet
Oración en Vietnamita
3 pages
Operating System Support: Introduction
No ratings yet
Operating System Support: Introduction
9 pages
CMT - Level 3 - Written Test: Amazon Confidential
No ratings yet
CMT - Level 3 - Written Test: Amazon Confidential
6 pages
682-Article Text-2392-1-10-20230131
No ratings yet
682-Article Text-2392-1-10-20230131
7 pages
Teaching Grammar
No ratings yet
Teaching Grammar
3 pages
12th English Public Exam 2025 Answer Key Robert PDF Download
No ratings yet
12th English Public Exam 2025 Answer Key Robert PDF Download
8 pages
Brixton Beach-Holy Cross
No ratings yet
Brixton Beach-Holy Cross
10 pages
Chapter 11
No ratings yet
Chapter 11
12 pages
ICON 2017 - The Effects of MRM S On Phonological Processing - 2017.08.02 - MC - PDR
No ratings yet
ICON 2017 - The Effects of MRM S On Phonological Processing - 2017.08.02 - MC - PDR
1 page
Tavakoli Et Al 2023 Assessment of Fluency in The Test of English For Educational Purposes
No ratings yet
Tavakoli Et Al 2023 Assessment of Fluency in The Test of English For Educational Purposes
23 pages
Modul Bimbel Kelas 7 KTSP 7402 Inggris Bab 2 Introducing
No ratings yet
Modul Bimbel Kelas 7 KTSP 7402 Inggris Bab 2 Introducing
7 pages
Resume of Lovy Mathew
No ratings yet
Resume of Lovy Mathew
3 pages
Nitin Varma IndianHistorical Review-2006
No ratings yet
Nitin Varma IndianHistorical Review-2006
30 pages
Organizational Management
100% (1)
Organizational Management
73 pages
As Soon As
No ratings yet
As Soon As
3 pages
44 Sounds of A Us English
No ratings yet
44 Sounds of A Us English
2 pages
Shabine
100% (1)
Shabine
1 page
Ηeritage Non Heritage Bilinguals
No ratings yet
Ηeritage Non Heritage Bilinguals
20 pages
Kannada Learning Forum: List of Volunteers
100% (1)
Kannada Learning Forum: List of Volunteers
14 pages
Kereke, Julia Winning An Interviewers Trust in A Gate Keeping Encounter
No ratings yet
Kereke, Julia Winning An Interviewers Trust in A Gate Keeping Encounter
31 pages
BLIE-228: Bachelor'S Degree in Library and Information Science (Revised) Term-End Examination December, 2020
No ratings yet
BLIE-228: Bachelor'S Degree in Library and Information Science (Revised) Term-End Examination December, 2020
6 pages
Application For International Students National University of Tainan
No ratings yet
Application For International Students National University of Tainan
7 pages
How Linguistics Is A Science
100% (11)
How Linguistics Is A Science
7 pages
Future Tenses2008, Forms and Phrases - Revision (2) - 241010 - 125734
No ratings yet
Future Tenses2008, Forms and Phrases - Revision (2) - 241010 - 125734
6 pages

Day_10 Python External Files

Uploaded by

Day_10 Python External Files

Uploaded by

Opening (and Reading) Files

<_io.TextIOWrapper name='data/elderlyHeightWeight.csv' mode='r' encoding='UTF

Gender Age body wt ht

Take a long time

Out[6]: 'Gender\tAge\tbody wt\tht\n'

In [7]: f_hand = open(file_loc, 'r') ['Gender\tAge\tbody wt\tht\n', 'F\t77\t63.8\t155.5\n']

Out[8]: 'F\t80\t56.4\t160.5\n' In [13]: for line in lines[:4]:

In [ ]: Gender Age body wt ht

Out[10]: 'Gender\tAge\tbody wt\tht'

Gender Age body wt ht for line in lines: # loop to filter

Gender Age body wt ht

In [16]: f_hand = open(file_loc, 'r') filter alternative

Gender Age body wt ht In [19]: f_hand = open(file_loc)

Using lambda expressions

M 79 75.5 171 In [23]: f_hand = open(file_loc)

Tip 2: Remember the ! (esclamation mark) Ex. no 3

{'F': [('F', '77', '63.8', '155.5'),

In [30]: max_male = max(males, key=lambda e: float(e[2]) + float(e[3]))

['M', '78', '95.6', '168']

['Gender', 'Age', 'body wt', 'ht'] Writing files

# use next() method on reader object to id the headers

Adding data to an existing file from collections import defaultdict

In [7]: import csv for key, value in csv_info.items():

Processing and writing file data

# iterate over data and write to file

In [47]: zip_sequence = zip(gender, age, weight, height, height_m)

In [48]: print (gender[:4])

['Gender', 'F', 'F', 'F']

individual number (unique ID for each individual in NHANES)

You might also like