Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
Ascii Unicode: Chapter - 4 CSV Files 1. What Is A CSV File?
CSV Files
A CSV file (Comma Separated Values file) is a type of plain text file that uses specific
structuring to arrange tabular data. Because it’s a plain text file, it can contain only actual
text data—in other words, printable ASCII or Unicode characters.
The structure of a CSV file is given away by its name. Normally, CSV files use a comma to
separate each specific data value. Here’s what that structure looks like:
In general, the separator character is called a delimiter, and the comma is not the only
one used. Other popular delimiters include the tab (\t), colon (:) and semi-colon (;)
characters. Properly parsing a CSV file requires us to know which delimiter is being used.
CSV files are normally created by programs that handle large amounts of data. They are
a convenient way to export data from spreadsheets and databases as well as import or use
it in other programs. For example, you might export the results of a data mining program to
a CSV file and then import that into a spreadsheet to analyze the data, generate graphs for a
presentation, or prepare a report for publication.
CSV files are very easy to work with programmatically. Any language that supports text
file input and string manipulation (like Python) can work with CSV files directly.
The csv library provides functionality to both read from and write to CSV files. Designed
to work out of the box with Excel-generated CSV files, it is easily adapted to work with a
variety of CSV formats. The csv library contains objects and other code to read, write, and
process data from and to CSV files.
Reading from a CSV file is done using the reader object. The CSV file is opened as a
text file with Python’s built-in open() function, which returns a file object. This is then
passed to the reader, which does the heavy lifting.
Here’s the employee_birthday.txt file:
name,department,birthday month
John Smith,Accounting,November
Erica Meyers,IT,March
Here’s code to read it:
import csv
Rather than deal with a list of individual String elements, you can read CSV data
directly into a dictionary (technically, an Ordered Dictionary) as well.
name,department,birthday month
John Smith,Accounting,November
Erica Meyers,IT,March
Here’s the code to read it in as a dictionary this time:
import csv
delimiter specifies the character used to separate each field. The default is the
comma (',').
quotechar specifies the character used to surround fields that contain the delimiter
character. The default is a double quote (' " ').
escapechar specifies the character used to escape the delimiter character, in case
quotes aren’t used. The default is no escape character.
These parameters deserve some more explanation. Suppose you’re working with the
following employee_addresses.txt file:
name,address,date joined
john smith,1132 Anywhere Lane Hoboken NJ, 07030,Jan 4
erica meyers,1234 Smith Lane Hoboken NJ, 07030,March 2
This CSV file contains three fields: name, address, and date joined, which are delimited by
commas. The problem is that the data for the address field also contains a comma to signify
the zip code.
You can also write to a CSV file using a writer object and the .write_row() method:
import csv
John Smith,Accounting,November
Erica Meyers,IT,March
Remove ads
Since you can read our data into a dictionary, it’s only fair that you should be able to write
it out from a dictionary as well:
import csv
writer.writeheader()
writer.writerow({'emp_name': 'John Smith', 'dept': 'Accounting', 'birth_month':
'November'})
writer.writerow({'emp_name': 'Erica Meyers', 'dept': 'IT', 'birth_month': 'March'})
Unlike DictReader, the fieldnames parameter is required when writing a dictionary. This
makes sense, when you think about it: without a list of fieldnames, the DictWriter can’t
know which keys to use to retrieve values from your dictionaries. It also uses the keys
in fieldnames to write out the first row as column names.
emp_name,dept,birth_month
John Smith,Accounting,November
Erica Meyers,IT,March
Of course, the Python CSV library isn’t the only game in town. Reading CSV files is
possible in pandas as well. It is highly recommended if you have a lot of data to analyze.
To show some of the power of pandas CSV capabilities, I’ve created a slightly more
complicated file to read, called hrdata.csv. It contains data on company employees:
import pandas
df = pandas.read_csv('hrdata.csv')
print(df)
That’s it: three lines of code, and only one of them is doing the actual
work. pandas.read_csv() opens, analyzes, and reads the CSV file provided, and stores the
data in a DataFrame. Printing the DataFrame results in the following output:
First, pandas recognized that the first line of the CSV contained column names, and
used them automatically. I call this Goodness.
However, pandas is also using zero-based integer indices in the DataFrame. That’s
because we didn’t tell it what our index should be.
Further, if you look at the data types of our columns , you’ll see pandas has properly
converted the Salary and Sick Days remaining columns to numbers, but the Hire
Date column is still a String. This is easily confirmed in interactive mode:
>>>
>>> print(type(df['Hire Date'][0]))
<class 'str'>
Let’s tackle these issues one at a time. To use a different column as the DataFrame index,
add the index_col optional parameter:
import pandas
df = pandas.read_csv('hrdata.csv', index_col='Name')
print(df)
Now the Name field is our DataFrame index:
import pandas
df = pandas.read_csv('hrdata.csv', index_col='Name', parse_dates=['Hire Date'])
print(df)
Notice the difference in the output:
>>>
>>> print(type(df['Hire Date'][0]))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
If your CSV files doesn’t have column names in the first line, you can use the names optional
parameter to provide a list of column names. You can also use this if you want to override
the column names provided in the first line. In this case, you must also
tell pandas.read_csv() to ignore existing column names using the header=0 optional
parameter:
import pandas
df = pandas.read_csv('hrdata.csv',
index_col='Employee',
parse_dates=['Hired'],
header=0,
names=['Employee', 'Hired','Salary', 'Sick Days'])
print(df)
Notice that, since the column names changed, the columns specified in
the index_col and parse_dates optional parameters must also be changed. This now results
in the following output:
Of course, if you can’t get your data out of pandas again, it doesn’t do you much good.
Writing a DataFrame to a CSV file is just as easy as reading one in. Let’s write the data with
the new column names to a new CSV file:
import pandas
df = pandas.read_csv('hrdata.csv',
index_col='Employee',
parse_dates=['Hired'],
header=0,
names=['Employee', 'Hired', 'Salary', 'Sick Days'])
df.to_csv('hrdata_modified.csv')
The only difference between this code and the reading code above is that the print(df) call
was replaced with df.to_csv(), providing the file name. The new CSV file looks like this:
Employee,Hired,Salary,Sick Days
Graham Chapman,2014-03-15,50000.0,10
John Cleese,2015-06-01,65000.0,8
Eric Idle,2014-05-12,45000.0,10
Terry Jones,2013-11-01,70000.0,3
Terry Gilliam,2014-08-12,48000.0,7
Michael Palin,2013-05-23,66000.0,8