0% found this document useful (0 votes)

79 views23 pages

Python - Working With Data - Text Formats

This document discusses reading text file formats into Python for data analysis. It provides an example of reading a .txt file of monthly UK rainfall data from 1910-present. Functions are defined to read the header metadata, check for and handle missing values, and read the data into a dictionary with columns as keys and masked arrays as values. The techniques demonstrated work similarly for comma-separated CSV files using Python's csv module.

Uploaded by

sunil jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views23 pages

Python - Working With Data - Text Formats

Uploaded by

sunil jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Python – working with data –

text formats
ASCII or text file formats
Advantages of working with text formats:
• They are usually human-readable.
• They tend to be simple structures.
• It is relatively easy to write code to interpret
them.
Disadvantages include:
• Inefficient storage for big data volumes.
• Most people invent their own format so
there is a lack of standardisation.
Using python to read text formats
As we have seen Python has a great toolkit for
reading files and working with strings.

In this example we use a file that we found on

the web, and then adapt some code to read it
into a useful, re-usable form.
Our example file
We found a suitable data set on the web:
http://www.metoffice.gov.uk/climate/uk/summaries/datasets#Yearorder

Met Office monthly weather statistics for

the UK since 1910.
Header

Lines numbers
(for reference
only)
Data (first 9 columns)
Data (last 8 columns)
Look! A missing value!
Let's write some code to read it
We'll need:

• To read the header and data separately

• To think about the data structure (so it is easy to
retrieve the data in a useful manner).

Let's put into practice what we have learnt:

• Use NumPy to store the arrays

• But we'll need to test for missing values and use
Masked Array (numpy.ma)
Example code (and data)
Please refer to the example code:

example_code/test_read_rainfall.py

And data file:

example_data/uk_rainfall.txt
Reading the header
UK Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
order of year descending.
Data are provisional from December 2014 & Winter
2015. Last updated 07/04/2015
Reading the header
UK Rainfall (mm) Line 1 is important
Areal series, starting from information.
1910
Allowances have been made for topographic, coastal
and urban effects where relationships are
found to exist. Other lines are useful
information.
Seasons: Winter=Dec-Feb, Spring=Mar-May,
Summer=June-Aug, Autumn=Sept-Nov. (Winter:
Let's capture the metadata in:
Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where
values are equal, rankings are based in
- location: UK
order of year descending.
- variable:
Data are provisional from December Rainfall
2014 & Winter
- units: mm
2015. Last updated 07/04/2015
Reading the header
def readHeader(fname):
# Open the file and read the relevant lines
f = open(fname)
head = f.readlines()[:6]
f.close()

# Get important stuff

location, variable, units = head[0].split()
units = units.replace("(", "").replace(")", "")

# Put others lines in comments

comments = head[1:6]
return (location, variable, units, comments)
Test the reader
>>> (location, variable, units, comments) = \
readHeader("example_data/uk_rainfall.txt")

>>> print location, variable, units

UK Rainfall mm

>>> print comments[1]

Allowances have been made for topographic, coastal
and urban effects where relationships are found to
exist.
Write a function to handle missing
data properly
import numpy.ma as MA

def checkValue(value):
# Check if value should be a float
# or flagged as missing
if value == "---":
value = MA.masked
else:
value = float(value)
return value
Reading the data (part 1)
import numpy.ma as MA
def readData(fname):
# Open file and read column names and data block
f = open(fname)

# Ignore header
for i in range(7):
f.readline()

col_names = f.readline().split()
data_block = f.readlines()
f.close()

# Create a data dictionary, containing

# a list of values for each variable
data = {}
Data (first 9 columns)
Reading the data (part 2)
# Add an entry to the dictionary for each column
for col_name in col_names:

data[col_name] = MA.zeros(len(data_block), 'f',

fill_value = -999.999)
Reading the data (part 3)
# Loop through each value: append to each column
for (line_count, line) in enumerate(data_block):
items = line.split()

for (col_count, col_name) in enumerate(col_names):

value = items[col_count]
data[col_name][line_count] = checkValue(value)

return data
Testing the code
>>> data = readData("example_data/uk_rainfall.txt")
>>> print data["Year"]
[ 1910. 1911. 1912. ...

>>> print data["JAN"]

[ 111.40000153 59.20000076 111.69999695 ...

>>> winter = data["WIN"]

>>> print MA.is_masked(winter[0])
True
>>> print MA.is_masked(winter[1])
False
Look! A missing value!
What about CSV or tab-delimited?
The above example will work exactly the same with
a tab-delimited file (because the string split
method splits on white space) .

If the file used commas (CSV) to separate columns

then you could use:

line.split(",")
Or try the Python "csv" module
There is a python "csv" module that is able to read text files
with various delimiters. E.g.:

>>> import csv

>>> r = csv.reader(open("example_data/weather.csv"))
>>> for row in r:
... print row

['Date', 'Time', 'Temp', 'Rainfall']

['2014-01-01', '00:00', '2.34', '4.45']
['2014-01-01', '12:00', '6.70', '8.34']
['2014-01-02', '00:00', '-1.34', '10.25']

See: https://docs.python.org/2/library/csv.html

777F Manual
100% (5)
777F Manual
243 pages
Recommendations-Precautions MCP Piloted Manual Gearbox Peugeot 3008
No ratings yet
Recommendations-Precautions MCP Piloted Manual Gearbox Peugeot 3008
10 pages
Files in MATLAB
No ratings yet
Files in MATLAB
11 pages
AERE 1610 Homework 05
No ratings yet
AERE 1610 Homework 05
8 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
Fundamentals of Data Science Lab Manual-5-26
No ratings yet
Fundamentals of Data Science Lab Manual-5-26
22 pages
Data Science Fundamentals Lab
No ratings yet
Data Science Fundamentals Lab
24 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
Python Foundation For Data Science
No ratings yet
Python Foundation For Data Science
9 pages
Programming With Python: Contents
No ratings yet
Programming With Python: Contents
28 pages
Pythonfile
No ratings yet
Pythonfile
37 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Week1 Numpy, Pandas (178) .Ipynb Colab
No ratings yet
Week1 Numpy, Pandas (178) .Ipynb Colab
6 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
Data Science Problem Statements - Project 2 Titles
No ratings yet
Data Science Problem Statements - Project 2 Titles
50 pages
Manual
No ratings yet
Manual
21 pages
11th PGM
No ratings yet
11th PGM
9 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
Importing Data Python Cheat Sheet PDF
No ratings yet
Importing Data Python Cheat Sheet PDF
1 page
Dfs Manual
No ratings yet
Dfs Manual
43 pages
Fundamentals of Data Science Lab Manual New
No ratings yet
Fundamentals of Data Science Lab Manual New
33 pages
Fundamentals of Data Science Lab Manual
No ratings yet
Fundamentals of Data Science Lab Manual
34 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
54 pages
DV Lab Manual Modified
No ratings yet
DV Lab Manual Modified
31 pages
Python Lab PRG
No ratings yet
Python Lab PRG
20 pages
Your Roll No ..............
No ratings yet
Your Roll No ..............
6 pages
DW - DW Internal 1 - Merged
No ratings yet
DW - DW Internal 1 - Merged
12 pages
Batch2 FDS Printout
No ratings yet
Batch2 FDS Printout
38 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
38 pages
22mbada303 Module 4
No ratings yet
22mbada303 Module 4
32 pages
Numpy - Ipynb - Colaboratory
No ratings yet
Numpy - Ipynb - Colaboratory
32 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
Fds Lab
No ratings yet
Fds Lab
16 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
No ratings yet
#Pip Install Pandas #Pandas Can Be Installed Using:: Import
6 pages
05 Data Loading, Storage and Wrangling-1
No ratings yet
05 Data Loading, Storage and Wrangling-1
22 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
UNIT-4 Important Q-A
No ratings yet
UNIT-4 Important Q-A
28 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Python
No ratings yet
Python
17 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Loops: Genome 559: Introduction To Statistical and Computational Genomics Prof. James H. Thomas
No ratings yet
Loops: Genome 559: Introduction To Statistical and Computational Genomics Prof. James H. Thomas
27 pages
PRINCIPLES OF DATA SCIENCE Lab
No ratings yet
PRINCIPLES OF DATA SCIENCE Lab
20 pages
RAW Data
No ratings yet
RAW Data
22 pages
Fds Lab Manual
No ratings yet
Fds Lab Manual
24 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Fds QB
No ratings yet
Fds QB
6 pages
Introduction To Numpy Pandas and Matplotlib
No ratings yet
Introduction To Numpy Pandas and Matplotlib
2 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
34 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
45 pages
EX-02-Data Manipulation Pandas Matplot
No ratings yet
EX-02-Data Manipulation Pandas Matplot
9 pages
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
No ratings yet
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
17 pages
III Unit Fds
No ratings yet
III Unit Fds
24 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Pansys Documentation: Release 0.1a
No ratings yet
Pansys Documentation: Release 0.1a
35 pages
Pandas PDF
100% (2)
Pandas PDF
1,787 pages
Adhesive Bonding ECSS E HB 32 21A 20march2011
100% (1)
Adhesive Bonding ECSS E HB 32 21A 20march2011
461 pages
Cgna16684enc 001 PDF
No ratings yet
Cgna16684enc 001 PDF
272 pages
Deed of Sale Motorcycle Mila Pugal
No ratings yet
Deed of Sale Motorcycle Mila Pugal
1 page
A. Shortening Words and Phrases
No ratings yet
A. Shortening Words and Phrases
2 pages
Ed3210 Assignment 2
No ratings yet
Ed3210 Assignment 2
3 pages
Manual Navcom
No ratings yet
Manual Navcom
5 pages
Describe A Time You Did Something That You Did Not Want To Do
No ratings yet
Describe A Time You Did Something That You Did Not Want To Do
17 pages
Bluelink 25 - 9C0
No ratings yet
Bluelink 25 - 9C0
7 pages
PC 3000 Express
No ratings yet
PC 3000 Express
1 page
Python Lab
0% (1)
Python Lab
16 pages
Dental ALL
No ratings yet
Dental ALL
13 pages
List of CDSL DPs For Opening Account Online - Offline
No ratings yet
List of CDSL DPs For Opening Account Online - Offline
6 pages
Junaid JJ Ob
No ratings yet
Junaid JJ Ob
2 pages
Manne The Market For Corporate Control PDF
No ratings yet
Manne The Market For Corporate Control PDF
12 pages
Corporate Overview Daniel I Group
No ratings yet
Corporate Overview Daniel I Group
34 pages
RRL For Communication Skills
100% (1)
RRL For Communication Skills
2 pages
Reward and Recognition Policy
100% (1)
Reward and Recognition Policy
2 pages
The Complete Guide To Working Remotely As A Lawyer
No ratings yet
The Complete Guide To Working Remotely As A Lawyer
22 pages
Accounting For Partnerships - BBA1 Accounting - Aug 2024
No ratings yet
Accounting For Partnerships - BBA1 Accounting - Aug 2024
9 pages
Ledger of Voss
No ratings yet
Ledger of Voss
6 pages
Estimating Ethylene Glycol Injection Rate For Hydrate Inhibition
No ratings yet
Estimating Ethylene Glycol Injection Rate For Hydrate Inhibition
2 pages
Gstr2a Excel Merging Utility
No ratings yet
Gstr2a Excel Merging Utility
7 pages
Chickens PDF
No ratings yet
Chickens PDF
1 page
E. CAPF Top Tips For Bids 2024-25
No ratings yet
E. CAPF Top Tips For Bids 2024-25
6 pages
Alegre-Assignment EHR (FUNDA 103)
No ratings yet
Alegre-Assignment EHR (FUNDA 103)
2 pages
01 Hazardous Material Awareness Training - May 22. 2025
No ratings yet
01 Hazardous Material Awareness Training - May 22. 2025
22 pages
Real-Time System For Driver Fatigue Detection Based On A Recurrent Neuronal Network
No ratings yet
Real-Time System For Driver Fatigue Detection Based On A Recurrent Neuronal Network
15 pages
MOROTOLA Factory Mutual Approvals
No ratings yet
MOROTOLA Factory Mutual Approvals
10 pages
Jollibee Dilemma
No ratings yet
Jollibee Dilemma
6 pages
Osu Affidavit
No ratings yet
Osu Affidavit
2 pages

Python - Working With Data - Text Formats

Uploaded by

Python - Working With Data - Text Formats

Uploaded by

Python – working with data –

In this example we use a file that we found on

Met Office monthly weather statistics for

• To read the header and data separately

Let's put into practice what we have learnt:

• Use NumPy to store the arrays

And data file:

# Get important stuff

# Put others lines in comments

>>> print location, variable, units

>>> print comments[1]

# Create a data dictionary, containing

data[col_name] = MA.zeros(len(data_block), 'f',

for (col_count, col_name) in enumerate(col_names):

>>> print data["JAN"]

>>> winter = data["WIN"]

If the file used commas (CSV) to separate columns

>>> import csv

['Date', 'Time', 'Temp', 'Rainfall']

You might also like