1/23/2025
BIGDATA
Big Data Techniques
and Technologies
Bobby Reyes
Text and File Processing
1
1/23/2025
Strings
◼ string: A sequence of text characters in a program.
◼ Strings start and end with quotation mark " or apostrophe ' characters.
◼ Examples:
"hello"
"This is a string"
"This, too, is a string. It can be very long!"
◼ A string may not span across multiple lines or contain a " character.
"This is not
a legal String."
"This is not a "legal" String either."
◼ A string can represent characters by preceding them with a backslash.
◼ \t tab character
◼ \n new line character
◼ \" quotation mark character
◼ \\ backslash character
◼ Example: "Hello\tthere\nHow are you?"
Indexes
◼ Characters in a string are numbered with indexes starting at 0:
◼ Example:
name = "P. Diddy"
index 0 1 2 3 4 5 6 7
character P . D i d d y
◼ Accessing an individual character of a string:
variableName [ index ]
◼ Example:
print(name, "starts with", name[0])
Output:
P. Diddy starts with P
2
1/23/2025
String Properties
◼ len(string) - number of characters in a string
(including spaces)
◼ str.lower(string) - lowercase version of a string
◼ str.upper(string) - uppercase version of a string
◼ Example:
name = "Martin Douglas Stepp"
length = len(name)
big_name = str.upper(name)
print(big_name, "has", length, "characters")
Output:
MARTIN DOUGLAS STEPP has 20 characters
input
◼ input : Reads a string of text from user input.
◼ Example:
name = input("Howdy, pardner. What's yer name? ")
print(name, "... what a silly name!")
Output:
Howdy, pardner. What's yer name? Sixto Dimaculangan
Sixto Dimaculangan ... what a silly name!
3
1/23/2025
Text Processing
◼ text processing: Examining, editing, formatting text.
◼ often uses loops that examine the characters of a string one by one
◼ A for loop can examine each character in a string in sequence.
◼ Example:
for c in "booyah":
print(c)
Output:
b
o
o
y
a
h
Strings and Numbers
◼ ord(text) - converts a string into a number.
◼ Example: ord("a") is 97, ord("b") is 98, ...
◼ Characters map to numbers using standardized mappings such as
ASCII and Unicode.
◼ chr(number) - converts a number into a string.
◼ Example: chr(99) is "c"
◼ Exercise: Write a program that performs a rotation cypher.
◼ e.g. "Attack" when rotated by 1 becomes "buubdl"
4
1/23/2025
The File Object
◼ Many programs handle data, which often comes from files.
◼ File handling in Python can easily be done with the built-in object
file.
◼ The file object provides all of the basic functions necessary in
order to manipulate files.
◼ Exercise: Open up notepad or notepad++. Write some text and save
the file to a location and with a name you’ll remember, say
'Practice_File.txt'.
The open() function
◼ Before you can work with a file, you first have to open it using
Python’s in-built open() function.
◼ The open() function takes two arguments; the name of the file
that you wish to use and the mode for which we would like to open
the file; the result of open() is a file object that is used work on
this file
fh = open('Practice_File.txt', 'r')
◼ By default, the open() function opens a file in ‘read mode’; this is
what the 'r' above signifies.
◼ There are a number of different file opening modes. The most
common are: 'r'= read, 'w'=write, 'r+'=both reading and
writing, 'a'=appending.
◼ Exercise: Use the open() function to read the file in.
5
1/23/2025
The close() function
◼ Likewise, once you’re done working with a file, you can close it
with the close() function.
◼ Using this function will free up any system resources that are
being used up by having the file open.
fh.close()
Reading in a file and printing to
screen example
◼ Using what you have now learned about for loops, it is possible to
open a file for reading and then print each line in the file to the
screen using a for loop.
◼ Use a for loop and the variable name that you assigned the open
file to in order to print each of the lines in your file to the screen.
◼ Example:
fh = open('Practice_File.txt', 'r')
for line in fh:
print(line)
Output:
The first line of text
The second line of text
The third line of text
…
6
1/23/2025
The read() function
◼ However, you don’t need to use any loops to access file contents.
Python has in-built file reading commands:
◼ The read() function gets an optional argument, which is the
number of bytes to read. If you skip it, it will read the whole file
content and return it as a string.
1. <fileobject>.read() - returns the entire contents of the file as a single string
Output:
fh = open('Practice_File.txt', 'r’) The first line of text
print(fh.read()) The second line of text
The third line of text
The fourth line of text
The fifth line of text
<fileobject>.read(6) - read n=6 number of bytes Output:
The fi
fh = open('Practice_File.txt', 'r’)
print(fh.read(6))
readline() functions
◼ Other in-built file reading commands:
2. <fileobject>.readline() - returns one line at a time
fh = open('Practice_File.txt', 'r’)
Output:
print(fh.readline())
The first line of text
3. <fileobject>.readlines() - returns a list of lines
Output:
fh = open('Practice_File.txt', 'r’)
['The second line of
print(fh.readlines()) text\n', 'The third line of
text\n', 'The fourth line
of text\n', 'The fifth line
of text\n']
7
1/23/2025
The write() function
◼ Likewise, there are two similar in-built functions for getting Python
to write to a file:
1. <file>.write() - Writes a specified sequence of characters to a file
fh = open('Practice_File_W.txt', 'w')
fh.write('I am adding this string')
2. <file>.writelines() - Writes a list of strings to a file:
testList = ['First line\n', 'Second line\n']
fh = open('Practice_File_W.txt', 'w')
fh.writelines(testList)
Example Line-by-line Processing
◼ Reading a file line-by-line and write to output file:
fh1 = open('Practice_File.txt', 'r')
fh2 = open('Write_File.txt', 'w')
count = 0
for line in fh1.readlines():
fh2.write(line)
count += 1
fh2.write('The file contains ' + str(count) + ' lines.')
fh1.close()
fh2.close()
◼ Exercise: Write a program to process a file of DNA text, such as:
ATGCAATTGCTCGATTAG
◼ Count the percent of C+G present in the DNA.
8
1/23/2025
Data Conversion and Parsing
◼ A file, specifically a text file, consists of strings. However,
especially in engineering and science, we work with numbers.
Thus, need to convert (cast) input string to int or float.
◼ Another challenge is having multiple numbers on a string (line)
separated by special characters or simply spaces as '10.0 5.0 5.0’
➢ Can use the .split(delimiter) method of a string, which
returns a list of strings separated by the given delimiters.
instr = '10.0 5.0 5.0'
outlst = [ float(substr) for substr in instr.split(' ')]
print(outlst)
[10.0, 5.0, 5.0]
▪ Other useful methods on working with strings:
▪ .join(delimiter) – join elements of list of string with a delimiter
▪ .rstrip(‘\n’) – remove occurrences of '\n' at the end of string
Termination of Input
◼ Two ways to stop reading input:
1. By reading a definite number of items.
2. By the end of the file.
➢ EOF indicator – at end of file, functions like read() and
readline() return an empty string ''.
fp = open("pointlist.txt") # open file for reading
pointlist = [] # start with empty list
nextline = fp.readline() # first line of pointlist.txt is number of lines that follow; skip
nextline = fp.readline() # read following line, has two real values
# denoting x and y values of a point
while nextline != ‘’: # until end of file
nextline = nextline.rstrip('\n’) # remove occurrences of '\n' at the end
(x, y) = nextline.split(' ‘) # get x and y (note that they are still strings)
x = float(x) # convert them into real values
y = float(y)
pointlist.append( (x,y) ) # add tuple at the end
nextline = fp.readline() # read the nextline
fp.close()
print(pointlist)
[(0.0, 0.0), (10.0, 0.0), (10.0, 10.0), (0.0, 10.0)]