6.Text Processing and Pattern Searching

TEXT PROCESSING
AND PATTERN
SEARCHING
CHAPTER 6
Introduction
•Apart from numeric computations, computers also are well
equipped to perform processing of textual information.
•Computational strategies and algorithms differ significantly
from existing strategies and methods for numerical data.
•Textual processing can be broadly covered by
◦ Manipulation and movement of characters
◦ Searching for patterns/words.
6.1 Text Line Length Adjustment
•Problem: Given a set of lines of text of arbitrary length,

reformat the text so that no lines of more than n characters
(e.g. 40) are printed. In each output line the maximum
number of words that occupy less than or n characters,
should be printed and no word should extend across two
lines. Paragraphs should also remain indented.
•The input lines can contain less than n, n itself or more than
n characters.
•Read 40 characters from the input and place them in an

array.
•When we encounter EOL in the input we need to insert a space to
guarantee word separation.
•Since no word should spill over two lines, we need to know where
the last full word ended.
◦ Count back from position 41 until we encounter a blank.
◦ As and when a blank is encountered update a variable with the current
position.
•What to do with the partial word?
•When to terminate the algorithm?
•Line centered approach:
•Word-oriented approach:
•How to handle punctuation marks and keep the indentation
of paragraphs in the original text?
•Words are generally separated by one or more spaces and/or
a new line character(s).
•Any punctuation which generally follows on directly at the
end of words need not be distinguished from the words.
•Read the input character-by-character.
•If the current character is neither a space nor an end-of-line
it can simply be added to the current word array.
•As each word is built, a character count will need to be
made.
•Use two values wordcnt and linecnt to keep track of the
length of the word and length of the line.
•If the line length limit not exceeded
◦ The current word can be written out directly and prepare to accept
the next word as input.
•Else
•How to distinguish between multiple blanks embedded in
the text and a new paragraph.
◦ Detect paragraph by making a condition (using a flag): EOL
followed by a space.
•The algorithm is divided in parts to organize the input and

output lines in an efficient manner
•Variables needed
◦ Limit – limit on the number of characters in each line
◦ Linecnt - count of characters on current line
◦ Wordcnt – count of characters in current word
◦ Chr - character read
◦ Space - the space character
◦ EOL – flag
◦ Word – the current word
•The cost of reformatting is linearly depended on the number
of characters in the original text.
•Focusing on the output of words rather than the output lines
of text leads to a clearer and simpler algorithm.
•Applications: text processing, report writing and file
management
6.2: Left and Right Justification of text
•Problem: Design and implement a procedure that will left

and right justify text in a way that avoids splitting words and
leaves paragraphs indented. An attempt should also be made
to distribute the additional blanks as evenly as possible in
the justified line.
•Books and computer prepared documents are almost always left and right
justified.
•What this means is that all lines have a fixed length with the first word on
the line always starting in a fixed position (except when there is a new
paragraph) and the last word also finishing in another fixed position.
•How to achieve the fixed line length?
◦ By inserting additional spaces between words
•What should be the relation between the fixed line length and the line
length of the unformatted text.
◦ The length of the unformatted line should be less than or equal to the fixed line
length.
•Reformatting is done in two steps
◦ Rearrange the text so that the line lengths do not exceed the fixed output line
length
◦ Left and right justify the reformatted text lines
•Assume that the input line has only one space separating
words and there are no added spaces at the end of the line.
•Assume that any punctuation in the text either directly
precedes or directly follows words with no intervening
spaces.
•Four possible situations may prevail
•1) the line is already of the correct length (no processing required)
•2) the number of spaces needed to expand the current line to the
required length is equal to the number of spaces already present in
the line.
•3) the number of spaces to be added is greater than the existing
number of spaces in the line
•4) the number of spaces to be added is lesser than the existing
number of spaces in the line.
•Evenly distribute 10 extra spaces to a line which originally had 7 spaces.
•The 3 extra spaces shifts the situation from (3) to (4)

• , so add the remaining spaces after every 2nd space
•Consider the case of adding 4 spaces in the 7 spaces, then which is closer to 2
than 1.
•Add spaces at 2,4,6,8 but this will fail.
•Observations:
•1) The chosen starting position can influence whether or not
the required set of spaces can be added.
•2) no guarantee that for a given increment it will be possible
to add required number of spaces.
•Need a template method to distribute extra spaces evenly.
◦ nspaces- number of spaces available in the line before modification
◦ extraspaces- the number of spaces to be added.
◦ round- a function to assign the closest integer value to delta.
•Which starting position will give the most balanced
distribution?
◦ why?
◦ Since it is the “average” of the possible range.
◦ However if delta is 1, next will be zero
•To avoid it,
• 1) The chosen starting position can influence whether or not the
required set of spaces can be added.
•To solve the second problem we will need to redistribute the extra
spaces, we may require two loops.
•Loop 1: compute current template increment and starting position
according to extra space count.
•Loop 2: distribute extra spaces guided by template and reduce the
extra space count accordingly.
•How to apply this template method for the case (3)?
◦ By choosing an increment of 1 and starting position as 1 we can
add a space to every existing space.
◦ We may need to add more than one space to each existing space.
(example 9 spaces over 4 existing spaces)
◦ Create a variable to indicate the number of spaces to be added to
each position with the current template pass.
•Determine the number of positions in the line where the space
can be added.
•Check for leading spaces at the start of a new paragraph.
•As a precaution check for spaces at the end of the line.
•After the above steps, create a variable that is incremented
every time a space is encountered.
•Problem: Multiple passes over the line to attain the distribution
of the spaces.
•Create an array of length the number of spaces and fill it
with 1s.
•Adding 10 extra spaces for 7 existing positions
•Adding 3 extra spaces for 7 existing positions

•Step through the line character-by-character.
•If a space is encountered, refer to the space array to decide
how many extra spaces to add at the current location.
•If it is not a space write out the characters in the word or
else write the number of spaces by referring to the space
array.
•Array index is increased by 1 every time a space is
encountered.
6.3: Keyword searching in text
•Problem: count the number of times a particular word occurs in a given
text.
•Example: does the word SENTENCE appear in
“THIS IS A SENTENCE OF A TEXT”
•What is a word and what is text?
•A match is found when two words match character by character for the
length of the word sought.
•Generally words are preceded and followed by non-alphabetic
characters.
•Pattern match can be done by storing the word sought in an
array and then increment the position in the array each time
a partial match is found.
•Every time the pattern sought has been completely matched,
check if the text is substring of a word.
◦ After the pattern match condition is met, check if the next character
is alphabetic. Ex: SENT in SENTENCE
•What about the case of SENT in PRESENT?
•What to do when a mismatch happens?
◦ Save the last text character that had a mismatch before a pattern
match started as pre
◦ Reset the pointer for the word array to the beginning of the array.
•The very first word in the text there is not preceded by a
non-alphabetic character, in this case the pre variable is
assigned a non-alphabetic character.
•What to do after a word match has happened?
◦ The character that follows the matched word assumes the role of
the character that precedes the next word.
◦ The pointer for the word array also needs to be set to the beginning.
◦ Update the match count.
•1. Find the word length of the search-word
•2. Set match-count to 0, set preceding character to non-
alphabetic, set pointer of array to 0/1.
•3.
•While not EOF
◦ While not EOL
◦ Read next character
◦ If current text character matches ith character in word:
◦ extend partial match by 1
◦ If a word-pattern match: read next character post
◦ If pre and post are non-alphabetic, then update count
◦ Reinitialize pointer to word array
◦ Save following character post as preceding character.
◦ Else
◦ Save current text character as preceding character for match
◦ Reset word array pointer to first position.
◦ Read past EOL

•The dominant instruction for this algorithm will be the
comparison of the current text character with the current
word chacter.
•Applications: limited text searching.

6.Text Processing and Pattern Searching

Uploaded by

Copyright:

Available Formats

6.Text Processing and Pattern Searching

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6.Text Processing and Pattern Searching

Uploaded by

Copyright:

Available Formats

TEXT PROCESSING

•Problem: Given a set of lines of text of arbitrary length,

•Read 40 characters from the input and place them in an

•The algorithm is divided in parts to organize the input and

•Problem: Design and implement a procedure that will left

•The 3 extra spaces shifts the situation from (3) to (4)

•Adding 3 extra spaces for 7 existing positions

◦ Read past EOL

You might also like