Assignment3 ConcordanceGenerator PDF
Assignment3 ConcordanceGenerator PDF
Assignment3 ConcordanceGenerator PDF
Data Structures
Assignment 3: Concordance Generator
Introduction Testing Parsing with File-Oriented Input
Before creating your full Concordance Generator application, You need to prove that you can load and parse individual
you should build components that you know you’ll need. words correctly. First, simplify your testing by creating a small
1
Build and test these components first; then assemble them in text-oriented file with several paragraphs . Remember, your
the full Concordance Generator application. parsing needs to isolate all individual words: eliminating
Each of these steps is fairly short, and the final body of code punctuation; respecting different word boundaries, etc.
is not huge . . . but you do need to understand what you’re Therefore, your small text file must include typical patterns.
building . . . and that suggests that you need to plan before This suggests creating a test plan to list those cases.
coding . . . and test subsections of code before assembling the Once you’ve proven your design with a small text file, you can
final solution. move to larger ones.
Step 1: Word Parsing During Build Initially, you could merely display the parsing results to the
screen when dealing with a small sample file. Later, the
The following code fragment will produce a workable parsing
results of the parsing are used to build the concordance, so
tool. If you choose to use this code, you’ll have to surround it
nothing will be displayed during this section of execution in
with suitable code that builds the actual concordance data
your final version.
structures. The key line of code uses a regular expression
when invoking the split() method for my paragraph String Step 2: Sketch your Planned Runtime
object. I have crafted a test String object that embeds a Reference-to-Object Layout
number of distracting non-word characters, such as: 2
This runtime reference-to-object layout need not be a
• multiple spaces polished final drawing. Do it in pencil. Your runtime reference-
• parentheses to-object layout will include the data structure that holds the
• commas and multiple periods original collection of paragraphs.
• quotation marks The second data structure will contain a “ map of unique
• a contraction: “It’s” 3
words” . This will become the index back into the collection
Note: The String initialization in this document has been split of original paragraphs. You have a choice here. There are
across multiple lines, but only because I wanted to fit the several workable approaches to building a map of unique
source code in this page width. In my original program, I left it words, including HashMap and TreeMap. Will you choose
as one long continuous character sequence. wisely? Each item in this data structure holds a unique word
as the key value (a word such as “happy”). Since this word
public class SplitTest { (“happy”) can have many occurrences, the map entry for
“happy” must also have a list of references back to the
public static void main(String[ ] args) {
original paragraphs. The Map classes are called associative
String sParagraph = "This is a paragraph (with several sentences " +
". . . and lots of punctuation). It's clear that parsing " + because they associate a key (such as “happy”) with a value
"can be complex! You might need to shout, \"Help!\""; (such as an ArrayList or LinkedList of references to String
System.out.println(sParagraph); objects, which hold references to your full original
paragraphs).
String[ ] asCollectionOfWords = sParagraph.split("[^a-zA-Z_0-9']+"); So your drawing should show how the two data structures
for (String s : asCollectionOfWords) relate: the original list of paragraphs and the layered map-
System.out.println(s);
} // end main()
1
Paragraphs are delimited by a newline character.
} // end class SplitTest 2
I used to call this a “memory map,” but strictly speaking,
The split() method will create individual word-oriented String Java, doesn’t allow direct access to memory, so that phrase
objects, and capture a reference to each in an array of “memory map” isn’t precisely accurate. Therefore, I
references to String objects. The reference to that array of substituted the phrase “runtime reference-to-object layout”.
references is captured in asCollectionOfWords. To confirm I’m hoping that this will make Svillen Ranev much happier ☺.
your understanding of split(), capture my sample code; build 3
Notice that I wrote “ map of unique words” not a
a Java project and set a breakpoint to inspect memory after “ collection of unique words”. There are two major families
the call to split() has finished its work. If you want credit for of data structures which are organized under two
completing this step, you’ll need to show me a sketch of fundamental interfaces: Collection and Map. Up to now
memory after the split() method has done its work. you’ve probably only used data structures that implement
the Collection interface. For this part, you need to use a
concrete class that implements the Map interface.
based index that refers back into the original list of • At a suitable breakpoint, check the list of original
paragraphs. paragraphs. Are the paragraphs organized as you had
expected in your runtime reference-to-object layout.
Confirming Correctness of Your Runtime Reference-to-
Object Layout • In the build process, step through a line at a time to
prove:
To confirm suitability of your choice of data structures, you
o Individual words are parsed correctly.
can show it to me.
o First occurrences of new words result in the addition
Step 3: Implement the Build Process of a new entry in the lookup data structure. Each
With your runtime reference-to-object layout complete, and new entry has a key and a value component. The
proven parsing routines, you are ready to build the two lookup work is the key; the reference to the
related data structures. Here’s an outline of your algorithm. associated paragraph becomes the first entry in a
I’ll assume that you have already loaded the paragraphs into Collection of paragraph references (which is the
a collection. associated value).
o Subsequent occurrences of a word locate the target
• Create a new lookup map based on the blend Map and in the Map data structure and add another reference
Collection classes shown in the runtime reference-to- to the associated paragraph.
object layout you chose in Step 2. The key component of
• When the build is complete, you can use the debugger to
the Map is the lookup word. The value component of the
find components in the concordance index and compare
Map is the List of references to the associated
with your small text file.
paragraphs.
• Run it with one of the large text files (such as: Charles
• In a loop, iterate through all paragraphs that are in your
Dicken’s Oliver Twist, Charles Darwin’s Origin of the
Collection of paragraphs. 4
Species, Victor Hugo’s Les Miserables ). With big files you
o Get a reference to the current paragraph.
will see the Big-O effect of your wise (or not-so-wise)
o Invoke split() for that paragraph and capture the
choices.
resulting array of references to isolated words.
o In an inner loop, iterate through all words in the Ignore User-Interface and Display Issues
paragraph (using your proven parsing tools). Your final application will have to have some kind of user-
Lookup the isolated word in your Map. interface with display features. Don’t worry about that while
If it’s not found then, you must create a new testing the design now.
lookup entry for it, and then store the current
Prove that you can build the memory-resident indexing data
paragraph reference as the first instance in your
structure first.
list of paragraphs for this word.
Else if the word is found, the lookup entry must This first part doesn’t require a great deal of coding. In
already exist, and you merely add this paragraph addition, I’ve given you workable parsing routines.
reference to your list of paragraphs for this Your real challenge is understanding the relationship of the
word. two major data structures: the Collection of original
That’s the essence of the build algorithm. Of course, there are paragraphs; the Map of words that manage references back
a number of detailed issues that you must deal with. to the original paragraphs. Make sure you understand the
runtime reference-to-object layout that reflects the design
One detailed issue flows from your choice of Map and
you propose to implement.
Collection classes (Step 2 decision). Different collections have
variations in their constructors and in some of their methods,
so you’ll probably need to spend some time looking at the
Java SE7 API documentation.
Another detailed issue is how to handle multiple occurrences
of a word in a paragraph. For example, the word “the” might
occur a dozen times in a single paragraph. Your parsing
routine will discover all dozen instances of the word “the”.
However, you really want to store only one reference to this
paragraph – not a dozen references to this same paragraph.
There are several workable approaches to solving this
problem. I’ll leave it for you to choose the solution to this.
Testing Index Building
In my opinion, this testing is best done with a small sample
text file and through the use of the debugger. Set breakpoints
at strategic locations.
4
This is the biggest of the text files: 3,224,332 bytes in size.
Step 4: Putting the Final Project Together Note some of the special elements in this regular expression:
Accessing Indexing Information • \\b This identifies a standard word boundary. Of
This part should be easy. Imagine that the user is now using course, the double \\ is used to embed a
the concordance index. She types in “happy” to find all the single \ in the String object. The replaceAll()
places where happy is located in the original document. method will really receive \b which ensures that we
Clearly, “happy” is your key. Since you are using a Map, you search for a word that's delimited with word
5 boundaries (to avoid finding "the" in "their"
can use that key to retrieve the associated value . What’s the
associated value? The associated value is the list of "another" "them" etc.)
references to the original paragraphs. • ?i: This means perform the search of the word
Thus, retrieving the information about all occurrences of the inside the parentheses ignoring case.
word “happy” can be implemented with a single method call • ( ) This means group all elements inside the
to your Map object. The Big-O associated with that single parentheses, and the grouping has two purposes.
6 First, the case-insensitive setting of ?i: is applied to
method call will depend on your choice of Map.
everything inside the parentheses. Second, the item
Issues in Displaying Search Results found by the search becomes available for use; in
Your user has looked up a word, and you have the list of essence it establishes an accessible variable, which
paragraphs where that word is located. But how are you you’ll later see as $0
going to display the results? The following outlines an
approach that I used with my GUI-based implementation. You Building the Displayable Search Results
are free use the code (with appropriate credit) or you can In preparation for display, I create a StringBuilder object. I
implement your own. can’t use a String object, since String objects are immutable.
Imagine that you’re dealing with a single paragraph where The StringBuilder object will allow me to append again and
the target word resides. There may be more than one again with little performance penalty. The <i> and </i> are
occurrence of this word in the current paragraph, yet you will html tags for italics. The <br> elements generate a line break
want to highlight all occurrences. Here’s a simple method in html.
using regular expressions. My initial StringBuilder object creation is shown at the top of
Given a search word, such as “happy”, I want to replace the next page (placed on a new page to avoid splitting the
“happy” with “<b>happy</b>” essentially surrounding the long line of Java code).
target word with hypertext tagging to turn it bold (because
the Java control in which I display the text can correctly
display html tagged text). If I was displaying a pure text
version, I might highlight “happy” by replacing it with
“**happy**”. How you highlight the word is your choice, and
it’s easy to adjust. There’s one more thing; I want to search
case-insensitive, but replace retaining case-sensitivity. So a
search for “happy” (an all lower-case search word) would
result in the following matches and associated replacements:
• happy <b>happy</b>
• Happy <b>Happy</b>
• HAPPY <b>HAPPY</b>
5
And remember, a Map is an associative container in that it
associates a key with a corresponding value. Of course, you
don’t have to use a Map, to build the concordance. There are
many ways to provide well-functioning, high-performance
indexing.
6
Remember, Map is an interface. You cannot create an
object of type Map. But you can create one of several
implementing subclasses, and they all share the same
method interfaces (though with different Big-O
characteristics).
StringBuilder sbResults = (new StringBuilder("<i>Search</i>: <b>")).append(sSearchWord).append("</b><br><br>");
Optional Bonuses
There are many possible enhancements. None are required,
but each represents a significant improvement over the basic
implementation.
• Implement a GUI interface. I’m quite happy to help with
this once you get started.
• Implement serialization. Remember that a concordance
is an index. Once the index has been built, it will never
have to change (since the original text of a novel like
Oliver Twist is never going to change). On subsequent
runs of the program it would be better to permit the
reload of the pre-built concordance rather than always
having to build it from the original source text file. In just
7
a single line of code , you can output a multi-megabyte
data structure with all the rich network of reference
relationships. Later, you can re-read this multi-megabyte
data structure and restore its rich network of reference
relationships; again, with a single line of code.
• Use your imagination and creativity . . . it will be
rewarded.
7
Of course, you do have to open the output serialization file
first before you use this ”magical” line of code.
The Problem Statement is weighted more heavily
Submission Requirements since you’ll be justifying your implementation
The following need to be part of your submission. choices with reference to Big-O issues.
Note: You also need a chart for the Map choices. This
chart only shows a selection of Collection choices.