Assignment3 ConcordanceGenerator PDF

CST8130:
Data Structures
Assignment 3: Concordance Generator
Introduction Testing Parsing with File-Oriented Input
Before creating your full Concordance Generator application, You need to prove that you can load and parse individual
you should build components that you know you’ll need. words correctly. First, simplify your testing by creating a small
1
Build and test these components first; then assemble them in text-oriented file with several paragraphs . Remember, your
the full Concordance Generator application. parsing needs to isolate all individual words: eliminating
Each of these steps is fairly short, and the final body of code punctuation; respecting different word boundaries, etc.
is not huge . . . but you do need to understand what you’re Therefore, your small text file must include typical patterns.
building . . . and that suggests that you need to plan before This suggests creating a test plan to list those cases.
coding . . . and test subsections of code before assembling the Once you’ve proven your design with a small text file, you can
final solution. move to larger ones.
Step 1: Word Parsing During Build Initially, you could merely display the parsing results to the
screen when dealing with a small sample file. Later, the
The following code fragment will produce a workable parsing
results of the parsing are used to build the concordance, so
tool. If you choose to use this code, you’ll have to surround it
nothing will be displayed during this section of execution in
with suitable code that builds the actual concordance data
your final version.
structures. The key line of code uses a regular expression
when invoking the split() method for my paragraph String Step 2: Sketch your Planned Runtime
object. I have crafted a test String object that embeds a Reference-to-Object Layout
number of distracting non-word characters, such as: 2
This runtime reference-to-object layout need not be a
• multiple spaces polished final drawing. Do it in pencil. Your runtime reference-
• parentheses to-object layout will include the data structure that holds the
• commas and multiple periods original collection of paragraphs.
• quotation marks The second data structure will contain a “ map of unique
• a contraction: “It’s” 3
words” . This will become the index back into the collection
Note: The String initialization in this document has been split of original paragraphs. You have a choice here. There are
across multiple lines, but only because I wanted to fit the several workable approaches to building a map of unique
source code in this page width. In my original program, I left it words, including HashMap and TreeMap. Will you choose
as one long continuous character sequence. wisely? Each item in this data structure holds a unique word
as the key value (a word such as “happy”). Since this word
public class SplitTest { (“happy”) can have many occurrences, the map entry for
“happy” must also have a list of references back to the
public static void main(String[ ] args) {
original paragraphs. The Map classes are called associative
String sParagraph = "This is a paragraph (with several sentences " +
". . . and lots of punctuation). It's clear that parsing " + because they associate a key (such as “happy”) with a value
"can be complex! You might need to shout, \"Help!\""; (such as an ArrayList or LinkedList of references to String
System.out.println(sParagraph); objects, which hold references to your full original
paragraphs).
String[ ] asCollectionOfWords = sParagraph.split("[^a-zA-Z_0-9']+"); So your drawing should show how the two data structures
for (String s : asCollectionOfWords) relate: the original list of paragraphs and the layered map-
System.out.println(s);
} // end main()
1
Paragraphs are delimited by a newline character.
} // end class SplitTest 2
I used to call this a “memory map,” but strictly speaking,
The split() method will create individual word-oriented String Java, doesn’t allow direct access to memory, so that phrase
objects, and capture a reference to each in an array of “memory map” isn’t precisely accurate. Therefore, I
references to String objects. The reference to that array of substituted the phrase “runtime reference-to-object layout”.
references is captured in asCollectionOfWords. To confirm I’m hoping that this will make Svillen Ranev much happier ☺.
your understanding of split(), capture my sample code; build 3
Notice that I wrote “ map of unique words” not a
a Java project and set a breakpoint to inspect memory after “ collection of unique words”. There are two major families
the call to split() has finished its work. If you want credit for of data structures which are organized under two
completing this step, you’ll need to show me a sketch of fundamental interfaces: Collection and Map. Up to now
memory after the split() method has done its work. you’ve probably only used data structures that implement
the Collection interface. For this part, you need to use a
concrete class that implements the Map interface.
based index that refers back into the original list of • At a suitable breakpoint, check the list of original
paragraphs. paragraphs. Are the paragraphs organized as you had
expected in your runtime reference-to-object layout.
Confirming Correctness of Your Runtime Reference-to-
Object Layout • In the build process, step through a line at a time to
prove:
To confirm suitability of your choice of data structures, you
o Individual words are parsed correctly.
can show it to me.
o First occurrences of new words result in the addition
Step 3: Implement the Build Process of a new entry in the lookup data structure. Each
With your runtime reference-to-object layout complete, and new entry has a key and a value component. The
proven parsing routines, you are ready to build the two lookup work is the key; the reference to the
related data structures. Here’s an outline of your algorithm. associated paragraph becomes the first entry in a
I’ll assume that you have already loaded the paragraphs into Collection of paragraph references (which is the
a collection. associated value).
o Subsequent occurrences of a word locate the target
• Create a new lookup map based on the blend Map and in the Map data structure and add another reference
Collection classes shown in the runtime reference-to- to the associated paragraph.
object layout you chose in Step 2. The key component of
• When the build is complete, you can use the debugger to
the Map is the lookup word. The value component of the
find components in the concordance index and compare
Map is the List of references to the associated
with your small text file.
paragraphs.
• Run it with one of the large text files (such as: Charles
• In a loop, iterate through all paragraphs that are in your
Dicken’s Oliver Twist, Charles Darwin’s Origin of the
Collection of paragraphs. 4
Species, Victor Hugo’s Les Miserables ). With big files you
o Get a reference to the current paragraph.
will see the Big-O effect of your wise (or not-so-wise)
o Invoke split() for that paragraph and capture the
choices.
resulting array of references to isolated words.
o In an inner loop, iterate through all words in the Ignore User-Interface and Display Issues
paragraph (using your proven parsing tools). Your final application will have to have some kind of user-
Lookup the isolated word in your Map. interface with display features. Don’t worry about that while
If it’s not found then, you must create a new testing the design now.
lookup entry for it, and then store the current
Prove that you can build the memory-resident indexing data
paragraph reference as the first instance in your
structure first.
list of paragraphs for this word.
Else if the word is found, the lookup entry must This first part doesn’t require a great deal of coding. In
already exist, and you merely add this paragraph addition, I’ve given you workable parsing routines.
reference to your list of paragraphs for this Your real challenge is understanding the relationship of the
word. two major data structures: the Collection of original
That’s the essence of the build algorithm. Of course, there are paragraphs; the Map of words that manage references back
a number of detailed issues that you must deal with. to the original paragraphs. Make sure you understand the
runtime reference-to-object layout that reflects the design
One detailed issue flows from your choice of Map and
you propose to implement.
Collection classes (Step 2 decision). Different collections have
variations in their constructors and in some of their methods,
so you’ll probably need to spend some time looking at the
Java SE7 API documentation.
Another detailed issue is how to handle multiple occurrences
of a word in a paragraph. For example, the word “the” might
occur a dozen times in a single paragraph. Your parsing
routine will discover all dozen instances of the word “the”.
However, you really want to store only one reference to this
paragraph – not a dozen references to this same paragraph.
There are several workable approaches to solving this
problem. I’ll leave it for you to choose the solution to this.
Testing Index Building
In my opinion, this testing is best done with a small sample
text file and through the use of the debugger. Set breakpoints
at strategic locations.
4
This is the biggest of the text files: 3,224,332 bytes in size.
Step 4: Putting the Final Project Together Note some of the special elements in this regular expression:
Accessing Indexing Information • \\b This identifies a standard word boundary. Of
This part should be easy. Imagine that the user is now using course, the double \\ is used to embed a
the concordance index. She types in “happy” to find all the single \ in the String object. The replaceAll()
places where happy is located in the original document. method will really receive \b which ensures that we
Clearly, “happy” is your key. Since you are using a Map, you search for a word that's delimited with word
5 boundaries (to avoid finding "the" in "their"
can use that key to retrieve the associated value . What’s the
associated value? The associated value is the list of "another" "them" etc.)
references to the original paragraphs. • ?i: This means perform the search of the word
Thus, retrieving the information about all occurrences of the inside the parentheses ignoring case.
word “happy” can be implemented with a single method call • ( ) This means group all elements inside the
to your Map object. The Big-O associated with that single parentheses, and the grouping has two purposes.
6 First, the case-insensitive setting of ?i: is applied to
method call will depend on your choice of Map.
everything inside the parentheses. Second, the item
Issues in Displaying Search Results found by the search becomes available for use; in
Your user has looked up a word, and you have the list of essence it establishes an accessible variable, which
paragraphs where that word is located. But how are you you’ll later see as $0
going to display the results? The following outlines an
approach that I used with my GUI-based implementation. You Building the Displayable Search Results
are free use the code (with appropriate credit) or you can In preparation for display, I create a StringBuilder object. I
implement your own. can’t use a String object, since String objects are immutable.
Imagine that you’re dealing with a single paragraph where The StringBuilder object will allow me to append again and
the target word resides. There may be more than one again with little performance penalty. The and are
occurrence of this word in the current paragraph, yet you will html tags for italics. The elements generate a line break
want to highlight all occurrences. Here’s a simple method in html.
using regular expressions. My initial StringBuilder object creation is shown at the top of
Given a search word, such as “happy”, I want to replace the next page (placed on a new page to avoid splitting the
“happy” with “happy” essentially surrounding the long line of Java code).
target word with hypertext tagging to turn it bold (because
the Java control in which I display the text can correctly
display html tagged text). If I was displaying a pure text
version, I might highlight “happy” by replacing it with
“**happy**”. How you highlight the word is your choice, and
it’s easy to adjust. There’s one more thing; I want to search
case-insensitive, but replace retaining case-sensitivity. So a
search for “happy” (an all lower-case search word) would
result in the following matches and associated replacements:
• happy happy
• Happy Happy
• HAPPY HAPPY
The following regular expression isolates a word:

String sRegExSearch = "\\b(?i:" + sSearchWord + ")\\b";
5
And remember, a Map is an associative container in that it
associates a key with a corresponding value. Of course, you
don’t have to use a Map, to build the concordance. There are
many ways to provide well-functioning, high-performance
indexing.
6
Remember, Map is an interface. You cannot create an
object of type Map. But you can create one of several
implementing subclasses, and they all share the same
method interfaces (though with different Big-O
characteristics).
StringBuilder sbResults = (new StringBuilder("Search: ")).append(sSearchWord).append(" ");
Finally, I want to replace all occurrences of the target word

with that same word, but surrounded with the html
tags and to turn it bold. The use of $0 retains
the original word with its mix of upper and lower case. If
there were 20 instances of the word the in the paragraph,
all would be replaced with this single method call – each
retaining its original pattern of upper and lower case.
String sModified = sParagraph.replaceAll(sRegExSearch, "$0");

sbSearchResults.append(String.format("Instance %d: %s ", nFirst+1, sModified));
The User Interface

• You need to allow the user to specify the name of the file
from which the concordance will be created.
• Once the concordance has been built, you need to allow
the user to search for a word and display the results,
highlighting the target search word in the displayed
paragraph.
• In cases where there will be many matches (for example,
when searching for the word “the”), implement a
mechanism to show a limited number of paragraphs at a
time.
• Allow the user to search again and again, on a single run
of the program.
Optional Bonuses
There are many possible enhancements. None are required,
but each represents a significant improvement over the basic
implementation.
• Implement a GUI interface. I’m quite happy to help with
this once you get started.
• Implement serialization. Remember that a concordance
is an index. Once the index has been built, it will never
have to change (since the original text of a novel like
Oliver Twist is never going to change). On subsequent
runs of the program it would be better to permit the
reload of the pre-built concordance rather than always
having to build it from the original source text file. In just
7
a single line of code , you can output a multi-megabyte
data structure with all the rich network of reference
relationships. Later, you can re-read this multi-megabyte
data structure and restore its rich network of reference
relationships; again, with a single line of code.
• Use your imagination and creativity . . . it will be
rewarded.
Due Date: Monday March 11th 8:00 a.m.
7
Of course, you do have to open the output serialization file
first before you use this ”magical” line of code.
The Problem Statement is weighted more heavily
Submission Requirements since you’ll be justifying your implementation
The following need to be part of your submission. choices with reference to Big-O issues.
• Title Page: It must identify you, the course, section

number, course professor, assignment name and CST8130: Data Structures
number, date of submission.
Lab Assignment 3: Concordance Generator
• Runtime reference-to-object layouts: I’ve moved this Your lab assignment has been evaluated using the following
requirement higher in the list to emphasize how criteria:
important it is. Your runtime reference-to-object layout
will communicate the kind of data structures you are □ Problem Statement including justification of choices
using to implement the concordance. This runtime using Wise Choices chart (2 marks)
reference-to-object layout is likely to be very similar to
the runtime reference-to-object layout you supplied for □ Test Plan (1 mark)
phase 1, but it should also reflect the improvements that
I may have suggested. The runtime reference-to-object □ Runtime reference-to-object layout (2 marks)
layout does not need to show detailed level reference ID
numbers, but it should show directed arrows that □ Source Code: Indentation / Naming / Commenting
represent the reference relationships. If you use a (1 mark)
HashMap for your primary lookup index, your diagram
should show samples of collisions and how they are □ Successful Execution (3 marks)
handled. In addition, the diagram should show the key
and value components of the Map, with arrows linking □ Format / Organization (1 mark)
the related objects. If you use a TreeMap, the diagram
should show enough entries (with their key) to clarify the
order in which each sample key is sequenced. If you use Subtotal /10
an ArrayList for your diagram should show how the
□ Penalties (up to 3 marks deducted).
sequence is structured.
• Source and Executable code: I will evaluate.
• Problem Statement: In your own words, write a
paragraph or two that describes what you are attempting
□ GUI Bonuses (up to 5 marks added).
to achieve with this assignment. You must justify the
choices that you make for your implementation with
references to Big-O issues.
• Chart of “Wise Choices”: This chart will capture the □ Other Bonuses (up to 3 marks added).
essence of the work you are doing in Data Structures. In
a chart form, I expect you to identify and defend the
choices you made for your Collection and Map classes. I
have entered samples in the chart below for the original Total /10
paragraphs to illustrate the level of detail in your
documentation. Of course, you also need to chart your I’ll demonstrate JavaFX GUI tools. You
choice for the indexing (using a Map class). are not required to use them, though,
if you do, you deserve a bonus.
Collection Process Method Big-O Frequency Side-Effect Issues

or Map of Use
TreeSet Building collection n log2 n Once to With TreeSet<String> to store the paragraphs, the paragraphs are sorted
of paragraphs, for load the alphabetically, which alters the original sequence of paragraphs in the novel.
n paragraphs novel. With TreeSet<Paragraph>, where you invent a class Paragraph, you can establish
an alternate mechanism to order the objects and maintain the original sequence.
ArrayList Same as above. n Same Maintains the original sequence by default.
LinkedList Same as above. n Same Maintains the original sequence by default.
LinkedHastSet Same as above. n Same The Hash component of LinkedHashSet resequences the original order based on
the hashCode(). The Linked component of LinkedHashSet maintains a parallel
linked list based on the insertion order.
All four might be possible, but which is best? You could create a reference-to Collection
to hold the collection of paragraphs, and then experiment with different concrete
implementing classes (such as, TreeSet, ArrayList, LinkedList and LinkedHashSet).
Note: You also need a chart for the Map choices. This
chart only shows a selection of Collection choices.

Assignment3 ConcordanceGenerator PDF

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Assignment3 ConcordanceGenerator PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment3 ConcordanceGenerator PDF

Uploaded by

Copyright:

Available Formats

CST8130:

The following regular expression isolates a word:

Finally, I want to replace all occurrences of the target word

String sModified = sParagraph.replaceAll(sRegExSearch, "<b>$0</b>");

The User Interface

Due Date: Monday March 11th 8:00 a.m.

• Title Page: It must identify you, the course, section

Collection Process Method Big-O Frequency Side-Effect Issues

You might also like