CS 3308 - Programming Assignment Unit 2
CS 3308 - Programming Assignment Unit 2
In this assignment, I worked with the index.py script within an integrated development environment (IDE).
The objective was to ensure the script was running without errors and to analyze the outputs generated when
processing the corpus file (cacm). After making the necessary edits to the script—primarily updating the file
path for the corpus directory—I proceeded to run the script and debug it to confirm that there were no bugs in
the process. Once the code was verified to be bug-free, I ran the script multiple times to ensure consistent
results.
After verifying that the output remained stable and accurate, I observed the documents.dat and index.dat files
that were created at the end of the script execution. These files are essential components of the inverted index,
and they contain important information such as the term and document mappings.
Observation:
While running the script, I noticed that the output varied across different runs until the last three executions,
where the results became consistent. This inconsistency was intriguing, and it prompted me to examine the
sequence of events leading to the final uniform output. The discrepancy was resolved after several successful
runs, which led to the generation of the documents.dat and index.dat files.
Here are the outputs I encountered during the first execution:
Conclusion
The indexer successfully served its primary purpose of generating the necessary output files,
including documents.dat and index.dat, which map documents and terms. However, I am not entirely
certain that the index is perfect. My concern stems from the possibility of inaccuracies in the
tokenization process, which might have affected the final index. Specifically, the metrics indicated a
potentially imperfect tokenization, which could be due to several factors such as the format of the
documents, inconsistencies in word spacing, or the presence of special characters that might not have
been handled properly by the tokenizer.