CS 3308 - Programming Assignment Unit 2

The assignment involved debugging and running the index.py script to process a corpus file, resulting in the generation of documents.dat and index.dat files. Initial output varied across executions but stabilized after several runs, revealing the number of documents, tokens, and terms processed. Concerns about potential inaccuracies in the tokenization process were noted, which could affect the integrity of the final index.

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

CS 3308 - Programming Assignment Unit 2

Uploaded by

djromodeste

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Programming Assignment Unit 2: Information Retrieval

University of the People

CS 3308: Information Retrieval

Sharina Babb, Instructor

February 12, 2025

Introduction:

In this assignment, I worked with the index.py script within an integrated development environment (IDE).
The objective was to ensure the script was running without errors and to analyze the outputs generated when
processing the corpus file (cacm). After making the necessary edits to the script—primarily updating the file
path for the corpus directory—I proceeded to run the script and debug it to confirm that there were no bugs in
the process. Once the code was verified to be bug-free, I ran the script multiple times to ensure consistent
results.
After verifying that the output remained stable and accurate, I observed the documents.dat and index.dat files
that were created at the end of the script execution. These files are essential components of the inverted index,
and they contain important information such as the term and document mappings.

Observation:

While running the script, I noticed that the output varied across different runs until the last three executions,
where the results became consistent. This inconsistency was intriguing, and it prompted me to examine the
sequence of events leading to the final uniform output. The discrepancy was resolved after several successful
runs, which led to the generation of the documents.dat and index.dat files.
Here are the outputs I encountered during the first execution:

Actual Output from the First Execution:

 Processing Start Time: 20:10

 Number of Documents: 570
 Number of Tokens: 26,543
 Number of Terms: 9,606
 Processing End Time: 20:10

Image of the Actual output:

The last three output reported:

 Processing Start Time: 20:16

 Documents 572
 Tokens 26545
 Terms 9608
 Processing End Time: 20:16

Below is the detailed image of the outputs:

Conclusion

The indexer successfully served its primary purpose of generating the necessary output files,
including documents.dat and index.dat, which map documents and terms. However, I am not entirely
certain that the index is perfect. My concern stems from the possibility of inaccuracies in the
tokenization process, which might have affected the final index. Specifically, the metrics indicated a
potentially imperfect tokenization, which could be due to several factors such as the format of the
documents, inconsistencies in word spacing, or the presence of special characters that might not have
been handled properly by the tokenizer.