CS 3308 Programming Assignment 2
Exploring a Python-Powered Text Indexer with SQLite
Introduction
In the field of information retrieval, efficiently processing and indexing large amounts of
unstructured text is crucial for developing high-performance search systems. The Unit 2
assignment involves creating a text indexer using Python and SQLite. This tool processes a set of
documents, breaks down their content into tokens, and structures the data into dictionaries stored
in a relational database. Through this project, students gain hands-on experience in constructing
the foundational components of a basic search engine.
Core Functionality
The Python indexer starts by analyzing documents within a directory called *cacm*, where each
file represents a separate document. Using regular expressions, the text is tokenized—split into
individual words based on non-word characters—and each unique token is classified as a term.
As outlined by Manning, Raghavan, & Schütze (2009), the program generates a *Term* object
for every identified term, storing a unique term ID, its overall frequency in the corpus (term
frequency), and the number of documents containing it (document frequency). The document
path and corresponding ID are also recorded in a *DocumentDictionary*. This structured data is
saved across three SQLite database tables:
DocumentDictionary Links file paths to unique document IDs.
TermDictionary Associates each term with a unique term ID.
Posting (for extensions) Designed to store advanced indexing details, such as TF-IDF scores.
Assignment Components and Clarifications
The submission requires four elements: the Python indexer code, documents.dat and index.dat
files, performance metrics, and a reflective summary. However, the provided code does not
explicitly generate documents.dat and index.dat files. Instead, all necessary data is stored in an
SQLite database named indexer_part2.db.
(Since the SQLite database already organizes all structured data—including document paths,
term IDs, frequencies, and document-term relationships—the .dat files are unnecessary. The
database serves the same purpose more efficiently while offering superior querying capabilities.)
(If required, the code could be adjusted to export document ID mappings and term-posting lists
into plain text files. However, this functionality is not currently implemented.)
Output Explanation Once the indexer completes its execution, it displays several key results.
These include the start and end times of the operation, the contents of the TermDictionary, which
lists each unique term and its assigned ID, and the DocumentDictionary, which associates each
file name with a unique document ID. In addition, it prints out summary statistics: the number of
documents processed (570), the number of unique terms identified (4279), and the total number
of tokens extracted ( 37470). Observations (for Submission) Content of Data: The CACM
dataset consists of academic articles containing technical terms related to computer science.
Tokenization using the \W+ regex may exclude numbers and special characters, focusing only on
meaningful terms. Running Time: Processing the entire corpus takes approximately 8 minutes,
though this varies based on my system performance. Efficiency: The in-memory dictionary
works well for small to medium corpora. For larger datasets, storing results in SQLite enhances
scalability and data persistence. Issues: If the cacm directory is not properly located or
extracted, the program will raise an error. It’s important to ensure the correct file path is
specified.
References
Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information
Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available
at http://nlp.stanford.edu/IR-book/information-retrieval-book.htm