CS 3308 Programming Assignment 2

The document outlines a programming assignment focused on creating a text indexer using Python and SQLite for processing and indexing unstructured text. It details the core functionality of the indexer, including tokenization of documents, storage of term and document data in SQLite, and the required submission components. Additionally, it discusses performance metrics and potential issues related to file paths and processing time.

Uploaded by

Patrick Maina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views3 pages

CS 3308 Programming Assignment 2

Uploaded by

Patrick Maina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

CS 3308 Programming Assignment 2

Exploring a Python-Powered Text Indexer with SQLite

Introduction

In the field of information retrieval, efficiently processing and indexing large amounts of
unstructured text is crucial for developing high-performance search systems. The Unit 2
assignment involves creating a text indexer using Python and SQLite. This tool processes a set of
documents, breaks down their content into tokens, and structures the data into dictionaries stored
in a relational database. Through this project, students gain hands-on experience in constructing
the foundational components of a basic search engine.

Core Functionality

The Python indexer starts by analyzing documents within a directory called *cacm*, where each
file represents a separate document. Using regular expressions, the text is tokenized—split into
individual words based on non-word characters—and each unique token is classified as a term.

As outlined by Manning, Raghavan, & Schütze (2009), the program generates a *Term* object
for every identified term, storing a unique term ID, its overall frequency in the corpus (term
frequency), and the number of documents containing it (document frequency). The document
path and corresponding ID are also recorded in a *DocumentDictionary*. This structured data is
saved across three SQLite database tables:

DocumentDictionary Links file paths to unique document IDs.

TermDictionary Associates each term with a unique term ID.

Posting (for extensions) Designed to store advanced indexing details, such as TF-IDF scores.

Assignment Components and Clarifications

The submission requires four elements: the Python indexer code, documents.dat and index.dat
files, performance metrics, and a reflective summary. However, the provided code does not
explicitly generate documents.dat and index.dat files. Instead, all necessary data is stored in an
SQLite database named indexer_part2.db.

(Since the SQLite database already organizes all structured data—including document paths,
term IDs, frequencies, and document-term relationships—the .dat files are unnecessary. The
database serves the same purpose more efficiently while offering superior querying capabilities.)

(If required, the code could be adjusted to export document ID mappings and term-posting lists
into plain text files. However, this functionality is not currently implemented.)
Output Explanation Once the indexer completes its execution, it displays several key results.
These include the start and end times of the operation, the contents of the TermDictionary, which
lists each unique term and its assigned ID, and the DocumentDictionary, which associates each
file name with a unique document ID. In addition, it prints out summary statistics: the number of
documents processed (570), the number of unique terms identified (4279), and the total number
of tokens extracted ( 37470). Observations (for Submission)  Content of Data: The CACM
dataset consists of academic articles containing technical terms related to computer science.
Tokenization using the \W+ regex may exclude numbers and special characters, focusing only on
meaningful terms.  Running Time: Processing the entire corpus takes approximately 8 minutes,
though this varies based on my system performance.  Efficiency: The in-memory dictionary
works well for small to medium corpora. For larger datasets, storing results in SQLite enhances
scalability and data persistence.  Issues: If the cacm directory is not properly located or
extracted, the program will raise an error. It’s important to ensure the correct file path is
specified.

References

Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information

Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available

at http://nlp.stanford.edu/IR-book/information-retrieval-book.htm

1 I Wonder 5 Activity Book
100% (2)
1 I Wonder 5 Activity Book
29 pages
First Holy Communion (A5 Booklet)
100% (1)
First Holy Communion (A5 Booklet)
7 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Accomplishment Report of Project ICARE
100% (1)
Accomplishment Report of Project ICARE
10 pages
10 Steps To Improving Your Study Skills
No ratings yet
10 Steps To Improving Your Study Skills
10 pages
CS 3308 Programming Assignment Unit 2
No ratings yet
CS 3308 Programming Assignment Unit 2
10 pages
Unit 4 Source Code
No ratings yet
Unit 4 Source Code
11 pages
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 05 - CS 3308 - Information Retrieval - University of The People
9 pages
Assignment 4
No ratings yet
Assignment 4
11 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
From Everand
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
Younes Hamdani
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Data Structure in Python: Essential Techniques
From Everand
Data Structure in Python: Essential Techniques
Ed A Norex
No ratings yet
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
DataFusion Python Bindings in Practice: The Complete Guide for Developers and Engineers
From Everand
DataFusion Python Bindings in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Project Report
No ratings yet
Project Report
5 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
CS 3308 Programming Assignment Unit 4
No ratings yet
CS 3308 Programming Assignment Unit 4
7 pages
Explain Text Operation
No ratings yet
Explain Text Operation
6 pages
Notepad++ Workflow Engineering: Techniques for Rapid, Reliable Text Management
From Everand
Notepad++ Workflow Engineering: Techniques for Rapid, Reliable Text Management
William E. Clark
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Data Manipulation with Python Step by Step: A Practical Guide with Examples
From Everand
Data Manipulation with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Scalable Computing with Dask: The Complete Guide for Developers and Engineers
From Everand
Scalable Computing with Dask: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
LlamaIndex in Practice: The Complete Guide for Developers and Engineers
From Everand
LlamaIndex in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
DataFusion: Query Execution with Rust and Arrow: The Complete Guide for Developers and Engineers
From Everand
DataFusion: Query Execution with Rust and Arrow: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Textract Workflows and Applications: Definitive Reference for Developers and Engineers
From Everand
Textract Workflows and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
From Everand
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
Vaex for Scalable Data Processing in Python: The Complete Guide for Developers and Engineers
From Everand
Vaex for Scalable Data Processing in Python: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Written Assignment
No ratings yet
Written Assignment
8 pages
Simple Search Engine - Project Repo
No ratings yet
Simple Search Engine - Project Repo
2 pages
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
From Everand
DP-500 Designing and Implementing Enterprise-Scale Analytics Solutions Using Microsoft Azure and Microsoft Power BI Exam Guide
Anand Vemula
No ratings yet
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Python Data Persistence
From Everand
Python Data Persistence
Malhar Lathkar
No ratings yet
OrientDB Essentials: The Complete Guide for Developers and Engineers
From Everand
OrientDB Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering the Craft of Python Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of Python Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Computer Pratical Grade12e
No ratings yet
Computer Pratical Grade12e
64 pages
Elasticsearch Server: Second Edition
From Everand
Elasticsearch Server: Second Edition
Rafał Kuć
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
Learning Journal Unit 2
No ratings yet
Learning Journal Unit 2
3 pages
Learning Journal Entry Week 3 Reflection
No ratings yet
Learning Journal Entry Week 3 Reflection
2 pages
Learning Journal Unit2
No ratings yet
Learning Journal Unit2
6 pages
CS1103 Learning Journal Unit 1
No ratings yet
CS1103 Learning Journal Unit 1
2 pages
CS 2301 Unit 2 Written Assignment
No ratings yet
CS 2301 Unit 2 Written Assignment
4 pages
CS 2301 Written Assignment Unit 1
No ratings yet
CS 2301 Written Assignment Unit 1
2 pages
CS 1104 Learning Journal Unit 2
No ratings yet
CS 1104 Learning Journal Unit 2
1 page
CS 1104 Discussion Forum Unit 3
No ratings yet
CS 1104 Discussion Forum Unit 3
2 pages
CS 1104 Learning Journal Unit 1
No ratings yet
CS 1104 Learning Journal Unit 1
1 page
CS 1104 Discussion Forum Unit 1
No ratings yet
CS 1104 Discussion Forum Unit 1
2 pages
CS 1103 Programming Assignment Unit 1
No ratings yet
CS 1103 Programming Assignment Unit 1
2 pages
CS 1103 Programming Assignment Unit 3
No ratings yet
CS 1103 Programming Assignment Unit 3
5 pages
Sugandha Srinivas S
No ratings yet
Sugandha Srinivas S
3 pages
Unit Plan Conrad Sully
No ratings yet
Unit Plan Conrad Sully
84 pages
InGuard (Toll Fraud Guard) Application Installation Manual - 2 - 0
No ratings yet
InGuard (Toll Fraud Guard) Application Installation Manual - 2 - 0
23 pages
Message Queues
No ratings yet
Message Queues
15 pages
Year 4 Reading Assessment Answer Booklet - Fiction: English KS2 2016
No ratings yet
Year 4 Reading Assessment Answer Booklet - Fiction: English KS2 2016
4 pages
Section 4
No ratings yet
Section 4
4 pages
Loading Data in +snowflake
No ratings yet
Loading Data in +snowflake
10 pages
Space and Geometry in The B Deduction
No ratings yet
Space and Geometry in The B Deduction
31 pages
Solution Manual Principles of Electronic
No ratings yet
Solution Manual Principles of Electronic
26 pages
MS Excel Full Notes PDF Free Download - Google Search
No ratings yet
MS Excel Full Notes PDF Free Download - Google Search
3 pages
Inglés: Cuaderno de Trabajo
No ratings yet
Inglés: Cuaderno de Trabajo
3 pages
SEQA Session 4.1
No ratings yet
SEQA Session 4.1
86 pages
Interpolation and Least Square
No ratings yet
Interpolation and Least Square
18 pages
DLL Sept 19 English III
No ratings yet
DLL Sept 19 English III
3 pages
Sentence Types and Their Effects
100% (1)
Sentence Types and Their Effects
1 page
Shortcut Keys
No ratings yet
Shortcut Keys
1 page
Creating An Object Save Location For The Object Management Workbench - Document 626181.1
No ratings yet
Creating An Object Save Location For The Object Management Workbench - Document 626181.1
6 pages
The Truth About The Drug Companies How They Deceive Us And What To Do About It 1st Edition Marcia Angell instant download
100% (2)
The Truth About The Drug Companies How They Deceive Us And What To Do About It 1st Edition Marcia Angell instant download
37 pages
Error and Solution Ls Retail
No ratings yet
Error and Solution Ls Retail
10 pages
(Ebook PDF) SAS Certified Specialist Prep Guide: Base Programming Using SAS 9.4 PDF Download
100% (2)
(Ebook PDF) SAS Certified Specialist Prep Guide: Base Programming Using SAS 9.4 PDF Download
55 pages
Allusions
No ratings yet
Allusions
5 pages
Yassarnal Qur-Aan: Part Two
No ratings yet
Yassarnal Qur-Aan: Part Two
32 pages
Tutorial - The Sum-Product Algorithm
No ratings yet
Tutorial - The Sum-Product Algorithm
5 pages
Immediate Future - Going To
No ratings yet
Immediate Future - Going To
7 pages
Example of Thesis Paragraph
100% (2)
Example of Thesis Paragraph
4 pages
Cat A Dog
No ratings yet
Cat A Dog
9 pages

CS 3308 Programming Assignment 2

Uploaded by

CS 3308 Programming Assignment 2

Uploaded by

CS 3308 Programming Assignment 2

Exploring a Python-Powered Text Indexer with SQLite

DocumentDictionary Links file paths to unique document IDs.

TermDictionary Associates each term with a unique term ID.

Assignment Components and Clarifications

Manning, C.D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information

Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available

You might also like