Errata Locator

A Python script that automatically extracts errata information from curriculum websites and exports it to CSV format.

Features

🔐 Secure Authentication: Handles login with VPN compatibility
🕷️ Web Scraping: Extracts data from accordion-structured errata pages
📊 CSV Export: Outputs structured data with configurable columns
🔍 Data Validation: Ensures data quality and consistency
📝 Comprehensive Logging: Detailed logs for troubleshooting
⚡ Automated Processing: End-to-end extraction with minimal setup

Output Format

The script generates a CSV file with the following columns:

Column	Description	Example
Date_Extracted	When the data was extracted	2025-08-21 14:30:15
Unit	Curriculum unit identifier	Unit 1, Unit 2, etc.
Resource	Type of resource	Teacher Guide, Teacher Edition Glossary, Student Edition
Location	Specific location within resource	Lesson 3 Section A, Chapter 2
Instructional_Moment	Type of instructional activity	Warm-up, Activity, Lesson Synthesis
Page_Numbers	Relevant page numbers	12-15, 7, 20, 25
Improvement_Description	Description of the change/correction	Updated calculation formula
Improvement_Type	Type of improvement	Correction, Update, Enhancement
Date_Updated	When the improvement was made	2025-08-15

Installation

Clone or download the project:

cd "c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator"

Install Python dependencies:
```
pip install -r requirements.txt
```
Set up configuration:
- Copy config/credentials.env.template to config/.env
- Edit config/.env with your actual website credentials
- Update config/config.yaml with your website's specific URLs and selectors

Configuration

Credentials (config/.env)

USERNAME=your_username_here
PASSWORD=your_password_here

Website Configuration (config/config.yaml)

Update the following sections for your specific curriculum website:

website:
  base_url: "https://your-curriculum-site.com"
  login_url: "/login"
  errata_pages:
    - "/errata/unit1"
    - "/errata/unit2"

selectors:
  errata_container: ".errata-list"
  unit_field: ".unit-name"
  resource_field: ".resource-type"
  # ... other selectors

Usage

Basic Commands

Full extraction (recommended for first run):

python main.py

Test authentication only:

python main.py --test-auth

Incremental update (add only new errata):

python main.py --incremental

Use requests instead of Selenium (faster, but may miss dynamic content):

python main.py --use-requests

Validate setup:

python main.py --validate-setup

Advanced Options

Use custom configuration file:

python main.py --config path/to/custom-config.yaml

Combine options:

python main.py --incremental --use-requests

Project Structure

Errata-Locator/
├── main.py                    # Entry point script
├── requirements.txt           # Python dependencies
├── README.md                 # This file
├── src/
│   ├── auth.py               # Authentication handling
│   ├── scraper.py            # Main scraping coordination
│   ├── parser.py             # HTML parsing and data extraction
│   └── csv_writer.py         # CSV output and data processing
├── config/
│   ├── config.yaml           # Main configuration
│   ├── credentials.env.template  # Template for credentials
│   └── .env                  # Your actual credentials (create this)
├── output/
│   ├── errata_changes.csv    # Generated CSV file
│   └── backups/              # Automatic backups
└── logs/
    └── errata_locator.log    # Application logs

Troubleshooting

Common Issues

Import errors when running:
- Make sure all dependencies are installed: pip install -r requirements.txt
- Check that you're running from the correct directory
Authentication failures:
- Verify credentials in config/.env
- Test authentication only: python main.py --test-auth
- Check that the login URL and selectors are correct
No data extracted:
- Verify the errata page URLs in config.yaml
- Check CSS selectors for data extraction
- Try using Selenium instead of requests: remove --use-requests flag
Chrome driver issues (when using Selenium):
- The script automatically downloads the Chrome driver
- Make sure Chrome browser is installed
- Check firewall/antivirus settings

Debugging

Enable debug logging by editing config/config.yaml:

logging:
  level: "DEBUG"

Check log files in the logs/ directory for detailed error information.

Customization

Adding New Data Fields

Add new selectors to config/config.yaml:

selectors:
  new_field: ".new-field-selector"

Add the field to CSV columns:

output:
  csv_columns:
    - "Date_Extracted"
    - "New_Field"
    # ... other columns

Update parsing logic in src/parser.py if needed.

Different Website Structures

If your curriculum website has a different structure:

Update the selectors section in config/config.yaml
Modify the login process in src/auth.py if needed
Adjust parsing logic in src/parser.py for your specific HTML structure

Scheduling

To run the script automatically:

Windows Task Scheduler

Open Task Scheduler
Create Basic Task
Set trigger (daily, weekly, etc.)
Action: Start a program
Program: python
Arguments: main.py --incremental
Start in: c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator

PowerShell Scheduled Job

$trigger = New-ScheduledTaskTrigger -Daily -At "09:00AM"
$action = New-ScheduledTaskAction -Execute "python" -Argument "main.py --incremental" -WorkingDirectory "c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator"
Register-ScheduledTask -TaskName "ErrataLocator" -Trigger $trigger -Action $action

Security Notes

Never commit the .env file to version control
Use environment variables for sensitive credentials
The script respects rate limiting to avoid overwhelming servers
Consider using application-specific passwords if available

Support

For issues or questions:

Check the logs in logs/errata_locator.log
Run with --validate-setup to check configuration
Test authentication separately with --test-auth
Enable debug logging for more detailed information

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
extract_all.py		extract_all.py
il.yaml		il.yaml
main.py		main.py
requirements.txt		requirements.txt
resources.yaml		resources.yaml
service.yaml		service.yaml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Errata Locator

Features

Output Format

Installation

Configuration

Credentials (config/.env)

Website Configuration (config/config.yaml)

Usage

Basic Commands

Advanced Options

Project Structure

Troubleshooting

Common Issues

Debugging

Customization

Adding New Data Fields

Different Website Structures

Scheduling

Windows Task Scheduler

PowerShell Scheduled Job

Security Notes

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ImagineLearning/aia-improvements-viewer

Folders and files

Latest commit

History

Repository files navigation

Errata Locator

Features

Output Format

Installation

Configuration

Credentials (config/.env)

Website Configuration (config/config.yaml)

Usage

Basic Commands

Advanced Options

Project Structure

Troubleshooting

Common Issues

Debugging

Customization

Adding New Data Fields

Different Website Structures

Scheduling

Windows Task Scheduler

PowerShell Scheduled Job

Security Notes

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages