A Python script that automatically extracts errata information from curriculum websites and exports it to CSV format.
- π Secure Authentication: Handles login with VPN compatibility
- π·οΈ Web Scraping: Extracts data from accordion-structured errata pages
- π CSV Export: Outputs structured data with configurable columns
- π Data Validation: Ensures data quality and consistency
- π Comprehensive Logging: Detailed logs for troubleshooting
- β‘ Automated Processing: End-to-end extraction with minimal setup
The script generates a CSV file with the following columns:
Column | Description | Example |
---|---|---|
Date_Extracted | When the data was extracted | 2025-08-21 14:30:15 |
Unit | Curriculum unit identifier | Unit 1, Unit 2, etc. |
Resource | Type of resource | Teacher Guide, Teacher Edition Glossary, Student Edition |
Location | Specific location within resource | Lesson 3 Section A, Chapter 2 |
Instructional_Moment | Type of instructional activity | Warm-up, Activity, Lesson Synthesis |
Page_Numbers | Relevant page numbers | 12-15, 7, 20, 25 |
Improvement_Description | Description of the change/correction | Updated calculation formula |
Improvement_Type | Type of improvement | Correction, Update, Enhancement |
Date_Updated | When the improvement was made | 2025-08-15 |
-
Clone or download the project:
cd "c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator"
-
Install Python dependencies:
pip install -r requirements.txt
-
Set up configuration:
- Copy
config/credentials.env.template
toconfig/.env
- Edit
config/.env
with your actual website credentials - Update
config/config.yaml
with your website's specific URLs and selectors
- Copy
USERNAME=your_username_here
PASSWORD=your_password_here
Update the following sections for your specific curriculum website:
website:
base_url: "https://your-curriculum-site.com"
login_url: "/login"
errata_pages:
- "/errata/unit1"
- "/errata/unit2"
selectors:
errata_container: ".errata-list"
unit_field: ".unit-name"
resource_field: ".resource-type"
# ... other selectors
Full extraction (recommended for first run):
python main.py
Test authentication only:
python main.py --test-auth
Incremental update (add only new errata):
python main.py --incremental
Use requests instead of Selenium (faster, but may miss dynamic content):
python main.py --use-requests
Validate setup:
python main.py --validate-setup
Use custom configuration file:
python main.py --config path/to/custom-config.yaml
Combine options:
python main.py --incremental --use-requests
Errata-Locator/
βββ main.py # Entry point script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ src/
β βββ auth.py # Authentication handling
β βββ scraper.py # Main scraping coordination
β βββ parser.py # HTML parsing and data extraction
β βββ csv_writer.py # CSV output and data processing
βββ config/
β βββ config.yaml # Main configuration
β βββ credentials.env.template # Template for credentials
β βββ .env # Your actual credentials (create this)
βββ output/
β βββ errata_changes.csv # Generated CSV file
β βββ backups/ # Automatic backups
βββ logs/
βββ errata_locator.log # Application logs
-
Import errors when running:
- Make sure all dependencies are installed:
pip install -r requirements.txt
- Check that you're running from the correct directory
- Make sure all dependencies are installed:
-
Authentication failures:
- Verify credentials in
config/.env
- Test authentication only:
python main.py --test-auth
- Check that the login URL and selectors are correct
- Verify credentials in
-
No data extracted:
- Verify the errata page URLs in
config.yaml
- Check CSS selectors for data extraction
- Try using Selenium instead of requests: remove
--use-requests
flag
- Verify the errata page URLs in
-
Chrome driver issues (when using Selenium):
- The script automatically downloads the Chrome driver
- Make sure Chrome browser is installed
- Check firewall/antivirus settings
Enable debug logging by editing config/config.yaml
:
logging:
level: "DEBUG"
Check log files in the logs/
directory for detailed error information.
-
Add new selectors to
config/config.yaml
:selectors: new_field: ".new-field-selector"
-
Add the field to CSV columns:
output: csv_columns: - "Date_Extracted" - "New_Field" # ... other columns
-
Update parsing logic in
src/parser.py
if needed.
If your curriculum website has a different structure:
- Update the
selectors
section inconfig/config.yaml
- Modify the login process in
src/auth.py
if needed - Adjust parsing logic in
src/parser.py
for your specific HTML structure
To run the script automatically:
- Open Task Scheduler
- Create Basic Task
- Set trigger (daily, weekly, etc.)
- Action: Start a program
- Program:
python
- Arguments:
main.py --incremental
- Start in:
c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator
$trigger = New-ScheduledTaskTrigger -Daily -At "09:00AM"
$action = New-ScheduledTaskAction -Execute "python" -Argument "main.py --incremental" -WorkingDirectory "c:\Users\RyandelaGarza\.vscode\AIA Builds\Errata-Locator"
Register-ScheduledTask -TaskName "ErrataLocator" -Trigger $trigger -Action $action
- Never commit the
.env
file to version control - Use environment variables for sensitive credentials
- The script respects rate limiting to avoid overwhelming servers
- Consider using application-specific passwords if available
For issues or questions:
- Check the logs in
logs/errata_locator.log
- Run with
--validate-setup
to check configuration - Test authentication separately with
--test-auth
- Enable debug logging for more detailed information