Skip to content

AI-powered document extractor for names, emails, and organizations. License: MIT

License

Notifications You must be signed in to change notification settings

PMTheTechGuy/document-entity-extractor

Repository files navigation

AI Data Extraction Tool

🚀 Upload documents → Extract Names, Emails, and Organizations → Download structured Excel results instantly.
Built with FastAPI, Pandas, and optional GPT-enhanced extraction.
Deployed live on Render.


✨ Features

  • ✅ Upload PDF, DOCX, and TXT documents
  • ✅ Extract Names, Emails, and Organizations
  • ✅ Multi-file uploads supported (combines results into one Excel)
  • ✅ Clean and organized Excel file download (.xlsx)
  • ✅ Supports both local entity extraction and GPT-enhanced extraction
  • ✅ Automatic fallback if the custom model is missing
  • ✅ Deployed online via Render

📸 Screenshots

Upload Page

Upload

Extraction Results Page

Results


🚀 Live Demo

🟢 Visit the Live App Here


⚙️ Technologies Used

  • Python 3.11
  • FastAPI
  • Uvicorn
  • Pandas
  • spaCy
  • OpenAI API (optional GPT-enhancement)
  • openpyxl (for Excel export)

🛠 Local Development Setup

Clone the repository:

git clone https://github.com/PMTheTechGuy/document-entity-extractor.git
cd document-entity-extractor

Install dependencies:

pip install -r requirements.txt

Set up your environment variables:

Create a .env file based on .env.example.

cp .env.example .env

Start the server locally:

uvicorn api.main:app --reload

If you encounter an issue loading the application on HTTP://localhost:8000.

Quit the application using Ctrl + C and start the server on port 8001.

uvicorn api.main:app --reload --port 8001

🧠 OpenAI Key Setup (Optional for GPT Extraction)

This app supports two extraction modes:

  • 🧠 GPT-enhanced extraction (more accurate, slower, uses OpenAI API)

  • ⚡ Local spaCy model extraction (faster, free, no external API calls)

By default, the app will fall back to spaCy if no OpenAI key is provided and the USE_GPT_EXTRACTION is set to False.

Setting Up OpenAI GPT Extraction (Optional)

1. In your .env file, add your OpenAI API Key:

OPENAI_API_KEY=your-real-openai-api-key-here

2. Save the .env file.

3. Restart the FastAPI server:

uvicorn api.main:app --reload
  • ✅ If a key is provided, the app will automatically use GPT for extractions.
  • ✅ If no key is provided or an API error occurs, the app will fall back to using spaCy.

⚙️ Controlling GPT Extraction Mode

In your .env file, you can control whether the app uses GPT or local spaCy extraction:

USE_GPT_EXTRACTION=True
  • ✅ True → Use GPT extraction (requires valid OpenAI API key)

  • ✅ False → Force local spaCy extraction, even if API key is present

Restart the server after changing the .env settings.

uvicorn api.main:app --reload

The app will detect this automatically at runtime.


🌍 Deployment

This app is deployed on Render.

You can deploy your version in one click.


📦 Folder Structure

api/             # FastAPI backend
├── templates/   # HTML templates (upload form, results page)
├── static/      # Static files
utils/           # Helper modules (export, logging, etc.)
extractor/       # File reading and entity extraction
gpt_integration/ # GPT-enhanced extraction
output/          # Exported Excel files
logs/            # Application logs

📦 Features

  • Multi-file Upload: Upload one or more .pdf, .docx, or .txt files for processing.
  • Entity Extraction: Automatically identifies and extracts:
    • People (names)
    • Emails
    • Organizations
  • Results Summary: Displays a summary of total files processed, and the number of names, emails, and organizations found.
  • CSV & Excel Export: Download extracted data in .csv or .xlsx format.
  • Auto Cleanup: Temporary files that are older than one hour will be automatically deleted.
  • Error Handling: User interface for handling invalid uploads, unsupported file types, and extraction failures.

🚧 Coming Soon

  • Daily upload limits per user or IP (via database tracking)
  • Admin dashboard to review processed data
  • File size limit configuration in .env

🙌 Acknowledgements


📫 Contact

Crafted with dedication by PM The Tech Guy.

Please don't hesitate to reach out or share your ideas!


📝 License

This project is licensed under the MIT License.