🚀 Upload documents → Extract Names, Emails, and Organizations → Download structured Excel results instantly.
Built with FastAPI, Pandas, and optional GPT-enhanced extraction.
Deployed live on Render.
- ✅ Upload PDF, DOCX, and TXT documents
- ✅ Extract Names, Emails, and Organizations
- ✅ Multi-file uploads supported (combines results into one Excel)
- ✅ Clean and organized Excel file download (
.xlsx
) - ✅ Supports both local entity extraction and GPT-enhanced extraction
- ✅ Automatic fallback if the custom model is missing
- ✅ Deployed online via Render
- Python 3.11
- FastAPI
- Uvicorn
- Pandas
- spaCy
- OpenAI API (optional GPT-enhancement)
- openpyxl (for Excel export)
Clone the repository:
git clone https://github.com/PMTheTechGuy/document-entity-extractor.git
cd document-entity-extractor
Install dependencies:
pip install -r requirements.txt
Set up your environment variables:
Create a .env
file based on .env.example
.
cp .env.example .env
Start the server locally:
uvicorn api.main:app --reload
If you encounter an issue loading the application on HTTP://localhost:8000
.
Quit the application using Ctrl + C
and start the server on port 8001
.
uvicorn api.main:app --reload --port 8001
This app supports two extraction modes:
-
🧠 GPT-enhanced extraction (more accurate, slower, uses OpenAI API)
-
⚡ Local spaCy model extraction (faster, free, no external API calls)
By default, the app will fall back to spaCy if no OpenAI key is provided and the USE_GPT_EXTRACTION
is set to False
.
1. In your .env
file, add your OpenAI API Key:
OPENAI_API_KEY=your-real-openai-api-key-here
2. Save the .env
file.
3. Restart the FastAPI server:
uvicorn api.main:app --reload
- ✅ If a key is provided, the app will automatically use GPT for extractions.
- ✅ If no key is provided or an API error occurs, the app will fall back to using spaCy.
In your .env
file, you can control whether the app uses GPT or local spaCy extraction:
USE_GPT_EXTRACTION=True
-
✅ True → Use GPT extraction (requires valid OpenAI API key)
-
✅ False → Force local spaCy extraction, even if API key is present
Restart the server after changing the .env
settings.
uvicorn api.main:app --reload
The app will detect this automatically at runtime.
This app is deployed on Render.
You can deploy your version in one click.
api/ # FastAPI backend
├── templates/ # HTML templates (upload form, results page)
├── static/ # Static files
utils/ # Helper modules (export, logging, etc.)
extractor/ # File reading and entity extraction
gpt_integration/ # GPT-enhanced extraction
output/ # Exported Excel files
logs/ # Application logs
- Multi-file Upload: Upload one or more
.pdf
,.docx
, or.txt
files for processing. - Entity Extraction: Automatically identifies and extracts:
- People (names)
- Emails
- Organizations
- Results Summary: Displays a summary of total files processed, and the number of names, emails, and organizations found.
- CSV & Excel Export: Download extracted data in
.csv
or.xlsx
format. - Auto Cleanup: Temporary files that are older than one hour will be automatically deleted.
- Error Handling: User interface for handling invalid uploads, unsupported file types, and extraction failures.
- Daily upload limits per user or IP (via database tracking)
- Admin dashboard to review processed data
- File size limit configuration in .env
Crafted with dedication by PM The Tech Guy.
Please don't hesitate to reach out or share your ideas!
This project is licensed under the MIT License.