Generate Database Schemas from Codebases with AI

Ever stared at a new codebase wondering what data it stores? This project builds an AI agent that analyzes GitHub repositories and produces a complete database schema describing every table and column.

This project showcases Pocket Flow, a 100-line LLM framework. It crawls GitHub repositories and analyzes the code to identify every database table used. The agent then generates a complete schema describing each table, every column, and the meaning of any constrained values such as enums.

Check out the YouTube Development Tutorial for more!
Check out the Substack Post Tutorial for more!

🔸 🎉 Reached Hacker News Front Page (April 2025) with >900 up‑votes: Discussion »

🚀 Getting Started

Clone this repository

git clone https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge

Install dependencies:
```
pip install -r requirements.txt
```
Set up LLM in utils/call_llm.py by providing credentials. By default, you can use the AI Studio key with this client for Gemini Pro 2.5:
```
client = genai.Client(
  api_key=os.getenv("GEMINI_API_KEY", "your-api_key"),
)
```
You can use your own models. We highly recommend the latest models with thinking capabilities (Claude 3.7 with thinking, O1). You can verify that it is correctly set up by running:
```
python utils/call_llm.py
```
Generate a complete database schema by running the main script:
```
# Analyze a GitHub repository
python main.py --repo https://github.com/username/repo --include "*.py" "*.js" --exclude "tests/*" --max-size 50000

# Or, analyze a local directory
python main.py --dir /path/to/your/codebase --include "*.py" --exclude "*test*"
```
- --repo or --dir - Specify either a GitHub repo URL or a local directory path (required, mutually exclusive)
- -n, --name - Project name (optional, derived from URL/directory if omitted)
- -t, --token - GitHub token (or set GITHUB_TOKEN environment variable)
- -o, --output - Output directory (default: ./output)
- -i, --include - Files to include (e.g., "*.py" "*.js")
- -e, --exclude - Files to exclude (e.g., "tests/*" "docs/*")
- -s, --max-size - Maximum file size in bytes (default: 100KB)
- --no-cache - Disable LLM response caching (default: caching enabled)
The application will crawl the repository, analyze the codebase, and write a YAML file describing every table and column in the specified directory (default: ./output).

🐳 Running with Docker

To run this project in a Docker container, you'll need to pass your API keys as environment variables.

Build the Docker image
```
docker build -t pocketflow-app .
```
Run the container

You'll need to provide your GEMINI_API_KEY for the LLM to function. If you're analyzing private GitHub repositories or want to avoid rate limits, also provide your GITHUB_TOKEN.

Mount a local directory to /app/output inside the container to access the generated schema on your host machine.

Example for analyzing a public GitHub repository:
```
docker run -it --rm \
  -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
  -v "$(pwd)/output_schema":/app/output \
  pocketflow-app --repo https://github.com/username/repo
```
Example for analyzing a local directory:
```
docker run -it --rm \
  -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
  -v "/path/to/your/local_codebase":/app/code_to_analyze \
  -v "$(pwd)/output_schema":/app/output \
  pocketflow-app --dir /app/code_to_analyze
```

💡 Development Tutorial

I built using Agentic Coding, the fastest development paradigm, where humans simply design and agents code.
The secret weapon is Pocket Flow, a 100-line LLM framework that lets Agents (e.g., Cursor AI) build for you
Check out the Step-by-step YouTube development tutorial:

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
assets		assets
docs		docs
utils		utils
.clinerules		.clinerules
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.env.sample		.env.sample
.gitignore		.gitignore
.windsurfrules		.windsurfrules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_SCHEMA.md		README_SCHEMA.md
flow.py		flow.py
main.py		main.py
nodes.py		nodes.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generate Database Schemas from Codebases with AI

🚀 Getting Started

💡 Development Tutorial

About

Uh oh!

Releases

Packages

Languages

License

aaronshieh/PocketFlow-Tutorial-Codebase-Knowledge

Folders and files

Latest commit

History

Repository files navigation

Generate Database Schemas from Codebases with AI

🚀 Getting Started

💡 Development Tutorial

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages