Ever stared at a new codebase wondering what data it stores? This project builds an AI agent that analyzes GitHub repositories and produces a complete database schema describing every table and column.
This project showcases Pocket Flow, a 100-line LLM framework. It crawls GitHub repositories and analyzes the code to identify every database table used. The agent then generates a complete schema describing each table, every column, and the meaning of any constrained values such as enums.
-
Check out the YouTube Development Tutorial for more!
-
Check out the Substack Post Tutorial for more!
🔸 🎉 Reached Hacker News Front Page (April 2025) with >900 up‑votes: Discussion »
-
Clone this repository
git clone https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge
-
Install dependencies:
pip install -r requirements.txt
-
Set up LLM in
utils/call_llm.py
by providing credentials. By default, you can use the AI Studio key with this client for Gemini Pro 2.5:client = genai.Client( api_key=os.getenv("GEMINI_API_KEY", "your-api_key"), )
You can use your own models. We highly recommend the latest models with thinking capabilities (Claude 3.7 with thinking, O1). You can verify that it is correctly set up by running:
python utils/call_llm.py
-
Generate a complete database schema by running the main script:
# Analyze a GitHub repository python main.py --repo https://github.com/username/repo --include "*.py" "*.js" --exclude "tests/*" --max-size 50000 # Or, analyze a local directory python main.py --dir /path/to/your/codebase --include "*.py" --exclude "*test*"
--repo
or--dir
- Specify either a GitHub repo URL or a local directory path (required, mutually exclusive)-n, --name
- Project name (optional, derived from URL/directory if omitted)-t, --token
- GitHub token (or set GITHUB_TOKEN environment variable)-o, --output
- Output directory (default: ./output)-i, --include
- Files to include (e.g., "*.py
" "*.js
")-e, --exclude
- Files to exclude (e.g., "tests/*
" "docs/*
")-s, --max-size
- Maximum file size in bytes (default: 100KB)--no-cache
- Disable LLM response caching (default: caching enabled)
The application will crawl the repository, analyze the codebase, and write a YAML file describing every table and column in the specified directory (default: ./output).
🐳 Running with Docker
To run this project in a Docker container, you'll need to pass your API keys as environment variables.
-
Build the Docker image
docker build -t pocketflow-app .
-
Run the container
You'll need to provide your
GEMINI_API_KEY
for the LLM to function. If you're analyzing private GitHub repositories or want to avoid rate limits, also provide yourGITHUB_TOKEN
.Mount a local directory to
/app/output
inside the container to access the generated schema on your host machine.Example for analyzing a public GitHub repository:
docker run -it --rm \ -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \ -v "$(pwd)/output_schema":/app/output \ pocketflow-app --repo https://github.com/username/repo
Example for analyzing a local directory:
docker run -it --rm \ -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \ -v "/path/to/your/local_codebase":/app/code_to_analyze \ -v "$(pwd)/output_schema":/app/output \ pocketflow-app --dir /app/code_to_analyze
-
I built using Agentic Coding, the fastest development paradigm, where humans simply design and agents code.
-
The secret weapon is Pocket Flow, a 100-line LLM framework that lets Agents (e.g., Cursor AI) build for you
-
Check out the Step-by-step YouTube development tutorial: