Skip to content

aaronshieh/PocketFlow-Tutorial-Codebase-Knowledge

 
 

Repository files navigation

Generate Database Schemas from Codebases with AI

License: MIT

Ever stared at a new codebase wondering what data it stores? This project builds an AI agent that analyzes GitHub repositories and produces a complete database schema describing every table and column.

This project showcases Pocket Flow, a 100-line LLM framework. It crawls GitHub repositories and analyzes the code to identify every database table used. The agent then generates a complete schema describing each table, every column, and the meaning of any constrained values such as enums.

  🔸 🎉 Reached Hacker News Front Page (April 2025) with >900 up‑votes: Discussion »

🚀 Getting Started

  1. Clone this repository

    git clone https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up LLM in utils/call_llm.py by providing credentials. By default, you can use the AI Studio key with this client for Gemini Pro 2.5:

    client = genai.Client(
      api_key=os.getenv("GEMINI_API_KEY", "your-api_key"),
    )

    You can use your own models. We highly recommend the latest models with thinking capabilities (Claude 3.7 with thinking, O1). You can verify that it is correctly set up by running:

    python utils/call_llm.py
  4. Generate a complete database schema by running the main script:

    # Analyze a GitHub repository
    python main.py --repo https://github.com/username/repo --include "*.py" "*.js" --exclude "tests/*" --max-size 50000
    
    # Or, analyze a local directory
    python main.py --dir /path/to/your/codebase --include "*.py" --exclude "*test*"
    • --repo or --dir - Specify either a GitHub repo URL or a local directory path (required, mutually exclusive)
    • -n, --name - Project name (optional, derived from URL/directory if omitted)
    • -t, --token - GitHub token (or set GITHUB_TOKEN environment variable)
    • -o, --output - Output directory (default: ./output)
    • -i, --include - Files to include (e.g., "*.py" "*.js")
    • -e, --exclude - Files to exclude (e.g., "tests/*" "docs/*")
    • -s, --max-size - Maximum file size in bytes (default: 100KB)
    • --no-cache - Disable LLM response caching (default: caching enabled)

    The application will crawl the repository, analyze the codebase, and write a YAML file describing every table and column in the specified directory (default: ./output).

🐳 Running with Docker

To run this project in a Docker container, you'll need to pass your API keys as environment variables.

  1. Build the Docker image

    docker build -t pocketflow-app .
  2. Run the container

    You'll need to provide your GEMINI_API_KEY for the LLM to function. If you're analyzing private GitHub repositories or want to avoid rate limits, also provide your GITHUB_TOKEN.

    Mount a local directory to /app/output inside the container to access the generated schema on your host machine.

    Example for analyzing a public GitHub repository:

    docker run -it --rm \
      -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
      -v "$(pwd)/output_schema":/app/output \
      pocketflow-app --repo https://github.com/username/repo

    Example for analyzing a local directory:

    docker run -it --rm \
      -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
      -v "/path/to/your/local_codebase":/app/code_to_analyze \
      -v "$(pwd)/output_schema":/app/output \
      pocketflow-app --dir /app/code_to_analyze

💡 Development Tutorial

  • I built using Agentic Coding, the fastest development paradigm, where humans simply design and agents code.

  • The secret weapon is Pocket Flow, a 100-line LLM framework that lets Agents (e.g., Cursor AI) build for you

  • Check out the Step-by-step YouTube development tutorial:



About

Pocket Flow: Codebase to Tutorial

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Dockerfile 0.6%