diff --git a/README.md b/README.md index ecb4875..29f0a72 100644 --- a/README.md +++ b/README.md @@ -33,12 +33,24 @@ source /bin/activate /bin/pip install langchain-google-cloud-sql-mysql ``` -## Usage +## Document Loader Usage + +Use a [document loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/) to load data as LangChain `Document`s. ```python -from langchain_google_cloud_sql_mysql import CloudSQLVectorstore, CloudSQLLoader, CloudSQLChatMessageHistory +from langchain_google_cloud_sql_mysql import MySQLEngine, MySQLLoader + + +engine = MySQLEngine.from_instance("project-id", "region", "my-instance", "my-database") +loader = MySQLLoader( + engine, + table_name="my-table-name" +) +docs = loader.lazy_load() ``` +See the full [Document Loader][loader] tutorial. + ## Contributing Contributions to this library are always welcome and highly encouraged. @@ -61,4 +73,5 @@ This is not an officially supported Google product. [billing]: https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project [api]: https://console.cloud.google.com/flows/enableapi?apiid=sqladmin.googleapis.com [auth]: https://googleapis.dev/python/google-api-core/latest/auth.html -[venv]: https://virtualenv.pypa.io/en/latest/ \ No newline at end of file +[venv]: https://virtualenv.pypa.io/en/latest/ +[loader]: ./docs/document_loader.ipynb \ No newline at end of file diff --git a/docs/document_loader.ipynb b/docs/document_loader.ipynb index 5f48d1d..a37fd5f 100644 --- a/docs/document_loader.ipynb +++ b/docs/document_loader.ipynb @@ -4,18 +4,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Google DATABASE\n", + "# Google Cloud SQL for MySQL\n", "\n", - "[Google DATABASE](https://cloud.google.com/DATABASE).\n", + "> [Cloud SQL](https://cloud.google.com/sql) is a fully managed relational database service that offers high performance, seamless integration, and impressive scalability. It offers [MySQL](https://cloud.google.com/sql/mysql), [PostgreSQL](https://cloud.google.com/sql/postgres), and [SQL Server](https://cloud.google.com/sql/sqlserver) database engines. Extend your database application to build AI-powered experiences leveraging Cloud SQL's Langchain integrations.\n", "\n", - "Load documents from `DATABASE`." + "This notebook goes over how to use [Cloud SQL for MySQL](https://cloud.google.com/sql/mysql) to [save, load and delete langchain documents](https://python.langchain.com/docs/modules/data_connection/document_loaders/) with `MySQLLoader` and `MySQLDocumentSaver`.\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/langchain-google-cloud-sql-mysql-python/blob/main/docs/document_loader.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Pre-reqs" + "## Before You Begin\n", + "\n", + "To run this notebook, you will need to do the following:\n", + "* [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)\n", + "* [Create a Cloud SQL for MySQL instance](https://cloud.google.com/sql/docs/mysql/create-instance)\n", + "* [Create a Cloud SQL database](https://cloud.google.com/sql/docs/mysql/create-manage-databases)\n", + "* [Add an IAM database user to the database](https://cloud.google.com/sql/docs/mysql/add-manage-iam-users#creating-a-database-user) (Optional)\n", + "\n", + "After confirmed access to database in the runtime environment of this notebook, filling the following values and run the cell before running example scripts." ] }, { @@ -26,18 +36,122 @@ }, "outputs": [], "source": [ - "%pip install PACKAGE_NAME" + "# @markdown Please fill in the both the Google Cloud region and name of your Cloud SQL instance.\n", + "REGION = \"us-central1\" # @param {type:\"string\"}\n", + "INSTANCE = \"test-instance\" # @param {type:\"string\"}\n", + "\n", + "# @markdown Please specify a database and a table for demo purpose.\n", + "DATABASE = \"test\" # @param {type:\"string\"}\n", + "TABLE_NAME = \"test-default\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🦜🔗 Library Installation\n", + "\n", + "The integration lives in its own `langchain-google-cloud-sql-mysql` package, so we need to install it." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ - "from PACKAGE import LOADER" + "%pip install -upgrade --quiet langchain-google-cloud-sql-mysql" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Colab only**: Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# # Automatically restart kernel after installs so that your environment can access the new packages\n", + "# import IPython\n", + "\n", + "# app = IPython.Application.instance()\n", + "# app.kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### ☁ Set Your Google Cloud Project\n", + "Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.\n", + "\n", + "If you don't know your project ID, try the following:\n", + "\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.\n", + "\n", + "PROJECT_ID = \"my-project-id\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "!gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🔐 Authentication\n", + "\n", + "Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.\n", + "\n", + "- If you are using Colab to run this notebook, use the cell below and continue.\n", + "- If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from google.colab import auth\n", + "\n", + "auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### API Enablement\n", + "The `langchain-google-cloud-sql-mysql` package requires that you [enable the Cloud SQL Admin API](https://console.cloud.google.com/flows/enableapi?apiid=sqladmin.googleapis.com) in your Google Cloud Project." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# enable Cloud SQL Admin API\n", + "!gcloud services enable sqladmin.googleapis.com" ] }, { @@ -51,25 +165,52 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Load from table" + "### MySQLEngine Connection Pool\n", + "\n", + "Before saving or loading documents from MySQL table, we need first configures a connection pool to Cloud SQL database. The `MySQLEngine` configures a connection pool to your Cloud SQL database, enabling successful connections from your application and following industry best practices.\n", + "\n", + "To create a `MySQLEngine` using `MySQLEngine.from_instance()` you need to provide only 4 things:\n", + "\n", + "1. `project_id` : Project ID of the Google Cloud Project where the Cloud SQL instance is located.\n", + "2. `region` : Region where the Cloud SQL instance is located.\n", + "3. `instance` : The name of the Cloud SQL instance.\n", + "4. `database` : The name of the database to connect to on the Cloud SQL instance.\n", + "\n", + "By default, [IAM database authentication](https://cloud.google.com/sql/docs/mysql/iam-authentication#iam-db-auth) will be used as the method of database authentication. This library uses the IAM principal belonging to the [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/application-default-credentials) sourced from the envionment.\n", + "\n", + "For more informatin on IAM database authentication please see:\n", + "* [Configure an instance for IAM database authentication](https://cloud.google.com/sql/docs/mysql/create-edit-iam-instances)\n", + "* [Manage users with IAM database authentication](https://cloud.google.com/sql/docs/mysql/add-manage-iam-users)\n", + "\n", + "Optionally, [built-in database authentication](https://cloud.google.com/sql/docs/mysql/built-in-authentication) using a username and password to access the Cloud SQL database can also be used. Just provide the optional `user` and `password` arguments to `MySQLEngine.from_instance()`:\n", + "* `user` : Database user to use for built-in database authentication and login\n", + "* `password` : Database password to use for built-in database authentication and login." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "loader = LOADER()\n", + "from langchain_google_cloud_sql_mysql import MySQLEngine\n", "\n", - "data = loader.load()" + "engine = MySQLEngine.from_instance(\n", + " project_id=PROJECT_ID, region=REGION, instance=INSTANCE, database=DATABASE\n", + ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Load from query" + "### Initialize a table\n", + "\n", + "Initialize a table of default schema via `MySQLEngine.init_document_table()`. Table Columns:\n", + "- page_content (type: text)\n", + "- langchain_metadata (type: JSON)\n", + "\n", + "`overwrite_existing=True` flag means the newly initialized table will replace any existing table of the same name." ] }, { @@ -78,41 +219,91 @@ "metadata": {}, "outputs": [], "source": [ - "loader = LOADER()\n", + "engine.init_document_table(TABLE_NAME, overwrite_existing=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Save documents\n", "\n", - "data = loader.load()" + "Save langchain documents with `MySQLDocumentSaver.add_documents()`. To initialize `MySQLDocumentSaver` class you need to provide 2 things:\n", + "1. `engine` - An instance of a `MySQLEngine` engine.\n", + "2. `table_name` - The name of the table within the Cloud SQL database to store langchain documents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from langchain_google_cloud_sql_mysql import MySQLDocumentSaver\n", + "from langchain_core.documents import Document\n", + "\n", + "test_docs = [\n", + " Document(\n", + " page_content=\"Apple Granny Smith 150 0.99 1\",\n", + " metadata={\"fruit_id\": 1},\n", + " ),\n", + " Document(\n", + " page_content=\"Banana Cavendish 200 0.59 0\",\n", + " metadata={\"fruit_id\": 2},\n", + " ),\n", + " Document(\n", + " page_content=\"Orange Navel 80 1.29 1\",\n", + " metadata={\"fruit_id\": 3},\n", + " ),\n", + "]\n", + "saver = MySQLDocumentSaver(engine=engine, table_name=TABLE_NAME)\n", + "saver.add_documents(test_docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Customize Document Page Content & Metadata" + "Load langchain documents with `MySQLLoader.load()` or `MySQLLoader.lazy_load()`. `lazy_load` returns a generator that only queries database during the iteration. To initialize `MySQLDocumentSaver` class you need to provide:\n", + "1. `engine` - An instance of a `MySQLEngine` engine.\n", + "2. `table_name` - The name of the table within the Cloud SQL database to store langchain documents." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "loader = LOADER()\n", + "from langchain_google_cloud_sql_mysql import MySQLLoader\n", "\n", - "data = loader.load()" + "loader = MySQLLoader(engine=engine, table_name=TABLE_NAME)\n", + "docs = loader.lazy_load()\n", + "for doc in docs:\n", + " print(\"Loaded documents:\", doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Customize Page Content Format" + "### Load documents via query" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Save Documents to table" + "Other than loading documents from a table, we can also choose to load documents from a view generated from a SQL query. For example:" ] }, { @@ -121,15 +312,41 @@ "metadata": {}, "outputs": [], "source": [ - "saver = SAVER()\n", - "saver.add_documents(docs)" + "from langchain_google_cloud_sql_mysql import MySQLLoader\n", + "\n", + "loader = MySQLLoader(\n", + " engine=engine,\n", + " query=f\"select * from `{TABLE_NAME}` where JSON_EXTRACT(langchain_metadata, '$.fruit_id') = 1;\",\n", + ")\n", + "onedoc = loader.load()\n", + "onedoc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Customize Connection & Authentication" + "The view generated from SQL query can have different schema than default table. In such cases, the behavior of MySQLLoader is the same as loading from table with non-default schema. Please refer to section [Load documents with customized document page content & metadata](#Load-documents-with-customized-document-page-content-&-metadata)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete documents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Delete a list of langchain documents from MySQL table with `MySQLDocumentSaver.delete()`.\n", + "\n", + "For table with default schema (page_content, langchain_metadata), the deletion criteria is:\n", + "\n", + "A `row` should be deleted if there exists a `document` in the list, such that\n", + "- `document.page_content` equals `row[page_content]`\n", + "- `document.metadata` equals `row[langchain_metadata]`" ] }, { @@ -138,14 +355,255 @@ "metadata": {}, "outputs": [], "source": [ - "from google.cloud.DATABASE import Client\n", + "from langchain_google_cloud_sql_mysql import MySQLLoader\n", + "\n", + "loader = MySQLLoader(engine=engine, table_name=TABLE_NAME)\n", + "docs = loader.load()\n", + "print(\"Documents before delete:\", docs)\n", + "saver.delete(onedoc)\n", + "print(\"Documents after delete:\", loader.load())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced Usage" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load documents with customized document page content & metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we prepare an example table with non-default schema, and populate it with some arbitary data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sqlalchemy\n", + "\n", + "with engine.connect() as conn:\n", + " conn.execute(sqlalchemy.text(f\"DROP TABLE IF EXISTS `{TABLE_NAME}`\"))\n", + " conn.commit()\n", + " conn.execute(\n", + " sqlalchemy.text(\n", + " f\"\"\"\n", + " CREATE TABLE IF NOT EXISTS `{TABLE_NAME}`(\n", + " fruit_id INT AUTO_INCREMENT PRIMARY KEY,\n", + " fruit_name VARCHAR(100) NOT NULL,\n", + " variety VARCHAR(50), \n", + " quantity_in_stock INT NOT NULL,\n", + " price_per_unit DECIMAL(6,2) NOT NULL,\n", + " organic TINYINT(1) NOT NULL\n", + " )\n", + " \"\"\"\n", + " )\n", + " )\n", + " conn.execute(\n", + " sqlalchemy.text(\n", + " f\"\"\"\n", + " INSERT INTO `{TABLE_NAME}` (fruit_name, variety, quantity_in_stock, price_per_unit, organic)\n", + " VALUES\n", + " ('Apple', 'Granny Smith', 150, 0.99, 1),\n", + " ('Banana', 'Cavendish', 200, 0.59, 0),\n", + " ('Orange', 'Navel', 80, 1.29, 1);\n", + " \"\"\"\n", + " )\n", + " )\n", + " conn.commit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we still load langchain documents with default parameters of `MySQLLoader` from this example table, the `page_content` of loaded documents will be the first column of the table, and `metadata` will be consisting of key-value pairs of all the other columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = MySQLLoader(\n", + " engine=engine,\n", + " table_name=TABLE_NAME,\n", + ")\n", + "loader.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can specify the content and metadata we want to load by setting the `content_columns` and `metadata_columns` when initializing the `MySQLLoader`.\n", + "1. `content_columns`: The columns to write into the `page_content` of the document.\n", + "2. `metadata_columns`: The columns to write into the `metadata` of the document.\n", + "\n", + "For example here, the values of columns in `content_columns` will be joined together into a space-separated string, as `page_content` of loaded documents, and `metadata` of loaded documents will only contain key-value pairs of columns specified in `metadata_columns`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = MySQLLoader(\n", + " engine=engine,\n", + " table_name=TABLE_NAME,\n", + " content_columns=[\n", + " \"variety\",\n", + " \"quantity_in_stock\",\n", + " \"price_per_unit\",\n", + " \"organic\",\n", + " ],\n", + " metadata_columns=[\"fruit_id\", \"fruit_name\"],\n", + ")\n", + "loader.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Save document with customized page content & metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to save langchain document into table with customized metadata fields. We need first create such a table via `MySQLEngine.init_document_table()`, and specify the list of `metadata_columns` we want it to have. In this example, the created table will have table columns:\n", + "- description (type: text): for storing fruit description.\n", + "- fruit_name (type text): for storing fruit name.\n", + "- organic (type tinyint(1)): to tell if the fruit is organic.\n", + "- other_metadata (type: JSON): for storing other metadata information of the fruit.\n", "\n", - "creds = \"\"\n", - "client = Client(creds=creds)\n", - "loader = LOADER(\n", - " client=client,\n", + "We can use the following parameters with `MySQLEngine.init_document_table()` to create the table:\n", + "1. `table_name`: The name of the table within the Cloud SQL database to store langchain documents.\n", + "2. `metadata_columns`: A list of `sqlalchemy.Column` indicating the list of metadata columns we need.\n", + "3. `content_column`: The name of column to store `page_content` of langchain document. Default: `page_content`.\n", + "4. `metadata_json_column`: The name of JSON column to store extra `metadata` of langchain document. Default: `langchain_metadata`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "engine.init_document_table(\n", + " TABLE_NAME,\n", + " metadata_columns=[\n", + " sqlalchemy.Column(\n", + " \"fruit_name\",\n", + " sqlalchemy.UnicodeText,\n", + " primary_key=False,\n", + " nullable=True,\n", + " ),\n", + " sqlalchemy.Column(\n", + " \"organic\",\n", + " sqlalchemy.Boolean,\n", + " primary_key=False,\n", + " nullable=True,\n", + " ),\n", + " ],\n", + " content_column=\"description\",\n", + " metadata_json_column=\"other_metadata\",\n", + " overwrite_existing=True,\n", ")" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Save documents with `MySQLDocumentSaver.add_documents()`. As you can see in this example, \n", + "- `document.page_content` will be saved into `description` column.\n", + "- `document.metadata.fruit_name` will be saved into `fruit_name` column.\n", + "- `document.metadata.organic` will be saved into `organic` column.\n", + "- `document.metadata.fruit_id` will be saved into `other_metadata` column in JSON format." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test_docs = [\n", + " Document(\n", + " page_content=\"Granny Smith 150 0.99\",\n", + " metadata={\"fruit_id\": 1, \"fruit_name\": \"Apple\", \"organic\": 1},\n", + " ),\n", + "]\n", + "saver = MySQLDocumentSaver(\n", + " engine=engine,\n", + " table_name=TABLE_NAME,\n", + " content_column=\"description\",\n", + " metadata_json_column=\"other_metadata\",\n", + ")\n", + "saver.add_documents(test_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with engine.connect() as conn:\n", + " result = conn.execute(sqlalchemy.text(f\"select * from `{TABLE_NAME}`;\"))\n", + " print(result.keys())\n", + " print(result.fetchall())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Delete documents with customized page content & metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also delete documents from table with customized metadata columns via `MySQLDocumentSaver.delete()`. The deletion criteria is:\n", + "\n", + "A `row` should be deleted if there exists a `document` in the list, such that\n", + "- `document.page_content` equals `row[page_content]`\n", + "- For every metadata field `k` in `document.metadata`\n", + " - `document.metadata[k]` equals `row[k]` or `document.metadata[k]` equals `row[langchain_metadata][k]`\n", + "- There no extra metadata field presents in `row` but not in `document.metadata`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loader = MySQLLoader(engine=engine, table_name=TABLE_NAME)\n", + "docs = loader.load()\n", + "print(\"Documents before delete:\", docs)\n", + "saver.delete(docs)\n", + "print(\"Documents after delete:\", loader.load())" + ] } ], "metadata": { @@ -164,9 +622,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +}