Skip to content

Commit 5b69235

Browse files
Lev Kokotovgitbook-bot
authored andcommitted
GITBOOK-64: change request with no subject merged in GitBook
1 parent d59367e commit 5b69235

File tree

5 files changed

+263
-34
lines changed

5 files changed

+263
-34
lines changed

pgml-docs/docs/guides/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
* [Data Storage & Retrieval](data-storage-and-retrieval/README.md)
5353
* [Tabular data](data-storage-and-retrieval/tabular-data.md)
5454
* [Vectors](data-storage-and-retrieval/vectors.md)
55+
* [Partitioning](data-storage-and-retrieval/partitioning.md)
5556
* [Deploying PostgresML](deploying-postgresml/README.md)
5657
* [PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)
5758
* [Plans](deploying-postgresml/postgresml-cloud/plans/README.md)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Partitioning
2+

pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md

Lines changed: 60 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
11
# Tabular data
22

3-
Tabular data is data stored in tables. While that's a bit of a recursive definition, tabular data is any kind for format that defines rows and columns, and is the most common type of data storage mechanism. Examples of tabular data include things like spreadsheets, database tables, CSV files, and Pandas dataframes.
3+
Tabular data is data stored in tables. A table is a format that defines rows and columns, and is the most common type of data organization mechanism. Examples of tabular data are spreadsheets, database tables, CSV files, and Pandas dataframes.
44

5-
Storing and accessing tabular data is a subject of decades of studies, and is the core purpose of many database systems. PostgreSQL has been leading the charge on optimal tabular storage for a while and remains today one of the most popular and effective ways to store, organize and retrieve this kind of data.
5+
Storing and accessing tabular data in an efficient manner is a subject of multiple decade-long studies, and is the core purpose of most database systems. PostgreSQL has been leading the charge on optimal tabular storage for a long time, and remains one of the most popular and effective ways to store, organize and retrieve tabular data today.
66

77
### Creating tables
88

99
Postgres makes it really easy to create and use tables. If you're looking to use PostgresML for a supervised learning project, creating a table will be very similar to a Pandas dataframe, except it will be durable and easily accessible for as long as the database exists.
1010

11-
For the rest of this guide, we'll take the [USA House Prices](https://www.kaggle.com/code/fatmakursun/supervised-unsupervised-learning-examples/) dataset from Kaggle, store it in Postgres and query it for basic statistics. The dataset has seven (7) columns and 5,000 rows:
12-
13-
11+
For the rest of this guide, we'll take the [USA House Prices](https://www.kaggle.com/code/fatmakursun/supervised-unsupervised-learning-examples/) dataset from Kaggle, store it in a Postgres table and run some basic queries. The dataset has seven (7) columns and 5,000 rows:
1412

1513
| Column | Data type | Postgres data type |
1614
| ---------------------------- | --------- | ------------------ |
@@ -22,7 +20,7 @@ For the rest of this guide, we'll take the [USA House Prices](https://www.kaggle
2220
| Price | Float | REAL |
2321
| Address | String | VARCHAR |
2422

25-
Once we know the column names and data types, the Postgres table definition almost writes itself:
23+
Once we know the column names and data types, the Postgres table definition is pretty straight forward:
2624

2725
```plsql
2826
CREATE TABLE usa_house_prices (
@@ -36,7 +34,7 @@ CREATE TABLE usa_house_prices (
3634
);
3735
```
3836

39-
The column names are double quoted because they contain special characters like `.` and space, which can be interpreted to be part of the SQL syntax. Generally speaking, it's good practice to double quote all entity names when using them in a PostgreSQL query, although most of the time it's not needed.
37+
The column names are double quoted because they contain special characters like `.` and space, which can be interpreted to be part of the SQL syntax. Generally speaking, it's good practice to double quote all entity names when using them in a query, although most of the time it's not needed.
4038

4139
If you run this using `psql`, you'll get something like this:
4240

@@ -56,9 +54,9 @@ postgresml=#
5654

5755
### Ingesting data
5856

59-
Right now the table is empty and that's a bit boring. Let's import the USA House Prices dataset into it using one of the easiest and fastest way to do so in Postgres: using `COPY`.
57+
Right now the table is empty which is a bit boring. Let's import the USA House Prices dataset using one of the fastest way to do so in Postgres: with `COPY`.
6058

61-
If you're like me and prefer to use the terminal, you can open up `psql` and ingest the dataset like this:
59+
If you're like me and prefer to use the terminal, you can open up `psql` and ingest the data like this:
6260

6361
```
6462
postgresml=# \copy usa_house_prices FROM 'USA_Housing.csv' CSV HEADER;
@@ -67,13 +65,13 @@ COPY 5000
6765

6866
As expected, Postgres copied all 5,000 rows into the `usa_house_prices` table. `COPY` accepts CSV, text, and Postgres binary formats, but CSV is definitely the most common.
6967

70-
You may have noticed that we used the `\copy` command in the terminal, not `COPY`. The `COPY` command actually comes in two forms: `\copy` which is a `psql` command that performs a local system to remote database server copy, and `COPY` which is more commonly used in applications. If you're writing your own application to ingest data into Postgres, you'll be using `COPY`.
68+
You may have noticed that we used the `\copy` command in the terminal, not `COPY`. The `COPY` command actually comes in two forms: `\copy` which is a `psql` command that copies data from system files to remote databases, while `COPY` is more commonly used in applications to send data from other sources, like standard input, files, other databases and streams.
7169

72-
### Querying data
70+
If you're writing your own application to ingest large amounts of data into Postgres, you should use `COPY` for maximum throughput.
7371

74-
Querying data stored in tables is what this is all about. After all, just storing data isn't particularly interesting or useful. Postgres has one of the most comprehensive and powerful querying languages of all data storage systems we've worked with so, for our example, we won't have any trouble calculating some statistics to understand our data better.
72+
### Querying data
7573

76-
Let's compute some basic statistics on the "Avg. Area Income" column using SQL:
74+
Querying data stored in tables is what makes PostgresML so powerful. Postgres has one of the most comprehensive querying languages of all databases we've worked with so, for our example, we won't have any trouble calculating some statistics:
7775

7876
```sql
7977
SELECT
@@ -87,19 +85,17 @@ SELECT
8785
FROM usa_house_prices;
8886
```
8987

90-
which produces exactly what we want:
91-
9288
```
9389
count | avg | max | min | percentile_75 | stddev
9490
-------+-------------------+-----------+----------+----------------+-------------------
9591
5000 | 68583.10897773437 | 107701.75 | 17796.63 | 75783.33984375 | 10657.99120344229
9692
```
9793

98-
The SQL language is very expressive and allows to select, filter and aggregate any number of columns from any number of tables with a single query.
94+
The SQL language is expressive and allows to select, filter and aggregate any number of columns with a single query.
9995

10096
### Adding more data
10197

102-
Because databases store data in perpetuity, adding more data to Postgres can take several forms. The simplest and most commonly used way to add data is to just insert it into a table that we already have. Using the USA House Prices example, we can add a new row into the table with just one query:
98+
Because databases store data permanently, adding more data to Postgres can be done in many ways. The simplest and most commonly used way is to just insert it into a table you already have. Using the same example dataset, we can add a new row with just one query:
10399

104100
```sql
105101
INSERT INTO usa_house_prices (
@@ -121,40 +117,70 @@ INSERT INTO usa_house_prices (
121117
);
122118
```
123119

124-
Another way to add more data to a table is to run `COPY` again with a different CSV as the source. Many ETL pipelines from places like Snowflake or Redshift split their output into multiple CSVs, which can be individually imported into Postgres using multiple `COPY` statements.
120+
If you have more CSV files you'd like to ingest, you can run `COPY` for each one. Many ETL pipelines from Snowflake or Redshift chunk their output into multiple CSVs, which can be individually imported into Postgres using `COPY`.
121+
122+
{% tabs %}
123+
{% tab title="Python" %}
124+
```python
125+
import psycopg
126+
from glob import glob
127+
128+
with psycopg.connect("postgres:///postgresml") as conn:
129+
cur = conn.cursor()
130+
131+
with cur.copy("COPY usa_house_prices FROM STDIN CSV") as copy:
132+
for csv_file in glob("*.csv"):
133+
with open(csv_file) as f:
134+
next(f) # Skip header
135+
for line in f:
136+
copy.write(line)
137+
```
138+
{% endtab %}
139+
140+
{% tab title="Bash" %}
141+
```bash
142+
#!/bin/bash
125143

126-
Adding rows is pretty simple, but now that our dataset is changing, we should explore some tools to help us protect it against bad values.
144+
for f in $(ls *.csv); do
145+
psql postgres:///postgresml \
146+
-c "\copy usa_house_prices FROM '$f' CSV HEADER"
147+
done
148+
```
149+
{% endtab %}
150+
{% endtabs %}
151+
152+
Now that our dataset is changing, we should explore some tools to protect it against bad values.
127153

128154
### Data integrity
129155

130-
Databases store very important data and they were built with many safety features to protect that data from common errors. In machine learning, one of the most common errors is data duplication, i.e. having the same row appear in the a table twice. Postgres can easily protect us against this with unique indexes.
156+
Databases store important data so they were built with many safety features in mind to protect from common errors. In machine learning, one of the most common errors is data duplication, i.e. having the same row appear in the a table twice. Postgres can protect us against this with unique indexes.
131157

132-
Looking at the USA House Price dataset, we can find its natural key pretty easily. Since most columns are aggregates, the only column that seems unique is the "Address". After all, there should never be more than one house at a single address, not for sale anyway.
158+
Looking at the USA House Price dataset, we can find its natural key pretty easily. Since most columns are aggregates, the only column that seems like it should contain unique values is the "Address". There should never be more than one house for sale at a single address.
133159

134-
To ensure that our dataset reflects this, let's add a unique index to our table. To do so, we can use this SQL query:
160+
To ensure that our table reflects this, let's add a unique index:
135161

136162
```sql
137163
CREATE UNIQUE INDEX ON usa_house_prices USING btree("Address");
138164
```
139165

140-
Postgres scans the whole table, ensures there are no duplicates in the "Address" column and creates an index on that column using the B-Tree algorithm.
166+
When creating a unique index, Postgres scans the whole table, ensuring there are no duplicates in the "Address" column, and writes the column into an index using the B-Tree algorithm.
141167

142-
If we now attempt to insert the same row again, we'll get an error:
168+
If we attempt to insert the same row again, now we're getting an error:
143169

144170
```
145171
ERROR: duplicate key value violates unique constraint "usa_house_prices_Address_idx"
146172
DETAIL: Key ("Address")=(1 Infinite Loop, Cupertino, California) already exists.
147173
```
148174

149-
Postgres supports many more indexing algorithms, namely GiST, BRIN, GIN, and Hash. Many extensions, for example `pgvector`, implement their own index types like HNSW and IVFFlat, to efficiently search and retrieve specialized values. We explore those in our guide about [Vectors](vectors.md).
175+
Postgres supports many more indexing algorithms, e.g. GiST, BRIN, GIN, and Hash. Many extensions, e.g. `pgvector`, implement their own index types like HNSW and IVFFlat, which help efficiently retrieve specialized values. We explore those in our guide about [Vectors](vectors.md).
150176

151177
### Accelerating recall
152178

153-
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan to return a result becomes quite large and queries become slow. To help with that, tables should have indexes that order and organize commonly accessed columns. Scanning a B-Tree index can be done in _O(log n)_ time, which is orders of magnitude faster than the _O(n)_ full table search.
179+
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow. To help with that, tables should have indexes that order and organize commonly read columns. Searching a B-Tree index can be done in _O(log n)_ time, which is orders of magnitude faster than the _O(n)_ full table search.
154180

155181
#### Querying an index
156182

157-
Postgres automatically uses indexes when possible in order to accelerate recall. Using our example above, we can query data using the "Address" column and we can do so very quickly by using the unique index we created.
183+
Postgres automatically uses indexes when its possible and optimal to do so. From our example, we can filter the dataset by the "Address" column, and we can do so very quickly because of the index we created:
158184

159185
```sql
160186
SELECT
@@ -173,11 +199,9 @@ which produces
173199
(1 row)
174200
```
175201

176-
which is exactly what we expected. Since we have a unique index on the table, we should only be getting one row back with that address.
177-
178-
To ensure that Postgres is using an index when querying a table, we can ask it to produce the query execution plan that it's going to use before executing that query. A query plan is a list of steps that Postgres will take in order to get the query result we requested.
202+
Since we have a unique index on the table, we will only see one row with that address.
179203

180-
To get the query plan for any query, prepend the keyword `EXPLAIN` to any query you're planning on running:
204+
To double check that Postgres is using an index, we can check the query execution plan. A query plan is a list of steps that Postgres will take to get the query result we requested. To see the query plan, prepend the keyword `EXPLAIN` to the query you're running:
181205

182206
```
183207
postgresml=# EXPLAIN (FORMAT JSON) SELECT
@@ -208,10 +232,12 @@ WHERE "Address" = '1 Infinite Loop, Cupertino, California';
208232
]
209233
```
210234

211-
The query plan indicates that it will be running an "Index Scan" using the index `usa_house_prices_Address_index` which is exactly what we want.
235+
The plan indicates it will use an "Index Scan" on the index `usa_house_prices_Address_index` which is what we're expecting.
236+
237+
Using `EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
212238

213-
The ability to create indexes on datasets of any size and to then efficiently query that data is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query datasets that would never be able to fit into memory and can do so quicker and more efficiently than most database systems currently used across the industry.
239+
The ability to create indexes on datasets of any size, and to then efficiently query that data with those indexes is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.
214240

215241
#### Maintaining an index
216242

217-
Indexes are automatically updated when new data is added and old data is removed. Postgres automatically ensures that indexes are efficiently organized and are ACID compliant. When using Postgres tables, the system guarantees that the data will always be consistent, no matter how many concurrent changes are made to the tables.
243+
Postgres indexes require no special maintenance. They are automatically updated when data is added or removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant. The database guarantees that the data is always consistent, no matter how many concurrent changes are made.

0 commit comments

Comments
 (0)