Skip to content

Commit c10bec2

Browse files
Lev Kokotovgitbook-bot
authored andcommitted
GITBOOK-65: change request with no subject merged in GitBook
1 parent 5b69235 commit c10bec2

File tree

5 files changed

+141
-30
lines changed

5 files changed

+141
-30
lines changed

pgml-docs/docs/guides/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
* [Tabular data](data-storage-and-retrieval/tabular-data.md)
5454
* [Vectors](data-storage-and-retrieval/vectors.md)
5555
* [Partitioning](data-storage-and-retrieval/partitioning.md)
56+
* [Documents](data-storage-and-retrieval/documents.md)
5657
* [Deploying PostgresML](deploying-postgresml/README.md)
5758
* [PostgresML Cloud](deploying-postgresml/postgresml-cloud/README.md)
5859
* [Plans](deploying-postgresml/postgresml-cloud/plans/README.md)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Documents
2+
3+
WIP: [https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,109 @@
11
# Partitioning
22

3+
Partitioning is the act of splitting Posgres tables into multiple smaller tables, which allows to query each smaller table independently. This is useful and sometimes necessary when tables get so large that querying a single table becomes too slow. Partitioning requires detailed knowledge of the dataset and uses that knowledge to help Postgres execute faster queries.
4+
5+
### Partitioning schemes
6+
7+
Postgres supports three (3) kinds of partitioning schemes: by range, by list, and by hash. Each scheme is appropriate for different use cases, and choosing the right scheme is important to get the best performance out of your data.
8+
9+
### Partition by range
10+
11+
Partition by range operates on numerical values. Dates, numbers and vectors can be used as range partition keys because their range of values can be split into non-overlapping parts.
12+
13+
For example, if we have a table with a date column (`TIMESTAMPTZ`, a date and time with timezone information), we can create three (3) partitions with the following bounds:
14+
15+
* partition 1 will contain all dates prior to January 1, 2000,
16+
* partition 2 will contain all dates between January 1, 2000 and December 31, 2020,
17+
* partition 3 will contain all dates after January 1, 2021.
18+
19+
While these ranges are not even, we chose them because of some knowledge we have about our dataset. In our hypothetical example, we know that these date ranges will split our dataset into roughly three (3) evenly sized tables.
20+
21+
#### Building partitions
22+
23+
Let's build some real partitions with a dataset from Kaggle: [Hourly Energy Consumption](https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption).
24+
25+
You can create a partition by range in Postgres with just a few queries. Partitioning requires two types of tables: the parent table which defines the partitioning scheme, and the child tables which define the ranges and store the actual data.
26+
27+
Let's start with the parent table:
28+
29+
```sql
30+
CREATE TABLE energy_consumption (
31+
"Datetime" TIMESTAMPTZ,
32+
"AEP_MW" REAL
33+
) PARTITION BY RANGE("Datetime");
34+
```
35+
36+
Now, let's add a couple child tables:
37+
38+
```sql
39+
CREATE TABLE energy_consumption_2004_2011
40+
PARTITION OF energy_consumption
41+
FOR VALUES FROM ('2004-01-01') TO ('2011-12-31');
42+
43+
CREATE TABLE energy_consumption_2012_2018
44+
PARTITION OF energy_consumption
45+
FOR VALUES FROM ('2011-12-31') TO ('2018-12-31');
46+
```
47+
48+
Postgres partition bounds are defined as `[start, end)`, which means the start of the range is included and the end of the range is excluded.
49+
50+
Let's ingest the dataset into our partitioned table and see what we get:
51+
52+
```
53+
postgresml=# \copy energy_consumption FROM 'AEP_hourly.csv' CSV HEADER;
54+
COPY 121273
55+
```
56+
57+
We have a grand total of 121,273 rows. If we partitioned the dataset correctly, the two child tables should have roughly the same number of rows:
58+
59+
```
60+
postgresml=# SELECT count(*) FROM energy_consumption_2004_2011;
61+
count
62+
-------
63+
63511
64+
65+
postgresml=# SELECT count(*) FROM energy_consumption_2012_2018;
66+
count
67+
-------
68+
57762
69+
```
70+
71+
Nicely done. The two tables are pretty close to each other which creates a roughly even distribution of data in our partitioning scheme.
72+
73+
Postgres allows to query each partition individually, which is nice if we know what the range specification is. While this works in this example, in a living dataset, we could continue to add partitions to include more values. If we wanted to store dates for the years 2019 through 2023, for example, we would need to make at least one more child table.
74+
75+
To make this user friendly, Postgres allows us to query the parent table instead. As long as we specify the partition key, we are guaranteed to get the most efficient query plan possible:
76+
77+
```sql
78+
SELECT
79+
avg("AEP_MW")
80+
FROM energy_consumption
81+
WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';
82+
```
83+
84+
```
85+
avg
86+
--------------------
87+
15175.689170820118
88+
```
89+
90+
If we look at the query plan, we'll see that Postgres only queries the first child table we created:
91+
92+
```
93+
postgresml=# EXPLAIN SELECT
94+
avg("AEP_MW")
95+
FROM energy_consumption
96+
WHERE "Datetime" BETWEEN '2004-01-01' AND '2005-01-01';
97+
98+
QUERY PLAN
99+
----------------------------------------------------------------------------
100+
Aggregate (cost=10000001302.18..10000001302.19 rows=1 width=8)
101+
-> Seq Scan on energy_consumption_2004_2011 energy_consumption (...)
102+
Filter: [...]
103+
```
104+
105+
which reduces the number of rows it has to scan by half. By adding more smaller partitions, we can significantly reduce the amount of data Postgres needs to scan to execute a query. That being said, scanning multiple tables could be more expensive than scanning just one table, so adding too many partitions quickly reduces its benefits if the queries need to scan more than a few child tables.
106+
107+
### Partition by hash
108+
109+
Partitioning by hash, unlike by range, can be applied to any data type, including text. A hash function is applied to the partition key to create a reasonably unique number, and that number is then divided by the number of partitions to find the right child table for the row.

pgml-docs/docs/guides/data-storage-and-retrieval/tabular-data.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -176,11 +176,11 @@ Postgres supports many more indexing algorithms, e.g. GiST, BRIN, GIN, and Hash.
176176

177177
### Accelerating recall
178178

179-
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow. To help with that, tables should have indexes that order and organize commonly read columns. Searching a B-Tree index can be done in _O(log n)_ time, which is orders of magnitude faster than the _O(n)_ full table search.
179+
Once the dataset gets large enough, and we're talking millions of rows, it's no longer practical to query the table directly. The amount of data Postgres has to scan becomes large and queries become slow. At that point, tables should have indexes that order and organize commonly read columns. Searching an index can be done in _O(log n)_ time, which is orders of magnitude faster than the _O(n)_ full table scan.
180180

181181
#### Querying an index
182182

183-
Postgres automatically uses indexes when its possible and optimal to do so. From our example, we can filter the dataset by the "Address" column, and we can do so very quickly because of the index we created:
183+
Postgres automatically uses indexes when its possible and optimal to do so. From our example, if we filter the dataset by the "Address" column, Postgres will use the index we created and return a result quickly:
184184

185185
```sql
186186
SELECT
@@ -199,9 +199,9 @@ which produces
199199
(1 row)
200200
```
201201

202-
Since we have a unique index on the table, we will only see one row with that address.
202+
Since we have a unique index on the table, we expect to see only one row with that address.
203203

204-
To double check that Postgres is using an index, we can check the query execution plan. A query plan is a list of steps that Postgres will take to get the query result we requested. To see the query plan, prepend the keyword `EXPLAIN` to the query you're running:
204+
To double check that Postgres is using an index, we can take a look at the query execution plan. A query plan is a list of steps that Postgres will take to get the result of the query. To see the query plan, prepend the keyword `EXPLAIN` to the query you'd like to run:
205205

206206
```
207207
postgresml=# EXPLAIN (FORMAT JSON) SELECT
@@ -232,12 +232,10 @@ WHERE "Address" = '1 Infinite Loop, Cupertino, California';
232232
]
233233
```
234234

235-
The plan indicates it will use an "Index Scan" on the index `usa_house_prices_Address_index` which is what we're expecting.
235+
The plan indicates that it will use an "Index Scan" on `usa_house_prices_Address_index` which is what we're expecting. Using `EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
236236

237-
Using `EXPLAIN` doesn't actually run the query, so it's safe to use on production systems.
238-
239-
The ability to create indexes on datasets of any size, and to then efficiently query that data with those indexes is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.
237+
The ability to create indexes on datasets of any size, and to efficiently query that data using indexes, is what separates Postgres from most ad-hoc tools like Pandas and Arrow. Postgres can store and query data that would never fit in memory, and it can do that quicker and more efficiently than most other database systems used in the industry.
240238

241239
#### Maintaining an index
242240

243-
Postgres indexes require no special maintenance. They are automatically updated when data is added or removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant. The database guarantees that the data is always consistent, no matter how many concurrent changes are made.
241+
Postgres indexes require no special maintenance. They are automatically updated when data is added and removed. Postgres also ensures that indexes are efficiently organized and are ACID compliant: the database guarantees that the data is always consistent, no matter how many concurrent changes are made.

0 commit comments

Comments
 (0)