GITBOOK-62: change request with no subject merged in GitBook

Lev Kokotov · gitbook-bot · commit 008a10747a3e · 2023-10-06T00:22:20.000Z
diff --git a/pgml-docs/docs/guides/SUMMARY.md b/pgml-docs/docs/guides/SUMMARY.md
@@ -58,6 +58,8 @@
     * [Pooler](deploying-postgresml/self-hosting/pooler.md)
     * [Building from source](deploying-postgresml/self-hosting/building-from-source.md)
     * [Replication](deploying-postgresml/self-hosting/replication.md)
+    * [Backups](deploying-postgresml/self-hosting/backups.md)
+    * [Running on EC2](deploying-postgresml/self-hosting/running-on-ec2.md)
 * [PgCat](pgcat.md)
 * [Benchmarks](benchmarks/README.md)
   * [PostgresML is 8-40x faster than Python HTTP microservices](benchmarks/postgresml-is-8-40x-faster-than-python-http-microservices.md)
diff --git a/pgml-docs/docs/guides/deploying-postgresml/self-hosting/backups.md b/pgml-docs/docs/guides/deploying-postgresml/self-hosting/backups.md
@@ -0,0 +1,103 @@
+# Backups
+
+Regular backups are necessary for pretty much any kind of PostgreSQL deployment. Even in development accidents happen, and instead of losing data one can always restore from a backup and get back to a working state.
+
+PostgresML backups work the same way as regular PostgreSQL database backups. PostgresML stores its data in regular Postgres tables, which will be backed up together with your other tables and schemas.
+
+### Architecture
+
+Postgres backups are composed of two (2) components: a Write-Ahead Log archive and the copies of the data files. The WAL archive will store every single write made to the database. The data file copies will contain point-in-time snapshots of what your databases had, going back up to the retention period of the backup repository.
+
+Using the WAL and backups together, Postgres can be restored to any point-in-time version of the database. This is a very powerful tool used for development and disaster recovery.
+
+### Configure the archive
+
+If you have followed the [Replication](replication.md) guide, you should have a working WAL archive. If not, take a look to get your archive configured. You can come back to this guide once you have working WAL archive.
+
+### Take your first backup
+
+Since we are using pgBackRest already for archiving WAL, we can continue to use it to take backups. pgBackRest can easily take full and incremental backups of pretty large database clusters. We've used in previously in production to backup terabytes of Postgres data on a weekly basis.
+
+To take a backup using pgBackRest, you can simply run this command:
+
+```bash
+pgbackrest backup --stanza=main
+```
+
+Once the command completes, you'll have a full backup of your database cluster safely stored in your S3 bucket. If you'd like to see what it takes to take a backup of a PostgreSQL database, you can add this to the command above:
+
+```
+--log-level-console=debug
+```
+
+pgBackRest will log every single step it does to take a working backup.
+
+### Restoring from backup
+
+When a disaster happens or you just would like to travel back in time, you can restore your database from your latest backup with just a couple commands.
+
+#### Stop the PostgreSQL server
+
+Restoring from backup will completely overwrite your existing database files. Therefore, don't do this unless you actually need to restore from backup.
+
+To do so, first, stop the PostgreSQL database server, if it's running:
+
+```
+sudo service postgresql stop
+```
+
+#### Restore the latest backup
+
+Now that PostgreSQL is no longer running, you can restore the latest backup using pgBackRest:
+
+```
+pgbackrest restore --stanza=main --delta
+```
+
+The `--delta` option will make pgBackRest check every single file in the Postgres data directory and, if it's different, overwrite it with the one saved in the backup repository. This is a quick way to restore a backup when most of the database files have not been corrupted or modified.
+
+#### Start the PostgreSQL server
+
+Once complete, your PostgreSQL server is ready to start again. You can do so with:
+
+```
+sudo service postgresql start
+```
+
+This will start PostgreSQL and make it check its local data files for consistency. This will be done pretty quickly and when complete, Postgres will start downloading and re-applying Write-Ahead Log files from the archive. When that operation completes, your PostgreSQL database will start and you'll be able to connect and use it again.
+
+Depending on how much data has been written to the archive since the last backup, the restore operation could take a bit of time. To minimize the time it takes for Postgres to start again, you can take more frequent backups, e.g. every 6 hours or every 2 hours. While costing more in storage and compute, this will ensure that your database recovers from a disaster much quicker than would of otherwise happened with just a daily backup.
+
+### Managing backups
+
+Backups can take a lot of space over time and some of them may no longer be needed. You can view what backups and WAL files are stored in your S3 bucket with:
+
+```
+pgbackrest info
+```
+
+#### Retention policy
+
+For most production deployments, you don't need or should retain more than a few backups. We would usually recommend keeping two (2) weeks of backups and WAL files, which should be enough time to notice that some data may be missing and needs to be restored.
+
+If you run full backups once a day (which should be plenty), you can set your pgBackRest backup retention policy to 14 days, by adding a couple settings to your `/etc/pgbackrest.conf` file:
+
+```
+[global]
+repo1-retention-full=14
+repo1-retention-archive=14
+```
+
+This configuration will ensure that you have at least 14 backups and 14 backups worth of WAL files. Because Postgres allows point-in-time recovery, you'll be able to restore your database to any version (up to millisecond precision) going back two (2) weeks.
+
+#### Automating backups
+
+Backups can be automated by running `pgbackrest backup --stanza=main` with a cron. You can edit your cron with `crontab -e` and add a daily midnight run, ensuring that you have fresh backups every day. Make sure you're editing the crontab of the `postgres` user since no other user will be allowed to backup Postgres or read the pgBackRest configuration file.
+
+### PostgresML considerations
+
+Since PostgresML stores most of its data in regular Postgres tables, a PostgreSQL backup is a valid PostgresML backup. The only thing stored outside of Postgres is the Hugging Face LLM cache, which is stored directly on disk in `/var/lib/postgresql/.cache`. In case of a disaster, the cache will be lost, but that's fine. Since it's only a cache, next time a PostgresML `pgml.embed()` or `pgml.transform()` function is used, PostgresML will automatically repopulate all the necessary files in the cache from Hugging Face and resume normal operations.
+
+#### HuggingFace cold starts
+
+In order to avoid cold starts, it's reasonable to backup the entire contents of the cache in a separate S3 location. When restoring from backup, one can just use `aws s3 sync` to download everything that should be in the cache folder back onto the machine. Make sure to do so before you start PostgreSQL in order to avoid a race condition with the Hugging Face library.
diff --git a/pgml-docs/docs/guides/deploying-postgresml/self-hosting/running-on-ec2.md b/pgml-docs/docs/guides/deploying-postgresml/self-hosting/running-on-ec2.md
@@ -0,0 +1,73 @@
+# Running on EC2
+
+AWS EC2 has been around for quite a while and requires no introduction. Running PostgresML on EC2 is very similar as any other cloud provider or on-prem deployment, but it does provide a few additional features that allow PostgresML to pretty easily scale into terabytes and beyond.
+
+### Operating  system
+
+We're big fans of Ubuntu and use it in our Cloud. AWS provides its own Ubuntu images (called AMIs, or Amazon Machine Images) which work very well and come with all the standard tools needed to run a PostgreSQL server.
+
+### Storage
+
+The choice of storage is critical to scalable and performant AI database operations. PostgresML deals with large datasets and even larger models, so performant and durable storage is important.
+
+EC2 provides two kinds of storage that can be used for running databases: EBS (Elastic Block Storage) and ephemeral NVMes. NVMe storage is typically faster than EBS and provides much lower latency, but it does lack some of the durability guarantees that one may want from a database deployment. We've ran databases on both, but currently prefer to use EBS because it allows us to take instant backups of our databases and to scale the storage of a database cluster independently from compute.
+
+#### Choosing storage type
+
+EBS has many different kinds of volumes, such as `gp2`, `gp3`, `io1`, `io2`, etc. The type of volume to use really depends on the cost/benefit analysis for the deployment in question. For example, if money is no object, running on `io2` would provide pretty great performance and durability guarantees. That being said, most deployments would be quite happy running on `gp3`.
+
+#### Choosing the filesystem
+
+The choice of the filesystem is a bit like getting married. You should really know what you're getting yourself into and more often than not, you're choice will stay with you for years to come. We've benchmarked and used many different types of filesystems in production, including ext4, ZFS, btrfs and NTFS. Our current filesystem of choice is ZFS because of its high durability, consistency and reasonable performance guarantees.
+
+### Backups
+
+If you choose to use EBS for your database storage, special consideration should be taken around backups. If you decide to use pgBackRest to backup your database, you needn't read any further, however if you'd like to use EBS snapshots, there is a quick tip that could save you from problematic outcomes down the line.
+
+EBS snapshots are an atomic point-in-time copy of the EBS volume. That means that if you take a snapshot of an EBS volume and restore it, whatever you have on that volume at the time of the snapshot will be exactly the way you left it. However, if you take a snapshot while you're writing to the volume, that write may only be partially saved in the snapshot. This is because EBS snapshots are controlled by the EBS server and the filesystem is not aware of its internal operations or that it's taking a snapshot at all. This is very similar to how snapshots work on hardware RAID volume managers.
+
+If you don't pause writes to your filesystem before you take an EBS snapshot, you will run the risk of possibly losing some of your data, or in the worst case, corrupting your filesystem. That means,  if you're using a filesystem like ext4, consider running `fsfreeze(8)` before taking an EBS backup.
+
+If you're like us and prefer ZFS, you don't need to do anything. ZFS is a copy-on-write filesystem that guarantees that all writes made to it are atomic. So even if the EBS volume cuts it off mid write, the filesystem will always be in a consistent state, although you may lose that last write that never fully made it into the snapshot.
+
+#### Taking an EBS backup
+
+You can use EBS backups for creating replicas and for disaster recovery. An EBS backup works exactly like `pg_basebackup` except it's instantaneous. To ensure that your backup is easily restorable, make sure to first create the `/var/lib/postgresql/14/main/standby.signal` file and only then taking a snapshot.
+
+This ensures that when you restore from that backup, Postgres does not automatically promote itself and start accepting writes. If that happens, you won't be able to use it as a replica without getting into `pg_rewind`.
+
+Alternatively, you can disable the `posgresql` service by default ensuring that Postgres does not start on system boot automatically.
+
+#### pgBackRest
+
+If you're using pgBackRest for backups and archiving, you can take advantage of EC2 IAM integration. Instead of saving AWS IAM keys and secrets in `/etc/pgbackrest.conf`, you can instead configure it to fetch temporary credentials from the EC2 API:
+
+```
+[global]
+repo1-s3-key-type=auto
+```
+
+Make sure that your EC2 IAM role has sufficient permissions to access your WAL archive S3 bucket.
+
+### Performance
+
+A typical single volume storage configuration is fine for low traffic databases. However, if you need additional performance, you have a few options. One option is to simply allocate more IOPS to your volume. That works, but that may be a bit costly when used at scale. Another option is to RAID multiple EBS volumes into either a RAID0 for maximum throughput or a RAIDZ1 for good throughput and reasonable durability guarantees.
+
+ZFS supports both RAID0 and RAIDZ1 configurations. If you have say 4 volumes, you can setup a RAID0 with just a couple commands:
+
+```
+zfs create tank /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1
+zfs create -o mountpoint=/var/lib/postgresql tank/pgdata
+```
+
+or a RAIDZ1 with 5 volumes:
+
+<pre><code><strong>zfs create tank raidz /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
+</strong></code></pre>
+
+RAIDZ1 protects against single volume failure, allowing you to replace an EBS volume without taking your database offline or restoring from backup. Considering EBS guarantees and additional redundancy provided by RAIDZ, this is a reasonable configuration to use for systems that require good durability and performance guarantees.
+
+A RAID configuration with at 4 volumes allows up to 4x read throughput which, in EBS terms, can produce up to 600MBps, without having to pay for additional IOPS.
+
+####
+