Posts about Go

New JSONPath Feature: SelectLocated

1 Jan 2025

Happy New Year! 🎉🥳🍾🥂

The JSONPath RFC includes a section on defining normalized paths, which use a subset of JSONPath syntax to define paths to the location of a node in a JSON value. I hadn’t thought much about it, but noticed that the serde JSONPath Sandbox provides a “Located” switch adds them to query results. For the sake of complementarity, I added the same feature to the Go JSONPath Playground.

🛝 See it in action with this example, where instead of the default output:

[
  8.95,
  12.99,
  8.99,
  22.99,
  399
]

The located result is:

[
  {
    "node": 8.95,
    "path": "$['store']['book'][0]['price']"
  },
  {
    "node": 12.99,
    "path": "$['store']['book'][1]['price']"
  },
  {
    "node": 8.99,
    "path": "$['store']['book'][2]['price']"
  },
  {
    "node": 22.99,
    "path": "$['store']['book'][3]['price']"
  },
  {
    "node": 399,
    "path": "$['store']['bicycle']['price']"
  }
]

v0.3.0 of the github.com/theory/jsonpath Go package enables this feature via its new SelectLocated method, which returns a LocatedNodeList that shows off a few of the benfits of pairing JSONPath query results with paths that uniquely identify their locations in a JSON value, including sorting and deduplication. It also takes advantage of Go v1.23 iterators, providing methods to range over all the results, just the node values, and just the paths. As a result, v0.3.0 now requires Go 1.23.

The serde_json_path Rust crate inspired the use of LocatedNodeList rather than a simple slice of LocatedNode structs, but I truly embraced it once I noticed the the focus on “nodelists” in the RFC’s overview, which provides this definition:

A JSONPath expression is a string that, when applied to a JSON value (the query argument), selects zero or more nodes of the argument and outputs these nodes as a nodelist.

It regularly refers to nodelists thereafter, and it seemed useful to have an object to which more features can be added in the future. github.com/theory/jsonpath v0.3.0 thererfore also changes the result value of Select from []any to the new NodeList struct, an alias for []any. For now it adds a single method, All, which again relies on Go v1.23 iterators to iterate over selected nodes.

While the data type has changed, usage otherwise has not. One can iterate directly over values just as before:

for _, val := range path.Select(jsonInput) {
    fmt.Printf("%v\n", val)
}

But All removes the need to alias-away the index value with _:

for val := range path.Select(jsonInput).All() {
    fmt.Printf("%v\n", val)
}

I don’t expect any further incompatible changes to the main jsonpath module, but adding these return values now allows new features to be added to the selected node lists in the future.

May you find it useful!

SQL/JSON Path Playground Update

31 Dec 2024

Based on the recently-released Go JSONPath and JSONTree playgrounds, I’ve updated the design and of the SQL/JSON Playground. It now comes populated with sample JSON borrowed from RFC 9535, as well as a selection of queries that randomly populate the query field on each reload. I believe this makes the playground nicer to start using, not to mention more pleasing to the eye.

The playground has also been updated to use the recently-released sqljson/path v0.2 package, which replicates a few changes included in the PostgreSQL 17 release. Notably, the .string() function no longer uses a time zone or variable format to for dates and times.

Curious to see it in action? Check it out!

JSONTree Module and Playground

22 Dec 2024

As a follow-up to the JSONPath module and playground I released last month, I’m happy to announce the follow-up project, called JSONTree. I’ve implemented it in the github.com/theory/jsontree Go package, and built a Wasm-powered browser playground for it.

JSONTree?

While a RFC 9535 JSONPath query selects and returns an array of values from the end of a path expression, a JSONTree compiles multiple JSONPath queries into a single query that selects values from multiple path expressions. It returns results not as an array, but as a subset of the query input, preserving the paths for each selected value.

In other words, it compiles multiple paths into a single tree of selection paths, and preserves the tree structure of the input. Hence JSONTree.

Example

Consider this JSON:

{
  "store": {
    "book": [
      {
        "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      {
        "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      {
        "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      {
        "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 399
    }
  }
}

This JSONPath query:

$..price

Selects these values (playground):

[8.95, 12.99, 8.99, 22.99, 399]

While this JSONPath query:

$..author

Selects (playground):

[
  "Nigel Rees",
  "Evelyn Waugh",
  "Herman Melville",
  "J. R. R. Tolkien"
]

JSONTree compiles these two JSONPaths into a single query that merges the author and price selectors into a single segment, which stringifies to a tree-style format (playground):

$
└── ..["author","price"]

This JSONTree returns the appropriate subset of the original JSON object (playground):

{
  "store": {
    "book": [
      {
        "author": "Nigel Rees",
        "price": 8.95
      },
      {
        "author": "Evelyn Waugh",
        "price": 12.99
      },
      {
        "author": "Herman Melville",
        "price": 8.99
      },
      {
        "author": "J. R. R. Tolkien",
        "price": 22.99
      }
    ],
    "bicycle": {
      "price": 399
    }
  }
}

Note that the original data structure remains, but only for the subset of the structure selected by the JSONPath queries.

Use Cases

A couple of use cases drove the conception and design of JSONPath.

Permissions

Consider an application in which ACLs define permissions for groups of users to access specific branches or fields of JSON documents. When delivering a document, the app would:

Fetch the groups the user belongs to
Convert the permissions from each into JSONPath queries
Compile the JSONPath queries into an JSONTree query
Select and return the permitted subset of the document to the user

Selective Indexing

Consider a searchable document storage system. For large or complex documents, it may be infeasible or unnecessary to index the entire document for full-text search. To index a subset of the fields or branches, one would:

Define JSONPaths the fields or branches to index
Compile the JSONPath queries into a JSONTree query
Select and submit only the specified subset of each document to the indexing system

Go Example

Use the github.com/theory/jsontree Go package together with github.com/theory/jsonpath to compile and execute JSONTree queries:

package main

import (
	"fmt"

	"github.com/theory/jsonpath"
	"github.com/theory/jsontree"
)

func main() {
	// JSON as unmarshaled by encoding/json.
	value := map[string]any{
		"name":  "Barrack Obama",
		"years": "2009-2017",
		"emails": []any{
			"potus@example.com",
			"barrack@example.net",
		},
	}

	// Compile multiple JSONPaths into a JSONTree.
	tree := jsontree.New(
		jsonpath.MustParse("$.name"),
		jsonpath.MustParse("$.emails[1]"),
	)

	// Select from the input value.
	js, err := json.Marshal(tree.Select(value))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%#v\n", tree.Select(value))
}

And the output:

{"emails":["barrack@example.net"],"name":"Barrack Obama"}

Note that the index position of the selected email was not preserved. Replace New with NewFixedModeTree to create a “fixed mode” JSONTree that preserves index positions by filling gaps with nulls. Its output of the above example would be:

{"emails":[null,"barrack@example.net"],"name":"Barrack Obama"}

Status

The public interface of the jsontree module is quite minimal and stable. But I suspect there may remain some flaws in the merging of JSONPath selectors. Please report bugs via GitHub issues and I’ll get them fixed up ASAP.

Otherwise, please share and enjoy!

Introducing RFC 9535 Go JSONPath and Playground

4 Nov 2024

I’ve written and release a RFC 9535 JSONPath JSONPath Go package, github.com/theory/jsonpath. Why? For a personal project, I needed a simpler JSONPath engine to complement to the Go SQL/JSON Path package, and quickly found myself implementing most of the RFC. So I decided do the whole thing.

Yes, yet another JSONPath package in Go. I really appreciate the idea of a standard — plus its support for features not included in the original design from 2007, such as object slices! But I could find no reference to the RFC on pkg.go.dev. Today the search shows one!

Example

Usage is straightforward; here’s a quick example (Go playground):

package main

import (
	"fmt"
	"log"

	"github.com/theory/jsonpath"
)

func main() {
	// Parse a jsonpath query.
	p, err := jsonpath.Parse(`$["name", "slogan"]`)
	if err != nil {
		log.Fatal(err)
	}

	// Select values from unmarshaled JSON input.
	json := map[string]any{
		"name":   "Kamala Harris",
		"title":  "Vice President of the United States",
		"home":   "California",
		"slogan": "We are not going back!",
	}
	items := p.Select(json)

	// Show the result.
	fmt.Printf("%#v\n", items)
}

And the output:

[]interface {}{"Kamala Harris", "We are not going back!"}

🛝 Playground

No need to write code to try it out, though. I’ve also written a playground webapp to encourage experimentation and exploration of the syntax and behavior of the package. The implementation follows the precedents set by the Go SQL/JSON Playground and design of the Rust JSONPath Playground. Moreover, thanks to TinyGo, the Wasm file comes in at a mere 254K!

The webapp loads sample JSON from the RFC, and randomly rotates through a few example JSONPath queries. Fill in your own and tap the “Permalink” button to share links. The Playground is a stateless JavaScript/Wasm web application: data persists only in permalink URLs.¹

🛝 Try this example Playground permalink right now!

Status

The root jsonpath package is stable and ready for use. Other packages remain in flux, as I refactor and rejigger things in the coming weeks as part of the aforementioned personal project. But for actual JSONPath execution and querying, it should continue to work as-is for the foreseeable future.

I hope you find it useful.

And whatever data [GitHub Pages collect] 😔. ↩︎

Introducing Go SQL/JSON Path and Playground

8 Jul 2024

For a personal project, I needed to parse and execute PostgreSQL-compatible jsonpath expressions.¹ So I’ve spent just about every spare evening and weekend the last several months porting Postgres jsonpath to Go, and it’s finally ready to ship.

Introducing Go SQL/JSON, featuring the path package. This project provides full support for all of the PostgresSQL 17 jsonpath features² in the Go programming language. An example:

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"

	"github.com/theory/sqljson/path"
	"github.com/theory/sqljson/path/exec"
)

func main() {
	// Parse some JSON.
	var value any
	err := json.Unmarshal([]byte(`{"a":[1,2,3,4,5]}`), &value)
	if err != nil {
		log.Fatal(err)
	}

	// Parse a path expression and execute it on the JSON.
	p := path.MustParse("$.a[*] ? (@ >= $min && @ <= $max)")
	res, err := p.Query(
		context.Background(),
		value,
		exec.WithVars(exec.Vars{"min": float64(2), "max": float64(4)}),
	)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("%v\n", res)
    // Output: [2 3 4]
}

I think the API is decent, but may implement better patterns as I discover them. Overall I’m quite satisfied with how it turned out, and just how well its implementation and performance compare to the original.

🛝 Playground

But why stop there? One of the nice things about this project is that Go supports compiling applications into WebAssembly (a.k.a. Wasm) via Go WebAssembly. Borrowing from the Goldmark project, I created and published the sqljson/path playground and populated the docs with links for all of its examples.

Now anyone can experiment with SQL/JSON path expressions, and share links to demonstrate patterns and techniques. The Playground is a stateless JavaScript/Wasm web application: data persists only in permalink URLs.³

🛝 Try this example Playground permalink right now!⁴

The Path Ahead

I’ve enjoyed learning how to implement a lexer, a goyacc parser, an AST, and an execution engine. The Playground was a bonus bit of fun!

I’m stoked to build cool stuff on this package, but don’t know whether anyone else will find it useful. If you do — or just enjoy messing about on the Playground, let me know!

“Whatever for,” you ask? Well, aside from wanting to see if I could do it, this post describes a POC. Now I’m working to create the real thing — done right and entirely from scratch. ↩︎
Well, nearly full. The only missing feature is the datetime(template) method. See also the comprehensive compatibility notes. ↩︎
And whatever data GitHub Pages collect 😔. ↩︎
JSON borrowed from MDM. ↩︎

PGXN Language Poll Result

10 May 2024

Back on March 28, I asked the Postgres community whether new services for PGXN v2 should be written in Go, Rust, or “some of each”. I went so far as to create a poll, which ran through April 12. A month later you might reasonably be wondering what became of it. Has David been refusing to face reality and accept the results?

The answer is “no”. Or at least I don’t think so. Who among us really knows ourselves. Since it closed, the poll has provided the results since it closed, but I suspect few have looked. So here they are:

Candidate	Votes	% All Votes
🦀 Rust	102	60.4%
🐿️ Go	53	31.4%
🐿️ + 🦀 Some of each	13	7.7%

🦀 Rust is the clear winner.

I don’t know whether some Rust brigade descended upon the poll, but the truth is that the outcome was blindingly apparent within a day of posting the poll. So much so that I decided to get ahead of things and try writing a pgrx extension. I released jsonschema on PGXN on April 30. Turned out to be kind of fun, and the pgrx developers kindly answered all my questions and even made a new release to simplify integration testing, now included in the pgrx-build-test utility in the pgxn-tools Docker image.

But I digress. As a result of this poll and chatting with various holders of stakes at work and haunting the #extensions Slack channel, I plan to use Rust for all new PGXN projects — unless there is an overwhelmingly compelling reason to use something else for a specific use case.

Want to help? Rustaceans welcome! Check out the project plan plan or join us in the #extensions channel on the Postgres Slack.

PGXN v2: Go or Rust?

28 Mar 2024

TL;DR: I’d like Postgres community input on a decision: Should we build PGXN v2 services and tools in Go or Rust? Context for the question and some weighing of options constitutes the rest of this post, but to skip to the end, 🗳️ Vote your choice! Poll closes April 12 at the end of the day (midnight) New York time.

The PGXN v2 project now under way requires developing or updating several services and tools, including:

A root registry for source distribution
A package registry for binary distribution
A command line client for developing, building, and distributing extension packages
An interactions service for notifications and stats aggregation

And more. Historically, the PGXN tools were written in Perl, which was an ideal choice for me back in 2011, and quite familiar to many members of the core team, but also deeply foreign to most everyone else. Furthermore, its dynamic nature and need for dozens of dependencies in most apps makes installation and packaging a challenge, to say the least.¹

In the last ten years I’ve become quite proficient in Go. I appreciate its near system-level performance, memory safety, robust standard library, concurrency design, and short learning curve — especially for web services. But perhaps most eye-opening to me, as a long-time user of dynamic languages, is that, like C, Go compiles an application into a single static binary. Not only that, but Go provides cross compilation natively. This makes distribution incredibly simple.

Distribution Digression

Compare, for example, the Dockerfile for Sqitch, the database change management system I wrote and maintain in Perl. It’s…a lot. Sure there are a ton of system dependencies, but what’s invisible in this file is the weeks of work that went into Module::Build::Sqitch, which performs a bunch of tricks to build the Sqitch “app” as a single directory with all of its Perl dependencies. Don’t get me wrong, the work was worth it for Sqitch, and powers the Homebrew formula, as well. But even there, I’ve not been able to get Sqitch into the Homebrew core because every single dependency requires a checksum, and I’ve not had the time (or energy) to figure out how to generate them.

Contrast with this Dockerfile for a Go service compiled into a binary named thinko:

FROM gcr.io/distroless/base-debian12:latest

# TARGETOS and TARGETARCH: https://docs.docker.com/build/guide/multi-platform/
ARG TARGETOS
ARG TARGETARCH

COPY "_build/${TARGETOS}-${TARGETARCH}/thinko" /thinko/bin/
USER nonroot:nonroot
ENTRYPOINT [ "/thinko/bin/thinko" ]

That’s the whole thing. There are no dependencies at all, aside from a few included in distroless image. And where does that image come from? This is the relevant from the project Makefile:

.PHONY: all # Build all binaries
all: local linux darwin windows freebsd

linux: thinko-linux
darwin: thinko-darwin
windows: thinko-windows
freebsd: thinko-freebsd

thinko-linux: _build/linux-amd64/thinko _build/linux-arm64/thinko
thinko-darwin: _build/darwin-amd64/thinko _build/darwin-arm64/thinko
thinko-windows: _build/windows-amd64/thinko _build/windows-arm64/thinko
thinko-freebsd: _build/freebsd-amd64/thinko _build/freebsd-arm64/thinko

# Build Thinko for specific platform
_build/%/thinko: cmd/thinko
	GOOS=$(word 1,$(subst -, ,$*)) GOARCH=$(word 2,$(subst -, ,$*)) $(GO) build -o $@ ./$<

This configuration allows me to build thinko for every OS and architecture at once:

$ make thinko
go build -o _build/local/thinko ./cmd/thinko
GOOS=linux GOARCH=amd64 go build -o _build/linux-amd64/thinko ./cmd/thinko
GOOS=linux GOARCH=arm64 go build -o _build/linux-arm64/thinko ./cmd/thinko
GOOS=darwin GOARCH=amd64 go build -o _build/darwin-amd64/thinko ./cmd/thinko
GOOS=darwin GOARCH=arm64 go build -o _build/darwin-arm64/thinko ./cmd/thinko
GOOS=windows GOARCH=amd64 go build -o _build/windows-amd64/thinko ./cmd/thinko
GOOS=windows GOARCH=arm64 go build -o _build/windows-arm64/thinko ./cmd/thinko
GOOS=freebsd GOARCH=amd64 go build -o _build/freebsd-amd64/thinko ./cmd/thinko
GOOS=freebsd GOARCH=arm64 go build -o _build/freebsd-arm64/thinko ./cmd/thinko

Those first two commands build thinko for Linux on amd64 and arm64, right where the Dockerfile expects them. Building then is easy; a separate make target runs the equivalent of:

$ docker buildx build --platform linux/arm64 -f dist/Dockerfile .
$ docker buildx build --platform linux/amd64 -f dist/Dockerfile .

The --platform flag sets the TARGETOS and TARGETARCH arguments in the Dockerfile, and because the directories into which each binary were compiled have these same terms, the binary compiled for the right OS and architecture can be copied right in.

And that’s it, it’s ready to ship! No mucking with dependencies, tweaking system issues, removing unneeded stuff from the image. It’s just the bare minimum.

This pattern works not just for Docker images, of course. See, for example, how [Hugo], the Go blog generator, releases tarballs for a bunch of OSes and architectures, each containing nothing more than a README.md, LICENSE.md, and the hugo binary itself. This pattern allows both the Hugo Homebrew formula and its Dockerfile to be incredibly simple.

Back to PGXN

I very much want these advantages for the next generation of PGXN tools. Not only the services, but also the command-line client, which would become very easy to distribute to a wide variety of platforms with minimal effort.

But there are other variables to weigh in the choice of language for the PGXN servers and tools, including:

Familiarity to other developers: Ideally someone can quickly contribute to a project because they’re familiar with the language, or there’s a short learning curve.
Safety from common issues and vulnerabilities such as buffer overflows, and dangling pointers.
Tooling for robust and integrated development, including dependency management, testing, distribution, and of course cross-compilation.

Decisions, Decisions

In my experience, there are two language that fulfill these requirements very well:

🐿️ Go²
🦀 Rust

Which should we use? Some relevant notes:

I expect to do the bulk of the initial development on PGXN v2, as the only person currently dedicated full time to the project, and I’m most familiar with Go — indeed I enjoy writing web services and CLIs in Go!. I’d therefore be able go ship Go tools more quickly.
But I’ve played around with Rust a number of times over the years, and very much would like to learn more. Its syntax and long feature list steepen the learning curve, but given my background in Perl — another language with unique syntax and context-sensitive features — I’m certain I could become incredibly proficient in Rust after a few months.
My employer, Tembo, is a Rust shop, and we’ll likely borrow heavily from the trunk project, especially for the CLI and binary registry. It would also be easier for my coworkers to contribute.
pgrx, the tooling to build Postgres extensions in Rust, has taken the community by storm, rapidly building familiarity with the language among extensions developers. Perhaps some of those developers would also be willing to turn their expertise to PGXN Rust contributions, as well. It’s likely some features could be borrowed, as well.
Sadly, the plgo project appears to have stalled, so has not built up the same community momentum.

This leaves me torn! But it’s time to start coding, so it’s also time to make some decisions. Should PGXN v2 services and tool be:

🐿️ Written in Go
🦀 Written in Rust
🐿️ + 🦀 Some of each (e.g., Go for web services and Rust for CLIs)

What do you think? If you were to contribute to PGXN, what language would you like to work in? Do you think one language or the other would be more compatible with community direction or core development?³

Got an opinion? 🗳️ Vote! Poll closes April 12 at the end of the day (midnight) New York time.

And if those choices aren’t enough for you, please come yell at me on Mastodon, or via the #extensions channel on the Postgres Slack. Thanks!

Ever wonder why PGXN isn’t hosted by community servers? It’s because I screwed up the installation trying to balance all the dependencies without wiping out Perl modules the systems depend on. 🤦🏻‍♂️ ↩︎
Pity there’s no gopher emoji yet. ↩︎
I can imagine a future where an extension CLI was included in core. ↩︎

Extension Registry Namespacing RFC

19 Mar 2024

A few weeks ago I brainstormed about decentralized Postgres extension publishing, inspired in part by an examination of Go decentralized publishing. It was…a lot. I’ve been deeply pondering the future of PGXN and the broader extension ecosystem, and want to start to nail down some decisions. To that end, I’d like to propose an update to extension namespacing.

Status Quo

There are currently three ways in which an extension is considered unique:

Only one extension can have a given name within a single Postgres cluster. Names are defined by the name of the control file. It is therefore not possible to have two extensions with the same name in the same Postgres cluster.
PGXN follows this pattern: Only one extension can have a given name in the PGXN registry. The first person to release an extension then “owns” its name, and no one else can release an extension with the same name.¹ I think dbdev follows the same pattern.
Other registries like trunk and pgxman define an extension by the distribution name, at least for the purposes of selecting a binary to install. Thus when you trunk install postgis, you get all of the extensions included, as you’d expect, while trunk install address_standardizer wouldn’t work at all. In the few places that trunk supports installation by extension name, it prompts the user to use the appropriate package name if there’s a conflict.

A Modest Proposal

I’d like to propose the following changes to the PGXN Meta Spec to start to move away from extension uniqueness in the broader extension ecosystem and more toward package name.

Add a new field, call it module_path, project_path, project_uri, that, if present, uniquely identifies an extension project and all of its parts. It should be to a Go-style module path (or URI) that identifies the project repository path where a META.json file lives.
Retain the provides object where keys identify extensions, but those keys will no longer be globally unique to the registry. In other words, the combination of module_path and extension name uniquely identifies an extension, including an empty module_path.

How it Works

Some examples. Let’s say there is an existing extension named pair, included in the distribution named pg_pair:

{
  "name": "pg_pair",
  "version": "1.2.3",
  "provides": {
    "pair": {
      "file": "pair.sql",
      "version": "1.2.0"
    }
  }
}

The extension name pair is unique, and pgxn install pair will download the pg_pair v1.2.3 bundle and compile and install pair v1.2.0.

Now someone else comes along and wants to make their own pair with this metadata:

{
  "name": "my_pair",
  "version": "0.2.3",
  "provides": {
    "pair": {
      "file": "pair.sql",
      "version": "0.2.3"
    }
  }
}

Just like today, this upload would be rejected, because there is already a registered pair extension. Under my proposal, they can disambiguate by providing a module_path:

{
  "name": "my_pair",
  "module_path": "github/example/pair",
  "version": "0.2.3",
  "provides": {
    "pair": {
      "file": "pair.sql",
      "version": "0.2.3"
    }
  }
}

This upload would be allowed. With these two releases, someone attempting to install pair would see something like this:

$ pgxn install pair
ERROR: Duplicate extension name “pair”. Install one of these instead:
       * pgxn.org/dist/pair
       * github/example/pair

Note the the module path pgxn.org/dist/pair in the the first option. This is the default module path for distributions without a module path.² But now the user can select the proper one to install:

$ pgxn install pgxn.org/dist/pair
INFO: latest version: pgxn.org/dist/pair@1.2.3
INFO: building extension
INFO: installing extension
INFO: done!

Furthermore, the PGXN client will prevent the user from later installing a conflicting extension. The failure would look something like:

$ pgxn install github/example/pair
INFO: latest version: pgxn.org/dist/pair@0.2.3
ERROR: Cannot install extension “pair” from pgxn.org/dist/pair:
ERROR: A conflicting extension named “pair” is already installed
ERROR: from pgxn.org/dist/pair

Features with Benefits

I see a number of benefits to this change:

Compatibility with the v1 metadata spec, so that no data migration or distribution indexing is required.
It loosens up extension namespacing (or name registration, if you prefer) while adding additional metadata to help users evaluate the quality of an extension. For example, does it come from a well-known developer? You can see it right in the module path.
It creates a pattern to eventually allow auto-indexing of extensions. For example, if you run pgxn install github.com/example/pew, and PGXN doesn’t have it, it can look for a META.json file in that repository and, if it exists, and there’s a semver release tag, it could try to index it and let the user install it. There are ownership issues to be worked out, but it has possibilities.
It preserves the Postgres core concept of extension identity while putting in place a well-established (by Go modules and widespread use of URIs in general) that the Postgres core could eventually adopt to allow more flexible extension namespacing.

Request for Comments

What do you think? Good idea? Terrible idea? Please hit me with your thoughts on Mastodon, or via the #extensions channel on the Postgres Slack. I’d like to get this decision (and a few others, stay tuned!) nailed down soon and start development, so don’t hesitate? I need your help to prevent me from making a huge mistake.

Unless the owner would like to share ownership with someone else, in which case they can email me to request that another user be granted “co-ownership”. They can also request to transfer ownership to another user, after which the original owner will no longer be able to release the extension. ↩︎
Or, if the META.json file has a repository resource with a URL, PGXN could index it as the implied module path. Or, failing that, maybe it should fall back on the distribution name instead of a pgxn.org path, and prompt with pg_pair/pair. ↩︎

Contemplating Decentralized Extension Publishing

1 Feb 2024

TL;DR

As I think through the future of the Postgres extension ecosystem as a key part of the new job, I wanted to understand how Go decentralized publishing works. In this post I work it out, and think through how we might do something similar for Postgres extension publishing. It covers the Go architecture, namespacing challenges, and PGXS abuse; then experiments with URL-based namespacing and ponders reorganizing installed extension files; and closes with a high-level design for making it work now and in the future.

It is, admittedly, a lot, mainly written for my own edification and for the information of my fellow extension-releasing travelers.

I find it fascinating and learned a ton. Maybe you will too! But feel free to skip this post if you’re less interested in the details of the journey and want to wait for more decisive posts once I’ve reached the destination.

Introduction

Most language registries require developers to take some step to make releases. Many automate the process in CI/CD pipelines, but it requires some amount of effort on the developer’s part:

Register for an account
Learn how to format things to publish a release
Remember to publish again for every new version
Create a pipeline to automate publishing (e.g., a GitHub workflow)

Decentralized Publishing

Go decentralized publishing has revised this pattern: it does not require user registration or authentication to to publish a module to pkg.go.dev. Rather, Go developers simply tag the source repository, and the first time someone refers to the tag in Go tools, the Go module index will include it.

For example, publishing v1.2.1 of a module in the github.com/golang/example repository takes just three commands:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
go list -m github.com/golang/example@v1.2.1

After a few minutes, the module will show up in the index and then on pkg.go.dev. Anyone can run go get -u github.com/golang/example to get the latest version. Go developers rest easy in the knowledge that they’re getting the exact module they need thanks to the global checksum database, which Go uses “in many situations to detect misbehavior by proxies or origin servers”.

This design requires go get to understand multiple source code management systems: it supports Git, Subversion, Mercurial, Bazaar, and Fossil.¹ It also needs the go.mod metadata file to live in the project defining the package.

But that’s really it. From the developer’s perspective it could not be easier to publish a module, because it’s a natural extension of the module development tooling and workflow of committing, tagging, and fetching code.

Decentralized Extension Publishing

Could we publish Postgres extensions in such a decentralized pattern? It might look something like this:

The developer places a metadata file in the proper location (control file, META.json, Cargo.toml, whatever — standard TBD)
To publish a release, the developer tags the repository and calls some sort of indexing service hook (perhaps from a tag-triggered release workflow)
The indexing service validates the extension and adds it to the index

Note that there is no registration required. It simply trusts the source code repository. It also avoids name collision: github.com/bob/hash is distinct from github.com/carol/hash.

This design does raise challenges for clients, whether they’re compiling extensions on a production system or building binary packages for distribution: they have to support various version control systems to pull the code (though starting with Git is a decent 90% solution).

Namespacing

Then there’s name conflicts. Perhaps github.com/bob/hash and github.com/carol/hash both create an extension named hash. By the current control file format, the script directory and module path can use any name, but in all likelihood the use these defaults:

directory = 'extension'
module_pathname = '$libdir/hash'

Meaning .sql files will be installed in the Postgres share/extension subdirectory — along with all the other installed extensions — and library files will be installed in the library directory along with all other libraries. Something like this:

pgsql
├── lib
│   └── hash.so
└── share
    └── extension
    │   └── hash.control
    │   ├── hash--1.0.0.sql
    └── doc
        └── hash.md

If both projects include, say, hash.control, hash--1.0.0.sql, and hash.so, the files from one will stomp all over the files of the other.

Installer Abuse

Go avoids this issue by using the domain and path from each package’s repository in its directory structure. For example, here’s a list of modules from google.golang.org repositories:

$ ls -1 ~/go/pkg/mod/google.golang.org
api@v0.134.0
api@v0.152.0
appengine@v1.6.7
genproto
genproto@v0.0.0-20230731193218-e0aa005b6bdf
grpc@v1.57.0
grpc@v1.59.0
protobuf@v1.30.0
protobuf@v1.31.0
protobuf@v1.32.0

The ~/go/pkg/mod directory has subdirectories for each VCS host name, and each then subdirectories for package paths. For the github.com/bob/hash example, the files would all live in ~/go/pkg/mod/github.com/bob/hash.

Could a Postgres extension build tool follow a similar distributed pattern by renaming the control file and installation files and directories to something specific for each, say github.com+bob+hash and github.com+carol+hash? That is, using the repository host name and path, but replacing the slashes in the path with some other character that wouldn’t create subdirectories — because PostgreSQL won’t find control files in subdirectories. The control file entries for github.com/carol/hash would look like this:

directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Since PostgreSQL expects the control file to have the same name as the extension, and for SQL scripts to start with that name, the files would have to be named like so:

hash
├── Makefile
├── github.com+carol+hash.control
└── sql
    └── github.com+carol+hash--1.0.0.sql

And the Makefile contents:

EXTENSION  = github.com+carol+hash
MODULEDIR  = $(EXTENSION)
DATA       = sql/$(EXTENSION)--1.0.0.sql
PG_CONFIG ?= pg_config

PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

In other words, the extension name is the full repository host name and path and the Makefile MODULEDIR variable tells pg_config to put all the SQL and documentation files into a directories named github.com+carol+hash — preventing them from conflicting with any other extension.

Finally, the github.com+carol+hash.control file — so named becaus it must have the same name as the extension — contains:

default_version = '1.0.0'
relocatable = true
directory = 'github.com+carol+hash'
module_pathname = '$libdir/github.com+carol+hash'

Note the directory parameter, which must match MODULEDIR from the Makefile, so that CREATE EXTENSION can find the SQL files. Meanwhile, module_pathname ensures that the library file has a unique name — the same as the long extension name — again to avoid conflicts with other projects.

That unsightly naming extends to SQL: using the URL format could get to be a mouthful:

CREATE EXTENSION "github.com+carol+hash";

Which is do-able, but some new SQL syntax might be useful, perhaps something like:

CREATE EXTENSION hash FROM "github.com+carol+hash";

Or, if we’re gonna really go for it, use slashes after all!

CREATE EXTENSION hash FROM "github.com/carol/hash";

Want to use both extensions but they have conflicting objects (e.g., both create a “hash” data type)? Put them into separatre schemas (assuming relocatable = true in the control file):

CREATE EXTENSION hash FROM "github.com/carol/hash" WITH SCHEMA carol;
CREATE EXTENSION hash FROM "github.com/bob/hash" WITH SCHEMA bob;
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
);

Of course it would be nice if PostgreSQL added support for something like Oracle packages, but using schemas in the meantime may be sufficient.

Clearly we’re getting into changes to the PostgreSQL core, so put that aside and we can just use long names for creating, modifying, and dropping extensions, but not necessarily otherwise:

CREATE EXTENSION "github.com+carol+hash" WITH SCHEMA carol;
CREATE EXTENSION "github.com+bob+hash" WITH SCHEMA bob;
CREATE EXTENSION "gitlab.com+barack+kicker_type";
CREATE TABLE try (
    h1 carol.hash,
    h2 bob.hash
    kt kicker
);

Namespacing Experiment

To confirm that this approach might work, I committed 24134fd and pushed it in the namespace-experiment branch of the semver extension. This commit changes the extension name from semver to github.com+theory+pg-semver, and follows the above steps to ensure that its files are installed with that name.

Abusing the Postgres extension installation infrastructure like this does work, but suffers from a number of drawbacks, including:

The extension name is super long, as before, but now so too are the files in the repository (as opposed to the installer renaming them on install). The shared library file has to have the long name, so therefore does the .c source file. The SQL files must all start with github.com+theory+pg-semver, although I skipped that bit in this commit; instead the Makefile generates just one from sql/semver.sql.
Any previous installation of the semver type would remain unchanged, with no upgrade path. Changing an extension’s name isn’t a great idea.

I could probably script renaming and modifying file contents like this and make it part of the build process, but it starts to get complicated. We could also modify installers to make the changes, but there are a bunch of moving parts they would have to compensate for, and given how dynamic this can be (e.g., the semver Makefile reads the extension name from META.json), we would rapidly enter the territory of edge case whac-a-mole. I suspect it’s simply too error-prone.

Proposal: Update Postgres Extension Packaging

Perhaps the Go directory pattern could inspire a similar model in Postgres, eliminating the namespace issue by teaching the Postgres extension infrastructure to include all but one of the files for an extension in a single directory. In other words, rather than files distributed like so for semver:

pgsql
├── lib
│   └── semver.so
└── share
    └── extension
    │   └── semver.control
    │   ├── semver--0.32.1.sql
    │   ├── semver--0.32.0--0.32.1.sql
    └── doc
        └── semver.md

Make it more like this:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── lib
                    │   └── semver.so
                    └── sql
                    │   └── semver--0.32.1.sql
                    │   └── semver--0.32.0--0.32.1.sql
                    └── doc
                        └── semver.md

Or perhaps:

pgsql
└── share
    └── extension
        └── github.com
            └── theory
                └── pg-semver
                    └── extension.control
                    └── semver.so
                    └── semver--0.32.1.sql
                    └── semver--0.32.0--0.32.1.sql
                    └── semver.md

The idea is to copy the files exactly as they’re stored in or compiled in the repository. Meanwhile, the new semver.name file — the only relevant file stored outside the extension module directory — simply points to that path:

github.com/theory/pg-semver

Then for CREATE EXTENSION semver, Postgres reads semver.name and knows where to find all the files to load the extension.

This configuration would require updates to the control file, now named extension.control, to record the full package name and appropriate locations. Add:

name = 'semver'
package = 'github.com/theory/pg-semver'

This pattern could also allow aliasing. Say we try to install a different semver extension from github.com/example/semver. This is in its extension.control file:

name = 'semver'
package = 'github.com/example/pg-semver'

The installer detects that semver.name already exists for a different package and raises an error. The user could then give it a different name by running something like:

make install ALIAS_EXTENSION_NAME=semver2

This would add semver2.name right next to semver.name, and its contents would contain github.com/example/semver, where all of its files are installed. This would allow CREATE EXTENSION semver2 to load the it without issue (assuming no object conflicts, hopefully resolved by relocate-ability).

I realize a lot of extensions with libraries could wreak some havoc on the library resolver having to search so many library directories, but perhaps there’s some way around that as well? Curious what techniques experienced C developers might have adopted.

Back to Decentralized Publishing

An updated installed extension file structure would be nice, and is surely worth a discussion, but even if it shipped in Postgres 20, we need an updated extension ecosystem today, to work well with all supported versions of Postgres. So let’s return to the idea of decentralized publishing without such changes.

I can think of two pieces that’d be required to get Go-style decentralized extension publishing to work with the current infrastructure.

Module Uniqueness

The first is to specify a new metadata field to be unique for the entire index, and which would contain the repository path. Call it module, after Go (a single Git repository can have multiple modules). In PGXN Meta Spec-style JSON it’d look something like this:

{
    "module": "github.com/theory/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Switch from the PGXN-style uniqueness on the distribution name (usually the name of the extension) and let the module be globally unique. This would allow another party to release an extension with the same name. Even a fork where only the module is changed:

{
    "module": "github.com/example/pg-semver",
    "version": "0.32.1",
    "provides": {
      "semver": {
         "abstract": "A semantic version data type",
      }
    }
}

Both would be indexed and appear under the module name, and both would be find-able by the provided extension name, semver.

Where that name must still be unique is in a given install. In other words, while github.com/theory/pg-semver and github.com/example/pg-semver both exist in the index, the semver extension can be installed from only one of them in a given Postgres system, where the extension name semver defines its uniqueness.

This pattern would allow for much more duplication of ideas while preserving the existing per-cluster namespacing. It also allows for a future Postgres release that supports something like the flexible per-cluster packaging as described above.²

Extension Toolchain App

The second piece is an extension management application that understands all this stuff and makes it possible. It would empower both extension development workflows — including testing, metadata management, and releasing — and extension user workflows — finding, downloading, building, and installing.

Stealing from Go, imagine a developer making a release with something like this:

git tag v1.2.1 -sm 'Tag v1.2.1'
git push --tags
pgmod list -m github.com/theory/pg-semver@v1.2.1

The creatively named pgmod tells the registry to index the new version directly from its Git repository. Thereafter anyone can find it and install it with:

pgmod get github.com/theory/pg-semver@v1.2.1 — installs the specified version
pgmod get github.com/theory/pg-semver — installs the latest version
pgmod get semver — installs the latest version or shows a list of matching modules to select from

Any of these would fail if the cluster already has an extension named semver with a different module name. But with something like the updated extension installation locations in a future version of Postgres, that limitation could be loosened.

Challenges

Every new idea comes with challenges, and this little thought experiment is no exception. Some that immediately occur to me:

Not every extension can be installed directly from its repository. Perhaps the metadata could include a download link for a tarball with the results of any pre-release execution?
Adoption of a new CLI could be tricky. It would be useful to include the functionality in existing tools people already use, like pgrx.
Updating the uniqueness constraint in existing systems like PGXN might be a challenge. Most record the repository info in the resources META.json object, so it would be do-able to adapt into a new META format, either on PGXN itself or in a new registry, should we choose to build one.
Getting everyone to standardize on standardized versioning tags might take some effort. Go had the benefit of controlling its entire toolchain, while Postgres extension versioning and release management has been all over the place. However PGXN long ago standardized on semantic versioning and those who have released extensions on PGXN have had few issues (one can still use other version formats in the control file, for better or worse).
Some PGXN distributions have shipped different versions of extensions in a single release, or the same version as in other releases. The release version of the overall package (repository, really) would have to become canonical.

I’m sure there are more, I just thought of these offhand. What have you thought of? Post ’em if you got ’em in the #extensions channel on the Postgres Slack, or give me a holler on Mastodon or via email.

Or does it? Yes, it does. Although the Go CLI downloads most public modules from a module proxy server like proxy.golang.org, it still must know how to download modules from a version control system when a proxy is not available. ↩︎
Assuming, of course, that if and when the Postgres core adopts more bundled packaging that they’d use the same naming convention as we have in the broader ecosystem. Not a perfectly safe assumption, but given the Go precedent and wide adoption of host/path-based projects, it seems sound. ↩︎