Introducing DuckDB – Real Python

The DuckDB database provides a seamless way to handle large datasets in Python with Online Analytical Processing (OLAP) optimization. You can create databases, verify data imports, and perform efficient data queries using both SQL and DuckDB’s Python API.

By the end of this tutorial, you’ll understand that:

You can create a DuckDB database by reading data from files like Parquet, CSV, or JSON and saving it to a table.
You query a DuckDB database using standard SQL syntax within Python by executing queries through a DuckDB connection object.
You can also use DuckDB’s Python API, which uses method chaining for an object-oriented approach to database queries.
Concurrent access in DuckDB allows multiple reads but restricts concurrent writes to ensure data integrity.
DuckDB integrates with pandas and Polars by converting query results into DataFrames using the .df() or .pl() methods.

The tutorial will equip you with the practical knowledge necessary to get started with DuckDB, including its Online Analytical Processing (OLAP) features, which enable fast access to data through query optimization and buffering.

Ideally, you should already have a basic understanding of SQL, particularly how its SELECT keyword can be used to read data from a relational database. However, the SQL language is very user-friendly, and the examples used here are self-explanatory.

Now, it’s time for you to start learning why there’s a growing buzz surrounding DuckDB.

Get Your Code: Click here to download the free sample code that shows you how to use DuckDB in Python.

Take the Quiz: Test your knowledge with our interactive “Introducing DuckDB” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Introducing DuckDB

This quiz will challenge your knowledge of working with DuckDB. You won't find all the answers in the tutorial, so you'll need to do some extra investigation. By finding all the answers, you're sure to learn some interesting things along the way.

Getting Started With DuckDB

To use DuckDB, you first need to install it. Fortunately, DuckDB is self-contained, meaning it won’t interfere with your existing Python environment.

You use python -m pip install duckdb to install it from the command prompt. If you’re working in a Jupyter Notebook, the command becomes !python -m pip install duckdb. The supporting downloadable code for this tutorial is also presented in a Jupyter Notebook.

Once the installation is complete, you can quickly test your installation with a query:

Python
      
        
      
    
>>> import duckdb

>>> duckdb.sql("SELECT 'whistling_duck' AS waterfowl, 'whistle' AS call")
┌────────────────┬─────────┐
│   waterfowl    │  call   │
│    varchar     │ varchar │
├────────────────┼─────────┤
│ whistling_duck │ whistle │
└────────────────┴─────────┘

To test that everything works, you first import the duckdb library before running a test SQL query. In SQL, a query is a command you use to interact with the data in your database. You commonly use queries to view, add, update, and delete your data.

In this example, you write a SQL SELECT statement to view some data defined by the query. By passing it to the sql() function, you run the query and produce the result shown.

Your query creates a table with two columns named waterfowl and call. These contain the data "whistling_duck" and "whistle", respectively. The data types of both columns are varchar, which is the data type DuckDB uses to store variable-length character strings. Running your query using duckdb.sql() uses the default in-memory database. This means that the data are temporary and will disappear when you end your Python session.

If you see the output shown above, your installation is working perfectly.

Note: DuckDB queries are not case-sensitive. However, writing reserved SQL keywords in uppercase is standard practice. Also, a terminating semicolon (;) is optional in SQL and isn’t used in this tutorial, though you may encounter it elsewhere.

Now that you know how to set things up, it’s time to dive into some of the features that make DuckDB easy to use. In the next section, you’ll create a database table using data imported from an existing file. You’ll also learn how to check that the data has been imported correctly.

Remove ads

Creating a Database From a Data Source

While it’s possible to create database tables using SQL, it’s more common to read data from an external file, perhaps one containing data you’ve extracted from another system, and allow DuckDB to create and populate the table.

DuckDB supports reading from and writing to a range of common file types such as Parquet, CSV, and JSON. In this example, you’ll use data stored in the presidents.parquet Parquet file included in your downloadable materials to create a table.

The presidents.parquet file contains the following six fields:

Heading	Meaning	Data Type
`sequence`	Order of presidency	`int64`
`last_name`	President’s last name	`varchar`
`first_name`	President’s first name	`varchar`
`term_start`	Start of presidency term	`date`
`term_end`	End of presidency term	`date`
`party_id`	Number representing political party	`int64`

When you import data, it gets placed into a DuckDBPyRelation object. In DuckDB, a relation stores a query definition but not its data. To see the data your relation represents, you must do so interactively by viewing it or running an SQL query against it to see specific data.

This create-then-run workflow means that you can develop relations containing multiple parts iteratively without having to wait for the results of earlier database reads to be completed first. This is called lazy evaluation. Although this sounds like a disparaging term, it ensures the query you run will be highly optimized, allowing it to process your data faster.

Also, when you finally run a query against a relation, only the first 10,000 rows are read. This reduces the stress on your computer’s resources. A full data read is only done when necessary. Usually, this occurs when you want to store everything in a table within your database, or write it out to an external file or DataFrame for analysis.

In the code below, you create a relation from presidents.parquet and save it permanently to a presidents table within a new database:

Python
      
        
      
    
>>> import duckdb

>>> conn = duckdb.connect(database="presidents.db")
>>> presidents_relation = conn.read_parquet("presidents.parquet")
>>> conn.sql(
...     """
...     SELECT sequence, last_name, first_name
...     FROM presidents_relation
...     WHERE sequence <= 2
...     """
... )
┌──────────┬────────────┬────────────┐
│ sequence │ last_name  │ first_name │
│  int64   │  varchar   │  varchar   │
├──────────┼────────────┼────────────┤
│        1 │ Washington │ George     │
│        2 │ Adams      │ John       │
└──────────┴────────────┴────────────┘

>>> presidents_relation.to_table("presidents")
>>> conn.close()

Line 3 is where you create a permanent database stored in a file named presidents.db. To do this, you first create a database connection, or DuckDBPyConnection object, using DuckDB’s connect() function. You pass in the name of the file you wish to store everything in as its database parameter. The connection then serves as a reference to your new database and, as you’ll see later, allows you to run queries against it.

Line 4 is where you create a relation. In this case, your presidents_relation relation object defines the data you want to read from the presidents.parquet file you downloaded earlier. Remember, at this point, the relation contains only a query definition, not any data. That’s why this line of code runs without delay.

Note: In general, you won’t name your relations with a _relation suffix. You do it here to be explicit while exploring how DuckDB works.

Line 5 is where you materialize the relation—in other words, you make its first 10,000 rows available. You do this by treating presidents_relation like a database table and use SQL to query it.

In this example, you’ve used SELECT to view only the sequence, last_name, and first_name columns from presidents_relation by using the relation’s name in the FROM clause. The WHERE clause ensures only the first two records are returned. To obtain the data, you pass it to the connection’s .sql() method.

You could have also materialized the relation by accessing presidents_relation directly, but this would have returned all columns. Using a query gives you more control over what is read from the file and what is displayed on the screen.

Lines 12 through 18 show the requested data.

Line 20 creates a new presidents table in the database and stores the data from the materialized relation in it. To do this, you pass the name of the table you wish to create to the relation object’s .to_table() method. Unlike the relation used to create it, this table will now be a permanent part of your database.

Line 21 closes the database connection. This frees up any resources used by your computer to maintain the connection and allows them to be used for other tasks. It’s important to close any open database once you’ve finished with it—otherwise, you may restrict future access to it. You’ll soon see how you can use a context manager to effectively handle database connections.

At this point, your database contains a single presidents table containing the entire content of your presidents.parquet file.

Note: This time, you ran your query using the .sql() method of the database connection object conn, whereas previously, you used duckdb.sql(). The difference is that this time, you used a permanent database, while earlier, you used a temporary in-memory database. As you’ll see shortly, it’s possible to reconnect and reuse your presidents.db database in future analysis. However, the in-memory database you created earlier is no longer available.

Now that your database has been safely stored on your hard drive, you can reuse it as you please. You can quickly verify this:

Python
      
        
      
    
>>> with duckdb.connect(database="presidents.db") as conn:
...     conn.sql(
...         """
...         SELECT last_name, first_name
...         FROM presidents
...         WHERE last_name = 'Adams'
...         """
...     )
...
┌───────────┬─────────────┐
│ last_name │ first_name  │
│  varchar  │   varchar   │
├───────────┼─────────────┤
│ Adams     │ John        │
│ Adams     │ John Quincy │
└───────────┴─────────────┘

The results show that the data has been stored permanently in the table. To do this, you create a fresh connection to the existing database and use it to run a similar query to your previous one, only this time you query the table directly.

To keep the code tidy, you use with to create a context manager. This opens and manages the connection for you and ensures the connection only remains open while the indented code beneath it runs. Using a context manager in this way means you don’t need to remember to close the connection manually after the query has finished.

Although the table has been stored in the database, the relation has not:

Python
      
        
      
    
>>> with duckdb.connect(database="presidents.db") as conn:
...     conn.sql("SELECT * FROM presidents_relation")
...
Traceback (most recent call last):
# ...
duckdb.duckdb.InvalidInputException: Invalid Input Error:
⮑ Python Object "presidents_relation" of type "DuckDBPyRelation"
⮑ not suitable for replacement scan.
The object was created by another Connection and can therefore
⮑ not be used by this Connection.

Relations, unlike tables, are only available to the connection they were created in while it’s still open. Once a connection has been closed, its relations no longer exist. They’re not designed to be permanent. The error message above highlights this behavior, making it clear that relations can’t be accessed from a new connection.

Remove ads

Correcting Your Data Import Errors

In the previous example, when you read the presidents.parquet file, the connection’s .read_parquet() method correctly identified the data types for each column of presidential data. This is because the Parquet format stores metadata about each column, including its data type. Unfortunately, not all files store this metadata, meaning their data should be checked and corrected if necessary after importing and before analysis.

Take a look at the output from the following code. This time, you’re reading a file named presidents.csv:

Python
      
        
      
    
>>> import duckdb

>>> with duckdb.connect(database="presidents.db") as conn:
...     presidents_relation = conn.read_csv("presidents.csv")
...     presidents_relation.limit(2)
...
┌──────────┬────────────┬────────────┬───────────────┬──────────────┬──────────┐
│ sequence │ last_name  │ first_name │  term_start   │  term_end    │ party_id │
│  int64   │  varchar   │  varchar   │    varchar    │   varchar    │  int64   │
├──────────┼────────────┼────────────┼───────────────┼──────────────┼──────────┤
│        1 │ Washington │ George     │ April 30 1789 │ March 4 1797 │       10 │
│        2 │ Adams      │ John       │ March 4 1797  │ March 4 1801 │       20 │
└──────────┴────────────┴────────────┴───────────────┴──────────────┴──────────┘

Each relation has a dtypes property that allows you to inspect the data types returned by the relation’s query. When you look at the highlighted line, you’ll see that the data types of the term_start and term_end data types have been read as varchar types—in other words, as strings. This may cause issues if you need to perform date calculations. Fortunately, there’s no harm done yet, because you’ve not yet written your data anywhere permanently.

To fix this, you can use the date_format parameter to specify the correct date formats:

Python
      
        
      
    
>>> with duckdb.connect(database="presidents.db") as conn:
...     presidents_relation = conn.read_csv(
...         "presidents.csv", date_format="%B %d %Y"
...     )
...     presidents_relation.dtypes
...
[BIGINT, VARCHAR, VARCHAR, DATE, DATE, BIGINT]

By using the format specifier "%B %d %Y", you’ve told read_csv() to interpret any varchar columns that match this pattern as containing a full month name, followed by a zero-padded day, and a four-digit year.

This time, you’ve used the relation’s dtypes property to display the data types of the six columns. Take a look at the highlighted row and you’ll see that the dates have now been correctly identified. Both last_name and first_name remain as varchar types because they don’t match the "%B %d %Y" pattern.

You may also have noticed that the column data types of the order and party columns appear to have changed to BIGINT. Although confusing, nothing has actually changed. It’s just that the dtypes property uses BIGINT, while relations use int64 to mean the same thing.

Next, you’ll take a look at two important ways of querying your database.

Querying the Database

If you already have experience of extracting information from a relational database, chances are you’ve used a version of the Structured Query Language (SQL) to do so. The good news is that you can directly apply your existing knowledge of SQL to DuckDB. You’ll also learn how to write queries using the DuckDB relational API. In fact, you’re about to learn how to write the same query using both techniques.

Querying Tables

Using SQL to write queries is a very well-established way of working with database data. SQL was developed during the 1970s at IBM and is used by all the major relational database management systems worldwide. What’s more, other than specific vendor enhancements, including some by DuckDB, the core statements haven’t really changed that much over the years.

Suppose you want to see a list of the names of US presidents who belong to the Whig party, along with the political parties they belong to. You also want the results sorted in descending order by the presidents’ last names. The first and last names are stored in the presidents table that already exists in your presidents.db file. However, the political parties are stored inside a parties.json file, so they’ll have to be read in.

You could complete this task using the default in-memory database you saw earlier, but instead, you’ll reuse your existing presidents.db database since some of the data is already present. The first step is to read in the parties.json file. This contains a party_name field containing a list of the political parties each president belongs to, as well as a party_id field, which contains an integer code for each party.

To read in the details of each party, you use this code:

Python
      
>>> with duckdb.connect(database="presidents.db") as conn:
...     conn.read_json("parties.json").to_table("parties")
...

As before, you set up a connection to the existing database using a context manager. You then use the connection’s .read_json() method to read the contents of parties.json into a relation. To create and populate a new table with the JSON data, you once again use .to_table().

When you run this code, you won’t see any output, but a new table will be added to your presidents.db database file. With this new table in place, your next task is to join it to the existing presidents table, then extract the information you want to see:

Python
      
        
      
    
>>> with duckdb.connect("presidents.db") as conn:
...     conn.sql(
...         """
...         SELECT first_name, last_name, party_name
...         FROM presidents
...         JOIN parties
...         ON presidents.party_id = parties.party_id
...         WHERE party_name = 'Whig'
...         ORDER BY last_name DESC
...         """
...     )
┌───────────────┬───────────┬────────────┐
│  first_name   │ last_name │ party_name │
│    varchar    │  varchar  │  varchar   │
├───────────────┼───────────┼────────────┤
│ John          │ Tyler     │ Whig       │
│ Zachary       │ Taylor    │ Whig       │
│ William Henry │ Harrison  │ Whig       │
│ Millard       │ Fillmore  │ Whig       │
└───────────────┴───────────┴────────────┘

To obtain the information you need, you construct the query shown above. Using the FROM, JOIN and ON SQL keywords, you perform an inner join of both the presidents and parties tables using their party_id columns as the link. This will match only those records that have an identical party_id value in both tables. Had there been any mismatched records, they would have been ignored.

From the joined table, you only want to see the presidents’ first_name, last_name, and the party_name they represent. You access the three columns using the SELECT SQL keyword.

To filter the output so you only see members of the Whig party, you use the WHERE SQL keyword.

Finally, to display the records in descending order, you use the ORDER BY and DESC SQL keywords on the last_name field.

Once you run this query using the connection’s .sql() method, the result is returned. At this point, the result is in a relation, so it won’t be permanent. If needed, you could write the results to a third table or export them to an external file for further analysis.

Remove ads

Querying Relations

Storing data in tables may waste space on your computer. It’s actually perfectly possible to repeat the previous query using only your computer’s memory. To do this, you use relations instead of tables, as well as the default in-memory database:

Python
      
        
      
    
>>> import duckdb

>>> presidents = duckdb.read_parquet("presidents.parquet")
>>> parties = duckdb.read_json("parties.json")

>>> duckdb.sql(
...     """
...     SELECT first_name, last_name, party_name
...     FROM presidents
...     JOIN parties
...     ON presidents.party_id = parties.party_id
...     WHERE party_name = 'Whig'
...     ORDER BY last_name DESC
...     """
... )
┌───────────────┬───────────┬────────────┐
│  first_name   │ last_name │ party_name │
│    varchar    │  varchar  │  varchar   │
├───────────────┼───────────┼────────────┤
│ John          │ Tyler     │ Whig       │
│ Zachary       │ Taylor    │ Whig       │
│ William Henry │ Harrison  │ Whig       │
│ Millard       │ Fillmore  │ Whig       │
└───────────────┴───────────┴────────────┘

This time, you create two relations from the presidents.parquet and parties.json files. Although you can name these relations using any valid Python variable name, by naming them presidents and parties, you can use the exact same SQL query as you did previously.

When you run the code above, you’ll see the same output as before. The only difference is that data is now being read from files and processed in memory, whereas previously, it existed within a database, so it was really just wasting space because you also have the files saved locally. If your files were stored remotely, meaning you’d need to read them across a network, using database tables may enhance performance.

Although using SQL is probably your go-to way of querying a relational database, DuckDB also contains other APIs, including the Python Relational API that allows you to query relations. The relational API provides a more object-oriented syntax. However, the code is transformed into optimized SQL behind the scenes before it’s run against your database.

The relational API is a set of methods belonging to the DuckDBPyRelation object, which, as you saw, represents a relation. Each method returns another DuckDBPyRelation object, meaning the different API methods can be chained to one another. This eliminates the need to create multiple intermediate relations.

The code below once more produces the same output as previously, but this time uses the relational API:

Python
      
        
      
    
>>> presidents = duckdb.read_parquet("presidents.parquet").set_alias(
...     "presidents"
... )
>>> parties = duckdb.read_json("parties.json").set_alias("parties")

>>> (
...     presidents
...     .join(parties, "presidents.party_id = parties.party_id")
...     .select("first_name", "last_name", "party_name")
...     .filter("party_name = 'Whig'")
...     .order("last_name DESC")
... )

The above code produces the same output as the previous two examples, only this time there’s no SQL in sight. Notice also that you’re using the default in-memory database by using duckdb.read_parquet() and duckdb.read_json() when reading the files. Previously, you used conn.read_json() to read the file across a connection to a permanent database.

There’s one small difference in creating the relations here. Although you assign them to the same variables as before, you must also give them alias names using the .set_alias() method. The relation and alias names have been given the same names. You’ll see why a little later.

Lines 1 through 4 are where you define the relations, their aliases, and their data sources.

Lines 6 through 12 contain the relational API version of your previous query. The relational API uses four methods to replicate the earlier SQL keywords. You must put the .join() first to make the fields from both tables available to the remainder of the query. The order of .select(), .filter(), and .order() don’t matter. Feel free to test this out for yourself. Also, your query will be optimized for you before it runs.

When using the relational API, relations are referenced using their variable names, while any fields referenced within the relations are referenced using their aliases.

For example, in lines 7 and 8, the code snippet presidents.join(parties ... refers to the presidents and parties relations. However, the ... "presidents.party_id = parties.party_id") snippet references the field names through the presidents and parties relation’s aliases.

Although the variables used to reference the relations can be anything you like, it simplifies things if you keep them the same as their aliases. In this next version of the code, you’ve changed the way you name things without affecting the result:

Python
      
        
      
    
>>> leaders = duckdb.read_parquet("presidents.parquet").set_alias(
...     "usa_presidents"
... )
>>> faction = duckdb.read_json("parties.json").set_alias("political_parties")

>>> (
...     leaders
...     .join(faction, "usa_presidents.party_id = political_parties.party_id")
...     .select("first_name", "last_name", "party_name")
...     .filter("party_name = 'Whig'")
...     .order("last_name DESC")
... )

This time, you changed the variables referencing the relations and their aliases. While the output is once more the same, the .join() code is not as clear as before.

Note: The term relation has a different meaning in DuckDB than in more general relational database terminology. Traditionally, a relation does store rows of data, each of which must be unique, hence the name relational database. Tables, on the other hand, can store duplicate data. Of course, in a fully normalized database, tables and relations are the same. However, that’s not the case here.

Next you’ll learn how to use concurrency correctly when working with your database.

Remove ads

Using Concurrency

DuckDB is primarily intended to work with bulk data. As such, it’s not optimized for running multiple small transactions by multiple applications, or even by multiple threads within a single application. Consequently, DuckDB’s concurrency support has limitations on concurrent processing, as you’ll see in this section.

Performing Concurrent Reads

Multiple applications, or threads within a single application, will each be able to read data successfully from the same database concurrently. This includes concurrent reads of the same data. The code below performs three concurrent reads of the same row from the presidents table by different threads:

Python
      
        
      
    
>>> from concurrent.futures import ThreadPoolExecutor
>>> import duckdb

>>> def read_data(thread_id):
...     print(f"Thread {thread_id} starting its read.")
...     with duckdb.connect("presidents.db") as conn:
...         conn.sql(
...             """
...             SELECT first_name, last_name
...             FROM presidents
...             WHERE sequence = 1
...             """
...         ).show()
...     print(f"Thread {thread_id} ending its read.")
...

>>> def concurrent_read():
...     with ThreadPoolExecutor(max_workers=3) as executor:
...         executor.map(read_data, range(3))
...

Line 1 imports the ThreadPoolExecutor from concurrent.futures. You need a ThreadPoolExecutor object to create and manage multiple threads.

Lines 4 through 14 contain your read_data() function that accesses the data in the database. The function connects to your existing presidents.db and reads the name details from the first row of the presidents table. The function takes a parameter named thread_id, which is a number that identifies the thread calling it. You use this in the print() functions in lines 5 and 14 to monitor the progress of the database reads.

Lines 17 through 19 are where you define a concurrent_read() function to create and manage your multiple threads. Using a context manager, you set up a ThreadPoolExecutor object, named executor, that can manage a thread pool of up to three different concurrent threads.

Line 19 uses the ThreadPoolExecutor object’s .map() method to to create three threads. Your read_data() function is called three times, with each call being placed into a separate thread. Also, each time read_data() is called, it gets passed a separate integer produced by the range(3) function. These integers are used within read_data() to identify the thread it runs within.

To generate the concurrent reads, all you need to do is call your concurrent_read() function:

Python
      
        
      
    
>>> concurrent_read()
Thread 0 starting its read.
Thread 1 starting its read.
Thread 2 starting its read.
┌────────────┬────────────┐
│ first_name │ last_name  │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ George     │ Washington │
└────────────┴────────────┘

┌────────────┬────────────┐
│ first_name │ last_name  │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ George     │ Washington │
└────────────┴────────────┘

Thread 1 ending its read.
Thread 0 ending its read.
┌────────────┬────────────┐
│ first_name │ last_name  │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ George     │ Washington │
└────────────┴────────────┘

Thread 2 ending its read.

As you can see, all three concurrent reads have succeeded with the same data being read. All three threads have completed successfully as well, albeit in a different order to that in which they started. Your finishing order might also be different to that shown above. This doesn’t matter, but the important point is that they all finished. You’ll see next that this won’t always be the case.

Performing Concurrent Writes

Now suppose you attempt to write to the same data within your database. To see what happens, take a look at the code shown below:

Python
      
        
      
    
>>> from concurrent.futures import ThreadPoolExecutor
>>> import duckdb

>>> def update_data(thread_id):
...     new_name = f"George ({thread_id})"
...     with duckdb.connect("presidents.db") as conn:
...         print(f"Thread {thread_id} starting its update.")
...         conn.sql(
...             f"""
...             UPDATE presidents
...             SET first_name = '{new_name}'
...             WHERE sequence = 1
...             """
...         )
...         print(f"Thread {thread_id} ending its update.")
...

>>> def concurrent_update():
...     with ThreadPoolExecutor(max_workers=3) as executor:
...         executor.map(update_data, range(3))
...

Lines 4 through 15 contain your new update_data() function. This runs a SQL UPDATE query to update the first record in the presidents table. The query updates the first_name field to include the identifier of the thread that accessed it. As before, there are two print() function calls to help you monitor progress.

Lines 18 through 20 show a new concurrent_update() function. This works in the same way as the concurrent_read() function you wrote earlier, only this time it calls update_data().

As before, you need to call concurrent_update() to see its output:

Python
      
        
      
    
>>> concurrent_update()
Thread 0 starting its update.
Thread 1 starting its update.
Thread 2 starting its update.
Thread 2 ending its update.

This time, only one thread completed successfully. The other two were stopped internally by a DuckDB Transaction conflict error. They haven’t been allowed to further update data that has been changed by another concurrent process. Any changes these threads might already have made elsewhere will have been rolled back.

To better understand the effect of this on your database, you can now run a SELECT query on it:

Python
      
        
      
    
>>> with duckdb.connect("presidents.db") as conn:
...     conn.sql(
...         """
...         SELECT last_name, first_name
...         FROM presidents
...         WHERE sequence = 1
...         """
...     )
...
┌────────────┬────────────┐
│ last_name  │ first_name │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ Washington │ George (2) │
└────────────┴────────────┘

Notice how the first_name field now contains the name George (2). This has happened because the thread with a thread_id of 2 was the first one to start its update, so it was the only one allowed to finish. The attempts by the remaining threads, 0 and 1, were both prevented from performing their updates.

When you run this code, either of the threads could win the race. The actual order of concurrent access depends on how your operating system prioritizes them.

The only time multiple concurrent writes are allowed to complete successfully is if they’re updating different rows.

Note: DuckDB doesnt support writes from multiple applications because that isn’t one of its intended use cases. It’s still possible to write from multiple applications, but concurrency would need to be managed by Python’s inter-process communication features.

Next, you’ll see how to work with the database data using Python code.

Remove ads

Integrating DuckDB Within Your Python Environment

So far, you’ve learned how to create and query databases. Once you’ve selected the data you need, you might need to further analyze or chart it using a Python data analysis tool. In this section, you’ll be introduced to two techniques for moving data from the database into your Python environment for further processing.

Creating Python Functions for DuckDB to Use

If you’re looking to perform a calculation on data, you should first check to see if there’s a built-in function available in either SQL or its Python client API. If you do find something suitable, you should use it. If not, you can write your own custom function in Python and register it with DuckDB.

Note: Using your own function degrades performance over using a built-in equivalent. This is because the data must be passed out of DuckDB and into the Python interpreter for processing before the results are returned. It’s considered best practice to stick to the built-in functions whenever possible because these are usually more efficient than versions you may write.

Suppose you want to extract the names of each president in the form initial. surname. You also want to calculate the term of each presidency in days. To reformat the names, you could write a function. To calculate the term of office, you calculate the difference between the term_start and term_end dates using base SQL functionality.

The code below shows one way of writing such a custom function:

Python
      
>>> import duckdb

>>> def short_name(first_name: str, last_name: str) -> str:
...     return f"{first_name[0]}. {last_name}"
...

Your short_name() function takes a president’s first and last names as its parameters. It then returns a string in the required format using an f-string to construct it. The two curly brackets contain the first letter of a president’s first name and their full last name, separated by a period and a space.

To make sure DuckDB can understand the data types required for both your function’s first_name and last_name parameters, as well as its return value, you supply these as type hints.

You specify function parameter type hints with a colon (:) followed by their Python data types, while the return value type hint is defined using a dash and greater than symbol (->), followed by its Python data type. In this example, all types are strings (str), but this doesn’t have to be the case.

Once you’ve written your function, you can test it to make sure it works in the usual way:

Python
      
>>> short_name("Abraham", "Lincoln")
'A. Lincoln'

As you can see, everything looks fine. Now that your function correctly does what you want, you need to register it with DuckDB:

Python
      
>>> duckdb.create_function("short_name", short_name)
<duckdb.duckdb.DuckDBPyConnection object at 0x0000021E01787870>

To make DuckDB aware that your function exists, you pass two parameters into .create_function(). The first is a string that defines the name that DuckDB will use to access the function, while the second is the name of the function you wish to register. You don’t enclose this latter parameter in quotes because it references your Python function object directly. While you can use different names, its best to keep both the same for readability.

Note: If you update your short_name() function then try to call .create_function() to update it, you’ll raise an NotImplementedException exception. Although you might think you can update your function in DuckDB this way, it’s unfortunately not possible. Instead, you should first run duckdb.remove_function("short_name") to de-register the original function, and then rerun duckdb.create_function("short_name", short_name) to update it.

Once you’ve successfully registered your function, you can run it as part of an SQL query:

Python
      
        
      
    
>>> presidents = duckdb.read_parquet("presidents.parquet")

>>> duckdb.sql(
...     """
...     SELECT short_name(first_name, last_name) AS name,
...     (term_end - term_start) AS days_in_office
...     FROM presidents
...     """
... ).limit(3)
┌───────────────┬────────────────┐
│     name      │ days_in_office │
│    varchar    │     int64      │
├───────────────┼────────────────┤
│ G. Washington │           2865 │
│ J. Adams      │           1460 │
│ T. Jefferson  │           2922 │
└───────────────┴────────────────┘

You can use your short_name() function within SQL by calling it and passing it the field names containing the parameters you wish it to use. In this case, you pass both first_name and last_name in the same order they’re defined in the original function. To make the output easier to understand, you use the AS clause to create a more descriptive column alias value.

To calculate the term of each presidency, you use SQL’s built-in capability of performing arithmetic on columns. In this case, you subtract term_start from term_end to determine the number of days each president is in office. Again, you create a column alias for clarity.

The output is restricted to the details of the first three presidents. This was done by passing 3 to the .limit() method of the relation object that was produced. If you were to omit this, you’d see all rows.

Remove ads

Using Polars and pandas With DuckDB

Sometimes, you may want to work with your database data in pandas or Polars DataFrames. You can write a database query to produce a relation containing the data you want, and then convert the relation to either pandas or Polars using its .df() or .pl() methods, respectively. You can then analyze your data using the features of these libraries.

Before continuing, please make sure that pandas, Polars, and PyArrow are installed in your Python environment. pandas is needed to work with pandas DataFrames, while you’ll need both Polars and PyArrow to work with Polars. You can use python -m pip install pandas polars pyarrow to install all three. If you’re working in a Jupyter Notebook, make sure you prefix the command with an exclamation point (!).

Suppose you want to read the names of the presidents from the presidents table of your presidents.db file into a pandas DataFrame. You first read what you want into a relation, and then convert it to a DataFrame:

Python
      
        
      
    
>>> import duckdb

>>> with duckdb.connect("presidents.db") as conn:
...     pandas_presidents = conn.sql(
...         """
...         SELECT last_name, first_name
...         FROM presidents
...         WHERE sequence BETWEEN 2 AND 5
...         """
...     ).df()
...
>>> pandas_presidents
   last_name first_name
0      Adams       John
1  Jefferson     Thomas
2    Madison      James
3     Monroe      James

This time, the result is a pandas DataFrame. The df() method did the conversion for you. Notice that while you selected only the last_name and first_name columns, the output also includes a DataFrame index column. Finally, you didn’t import pandas. Even though pandas must be installed for df() to work, it doesn’t need to be imported unless you want to call functions in the pandas namespace explicitly.

You can write data to a Polars DataFrame using a similar technique. Suppose you want to join the data from presidents.parquet and presidents.json together as before, only this time, write the result out to a Polars DataFrame:

Python
      
        
      
    
>>> presidents = duckdb.read_parquet("presidents.parquet").set_alias(
...     "presidents"
... )
>>> parties = duckdb.read_json("parties.json").set_alias("parties")

>>> (
...     presidents.join(parties, "presidents.party_id = parties.party_id")
...     .select("first_name", "last_name", "party_name")
...     .order("last_name DESC")
... ).pl().head(3)
shape: (3, 3)
┌────────────┬────────────┬──────────────┐
│ first_name ┆ last_name  ┆ party_name   │
│ ---        ┆ ---        ┆ ---          │
│ str        ┆ str        ┆ str          │
╞════════════╪════════════╪══════════════╡
│ Woodrow    ┆ Wilson     ┆ Democratic   │
│ George     ┆ Washington ┆ Unaffiliated │
│ Martin     ┆ Van Buren  ┆ Democratic   │
└────────────┴────────────┴──────────────┘

The results here are identical to those you saw earlier. However, this time, they’re in a Polars DataFrame. As before, all the results are written to the DataFrame, but by using the DataFrame’s .head(3) method, you can limit the output to three rows for neatness. Also as before, only duckdb needs to be imported. Again, you only need to import Polars if you want to explicitly call functions in the Polars namespace.

Note: It’s also possible to convert a relation to a NumPy array. To do this, you use the relation’s .fetchnumpy() method. This converts the data into a dictionary of NumPy arrays. Similarly, .arrow() converts it into an Arrow table.

At this point, you’ve had a taste of what the DuckDB relational database has to offer. However, there’s still a lot more to learn and explore.

Next Steps

Your next steps could be to discover how to:

Use the range of Structured Query Language (SQL), Data-Definition Language (DDL), and Data Manipulation Language (DML) query language constructs available
Use more of the Python relational Application Programming Interface (API)
Use the Python DB API to access data within your database
Use the various DuckDB extensions that allow reading of data from the web or from other applications

Don’t forget that DuckDB is a library that’s still growing in ability. Keep a close eye on the official documentation, particularly the Python API section, to learn more about how you can use it in your Python code and to keep up to date with its latest capabilities.

Conclusion

You’ve had an overview of DuckDB, and now have the knowledge and skills necessary to get started using it in your code.

In this tutorial, you’ve learned how to:

Create DuckDB databases from data files
Verify your data files have been read correctly
Query the database using Structured Query Language (SQL)
Query the database using DuckDB’s Python Application Programming Interface (API)
Access the database concurrently
Use DuckDB data in a Python environment

With all of these features, DuckDB is undoubtedly a valuable addition to Python’s data analysis capabilities. It’s a popular, fully featured, OLAP-optimized database that can query data efficiently regardless of the size of the dataset, and its Python API allows you to seamlessly integrate it into your projects.

Get Your Code: Click here to download the free sample code that shows you how to use DuckDB in Python.

Remove ads

Frequently Asked Questions

Now that you have some experience with DuckDB in Python, you can use the questions and answers below to check your understanding and recap what you’ve learned.

These FAQs are related to the most important concepts you’ve covered in this tutorial. Click the Show/Hide toggle beside each question to reveal the answer.

DuckDB is a database management system optimized for online analytical processing (OLAP) that you can use to store, access, and manipulate large datasets efficiently within Python.

You can install DuckDB in Python by running the command python -m pip install duckdb in your command prompt or terminal.

You import data into DuckDB by using methods like read_csv(), read_parquet(), or read_json() to create relations from external files, which you can then query or save as tables in your database.

Yes, you can convert DuckDB query results into a pandas DataFrame using the .df() method on a relation, allowing you to further analyze or manipulate the data with pandas.

DuckDB offers fast data access through query optimization, supports large datasets beyond your computer’s memory, and seamlessly integrates with Python, making it a powerful tool for data analysis.

Take the Quiz: Test your knowledge with our interactive “Introducing DuckDB” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Introducing DuckDB

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.

Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!