New Directions for Apache Arrow

New Directions for Apache Arrow
Wes McKinney
@wesmckinn
September 10, 2021
New York R Conference

2
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of developers of open source data projects
● Provides a shared foundation for data analytics
● Enables uniﬁcation of database and data science technology stacks
● Thriving user and developer community
● Adopted by numerous projects and products in the data ecosystem

3
2018: Ursa Labs
Founded with a not-for-proﬁt mission
● Build cross-language, open libraries for data analytics
● Grow the Apache Arrow ecosystem
● Employ a team of full-time developers
Supported by sponsors and partners

4
2020: Ursa Computing
Founded to support enterprise applications of Arrow
● Empower teams to accelerate data workﬂows
● Work with enterprises to enhance data platforms
● Enable organizations to get more out of their data
A venture-backed startup

5
2021: Voltron Data
Joining forces for an Arrow-native future
● Ursa joined forces with GPU-accelerated computing pioneers
● Together we are creating a uniﬁed foundation for the future of
analytical computing
○ Optimized for diverse hardware
○ Compatible across languages
○ Fast and efﬁcient
○ Based on Apache Arrow
● Ursa Labs is now Voltron Labs

6
Apache Arrow
● Speciﬁes a columnar format for how data is stored in memory
● Provides implementations or bindings in numerous languages

7
Arrow Flight
High-performance data transport protocol
● Provides a framework for sending and receiving Arrow data natively
● Built using gRPC, Protocol Buffers, and the Arrow columnar format
● Designed to move large-scale data with excellent speed and efﬁciency
● Enables seamless interoperability across networks
Arrow Flight SQL is a next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight

8
Arrow R Package
Exposes an interface to the Arrow C++ library
● Low-level access to the Arrow C++ API
● Higher-level access through a dplyr backend
Install the latest release from CRAN:
install.packages("arrow")
Install the latest nightly development build:
install.packages("arrow", repos =
c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))

read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct))
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 4.762 0.806 1.612
Parquet file
~10 million rows
~250 MB
Read into
an R data
frame
DEV
VERSIO
N

read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>%
filter(total_amount > 100) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> 3.446 1.012 0.605
Parquet file
~10 million rows
~250 MB
Read into
an Arrow
Table
DEV
VERSIO
N
Return the result
as an R data frame

open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
filter(n > 5000) %>%
collect()
#> 1 5 16.8 5806
#> 2 1 13.5 143087
#> 3 2 12.6 34418
#> 4 3 11.9 8922
125 Parquet files
~2 billion rows
~40 GB
system.time(...)
#> 3.319 0.247 1.111
DEV
VERSIO
N

open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
to_duckdb() %>%
filter(n > 5000) %>%
collect()
● Creates a virtual DuckDB table backed by an Arrow data object
○ No data is loaded until collect() is called
○ Returns a dbplyr object for use in dplyr pipelines
DEV
VERSIO
N

20
Coming Soon
Upcoming Arrow releases will bring additional
query execution capabilities to the Arrow C++ engine and R package
● Joins
● Window functions
● More scalar and aggregate functions
● Performance and efﬁciency improvements

Ibis
● The Arrow C++ engine currently lacks a high-level Python API
● Ibis can ﬁll this gap
taxi
.filter(taxi.total_amount > 100)
.projection(['tip_amount', 'total_amount', 'passenger_count'])
.mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100)
.group_by('passenger_count')
.aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean())
.filter(lambda x: x.n > 500)
.sort_by(ibis.desc('avg_tip'))
.execute()

Engines and Interfaces
There are multiple efforts underway to develop Arrow-native query engines
● Arrow C++ engine
● Arrow DataFusion
● DuckDB
● …
Users want ﬂuent interfaces to these engines from their preferred languages
● Python
● R
● JavaScript
● …
Users also want to run SQL queries on these engines

26
Compute Intermediate Representation (IR)
The Arrow community has launched a collaboration to establish a Compute IR
● A standard serialized representation of compute expressions
● A common layer connecting APIs (front ends) and engines (back ends)
○ Is produced by APIs
○ Is consumed by engines
● Follow this initiative at substrait.io

Thank you
Wes McKinney
@wesmckinn
arrow.apache.org
voltrondata.com
We’re hiring!
voltrondata.com/careers

New Directions for Apache Arrow

More Related Content

New Directions for Apache Arrow