- Apache Arrow is an open-source project that provides a shared data format and library for high performance data analytics across multiple languages. It aims to unify database and data science technology stacks.
- In 2021, Ursa Labs joined forces with GPU-accelerated computing pioneers to form Voltron Data, continuing development of Apache Arrow and related projects like Arrow Flight and the Arrow R package.
- Upcoming releases of the Arrow R package will bring additional query execution capabilities like joins and window functions to improve performance and efficiency of analytics workflows in R.
1 of 27
Downloaded 18 times
More Related Content
New Directions for Apache Arrow
1. New Directions for Apache Arrow
Wes McKinney
@wesmckinn
September 10, 2021
New York R Conference
2. 2
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of developers of open source data projects
● Provides a shared foundation for data analytics
● Enables unification of database and data science technology stacks
● Thriving user and developer community
● Adopted by numerous projects and products in the data ecosystem
3. 3
2018: Ursa Labs
Founded with a not-for-profit mission
● Build cross-language, open libraries for data analytics
● Grow the Apache Arrow ecosystem
● Employ a team of full-time developers
Supported by sponsors and partners
4. 4
2020: Ursa Computing
Founded to support enterprise applications of Arrow
● Empower teams to accelerate data workflows
● Work with enterprises to enhance data platforms
● Enable organizations to get more out of their data
A venture-backed startup
5. 5
2021: Voltron Data
Joining forces for an Arrow-native future
● Ursa joined forces with GPU-accelerated computing pioneers
● Together we are creating a unified foundation for the future of
analytical computing
○ Optimized for diverse hardware
○ Compatible across languages
○ Fast and efficient
○ Based on Apache Arrow
● Ursa Labs is now Voltron Labs
6. 6
Apache Arrow
● Specifies a columnar format for how data is stored in memory
● Provides implementations or bindings in numerous languages
7. 7
Arrow Flight
High-performance data transport protocol
● Provides a framework for sending and receiving Arrow data natively
● Built using gRPC, Protocol Buffers, and the Arrow columnar format
● Designed to move large-scale data with excellent speed and efficiency
● Enables seamless interoperability across networks
Arrow Flight SQL is a next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight
8. 8
Arrow R Package
Exposes an interface to the Arrow C++ library
● Low-level access to the Arrow C++ API
● Higher-level access through a dplyr backend
Install the latest release from CRAN:
install.packages("arrow")
Install the latest nightly development build:
install.packages("arrow", repos =
c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))
16. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct))
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 4.762 0.806 1.612
Parquet file
~10 million rows
~250 MB
Read into
an R data
frame
DEV
VERSIO
N
17. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 3.446 1.012 0.605
Parquet file
~10 million rows
~250 MB
Read into
an Arrow
Table
DEV
VERSIO
N
Return the result
as an R data frame
18. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 4 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 5 16.8 5806
#> 2 1 13.5 143087
#> 3 2 12.6 34418
#> 4 3 11.9 8922
125 Parquet files
~2 billion rows
~40 GB
system.time(...)
#> user system elapsed
#> 3.319 0.247 1.111
DEV
VERSIO
N
19. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
to_duckdb() %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
● Creates a virtual DuckDB table backed by an Arrow data object
○ No data is loaded until collect() is called
○ Returns a dbplyr object for use in dplyr pipelines
DEV
VERSIO
N
20. 20
Coming Soon
Upcoming Arrow releases will bring additional
query execution capabilities to the Arrow C++ engine and R package
● Joins
● Window functions
● More scalar and aggregate functions
● Performance and efficiency improvements
21. Ibis
● The Arrow C++ engine currently lacks a high-level Python API
● Ibis can fill this gap
taxi
.filter(taxi.total_amount > 100)
.projection(['tip_amount', 'total_amount', 'passenger_count'])
.mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100)
.group_by('passenger_count')
.aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean())
.filter(lambda x: x.n > 500)
.sort_by(ibis.desc('avg_tip'))
.execute()
22. Engines and Interfaces
There are multiple efforts underway to develop Arrow-native query engines
● Arrow C++ engine
● Arrow DataFusion
● DuckDB
● …
Users want fluent interfaces to these engines from their preferred languages
● Python
● R
● JavaScript
● …
Users also want to run SQL queries on these engines
26. 26
Compute Intermediate Representation (IR)
The Arrow community has launched a collaboration to establish a Compute IR
● A standard serialized representation of compute expressions
● A common layer connecting APIs (front ends) and engines (back ends)
○ Is produced by APIs
○ Is consumed by engines
● Follow this initiative at substrait.io