SlideShare a Scribd company logo
New Directions for Apache Arrow
Wes McKinney
@wesmckinn
September 10, 2021
New York R Conference
2
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of developers of open source data projects
● Provides a shared foundation for data analytics
● Enables unification of database and data science technology stacks
● Thriving user and developer community
● Adopted by numerous projects and products in the data ecosystem
3
2018: Ursa Labs
Founded with a not-for-profit mission
● Build cross-language, open libraries for data analytics
● Grow the Apache Arrow ecosystem
● Employ a team of full-time developers
Supported by sponsors and partners
4
2020: Ursa Computing
Founded to support enterprise applications of Arrow
● Empower teams to accelerate data workflows
● Work with enterprises to enhance data platforms
● Enable organizations to get more out of their data
A venture-backed startup
5
2021: Voltron Data
Joining forces for an Arrow-native future
● Ursa joined forces with GPU-accelerated computing pioneers
● Together we are creating a unified foundation for the future of
analytical computing
○ Optimized for diverse hardware
○ Compatible across languages
○ Fast and efficient
○ Based on Apache Arrow
● Ursa Labs is now Voltron Labs
6
Apache Arrow
● Specifies a columnar format for how data is stored in memory
● Provides implementations or bindings in numerous languages
7
Arrow Flight
High-performance data transport protocol
● Provides a framework for sending and receiving Arrow data natively
● Built using gRPC, Protocol Buffers, and the Arrow columnar format
● Designed to move large-scale data with excellent speed and efficiency
● Enables seamless interoperability across networks
Arrow Flight SQL is a next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight
8
Arrow R Package
Exposes an interface to the Arrow C++ library
● Low-level access to the Arrow C++ API
● Higher-level access through a dplyr backend
Install the latest release from CRAN:
install.packages("arrow")
Install the latest nightly development build:
install.packages("arrow", repos =
c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))
Major Releases of Arrow
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
Arrow R Package Milestones
read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct))
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 4.762 0.806 1.612
Parquet file
~10 million rows
~250 MB
Read into
an R data
frame
DEV
VERSIO
N
read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>%
filter(total_amount > 100) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 500) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 3 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 1 13.6 11714
#> 2 2 11.9 2892
#> 3 3 11.1 709
system.time(...)
#> user system elapsed
#> 3.446 1.012 0.605
Parquet file
~10 million rows
~250 MB
Read into
an Arrow
Table
DEV
VERSIO
N
Return the result
as an R data frame
open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
#> # A tibble: 4 x 3
#> passenger_count avg_tip_pct n
#> <int> <dbl> <int>
#> 1 5 16.8 5806
#> 2 1 13.5 143087
#> 3 2 12.6 34418
#> 4 3 11.9 8922
125 Parquet files
~2 billion rows
~40 GB
system.time(...)
#> user system elapsed
#> 3.319 0.247 1.111
DEV
VERSIO
N
open_dataset("nyc-taxi", partitioning = c("year", "month")) %>%
filter(total_amount > 100 & year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
mutate(tip_pct = tip_amount / total_amount * 100) %>%
to_duckdb() %>%
group_by(passenger_count) %>%
summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
filter(n > 5000) %>%
arrange(desc(avg_tip_pct)) %>%
collect()
● Creates a virtual DuckDB table backed by an Arrow data object
○ No data is loaded until collect() is called
○ Returns a dbplyr object for use in dplyr pipelines
DEV
VERSIO
N
20
Coming Soon
Upcoming Arrow releases will bring additional
query execution capabilities to the Arrow C++ engine and R package
● Joins
● Window functions
● More scalar and aggregate functions
● Performance and efficiency improvements
Ibis
● The Arrow C++ engine currently lacks a high-level Python API
● Ibis can fill this gap
taxi 
.filter(taxi.total_amount > 100) 
.projection(['tip_amount', 'total_amount', 'passenger_count']) 
.mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100) 
.group_by('passenger_count') 
.aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean()) 
.filter(lambda x: x.n > 500) 
.sort_by(ibis.desc('avg_tip')) 
.execute()
Engines and Interfaces
There are multiple efforts underway to develop Arrow-native query engines
● Arrow C++ engine
● Arrow DataFusion
● DuckDB
● …
Users want fluent interfaces to these engines from their preferred languages
● Python
● R
● JavaScript
● …
Users also want to run SQL queries on these engines
23
Past
24
Present
25
Future
26
Compute Intermediate Representation (IR)
The Arrow community has launched a collaboration to establish a Compute IR
● A standard serialized representation of compute expressions
● A common layer connecting APIs (front ends) and engines (back ends)
○ Is produced by APIs
○ Is consumed by engines
● Follow this initiative at substrait.io
Thank you
Wes McKinney
@wesmckinn
arrow.apache.org
voltrondata.com
We’re hiring!
voltrondata.com/careers

More Related Content

New Directions for Apache Arrow

  • 1. New Directions for Apache Arrow Wes McKinney @wesmckinn September 10, 2021 New York R Conference
  • 2. 2 Apache Arrow Multi-language toolbox for accelerated data interchange and in-memory processing ● Founded in 2016 by a group of developers of open source data projects ● Provides a shared foundation for data analytics ● Enables unification of database and data science technology stacks ● Thriving user and developer community ● Adopted by numerous projects and products in the data ecosystem
  • 3. 3 2018: Ursa Labs Founded with a not-for-profit mission ● Build cross-language, open libraries for data analytics ● Grow the Apache Arrow ecosystem ● Employ a team of full-time developers Supported by sponsors and partners
  • 4. 4 2020: Ursa Computing Founded to support enterprise applications of Arrow ● Empower teams to accelerate data workflows ● Work with enterprises to enhance data platforms ● Enable organizations to get more out of their data A venture-backed startup
  • 5. 5 2021: Voltron Data Joining forces for an Arrow-native future ● Ursa joined forces with GPU-accelerated computing pioneers ● Together we are creating a unified foundation for the future of analytical computing ○ Optimized for diverse hardware ○ Compatible across languages ○ Fast and efficient ○ Based on Apache Arrow ● Ursa Labs is now Voltron Labs
  • 6. 6 Apache Arrow ● Specifies a columnar format for how data is stored in memory ● Provides implementations or bindings in numerous languages
  • 7. 7 Arrow Flight High-performance data transport protocol ● Provides a framework for sending and receiving Arrow data natively ● Built using gRPC, Protocol Buffers, and the Arrow columnar format ● Designed to move large-scale data with excellent speed and efficiency ● Enables seamless interoperability across networks Arrow Flight SQL is a next-generation standard for data access using SQL ● Adds SQL semantics to Arrow Flight ● Enables ODBC/JDBC-style data access at the speed of Flight
  • 8. 8 Arrow R Package Exposes an interface to the Arrow C++ library ● Low-level access to the Arrow C++ API ● Higher-level access through a dplyr backend Install the latest release from CRAN: install.packages("arrow") Install the latest nightly development build: install.packages("arrow", repos = c("https://arrow-r-nightly.s3.amazonaws.com", getOption("repos")))
  • 10. Arrow R Package Milestones
  • 11. Arrow R Package Milestones
  • 12. Arrow R Package Milestones
  • 13. Arrow R Package Milestones
  • 14. Arrow R Package Milestones
  • 15. Arrow R Package Milestones
  • 16. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = TRUE) %>% filter(total_amount > 100) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) #> # A tibble: 3 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 1 13.6 11714 #> 2 2 11.9 2892 #> 3 3 11.1 709 system.time(...) #> user system elapsed #> 4.762 0.806 1.612 Parquet file ~10 million rows ~250 MB Read into an R data frame DEV VERSIO N
  • 17. read_parquet("nyc-taxi/2015/09/data.parquet", as_data_frame = FALSE) %>% filter(total_amount > 100) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 3 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 1 13.6 11714 #> 2 2 11.9 2892 #> 3 3 11.1 709 system.time(...) #> user system elapsed #> 3.446 1.012 0.605 Parquet file ~10 million rows ~250 MB Read into an Arrow Table DEV VERSIO N Return the result as an R data frame
  • 18. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 4 x 3 #> passenger_count avg_tip_pct n #> <int> <dbl> <int> #> 1 5 16.8 5806 #> 2 1 13.5 143087 #> 3 2 12.6 34418 #> 4 3 11.9 8922 125 Parquet files ~2 billion rows ~40 GB system.time(...) #> user system elapsed #> 3.319 0.247 1.111 DEV VERSIO N
  • 19. open_dataset("nyc-taxi", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% to_duckdb() %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect() ● Creates a virtual DuckDB table backed by an Arrow data object ○ No data is loaded until collect() is called ○ Returns a dbplyr object for use in dplyr pipelines DEV VERSIO N
  • 20. 20 Coming Soon Upcoming Arrow releases will bring additional query execution capabilities to the Arrow C++ engine and R package ● Joins ● Window functions ● More scalar and aggregate functions ● Performance and efficiency improvements
  • 21. Ibis ● The Arrow C++ engine currently lacks a high-level Python API ● Ibis can fill this gap taxi .filter(taxi.total_amount > 100) .projection(['tip_amount', 'total_amount', 'passenger_count']) .mutate(tip_pct = taxi.tip_amount / taxi.total_amount * 100) .group_by('passenger_count') .aggregate(n=lambda x: x.count(), avg_tip=lambda x: x.tip_pct.mean()) .filter(lambda x: x.n > 500) .sort_by(ibis.desc('avg_tip')) .execute()
  • 22. Engines and Interfaces There are multiple efforts underway to develop Arrow-native query engines ● Arrow C++ engine ● Arrow DataFusion ● DuckDB ● … Users want fluent interfaces to these engines from their preferred languages ● Python ● R ● JavaScript ● … Users also want to run SQL queries on these engines
  • 26. 26 Compute Intermediate Representation (IR) The Arrow community has launched a collaboration to establish a Compute IR ● A standard serialized representation of compute expressions ● A common layer connecting APIs (front ends) and engines (back ends) ○ Is produced by APIs ○ Is consumed by engines ● Follow this initiative at substrait.io