TF On Spark

TensorFlowOnSpark allows running TensorFlow applications on large Spark clusters for scalable deep learning. It supports both synchronous and asynchronous training as well as model and data parallelism. TensorFlowOnSpark integrates TensorFlow applications with existing HDFS data pipelines and Spark/MLlib algorithms. It provides two modes for feeding data to TensorFlow from HDFS: using RDDs or direct HDFS access. TensorFlowOnSpark has been used successfully in production at Yahoo for training on petabytes of data using thousands of nodes.

Uploaded by

ark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

131 views35 pages

TF On Spark

Uploaded by

ark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

TensorFlowOnSpark

Scalable TensorFlow Learning on Spark Clusters

Lee Yang, Andrew Feng
Yahoo Big Data ML Platform Team
What is TensorFlowOnSpark
Why TensorFlowOnSpark at Yahoo?
•  Major contributor to open-source Hadoop ecosystem
–  Originators of Hadoop (2006)
–  An early adopter of Spark (since 2013)
–  Open-sourced CaffeOnSpark (2016)
•  Large investment in production clusters
–  Tens of clusters
–  Thousands of nodes per cluster
•  Massive amounts of data
–  Petabytes of data
Private ML Clusters
Why TensorFlowOnSpark?
Scaling
Near-linear
scaling
RDMA Speedup over gRPC
2.4X faster
RDMA Speedup over gRPC

http://www.mellanox.com/solutions/machine-learning/tensorflow.php
TensorFlowOnSpark Design Goals
•  Scale up existing TF apps with minimal changes
•  Support all current TensorFlow functionality
–  Synchronous/asynchronous training
–  Model/data parallelism
–  TensorBoard
•  Integrate with existing HDFS data pipelines and ML
algorithms
–  ex. Hive, Spark, MLlib
TensorFlowOnSpark
•  Pyspark wrapper of TF app code
•  Launches distributed TF clusters using Spark executors
•  Supports TF data ingestion modes
–  feed_dict – RDD.mapPartitions()
–  queue_runner – direct HDFS access from TF
•  Supports TensorBoard during/after training
•  Generally agnostic to Spark/TF versions
Supported Environments
•  Python 2.7 - 3.x
•  Spark 1.6 - 2.x
•  TensorFlow 0.12, 1.x
•  Hadoop 2.x
Architectural Overview
TensorFlowOnSpark Basics
1.  Launch TensorFlow cluster
2.  Feed data to TensorFlow app
3.  Shutdown TensorFlow cluster
API Example

cluster = TFCluster.run(sc, map_fn, args, num_executors,
num_ps, tensorboard, input_mode)
cluster.train(dataRDD, num_epochs=0)
cluster.inference(dataRDD)
cluster.shutdown()
Conversion Example
# diff –w eval_image_classifier.py
20a21,27
> from pyspark.context import SparkContext
> from pyspark.conf import SparkConf
> from tensorflowonspark import TFCluster, TFNode
> import sys
>
> def main_fun(argv, ctx):
27a35,36
> sys.argv = argv
>
84,85d92
<
< def main(_):
88a96,97
> cluster_spec, server = TFNode.start_cluster_server(ctx)
>
191c200,204
< tf.app.run()
---
> sc = SparkContext(conf=SparkConf().setAppName("eval_image_classifier"))
> num_executors = int(sc._conf.get("spark.executor.instances"))
> cluster = TFCluster.run(sc, main_fun, sys.argv, num_executors, 0, False, TFCluster.InputMode.TENSORFLOW)
> cluster.shutdown()

Input Modes
•  InputMode.SPARK
HDFS → RDD.mapPartitions → feed_dict

•  InputMode.TENSORFLOW
TFReader + QueueRunner ← HDFS
InputMode.SPARK
Executor Executor Executor

python worker python worker python worker

RDD RDD

queue queue
ps:0

worker:0 worker:1
InputMode.TENSORFLOW
Executor Executor Executor

python worker python worker python worker

ps:0 worker:0 worker:1

HDFS
Failure Recovery
•  TF Checkpoints written to HDFS
•  InputMode.SPARK
–  TF worker runs in background
–  RDD data feeding tasks can be retried
–  However, TF worker failures will be “hidden” from Spark
•  InputMode.TENSORFLOW
–  TF worker runs in foreground
–  TF worker failures will be retried as Spark task
–  TF worker restores from checkpoint
Failure Recovery
•  Executor failures are problematic
–  e.g. pre-emption
–  TF cluster_spec is statically-defined at startup
–  YARN does not re-allocate on same node
–  Even if possible, port may no longer be available.
•  Need dynamic cluster membership
–  Exploring options w/ TensorFlow team
TensorBoard
TensorBoard
TensorBoard
TensorFlow App Development
Experimentation Phase
–  Single-node
–  Small scale data
–  TensorFlow APIs
tf.Graph
tf.Session
tf.InteractiveSession
TensorFlow App Development
Scaling Phase
–  Multi-node
–  Medium scale data (local disk)
–  Distributed TensorFlow APIs
tf.train.ClusterSpec
tf.train.Server
tf.train.Saver
tf.train.Supervisor
TensorFlow App Development
Production Phase
–  Cluster deployment
–  Upstream data pipeline
–  Model training w/ TensorFlowOnSpark APIs
TFCluster.run
TFNode.start_cluster_server
TFCluster.shutdown
–  Production inference w/ TensorFlow Serving
Example Usage

https://github.com/yahoo/TensorFlowOnSpark/tree/master/examples
Common Gotchas
•  Single task (TF node) per executor
•  HDFS access (native libs/env)
•  Why doesn’t algorithm X scale linearly?
What’s New?
•  Community contributions
–  CDH compatibility
–  TFNode.DataFeed
–  Bug fixes
•  RDMA merged into TensorFlow repository
•  Registration server
•  Spark streaming
•  Pip packaging
Spark Streaming
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 10)
images = sc.textFile(args.images).map(lambda ln: parse(ln)])
stream = ssc.textFileStream(args.images)
imageRDD = stream.map(lambda ln: parse(ln))
cluster = TFCluster.run(sc, map_fun, args,…)
predictionRDD = cluster.inference(imageRDD)
predictionRDD.saveAsTextFile(args.output)
predictionRDD.saveAsTextFiles(args.output)
ssc.start()
cluster.shutdown(ssc)
Pip packaging
pip install tensorflowonspark
${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--archives ${TFoS_HOME}/tfspark.zip \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_Standalone
Next Steps
•  TF/Keras Layers
•  Failure recovery w/ dynamic cluster
management (e.g. registration server)
Summary
TFoS brings deep-learning to big-data clusters
–  TensorFlow: 0.12 -1.x
–  Spark: 1.6-2.x
–  Cluster manager: YARN, Standalone, Mesos
–  EC2 image provided
–  RDMA in TensorFlow
Thanks!

And our open-source contributors!

Questions?

https://github.com/yahoo/TensorFlowOnSpark

Vmware 2V0-620
100% (1)
Vmware 2V0-620
79 pages
NSD Tuning
No ratings yet
NSD Tuning
6 pages
YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
64 pages
LSF Guide
No ratings yet
LSF Guide
30 pages
Cluster Computing Tutorial
No ratings yet
Cluster Computing Tutorial
101 pages
Web Console Guide Prism v6 10
No ratings yet
Web Console Guide Prism v6 10
400 pages
Focus Analyzer - Desktop - UserGuide
No ratings yet
Focus Analyzer - Desktop - UserGuide
112 pages
Workshop Nursultan 2019 - ESM
No ratings yet
Workshop Nursultan 2019 - ESM
63 pages
Intro To Apache Spark: Paco Nathan, Download Slides
No ratings yet
Intro To Apache Spark: Paco Nathan, Download Slides
86 pages
Emerging Technology
No ratings yet
Emerging Technology
52 pages
Tanium Core Platform For Windows 7.5.6.XXXX Ug
No ratings yet
Tanium Core Platform For Windows 7.5.6.XXXX Ug
200 pages
SPARK
No ratings yet
SPARK
125 pages
Microsoft Certificate Services 2008 R2 Windows
No ratings yet
Microsoft Certificate Services 2008 R2 Windows
23 pages
Networker 19.4 All Modules & Features Simple Support Matrix January 11, 2021
No ratings yet
Networker 19.4 All Modules & Features Simple Support Matrix January 11, 2021
77 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Spark
No ratings yet
Spark
13 pages
VERITAS Cluster Server For UNIX Fundamentals
No ratings yet
VERITAS Cluster Server For UNIX Fundamentals
434 pages
Lab_ Storing and Analyzing Data by Using Amazon Redshift
No ratings yet
Lab_ Storing and Analyzing Data by Using Amazon Redshift
22 pages
70-462: Administering Microsoft SQL Server 2012 Databases
No ratings yet
70-462: Administering Microsoft SQL Server 2012 Databases
5 pages
Microsoft SQL Server to Databricks Migration Guide
No ratings yet
Microsoft SQL Server to Databricks Migration Guide
36 pages
spark
No ratings yet
spark
160 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
PowerScale OneFS Technical Specifications Guide 9.2.1.0
No ratings yet
PowerScale OneFS Technical Specifications Guide 9.2.1.0
18 pages
slurm_MachineLearning
No ratings yet
slurm_MachineLearning
10 pages
Openshift Container Platform-3.11-Day Two Operations Guide
No ratings yet
Openshift Container Platform-3.11-Day Two Operations Guide
110 pages
Spark NLP Training-Public-Oct 2020
No ratings yet
Spark NLP Training-Public-Oct 2020
50 pages
13 SparkBuildingAndDeploying
No ratings yet
13 SparkBuildingAndDeploying
53 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Distributed Computing BE(AI&DS)
No ratings yet
Distributed Computing BE(AI&DS)
53 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
AIA 6550 Module 4
No ratings yet
AIA 6550 Module 4
13 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Python Advanced - Threads and Threading
No ratings yet
Python Advanced - Threads and Threading
9 pages
NGF TDM Deck
No ratings yet
NGF TDM Deck
154 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Vxrail Assesment
No ratings yet
Vxrail Assesment
6 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Airflow
No ratings yet
Airflow
37 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
F5 and ACI Integration
No ratings yet
F5 and ACI Integration
53 pages
Try Latest & Free Nutanix NCSE Core Exam Dumps
No ratings yet
Try Latest & Free Nutanix NCSE Core Exam Dumps
11 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
No ratings yet
Databricks DBX CLI - Deploy The Spark JAR Using YAML - by Ganesh Chandrasekaran - Medium
7 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Syllabus of Oracle RAC Online Training Course
No ratings yet
Syllabus of Oracle RAC Online Training Course
4 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
DES-6322 Exam
No ratings yet
DES-6322 Exam
16 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
1z0 908 Demo
No ratings yet
1z0 908 Demo
6 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
AZ-801 (169 Questions)
No ratings yet
AZ-801 (169 Questions)
10 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Py Spark
No ratings yet
Py Spark
427 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Microsoft - Testkings.az 801.PDF.2024 May 27.by - Moses.83q.vce
No ratings yet
Microsoft - Testkings.az 801.PDF.2024 May 27.by - Moses.83q.vce
13 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet