Apache Spark Installation and Programming Guide
Apache Spark Installation and Programming Guide
This is a step-by-step guide to install Apache Spark. Spark can be configured with multiple cluster
managers like YARN or in local mode and standalone mode.
StandaloneDeployMode
In this practical, you will be configuring Spark to run in standalone mode. Both driver and worker
nodes run on the same machine.
Since we use Java to write and run programs on Spark, ensure that Java 8 is pre-installed on
the machines on which you have to run Spark job.
To install Spark on the machine, you would download prebuilt binary of Spark from
http://spark.apache.org/downloads.html page.
You can also directly download Spark-1.6.1 by using the following command:
wget http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.4.tgz
Decompress the Spark file into the directory where you want to store Spark.
Make a softlink to the actual spark directory (This will be helpful for any version upgrade in future)
ln -s spark-1.5.2-bin-hadoop2.4 spark
SPARK_HOME=/mydirectory/spark
export PATH=$SPARK_HOME/bin:$PATH
Source the changed .bashrc file by the command
source ~/.bashrc
We have successfully configured spark in standalone mode. To check lets launch the Spark Shell by
the following command:
spark-shell
sc.version
WritingProgram
Next we will write a basic Java application to count a word in a file. Below is the source code for the
Word Count program in Apache Spark. You also need to import some Spark classes into your program.
You also need to include the path for the file to be used.
String>() {
")); }
});
Integer>(s, 1); }
});
});
counts.saveAsTextFile("hdfs://...");