Big_Data_Machine_Learning_using_Apache_S
Big_Data_Machine_Learning_using_Apache_S
Abstract—Artificial intelligence, and particularly machine [14], [15], text mining [16], [17], [18], [19], and stochastic
learning, has been used in many ways by the research community modeling [20], [21], [22] purposes just to name a few.
to turn a variety of diverse and even heterogeneous data Apache Spark MLlib is one of the most highly demanded
sources into high quality facts and knowledge, providing premier
capabilities to accurate pattern discovery. However, applying platform independent and open-source libraries for big data
machine learning strategies on big and complex datasets is machine learning which benefits from distributed architecture
computationally expensive, and it consumes a very large amount and automatic data parallelization. Apache Spark MLlib has
of logical and physical resources, such as data file space, CPU, and been provided with Apache Spark [23], [24], and it offers a set
memory. A sophisticated platform for efficient big data analytics of dominant functionalists for a variety of machine learning
is becoming more important these days as the data amount
generated in a daily basis exceeds over quintillion bytes. Apache tasks, including regression, dimension reduction, classification,
Spark MLlib is one of the most prominent platforms for big data clustering, and rule extraction. While machine learning and
analysis which offers a set of excellent functionalities for different its effective applications have been studied for a long time
machine learning tasks ranging from regression, classification, in the research community, the study of big data machine
and dimension reduction to clustering and rule extraction. In learning libraries, such as Apache Spark MLlib has been
this contribution, we explore, from the computational perspective,
the expanding body of the Apache Spark MLlib 2.0 as an open- very limited so far. This contribution is perhaps the first
source, distributed, scalable, and platform independent machine work that leverages a big data machine learning library, the
learning library. Specifically, we perform several real world Apache Spark MLlib 2.0, to tackle the problem of big data
machine learning experiments to examine the qualitative and analytics. An objective of big data analytics is to get advanced
quantitative attributes of the platform. Furthermore, we highlight computational infrastructures so that large-scale data can be
current trends in big data machine learning research and provide
insights for future work. mined and analyzed in a timely and efficient manner. This
constitutes the main motivation of the present work. Since big
Index Terms—Apache Spark MLlib, Big Data Machine Learn- data analytics is computationally intensive, the performance
ing, Big Data Analytics, Machine Learning.
and user experience are impacted by different hardware and/or
software configurations. In this paper, we evaluate the impact
I. I NTRODUCTION of different hardware and software configurations with a set of
big data analysis tasks. Based on the findings of this study, we
Digital datasets have been rapidly growing in size and com- provide insights for future big data machine learning-oriented
plexity, and the large volume of daily generated data exceeds hardware, software, and model design.
the boundary of normal processing capabilities, forcing us The mechanism this paper will discuss is to present, from
to take advanced computational infrastructures able to tackle the computational perspective, the Apache Spark MLlib 2.0
parallel and distributed processing. Efficient mining of such and its capabilities and advantages to big data machine
massive data amounts is an extremely challenging practice learning. We demonstrate through extensive experiments the
which requires developing more sophisticated platforms to benefits of such a highly scalable machine learning library,
get extensive big data analysis accurately done in a timely and highlight the insights that can be drawn by big data
fashion. Big data infrastructures have emerged to address the analytics. Using several datasets including tens of millions
problem of big data analytics with the use of fast, reliable, of data records, we demonstrate that we can reliably utilize
and scalable computational architecture, providing excellent Apache Spark MLlib for a variety of large-scale machine
quality attributes including elasticity, availability, and resource learning strategies ranging from big data classification to big
pooling with the ability of on-demand and ease-of-use self- data clustering and rule extraction. We briefly summarize our
services [1], [2], [3], [4]. Several big data machine learning main contributions in the following:
frameworks are now available which, aside from lending prac- • We utilize Apache Spark MLlib 2.0 to initiate a study of
tical contributions to other scientific disciplines, have demon- big data machine learning on massive datasets including
strated successful application in healthcare informatics [5], tens of millions of data records. The first and main
[6], [7], [8], [9], [10], [11], genomic data analysis [12], [13], contribution of the paper is to introduce an exciting but
IEEE BIG DATA 2017 2
challenging area to the machine learning community. pipelines [24], [31]. In the recent years, many facets of data
While machine learning and its applications in research science solutions have been amended by such a library [32],
and industrial communities have been studied for several [33], [34], [35], [36], [37], and many machine learning sci-
years now, the study of big data machine learning has entists and engineers have entered the development of new
been very limited so far. The present work is expected Apache Spark MLlib components to contribute to the big data
to bridge the gap between the two areas, opening the analytics community across the world. Figure 1 presents, from
doors for different interesting directions from the big the development side, a pathway of the Apache Spark MLlib
data community to the fast-growing machine learning 2.0 in which the number of unique commits per release has
application area. been rapidly growing over the years. Apache Spark MLlib has
• We perform several large-scale real world experiments to been in very active development, and at the time of writing this
examine a set of qualitative and quantitative attributes paper, the number of Apache Spark MLlib contributors was
of Apache Spark MLlib 2.0. Furthermore, we estab- above 1000 [38]. Here, we briefly review the recent advances
lish a comparative study with Weka library (Version in Apache Spark MLlib applications. Tizghadam et. al. [39]
3.7.12) [25] (running on hadoop-2.7 [26]); a very well- proposed an open source, scalable platform called CVST. The
known Java-based machine learning library which has system is proposed to be used in smart transport application
been widely used in the community. We evaluate multiple development. CVST cosists of four major components for re-
common big data machine learning models, including source management, data dissemination, business intelligence,
classification and clustering on real data, and compare and application. The business intelligence component which
their performance on various hardware and software is in charge of data analytics, uses MLlib to process the
configurations. data and deliver it to front-end. Lee et. al. [40] proposed an
The rest of the paper is organized as follows. We begin architecture design of academic information system providing
by introducing Apache Spark MLlib in Section 2. Section 3 services for analyzing students’ record patterns. The proposed
explains the materials and methods which consist of a list of recommender system uses Apache Spark MLlib to predict
Apache Spark MLlib 2.0 components we investigate, plus sev- and recommend courses for the following semester. SparkText
eral large-scale datasets. We then delve into the experimental is a text mining framework developed by Ye et. al. [41].
validations in Section 4. Finally, Section 5 concludes with a The proposed system which employs Apache Spark machine
summary of our findings and a discussion of several possible learning and streaming methods along with Cassandra NoSQL
directions for future work. database has been executed on a big dataset of medical articles
to classify cancer types. Arora [42] analyzed mobile data
collected from Web by using K-means algorithm on Apache
II. A PACHE S PARK ML LIB 2.0 Spark MLlib. The author offers an effective way of calculating
Apache Spark [24], [27], [28] as a highly scalable, fast, and the number of users of the network by clustering based on lat-
in-memory big data processing engine, has been originally itude and longitude values. Lee et. al. [43] introduced ALMD
developed in the AMPLab at UC Berkeley, and it offers an - a feature descriptor by considering motion and appearance
ability to develop distributed applications using Java, Python, and employing Apache Spark machine learning library random
Scala, and R programming languages. It comes with four forest to recognize human activities. An attempt to generate
major libraries, including Apache Spark Streaming, Apache a framework for analyzing the population structure using the
Spark SQL, Apache Spark GraphX, and Apache Spark ML- next generation sequencing data is made by Hryhorzhevsk et.
lib [29]. While Apache Spark Streaming as the core scheduling al. [44]. Authors of the paper proposed a distributed computing
module of Spark, implements stream processing within highly framework using ADAM combined with MLib, H2O and
fault tolerant and batch analytics architecture, The Apache Apache SystemML. An architecture for automatic machine
Spark SQL implements relational queries to mine different learning is proposed by Sparks et. al. [45]. The system is
database systems, introducing a data abstraction model called consisting of resource allocation estimator tunning, and opti-
DataFrames [24], [30]. Apache Spark GraphX is a graph mizing components. The entire system is built upon Apache
processing library on top of the Apache Spark which provides Spark and it leverages MLlib and other Spark components.
distributed computational models to process two common data The bigNN which is developed by Tafti et. al. [46] is another
structures, such as graph and collection. Apache Spark MLlib interesting big data analytics component implemented on top
is a big data analytics library which provides more than 55 of the Apache Spark, and it is capable to tackle very large-
scalable machine learning algorithms that benefit from both scale biomedical sentence classification. There still exists a
data and process parallelization [31]. The library includes list of valuable contributions on big data analytics. Interested
implementation of a variety of machine learning strategies, readers are referred to [40], [47], [48], [49], [50], [51], [52],
such as classification, clustering, regression, dimension re- [53], [54], [55] for further readings.
duction, and rule extraction which enables easy and fast in-
practice development of large-scale machine learning applica-
III. M ATERIALS AND M ETHODS
tions. Apache Spark MLlib also offers a set of multi-language
APIs to evaluate machine learning methods, deploying several In this section we further explain the materials and methods
computational components that deal with optimization, latent of the current research study. We shall begin with the Apache
Dirichlet allocation, linear algebra, and feature engineering Spark MLlib components, and then introduce the datasets.
IEEE BIG DATA 2017 3
(c) The results obtained on the HIGGS dataset. (d) The results obtained on the SUSY dataset.
Fig. 2: Running time of Apache Spark MLlib compared to Weka under different classification experiments. NB stands for
Naı̈ve Bayes, DT stands for Decision Tree, and RF refers to Random Forest.
TABLE III: Area under ROC associated with the classification solve the problem of pattern discovery form large-scale data
algorithms performed by Apache Spark MLlib and Weka sources more efficiently, and Apache Spark MLlib is one of
library. the widely-used big data machine learning library available to
Environment Dataset Algorithm MLlib Weka the community.
SUSY SVM 0.7932 0.7829 Apache Spark MLlib is a very strong tool for big data
Random Forest 0.7614 0.7731
Decision Tree 0.7528 0.7690 analytics and as results of our study shows, it presents spec-
Nave Bayes 0.7853 0.8031 tacular performance in terms of running time. Weka, on the
HIGGS SVM 0.5690 0.5543
Random Forest 0.5680 0.5712
other hand, works slower than Apache Spark MLlib under big
Decision Tree 0.5829 0.5873 loads of data samples. Considering different file systems and
Nave Bayes 0.5741 0.5722 configurations used by Apache Spark MLlib and Weka, this
ENV1 FLIGHT SVM 0.5189 0.5312
Random Forest 0.5305 0.5581 comparison may not be deemed fair though. We ran MLlib on
Decision Tree 0.5736 0.5819 Spark distributed file system and used Hadoop distributed file
Nave Bayes 0.5570 0.5384
HEPMASS SVM 0.9591 0.9543
system for Weka. But our goal is to show how Spark performs
Random Forest 0.8948 0.8961 with large sets and Weka is used here as a reliable baseline that
Decision Tree 0.8995 0.8971 is agreed by the research community. There are many features
Nave Bayes 0.9114 0.9398
SUSY SVM 0.7932 0.7829 with Weka that Spark can not compete with: (1) There is a big
Random Forest 0.7614 0.7731 pool of documents and resources available for users, (2) It is
Decision Tree 0.7528 0.7690 very straightforward and easy to use for non-expert users, and
Nave Bayes 0.7853 0.8031
HIGGS SVM 0.5690 0.5543 (3) It has a perfect graphical user interface. Weka supports a
Random Forest 0.5680 0.5712 vast variety of machine learning algorithms.
Decision Tree 0.5829 0.5873
Nave Bayes 0.5741 0.5722 Apache Spark MLlib offers fast, flexible, and scalable
ENV2 FLIGHT SVM 0.5189 0.5312 implementations of a variety of machine learning components,
Random Forest 0.5305 0.5581 ranging from ensemble learning and principal component
Decision Tree 0.5736 0.5819
Nave Bayes 0.5570 0.5384 analysis (PCA) to optimization and clustering analysis. Apache
HEPMASS SVM 0.9591 0.9543 Spark MLlib also offer options for distributed processing
Random Forest 0.8948 0.8961
Decision Tree 0.8995 0.8971
by parallel processing and support of big data tools that
Nave Bayes 0.9114 0.9398 utilize distributed architectures. These criteria will decrease
the processing time required and, at the same time, increase
TABLE IV: The general performance of Weka and Apache the time available to interpret analytics results. This becomes
Spark MLlib K-Means algorithm. very important when the machine learning task has many
Dataset MLlib Weka predictions to calculate. The distributed architecture can also
HETROACT I Time: 0 min, 48 sec Time: 11 min, 40 sec take advantages of some of the big data tool sets available to
Number of Clusters: 5 Number of Clusters: 5
SSE: 1.2403975185657699E34 SSE: 1.2419981124641322E34 help break apart the machine learning component to improve
HETROACT II Time: 19 min, 57 sec Time: 53 min, 16 sec
Number of Clusters: 5 Number of Clusters: 5 overall running time. Integration is another advantage of
SSE: 1.2312227116685513E35 SSE: 1.2619357121584791E35 Apache Spark MLlib, meaning that the MLlib gains from
several software components available in the Spark ecosystem,
such as Spark GraphX, Spark SQL, and Spark Streaming,
the detailed parameters of each algorithm in randomly and a wide range of well-organized documentations, including
selecting train and test instances while we are doing 4- code samples are publicly and freely available to the machine
fold cross validation, or they could come from the very learning community.
detailed internal parameters of each algorithm. Our future work will be focused on adding more practical
• The comparative study demonstrates that the Apache experiments with most of the Apache Spark MLlib compo-
Spark MLlib, as expected, is able to be faster in com- nents by utilizing a variety of bigger datasets. As part of our
parison with the Weka components we have utilized, and future work, we are working to run an experimental evaluation
performing a t-test on the running time matched by either of Apache Spark MLlib under a diversity of programming
classification algorithms or the clustering method shows languages (e.g., Python and R), clusters and also hardware
statistically significant differences (at p < 0.01) between or/and software configurations, employing a set of big datasets
Apache Spark MLlib and Weka. with variety of characteristics within the data.
[4] A. Abbasi, S. Sarker, and R. Chiang, “Big data research in information [28] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
systems: Toward an inclusive research agenda,” Journal of the Associa- “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no.
tion for Information Systems, vol. 17, no. 2, p. 3, 2016. 10-10, p. 95, 2010.
[5] J. Archenaa and E. M. Anita, “Interactive big data management in health- [29] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,
care using spark,” in Proceedings of the 3rd International Symposium X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al., “Apache
on Big Data and Cloud Computing Challenges (ISBCC–16). Springer, spark: a unified engine for big data processing,” Communications of the
2016, pp. 265–272. ACM, vol. 59, no. 11, pp. 56–65, 2016.
[6] R. Pita, C. Pinto, P. Melo, M. Silva, M. Barreto, and D. Rasella, [30] M. Frampton, Mastering Apache Spark. Packt Publishing Ltd, 2015.
“A spark-based workflow for probabilistic record linkage of healthcare [31] X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu,
data.” in EDBT/ICDT Workshops, 2015, pp. 17–26. J. Freeman, D. Tsai, M. Amde, S. Owen et al., “Mllib: Machine learning
[7] I. a. Q. M. J. How, D. I. M. H. I. Leave, Y. S. I. F. Question, and T. Y. in apache spark,” JMLR, vol. 17, no. 34, pp. 1–7, 2016.
Why, “A case study: Apache spark.” [32] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and
[8] A. P. Tafti, E. LaRose, J. C. Badger, R. Kleiman, and P. Peissig, “Ma- T. Kraska, “Automating model search for large scale machine learning,”
chine learning-as-a-service and its application to medical informatics,” in Proceedings of the Sixth ACM Symposium on Cloud Computing.
in International Conference on Machine Learning and Data Mining in ACM, 2015, pp. 368–380.
Pattern Recognition. Springer, 2017, pp. 206–219. [33] F. Liang, C. Feng, X. Lu, and Z. Xu, “Performance benefits of datampi:
[9] R. Fang, S. Pouyanfar, Y. Yang, S.-C. Chen, and S. Iyengar, “Computa- a case study with bigdatabench,” in Workshop on Big Data Benchmarks,
tional health informatics in the big data age: a survey,” ACM Computing Performance Optimization, and Emerging Hardware. Springer, 2014,
Surveys (CSUR), vol. 49, no. 1, p. 12, 2016. pp. 111–123.
[10] J. D. Van Horn, “Opinion: Big data biomedicine offers big higher edu- [34] M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica,
cation opportunities,” Proceedings of the National Academy of Sciences, P. Wendell, R. Xin, and M. Zaharia, “Scaling spark in the real world:
vol. 113, no. 23, pp. 6322–6324, 2016. performance and usability,” Proceedings of the VLDB Endowment,
[11] Z. Lv, J. Chirivella, and P. Gagliardo, “Bigdata oriented multimedia vol. 8, no. 12, pp. 1840–1843, 2015.
mobile health applications,” Journal of medical systems, vol. 40, no. 5, [35] E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and T. Kraska,
p. 120, 2016. “Tupaq: An efficient planner for large-scale predictive analytic queries,”
[12] M. S. Wiewiórka, A. Messina, A. Pacholewska, S. Maffioletti, arXiv preprint arXiv:1502.00068, 2015.
P. Gawrysiak, and M. J. Okoniewski, “Sparkseq: fast, scalable, cloud- [36] C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin, “Large-scale logistic
ready tool for the interactive genomic data analysis with nucleotide regression and linear support vector machines using spark,” in Big Data
precision,” Bioinformatics, p. btu343, 2014. (Big Data), 2014 IEEE International Conference on. IEEE, 2014, pp.
[13] M. Masseroli, P. Pinoli, F. Venco, A. Kaitoua, V. Jalili, F. Palluzzi, 519–528.
H. Muller, and S. Ceri, “Genometric query language: a novel approach to [37] A. G. Shoro and T. R. Soomro, “Big data analysis: Apache spark
large-scale genomic data management,” Bioinformatics, vol. 31, no. 12, perspective,” Global Journal of Computer Science and Technology,
pp. 1881–1888, 2015. vol. 15, no. 1, 2015.
[14] D. Ding, D. Wu, and F. Yu, “An overview on cloud computing platform [38] “Github-apache spark,” https://github.com/apache/spark/.
spark for human genome mining,” in Mechatronics and Automation [39] A. Tizghadam and A. Leon-Garcia, “Application platform for smart
(ICMA), 2016 IEEE International Conference on. IEEE, 2016, pp. transportation,” in Future Access Enablers of Ubiquitous and Intelligent
2605–2610. Infrastructures. Springer, 2015, pp. 26–32.
[15] M. Syed, “Using apache spark for scalable gene sequence analysis,” [40] M.-S. Lee, E. Kim, C.-S. Nam, and D.-R. Shin, “Design of educational
Ph.D. dissertation, TEXAS A&M UNIVERSITY-COMMERCE, 2016. big data application using spark,” in Advanced Communication Technol-
[16] J. Ryan, “Rapidminer for text analytic fundamentals,” Text Mining and ogy (ICACT), 2017 19th International Conference on. IEEE, 2017, pp.
Visualization: Case Studies Using Open-Source Tools, vol. 40, p. 1, 355–357.
2016. [41] Z. Ye, A. P. Tafti, K. Y. He, K. Wang, and M. M. He, “Sparktext:
[17] C. Rong et al., “Using mahout for clustering wikipedia’s latest articles: Biomedical text mining on big data framework,” PloS one, vol. 11, no. 9,
a comparison between k-means and fuzzy c-means in the cloud,” in p. e0162721, 2016.
Cloud Computing Technology and Science (CloudCom), 2011 IEEE [42] S. Arora, “Analyzing mobile phone usage using clustering in spark mllib
Third International Conference on. IEEE, 2011, pp. 565–569. and pig,” International Journal, vol. 8, no. 1, 2017.
[18] A. P. Tafti, J. Badger, E. LaRose, E. Shirzadi, A. Mahanke, J. Mayer, [43] M. A. Uddin, J. bibi Joolee, A. Alam, and Y.-K. Lee, “Human action
Z. Ye, D. Page, and P. Peissig, “Adverse drug event discovery using recognition using adaptive local motion descriptor in spark,” IEEE
biomedical literature: a big data neural network adventure,” JMIR Access, 2017.
Medical Informatics, 2017. [44] A. Hryhorzhevska, M. Wiewiórka, M. Okoniewski, and T. Gambin,
[19] A. Garcıa-Pablos, M. Cuadros, and G. Rigau, “V3: Unsupervised aspect “Scalable framework for the analysis of population structure using
based sentiment analysis for semeval-2015 task 12,” SemEval-2015, p. the next generation sequencing data,” in International Symposium on
714, 2015. Methodologies for Intelligent Systems. Springer, 2017, pp. 471–480.
[20] H. S. Bhat, R. Madushani, and S. Rawat, “Scalable sde filtering and [45] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and
inference with apache spark,,” Journal of Machine Learning Research T. Kraska, “Automating model search for large scale machine learning,”
W&CP, 2016. in Proceedings of the Sixth ACM Symposium on Cloud Computing.
[21] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, ACM, 2015, pp. 368–380.
M. J. Franklin, M. I. Jordan, and T. Kraska, “Mli: An api for distributed [46] A. P. Tafti, E. Behravesh, M. Assefi, E. LaRose, J. Badger, J. Mayer,
machine learning,” in 2013 IEEE 13th International Conference on Data A. Doan, D. Page, and P. Peissig, “bignn: an open-source big data toolkit
Mining. IEEE, 2013, pp. 1187–1192. focused on biomedical sentence classification,” in Proceedings of the
[22] H. Ji, S. H. Weinberg, M. Li, J. Wang, and Y. Li, “An apache spark IEEE BIG DATA, 2017.
implementation of block power method for computing dominant eigen- [47] R. Shyam, B. G. HB, S. Kumar, P. Poornachandran, and K. Soman,
values and eigenvectors of large-scale matrices,” in Big Data and Cloud “Apache spark a big data analytics platform for smart grid,” Procedia
Computing (BDCloud), Social Computing and Networking (SocialCom), Technology, vol. 21, pp. 171–178, 2015.
Sustainable Computing and Communications (SustainCom)(BDCloud- [48] S. Gopalani and R. Arora, “Comparing apache spark and map reduce
SocialCom-SustainCom), 2016 IEEE International Conferences on. with performance analysis using k-means,” International Journal of
IEEE, 2016, pp. 554–559. Computer Applications, vol. 113, no. 1, 2015.
[23] “Apache spark,” http://spark.apache.org/. [49] N. Yousefi, M. Georgiopoulos, and G. C. Anagnostopoulos, “Multi-
[24] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, task learning with group-specific feature space sharing,” in Joint Eu-
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: ropean Conference on Machine Learning and Knowledge Discovery in
A fault-tolerant abstraction for in-memory cluster computing,” in Pro- Databases. Springer, 2015, pp. 120–136.
ceedings of the 9th USENIX conference on Networked Systems Design [50] M. S. Fazli, S. A. Vella, S. N. Moreno, and S. Quinn, “Computational
and Implementation. USENIX Association, 2012, pp. 2–2. motility tracking of calcium dynamics in toxoplasma gondii,” arXiv
[25] “Weka library,” http://www.cs.waikato.ac.nz/ml/weka/. preprint arXiv:1708.01871, 2017.
[26] “Apache hadoop releases,” http://hadoop.apache.org/releases.html. [51] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B.
[27] “Apache spark mllib,” http://spark.apache.org/mllib/. Gutierrez, and K. Kochut, “A brief survey of text mining: Classification,
IEEE BIG DATA 2017 7
clustering and extraction techniques,” arXiv preprint arXiv:1707.02919, Spark Standalone Mode
2017. Apache Spark also provides an easy-to-use standalone de-
[52] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, meth-
ods, and analytics,” International Journal of Information Management, ployment mode, in which we can easily put a compiled
vol. 35, no. 2, pp. 137–144, 2015. version of Apache Spark on every single node on the cluster
[53] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: environment [69]. In this Appendix, and for example, we
promise and potential,” Health information science and systems, vol. 2,
no. 1, p. 3, 2014. are using only two nodes, a master node and a slave node,
[54] Z. Lv, H. Song, P. Basanta-Val, A. Steed, and M. Jo, “Next-generation assuming to employ Java programming language. In doing so,
big data analytics: State of the art, challenges, and future research we have to do the following steps respectively:
topics,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4,
pp. 1891–1899, 2017. (1) Install Java on both nodes.
[55] M. Pecaric, K. Boutis, J. Beckstead, and M. Pusic, “A big data (2) Set JAVA HOME environmental variable on both nodes.
and learning analytics approach to process-level feedback in cognitive (3) Make sure both nodes can ping each other by host-name.
simulations,” Academic Medicine, vol. 92, no. 2, pp. 175–184, 2017.
[56] “Uci machine learning repository,” We can then start a standalone master server by:
http://archive.ics.uci.edu/ml/index.html.
[57] “Us government’s bureau of transportation research
. / s b i n / s t a r t −m a s t e r . s h
and innovative technology administration (rita),”
http://transtats.bts.gov/DL SelectFields.asp?Table ID=236.
The master node should print a spark://HOST:PORT URL
[58] “Hepmass dataset,” http://archive.ics.uci.edu/ml/datasets/HEPMASS. in a way we can use that to connect workers to the master
[59] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson, “Pa- node, or pass as the master argument to the SparkContext.
rameterized machine learning for high-energy physics,” arXiv preprint
arXiv:1601.07913, 2016.
We are able to start one or more worker nodes and connect
[60] “Susy dataset,” http://archive.ics.uci.edu/ml/datasets/SUSY. them to the master using:
[61] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in
high-energy physics with deep learning,” Nature communications, vol. 5, . / s b i n / s t a r t −s l a v e . s h <m a s t e r −s p a r k −URL>
2014.
[62] “Higgs dataset,” http://archive.ics.uci.edu/ml/datasets/HIGGS. Further information is available at [69]. Here, we also present
[63] “Heterogeneity activity recognition dataset,” a sample code to see how we can use an Apache Spark MLlib’s
http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition. SVM and Decision Tree classification components:
[64] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard,
A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: / / SVM
Assessing and mitigatingmobile sensing heterogeneities for activity
recognition,” in Proceedings of the 13th ACM Conference on Embedded String path =
Networked Sensor Systems. ACM, 2015, pp. 127–140. R e s o u r c e . g e t P a t h ( ” d a t a / HEPMASS . t x t ” ) ;
[65] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. JavaRDD<L a b e l e d P o i n t > d a t a =
Elsevier, 2011.
[66] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a MLUtils . l o a d L i b S V M F i l e ( j s c . s c ( ) , p a t h )
receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1, . toJavaRDD ( ) ;
pp. 29–36, 1982. JavaRDD<L a b e l e d P o i n t > t r a i n i n g =
[67] S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica,
“Ernest: Efficient performance prediction for large-scale advanced ana- d a t a . s a m p l e ( f a l s e , 0 . 7 5 , 11L ) ;
lytics.” in NSDI, 2016, pp. 363–378. t r a i n i n g . cache ( ) ;
[68] S. Ma and Z. Liang, “Design and implementation of smart city big JavaRDD<L a b e l e d P o i n t > t e s t =
data processing platform based on distributed architecture,” in Intelligent
Systems and Knowledge Engineering (ISKE), 2015 10th International data . subtract ( training ) ;
Conference on. IEEE, 2015, pp. 428–433. int numIterations = 120;
[69] “Spark standalone mode,” https://spark.apache.org/docs/2.1.0/spark- f i n a l SVMModel modelSVM =
standalone.html.
SVMWithSGD . t r a i n ( t r a i n i n g . r d d ( ) ,
numIterations );
A PPENDIX A
/ / Decision Tree
H OW TO RUN A PACHE S PARK ON A C LUSTER
JavaRDD<L a b e l e d P o i n t >[] s p l i t s =
The SparkContext object in the main program, which is d a t a . r a n d o m S p l i t ( new d o u b l e [ ] { 0 . 7 5 , 0 . 2 5 } ) ;
also called the driver program, is utilized to run Apache Spark JavaRDD<L a b e l e d P o i n t > t r a i n i n g D a t a =
applications on a cluster environment. splits [0];
p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { JavaRDD<L a b e l e d P o i n t > t e s t D a t a =
SparkConf sparkConf = splits [1];
new S p a r k C o n f ( ) . setAppName ( ” J a v a S a m p l e ” ) ; I n t e g e r numClasses = 3;
sparkConf . setMaster ( ” l o c a l [∗] ” ) ; Map<I n t e g e r , I n t e g e r > c a t e g o r i c a l F e a t u r e s I n f o =
JavaSparkContext jsc = new HashMap < >();
new J a v a S p a r k C o n t e x t ( s p a r k C o n f ) ; String impurity = ” gini ” ;
I n t e g e r maxDepth = 5 ;
The SparkContext is able to connect to a variety of cluster I n t e g e r maxBins = 3 2 ;
managers, such as YARN and Mesos, which manage in allo-
cating resources across different applications. Once connected, f i n a l D e c i s i o n T r e e M o d e l modelDT =
Apache Spark procures executors on available nodes in the DecisionTree . t r a i n C l a s s i f i e r ( trainingData ,
cluster environment. It then sends the application code (e.g., numClasses , c a t e g o r i c a l F e a t u r e s I n f o , i m p u r i t y ,
JAR files) to the executors, and finally, the SparkContext maxDepth , maxBins ) ;
sends tasks to the executors to run.