Big Data Machine Learning Using Apache S

This document discusses the use of Apache Spark MLlib 2.0 for big data machine learning, highlighting its capabilities for efficient data analytics on large datasets. The authors conduct experiments to evaluate the performance of various machine learning algorithms, comparing results with the Weka library. The study aims to bridge the gap between big data analytics and machine learning applications, providing insights for future research in this area.

Uploaded by

Yến Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

Big Data Machine Learning Using Apache S

Uploaded by

Yến Lê

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IEEE BIG DATA 2017 1

Big Data Machine Learning using Apache Spark

MLlib
Mehdi Assefi1 , Ehsun Behravesh2 , Guangchi Liu3 , and Ahmad P. Tafti4
1
Department of Computer Science, University of Georgia, GA 30602, USA
2
IEEE Memebr, Kuala Lumpur, Malaysia
3
Stratifyd Inc. Charlotte, NC 28208, USA
4
Biomedical Informatics Research Center, Marshfield Clinic Research Institute, WI 54449, USA

Abstract—Artificial intelligence, and particularly machine [14], [15], text mining [16], [17], [18], [19], and stochastic
learning, has been used in many ways by the research community modeling [20], [21], [22] purposes just to name a few.
to turn a variety of diverse and even heterogeneous data Apache Spark MLlib is one of the most highly demanded
sources into high quality facts and knowledge, providing premier
capabilities to accurate pattern discovery. However, applying platform independent and open-source libraries for big data
machine learning strategies on big and complex datasets is machine learning which benefits from distributed architecture
computationally expensive, and it consumes a very large amount and automatic data parallelization. Apache Spark MLlib has
of logical and physical resources, such as data file space, CPU, and been provided with Apache Spark [23], [24], and it offers a set
memory. A sophisticated platform for efficient big data analytics of dominant functionalists for a variety of machine learning
is becoming more important these days as the data amount
generated in a daily basis exceeds over quintillion bytes. Apache tasks, including regression, dimension reduction, classification,
Spark MLlib is one of the most prominent platforms for big data clustering, and rule extraction. While machine learning and
analysis which offers a set of excellent functionalities for different its effective applications have been studied for a long time
machine learning tasks ranging from regression, classification, in the research community, the study of big data machine
and dimension reduction to clustering and rule extraction. In learning libraries, such as Apache Spark MLlib has been
this contribution, we explore, from the computational perspective,
the expanding body of the Apache Spark MLlib 2.0 as an open- very limited so far. This contribution is perhaps the first
source, distributed, scalable, and platform independent machine work that leverages a big data machine learning library, the
learning library. Specifically, we perform several real world Apache Spark MLlib 2.0, to tackle the problem of big data
machine learning experiments to examine the qualitative and analytics. An objective of big data analytics is to get advanced
quantitative attributes of the platform. Furthermore, we highlight computational infrastructures so that large-scale data can be
current trends in big data machine learning research and provide
insights for future work. mined and analyzed in a timely and efficient manner. This
constitutes the main motivation of the present work. Since big
Index Terms—Apache Spark MLlib, Big Data Machine Learn- data analytics is computationally intensive, the performance
ing, Big Data Analytics, Machine Learning.
and user experience are impacted by different hardware and/or
software configurations. In this paper, we evaluate the impact
I. I NTRODUCTION of different hardware and software configurations with a set of
big data analysis tasks. Based on the findings of this study, we
Digital datasets have been rapidly growing in size and com- provide insights for future big data machine learning-oriented
plexity, and the large volume of daily generated data exceeds hardware, software, and model design.
the boundary of normal processing capabilities, forcing us The mechanism this paper will discuss is to present, from
to take advanced computational infrastructures able to tackle the computational perspective, the Apache Spark MLlib 2.0
parallel and distributed processing. Efficient mining of such and its capabilities and advantages to big data machine
massive data amounts is an extremely challenging practice learning. We demonstrate through extensive experiments the
which requires developing more sophisticated platforms to benefits of such a highly scalable machine learning library,
get extensive big data analysis accurately done in a timely and highlight the insights that can be drawn by big data
fashion. Big data infrastructures have emerged to address the analytics. Using several datasets including tens of millions
problem of big data analytics with the use of fast, reliable, of data records, we demonstrate that we can reliably utilize
and scalable computational architecture, providing excellent Apache Spark MLlib for a variety of large-scale machine
quality attributes including elasticity, availability, and resource learning strategies ranging from big data classification to big
pooling with the ability of on-demand and ease-of-use self- data clustering and rule extraction. We briefly summarize our
services [1], [2], [3], [4]. Several big data machine learning main contributions in the following:
frameworks are now available which, aside from lending prac- • We utilize Apache Spark MLlib 2.0 to initiate a study of
tical contributions to other scientific disciplines, have demon- big data machine learning on massive datasets including
strated successful application in healthcare informatics [5], tens of millions of data records. The first and main
[6], [7], [8], [9], [10], [11], genomic data analysis [12], [13], contribution of the paper is to introduce an exciting but
IEEE BIG DATA 2017 2

challenging area to the machine learning community. pipelines [24], [31]. In the recent years, many facets of data
While machine learning and its applications in research science solutions have been amended by such a library [32],
and industrial communities have been studied for several [33], [34], [35], [36], [37], and many machine learning sci-
years now, the study of big data machine learning has entists and engineers have entered the development of new
been very limited so far. The present work is expected Apache Spark MLlib components to contribute to the big data
to bridge the gap between the two areas, opening the analytics community across the world. Figure 1 presents, from
doors for different interesting directions from the big the development side, a pathway of the Apache Spark MLlib
data community to the fast-growing machine learning 2.0 in which the number of unique commits per release has
application area. been rapidly growing over the years. Apache Spark MLlib has
• We perform several large-scale real world experiments to been in very active development, and at the time of writing this
examine a set of qualitative and quantitative attributes paper, the number of Apache Spark MLlib contributors was
of Apache Spark MLlib 2.0. Furthermore, we estab- above 1000 [38]. Here, we briefly review the recent advances
lish a comparative study with Weka library (Version in Apache Spark MLlib applications. Tizghadam et. al. [39]
3.7.12) [25] (running on hadoop-2.7 [26]); a very well- proposed an open source, scalable platform called CVST. The
known Java-based machine learning library which has system is proposed to be used in smart transport application
been widely used in the community. We evaluate multiple development. CVST cosists of four major components for re-
common big data machine learning models, including source management, data dissemination, business intelligence,
classification and clustering on real data, and compare and application. The business intelligence component which
their performance on various hardware and software is in charge of data analytics, uses MLlib to process the
configurations. data and deliver it to front-end. Lee et. al. [40] proposed an
The rest of the paper is organized as follows. We begin architecture design of academic information system providing
by introducing Apache Spark MLlib in Section 2. Section 3 services for analyzing students’ record patterns. The proposed
explains the materials and methods which consist of a list of recommender system uses Apache Spark MLlib to predict
Apache Spark MLlib 2.0 components we investigate, plus sev- and recommend courses for the following semester. SparkText
eral large-scale datasets. We then delve into the experimental is a text mining framework developed by Ye et. al. [41].
validations in Section 4. Finally, Section 5 concludes with a The proposed system which employs Apache Spark machine
summary of our findings and a discussion of several possible learning and streaming methods along with Cassandra NoSQL
directions for future work. database has been executed on a big dataset of medical articles
to classify cancer types. Arora [42] analyzed mobile data
collected from Web by using K-means algorithm on Apache
II. A PACHE S PARK ML LIB 2.0 Spark MLlib. The author offers an effective way of calculating
Apache Spark [24], [27], [28] as a highly scalable, fast, and the number of users of the network by clustering based on lat-
in-memory big data processing engine, has been originally itude and longitude values. Lee et. al. [43] introduced ALMD
developed in the AMPLab at UC Berkeley, and it offers an - a feature descriptor by considering motion and appearance
ability to develop distributed applications using Java, Python, and employing Apache Spark machine learning library random
Scala, and R programming languages. It comes with four forest to recognize human activities. An attempt to generate
major libraries, including Apache Spark Streaming, Apache a framework for analyzing the population structure using the
Spark SQL, Apache Spark GraphX, and Apache Spark ML- next generation sequencing data is made by Hryhorzhevsk et.
lib [29]. While Apache Spark Streaming as the core scheduling al. [44]. Authors of the paper proposed a distributed computing
module of Spark, implements stream processing within highly framework using ADAM combined with MLib, H2O and
fault tolerant and batch analytics architecture, The Apache Apache SystemML. An architecture for automatic machine
Spark SQL implements relational queries to mine different learning is proposed by Sparks et. al. [45]. The system is
database systems, introducing a data abstraction model called consisting of resource allocation estimator tunning, and opti-
DataFrames [24], [30]. Apache Spark GraphX is a graph mizing components. The entire system is built upon Apache
processing library on top of the Apache Spark which provides Spark and it leverages MLlib and other Spark components.
distributed computational models to process two common data The bigNN which is developed by Tafti et. al. [46] is another
structures, such as graph and collection. Apache Spark MLlib interesting big data analytics component implemented on top
is a big data analytics library which provides more than 55 of the Apache Spark, and it is capable to tackle very large-
scalable machine learning algorithms that benefit from both scale biomedical sentence classification. There still exists a
data and process parallelization [31]. The library includes list of valuable contributions on big data analytics. Interested
implementation of a variety of machine learning strategies, readers are referred to [40], [47], [48], [49], [50], [51], [52],
such as classification, clustering, regression, dimension re- [53], [54], [55] for further readings.
duction, and rule extraction which enables easy and fast in-
practice development of large-scale machine learning applica-
III. M ATERIALS AND M ETHODS
tions. Apache Spark MLlib also offers a set of multi-language
APIs to evaluate machine learning methods, deploying several In this section we further explain the materials and methods
computational components that deal with optimization, latent of the current research study. We shall begin with the Apache
Dirichlet allocation, linear algebra, and feature engineering Spark MLlib components, and then introduce the datasets.
IEEE BIG DATA 2017 3

TABLE I: Datasets’ attributes

Dataset Characteristics Attributes Data records Size
HEPMASS Multivariate Real 7,000,000 2.20 GB
SUSY Multivariate Real 4,999,998 1.81 GB
HIGGS Multivariate Real 10,999,998 5.74 GB
FLIGHT Multivariate Integer, Nominal 47,113,991 3.09 GB
HETROACT I Multivariate, Time-Series Real 13,062,477 879 MB
HETROACT II Multivariate, Time-Series Real 13,932,634 977 MB

TABLE II: The employed configurations of two VMs on a

VMWARE Cluster environment.
Environment Number of nodes HDD RAM CPU
Fig. 1: The development pathway of Apache Spark MLlib ENV1
ENV2
2
2
1 TB (in total)
1 TB (in total)
8 GB (each)
16 GB (each)
4 vCPUs (each)
8 vCPUs (each)
2.0. DataFrame-based APIs in Java and Scala programming
languages have been provided by MLlib V 1.2. The MLlib
v 1.3 and v 1.5 came with Python and R APIs respectively. time, arrival airport, delay time, and etc. within the dataset.
Unification of APIs (Dataset &DataFrame), built-in CSV file We utilized this dataset for classification purpose. The fifth
support, and rapid explorative data analysis offered by Apache and sixth datasets called “HETROACT I” and “HETROACT
Spark MLlib 2.0. It supports for Generalized Linear Models II” [63], [64]. These heterogeneity datasets for human activity
(GLM), Naı̈ve Bayes, Survival Regression, and K-Means in recognition from Smartphone and/or Smartwatch sensors are
R. designed to investigate sensor heterogeneities’ impacts on
human activity recognition algorithms [64], and we utilized
them for clustering purpose. For classification purposes, we
A. Machine Learning Components employed 75% of every dataset to train the classifier, and 25%
To evaluate the ability of the Apache Spark MLlib 2.0 to test, all with using 4-fold cross validation to get the Area
library in analyzing big data sets, we focused on a set of under ROC (auROC) [65], [66]. Table I illustrates the detailed
supervised (classification) methods, including SVM (Support attributes of each dataset.
Vector Machine), Decision Tree, Naı̈ve Bayes, and Random
Forest, along with a widely-used unsupervised (clustering) C. Testbed Environments
machine learning algorithm, namely KMeans. To address a Two VMs in a VMWARE Cluster environment have been
comparative study, the machine learning algorithms in Apache used to obtain experimental results. We utilized two different
Spark MLlib 2.0 library were compared to the same algorithm configurations namely “ENV1” and “ENV2” on both VMs as
in Weka library (Version 3.7.12) [25] running on hadoop- further illustrated in Table II. The 64-bit CentOS 6.8 operating
2.7 [26]. To work with Apache Spark MLlib 2.0 and also Weka system with Xeon E5-2690V3 2.6 GHz CPU were used on
components, We utilized Java2SE 8.1 programing language for both VMs.
all parts of the code and implementations. For each set of the
ML algorithms, we employed the same configurations (e.g., IV. E XPERIMENTAL R ESULTS
regularization, cost, loss, kernel type, seed, etc.) with equal
value of parameters offered by both Apache Spark MLlib 2.0 We focused on the following supervised (classification)
and Weka libraries. methods: SVM (Support Vector Machine), Decision Tree,
Naı̈ve Bayes, and Random Forest. We also employed K-Means
as an unsupervised (clustering) algorithm to perform experi-
B. Datasets mental studies on both supervised and unsupervised machine
Six various big datasets were used to analyze and compare learning methods. Here is a summary of our experiments
the Apache Spark MLlib 2.0 performance. Five datasets from over the six datasets illustrated in Table I. Results present the
the UCI Machine Learning Repository [56], and a dataset from running time of Mllib and Weka with same hardware setup.
the US government’s Bureau of Transportation Research and The area under ROC(s) obtained by Weka and Apache Spark
Innovative Technology Administration (RITA) [57] Web sites. MLlib with the use of SVM, Decision Tree, Naı̈ve Bayes,
The first dataset called “HEPMASS” [58], [59] includes and Random Forest methods over four different datasets are
high-energy physics experiments to search for the signatures presented in Table III. Table IV discusses the experimental
of exotic particles, and it is associated for binary classification results obtained by the K-Mean algorithm across the datasets.
task. The second one named “SUSY” is associated to a binary Figures 2a, 2b, 2c, and 2d show the running time of 4 selected
classification problem to distinguish between a signal process classification algorithms on the proposed datasets. The running
that produces supersymmetric particles and a background pro- times of K-means on HETROACT I and HETROACT II
cess that does not [60], [61].The third dataset called “HIGGS” datasets is summarized in figures 3a and 3b. Summary of the
is related to a binary classification problem to distinguish experimental results are:
between a signal process which produces Higgs bosons and • Area under the ROC obtained by both Weka and Apache
a background process which does not [62], [61].The forth Spark MLlib are pretty similar to each other, and the
dataset called “FLIGHT” [57] which includes flight informa- difference is not statistically significant. The small dif-
tion from October 1987 to April 2008. There are variables re- ferences between the Area under the ROC achieved by
lated to flight time, flight origin and destination, flight elapsed Weka and Apache Spark MLlib might be associated to
IEEE BIG DATA 2017 4

(b) The results obtained on the HEPPMASS

(a) The results obtained on the Flight dataset.
dataset.

(c) The results obtained on the HIGGS dataset. (d) The results obtained on the SUSY dataset.
Fig. 2: Running time of Apache Spark MLlib compared to Weka under different classification experiments. NB stands for
Naı̈ve Bayes, DT stands for Decision Tree, and RF refers to Random Forest.

(a) Running time of K-Means clustering over the

HETROACT I dataset. (b) Running time of K-Means clustering over the
HETROACT II dataset.
Fig. 3: The running time of Apache Spark MLlib K-Means compared to Weka K-Means clustering component.
IEEE BIG DATA 2017 5

TABLE III: Area under ROC associated with the classification solve the problem of pattern discovery form large-scale data
algorithms performed by Apache Spark MLlib and Weka sources more efficiently, and Apache Spark MLlib is one of
library. the widely-used big data machine learning library available to
Environment Dataset Algorithm MLlib Weka the community.
SUSY SVM 0.7932 0.7829 Apache Spark MLlib is a very strong tool for big data
Random Forest 0.7614 0.7731
Decision Tree 0.7528 0.7690 analytics and as results of our study shows, it presents spec-
Nave Bayes 0.7853 0.8031 tacular performance in terms of running time. Weka, on the
HIGGS SVM 0.5690 0.5543
Random Forest 0.5680 0.5712
other hand, works slower than Apache Spark MLlib under big
Decision Tree 0.5829 0.5873 loads of data samples. Considering different file systems and
Nave Bayes 0.5741 0.5722 configurations used by Apache Spark MLlib and Weka, this
ENV1 FLIGHT SVM 0.5189 0.5312
Random Forest 0.5305 0.5581 comparison may not be deemed fair though. We ran MLlib on
Decision Tree 0.5736 0.5819 Spark distributed file system and used Hadoop distributed file
Nave Bayes 0.5570 0.5384
HEPMASS SVM 0.9591 0.9543
system for Weka. But our goal is to show how Spark performs
Random Forest 0.8948 0.8961 with large sets and Weka is used here as a reliable baseline that
Decision Tree 0.8995 0.8971 is agreed by the research community. There are many features
Nave Bayes 0.9114 0.9398
SUSY SVM 0.7932 0.7829 with Weka that Spark can not compete with: (1) There is a big
Random Forest 0.7614 0.7731 pool of documents and resources available for users, (2) It is
Decision Tree 0.7528 0.7690 very straightforward and easy to use for non-expert users, and
Nave Bayes 0.7853 0.8031
HIGGS SVM 0.5690 0.5543 (3) It has a perfect graphical user interface. Weka supports a
Random Forest 0.5680 0.5712 vast variety of machine learning algorithms.
Decision Tree 0.5829 0.5873
Nave Bayes 0.5741 0.5722 Apache Spark MLlib offers fast, flexible, and scalable
ENV2 FLIGHT SVM 0.5189 0.5312 implementations of a variety of machine learning components,
Random Forest 0.5305 0.5581 ranging from ensemble learning and principal component
Decision Tree 0.5736 0.5819
Nave Bayes 0.5570 0.5384 analysis (PCA) to optimization and clustering analysis. Apache
HEPMASS SVM 0.9591 0.9543 Spark MLlib also offer options for distributed processing
Random Forest 0.8948 0.8961
Decision Tree 0.8995 0.8971
by parallel processing and support of big data tools that
Nave Bayes 0.9114 0.9398 utilize distributed architectures. These criteria will decrease
the processing time required and, at the same time, increase
TABLE IV: The general performance of Weka and Apache the time available to interpret analytics results. This becomes
Spark MLlib K-Means algorithm. very important when the machine learning task has many
Dataset MLlib Weka predictions to calculate. The distributed architecture can also
HETROACT I Time: 0 min, 48 sec Time: 11 min, 40 sec take advantages of some of the big data tool sets available to
Number of Clusters: 5 Number of Clusters: 5
SSE: 1.2403975185657699E34 SSE: 1.2419981124641322E34 help break apart the machine learning component to improve
HETROACT II Time: 19 min, 57 sec Time: 53 min, 16 sec
Number of Clusters: 5 Number of Clusters: 5 overall running time. Integration is another advantage of
SSE: 1.2312227116685513E35 SSE: 1.2619357121584791E35 Apache Spark MLlib, meaning that the MLlib gains from
several software components available in the Spark ecosystem,
such as Spark GraphX, Spark SQL, and Spark Streaming,
the detailed parameters of each algorithm in randomly and a wide range of well-organized documentations, including
selecting train and test instances while we are doing 4- code samples are publicly and freely available to the machine
fold cross validation, or they could come from the very learning community.
detailed internal parameters of each algorithm. Our future work will be focused on adding more practical
• The comparative study demonstrates that the Apache experiments with most of the Apache Spark MLlib compo-
Spark MLlib, as expected, is able to be faster in com- nents by utilizing a variety of bigger datasets. As part of our
parison with the Weka components we have utilized, and future work, we are working to run an experimental evaluation
performing a t-test on the running time matched by either of Apache Spark MLlib under a diversity of programming
classification algorithms or the clustering method shows languages (e.g., Python and R), clusters and also hardware
statistically significant differences (at p < 0.01) between or/and software configurations, employing a set of big datasets
Apache Spark MLlib and Weka. with variety of characteristics within the data.

V. D ISCUSSION AND O UTLOOK R EFERENCES

The amount of digital data being collected is expanding on [1] M. Parashar, X. Li, and S. Chandra, Advanced computational infrastruc-
a daily basis, and as a result, the science of data analytics tures for parallel and distributed applications. John Wiley & Sons,
2010, vol. 66.
and processing should be more technologically advanced to [2] D. Talia, “Toward cloud-based big-data analytics,” IEEE Computer
enable data scientists to turn this large body of structured Science, pp. 98–101, 2013.
and even unstructured data into high quality information and [3] D. Agrawal, S. Das, and A. El Abbadi, “Big data and cloud computing:
current state and future opportunities,” in Proceedings of the 14th
facts. Computer engineers and scientists have already entered International Conference on Extending Database Technology. ACM,
the development of big data machine learning components to 2011, pp. 530–533.
IEEE BIG DATA 2017 6

[4] A. Abbasi, S. Sarker, and R. Chiang, “Big data research in information [28] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
systems: Toward an inclusive research agenda,” Journal of the Associa- “Spark: Cluster computing with working sets.” HotCloud, vol. 10, no.
tion for Information Systems, vol. 17, no. 2, p. 3, 2016. 10-10, p. 95, 2010.
[5] J. Archenaa and E. M. Anita, “Interactive big data management in health- [29] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,
care using spark,” in Proceedings of the 3rd International Symposium X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin et al., “Apache
on Big Data and Cloud Computing Challenges (ISBCC–16). Springer, spark: a unified engine for big data processing,” Communications of the
2016, pp. 265–272. ACM, vol. 59, no. 11, pp. 56–65, 2016.
[6] R. Pita, C. Pinto, P. Melo, M. Silva, M. Barreto, and D. Rasella, [30] M. Frampton, Mastering Apache Spark. Packt Publishing Ltd, 2015.
“A spark-based workflow for probabilistic record linkage of healthcare [31] X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu,
data.” in EDBT/ICDT Workshops, 2015, pp. 17–26. J. Freeman, D. Tsai, M. Amde, S. Owen et al., “Mllib: Machine learning
[7] I. a. Q. M. J. How, D. I. M. H. I. Leave, Y. S. I. F. Question, and T. Y. in apache spark,” JMLR, vol. 17, no. 34, pp. 1–7, 2016.
Why, “A case study: Apache spark.” [32] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and
[8] A. P. Tafti, E. LaRose, J. C. Badger, R. Kleiman, and P. Peissig, “Ma- T. Kraska, “Automating model search for large scale machine learning,”
chine learning-as-a-service and its application to medical informatics,” in Proceedings of the Sixth ACM Symposium on Cloud Computing.
in International Conference on Machine Learning and Data Mining in ACM, 2015, pp. 368–380.
Pattern Recognition. Springer, 2017, pp. 206–219. [33] F. Liang, C. Feng, X. Lu, and Z. Xu, “Performance benefits of datampi:
[9] R. Fang, S. Pouyanfar, Y. Yang, S.-C. Chen, and S. Iyengar, “Computa- a case study with bigdatabench,” in Workshop on Big Data Benchmarks,
tional health informatics in the big data age: a survey,” ACM Computing Performance Optimization, and Emerging Hardware. Springer, 2014,
Surveys (CSUR), vol. 49, no. 1, p. 12, 2016. pp. 111–123.
[10] J. D. Van Horn, “Opinion: Big data biomedicine offers big higher edu- [34] M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica,
cation opportunities,” Proceedings of the National Academy of Sciences, P. Wendell, R. Xin, and M. Zaharia, “Scaling spark in the real world:
vol. 113, no. 23, pp. 6322–6324, 2016. performance and usability,” Proceedings of the VLDB Endowment,
[11] Z. Lv, J. Chirivella, and P. Gagliardo, “Bigdata oriented multimedia vol. 8, no. 12, pp. 1840–1843, 2015.
mobile health applications,” Journal of medical systems, vol. 40, no. 5, [35] E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and T. Kraska,
p. 120, 2016. “Tupaq: An efficient planner for large-scale predictive analytic queries,”
[12] M. S. Wiewiórka, A. Messina, A. Pacholewska, S. Maffioletti, arXiv preprint arXiv:1502.00068, 2015.
P. Gawrysiak, and M. J. Okoniewski, “Sparkseq: fast, scalable, cloud- [36] C.-Y. Lin, C.-H. Tsai, C.-P. Lee, and C.-J. Lin, “Large-scale logistic
ready tool for the interactive genomic data analysis with nucleotide regression and linear support vector machines using spark,” in Big Data
precision,” Bioinformatics, p. btu343, 2014. (Big Data), 2014 IEEE International Conference on. IEEE, 2014, pp.
[13] M. Masseroli, P. Pinoli, F. Venco, A. Kaitoua, V. Jalili, F. Palluzzi, 519–528.
H. Muller, and S. Ceri, “Genometric query language: a novel approach to [37] A. G. Shoro and T. R. Soomro, “Big data analysis: Apache spark
large-scale genomic data management,” Bioinformatics, vol. 31, no. 12, perspective,” Global Journal of Computer Science and Technology,
pp. 1881–1888, 2015. vol. 15, no. 1, 2015.
[14] D. Ding, D. Wu, and F. Yu, “An overview on cloud computing platform [38] “Github-apache spark,” https://github.com/apache/spark/.
spark for human genome mining,” in Mechatronics and Automation [39] A. Tizghadam and A. Leon-Garcia, “Application platform for smart
(ICMA), 2016 IEEE International Conference on. IEEE, 2016, pp. transportation,” in Future Access Enablers of Ubiquitous and Intelligent
2605–2610. Infrastructures. Springer, 2015, pp. 26–32.
[15] M. Syed, “Using apache spark for scalable gene sequence analysis,” [40] M.-S. Lee, E. Kim, C.-S. Nam, and D.-R. Shin, “Design of educational
Ph.D. dissertation, TEXAS A&M UNIVERSITY-COMMERCE, 2016. big data application using spark,” in Advanced Communication Technol-
[16] J. Ryan, “Rapidminer for text analytic fundamentals,” Text Mining and ogy (ICACT), 2017 19th International Conference on. IEEE, 2017, pp.
Visualization: Case Studies Using Open-Source Tools, vol. 40, p. 1, 355–357.
2016. [41] Z. Ye, A. P. Tafti, K. Y. He, K. Wang, and M. M. He, “Sparktext:
[17] C. Rong et al., “Using mahout for clustering wikipedia’s latest articles: Biomedical text mining on big data framework,” PloS one, vol. 11, no. 9,
a comparison between k-means and fuzzy c-means in the cloud,” in p. e0162721, 2016.
Cloud Computing Technology and Science (CloudCom), 2011 IEEE [42] S. Arora, “Analyzing mobile phone usage using clustering in spark mllib
Third International Conference on. IEEE, 2011, pp. 565–569. and pig,” International Journal, vol. 8, no. 1, 2017.
[18] A. P. Tafti, J. Badger, E. LaRose, E. Shirzadi, A. Mahanke, J. Mayer, [43] M. A. Uddin, J. bibi Joolee, A. Alam, and Y.-K. Lee, “Human action
Z. Ye, D. Page, and P. Peissig, “Adverse drug event discovery using recognition using adaptive local motion descriptor in spark,” IEEE
biomedical literature: a big data neural network adventure,” JMIR Access, 2017.
Medical Informatics, 2017. [44] A. Hryhorzhevska, M. Wiewiórka, M. Okoniewski, and T. Gambin,
[19] A. Garcıa-Pablos, M. Cuadros, and G. Rigau, “V3: Unsupervised aspect “Scalable framework for the analysis of population structure using
based sentiment analysis for semeval-2015 task 12,” SemEval-2015, p. the next generation sequencing data,” in International Symposium on
714, 2015. Methodologies for Intelligent Systems. Springer, 2017, pp. 471–480.
[20] H. S. Bhat, R. Madushani, and S. Rawat, “Scalable sde filtering and [45] E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and
inference with apache spark,,” Journal of Machine Learning Research T. Kraska, “Automating model search for large scale machine learning,”
W&CP, 2016. in Proceedings of the Sixth ACM Symposium on Cloud Computing.
[21] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, ACM, 2015, pp. 368–380.
M. J. Franklin, M. I. Jordan, and T. Kraska, “Mli: An api for distributed [46] A. P. Tafti, E. Behravesh, M. Assefi, E. LaRose, J. Badger, J. Mayer,
machine learning,” in 2013 IEEE 13th International Conference on Data A. Doan, D. Page, and P. Peissig, “bignn: an open-source big data toolkit
Mining. IEEE, 2013, pp. 1187–1192. focused on biomedical sentence classification,” in Proceedings of the
[22] H. Ji, S. H. Weinberg, M. Li, J. Wang, and Y. Li, “An apache spark IEEE BIG DATA, 2017.
implementation of block power method for computing dominant eigen- [47] R. Shyam, B. G. HB, S. Kumar, P. Poornachandran, and K. Soman,
values and eigenvectors of large-scale matrices,” in Big Data and Cloud “Apache spark a big data analytics platform for smart grid,” Procedia
Computing (BDCloud), Social Computing and Networking (SocialCom), Technology, vol. 21, pp. 171–178, 2015.
Sustainable Computing and Communications (SustainCom)(BDCloud- [48] S. Gopalani and R. Arora, “Comparing apache spark and map reduce
SocialCom-SustainCom), 2016 IEEE International Conferences on. with performance analysis using k-means,” International Journal of
IEEE, 2016, pp. 554–559. Computer Applications, vol. 113, no. 1, 2015.
[23] “Apache spark,” http://spark.apache.org/. [49] N. Yousefi, M. Georgiopoulos, and G. C. Anagnostopoulos, “Multi-
[24] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, task learning with group-specific feature space sharing,” in Joint Eu-
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: ropean Conference on Machine Learning and Knowledge Discovery in
A fault-tolerant abstraction for in-memory cluster computing,” in Pro- Databases. Springer, 2015, pp. 120–136.
ceedings of the 9th USENIX conference on Networked Systems Design [50] M. S. Fazli, S. A. Vella, S. N. Moreno, and S. Quinn, “Computational
and Implementation. USENIX Association, 2012, pp. 2–2. motility tracking of calcium dynamics in toxoplasma gondii,” arXiv
[25] “Weka library,” http://www.cs.waikato.ac.nz/ml/weka/. preprint arXiv:1708.01871, 2017.
[26] “Apache hadoop releases,” http://hadoop.apache.org/releases.html. [51] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B.
[27] “Apache spark mllib,” http://spark.apache.org/mllib/. Gutierrez, and K. Kochut, “A brief survey of text mining: Classification,
IEEE BIG DATA 2017 7

clustering and extraction techniques,” arXiv preprint arXiv:1707.02919, Spark Standalone Mode
2017. Apache Spark also provides an easy-to-use standalone de-
[52] A. Gandomi and M. Haider, “Beyond the hype: Big data concepts, meth-
ods, and analytics,” International Journal of Information Management, ployment mode, in which we can easily put a compiled
vol. 35, no. 2, pp. 137–144, 2015. version of Apache Spark on every single node on the cluster
[53] W. Raghupathi and V. Raghupathi, “Big data analytics in healthcare: environment [69]. In this Appendix, and for example, we
promise and potential,” Health information science and systems, vol. 2,
no. 1, p. 3, 2014. are using only two nodes, a master node and a slave node,
[54] Z. Lv, H. Song, P. Basanta-Val, A. Steed, and M. Jo, “Next-generation assuming to employ Java programming language. In doing so,
big data analytics: State of the art, challenges, and future research we have to do the following steps respectively:
topics,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4,
pp. 1891–1899, 2017. (1) Install Java on both nodes.
[55] M. Pecaric, K. Boutis, J. Beckstead, and M. Pusic, “A big data (2) Set JAVA HOME environmental variable on both nodes.
and learning analytics approach to process-level feedback in cognitive (3) Make sure both nodes can ping each other by host-name.
simulations,” Academic Medicine, vol. 92, no. 2, pp. 175–184, 2017.
[56] “Uci machine learning repository,” We can then start a standalone master server by:
http://archive.ics.uci.edu/ml/index.html.
[57] “Us government’s bureau of transportation research
. / s b i n / s t a r t −m a s t e r . s h
and innovative technology administration (rita),”
http://transtats.bts.gov/DL SelectFields.asp?Table ID=236.
The master node should print a spark://HOST:PORT URL
[58] “Hepmass dataset,” http://archive.ics.uci.edu/ml/datasets/HEPMASS. in a way we can use that to connect workers to the master
[59] P. Baldi, K. Cranmer, T. Faucett, P. Sadowski, and D. Whiteson, “Pa- node, or pass as the master argument to the SparkContext.
rameterized machine learning for high-energy physics,” arXiv preprint
arXiv:1601.07913, 2016.
We are able to start one or more worker nodes and connect
[60] “Susy dataset,” http://archive.ics.uci.edu/ml/datasets/SUSY. them to the master using:
[61] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in
high-energy physics with deep learning,” Nature communications, vol. 5, . / s b i n / s t a r t −s l a v e . s h <m a s t e r −s p a r k −URL>
2014.
[62] “Higgs dataset,” http://archive.ics.uci.edu/ml/datasets/HIGGS. Further information is available at [69]. Here, we also present
[63] “Heterogeneity activity recognition dataset,” a sample code to see how we can use an Apache Spark MLlib’s
http://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition. SVM and Decision Tree classification components:
[64] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard,
A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: / / SVM
Assessing and mitigatingmobile sensing heterogeneities for activity
recognition,” in Proceedings of the 13th ACM Conference on Embedded String path =
Networked Sensor Systems. ACM, 2015, pp. 127–140. R e s o u r c e . g e t P a t h ( ” d a t a / HEPMASS . t x t ” ) ;
[65] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. JavaRDD<L a b e l e d P o i n t > d a t a =
Elsevier, 2011.
[66] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a MLUtils . l o a d L i b S V M F i l e ( j s c . s c ( ) , p a t h )
receiver operating characteristic (roc) curve.” Radiology, vol. 143, no. 1, . toJavaRDD ( ) ;
pp. 29–36, 1982. JavaRDD<L a b e l e d P o i n t > t r a i n i n g =
[67] S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica,
“Ernest: Efficient performance prediction for large-scale advanced ana- d a t a . s a m p l e ( f a l s e , 0 . 7 5 , 11L ) ;
lytics.” in NSDI, 2016, pp. 363–378. t r a i n i n g . cache ( ) ;
[68] S. Ma and Z. Liang, “Design and implementation of smart city big JavaRDD<L a b e l e d P o i n t > t e s t =
data processing platform based on distributed architecture,” in Intelligent
Systems and Knowledge Engineering (ISKE), 2015 10th International data . subtract ( training ) ;
Conference on. IEEE, 2015, pp. 428–433. int numIterations = 120;
[69] “Spark standalone mode,” https://spark.apache.org/docs/2.1.0/spark- f i n a l SVMModel modelSVM =
standalone.html.
SVMWithSGD . t r a i n ( t r a i n i n g . r d d ( ) ,
numIterations );
A PPENDIX A
/ / Decision Tree
H OW TO RUN A PACHE S PARK ON A C LUSTER
JavaRDD<L a b e l e d P o i n t >[] s p l i t s =
The SparkContext object in the main program, which is d a t a . r a n d o m S p l i t ( new d o u b l e [ ] { 0 . 7 5 , 0 . 2 5 } ) ;
also called the driver program, is utilized to run Apache Spark JavaRDD<L a b e l e d P o i n t > t r a i n i n g D a t a =
applications on a cluster environment. splits [0];
p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) { JavaRDD<L a b e l e d P o i n t > t e s t D a t a =
SparkConf sparkConf = splits [1];
new S p a r k C o n f ( ) . setAppName ( ” J a v a S a m p l e ” ) ; I n t e g e r numClasses = 3;
sparkConf . setMaster ( ” l o c a l [∗] ” ) ; Map<I n t e g e r , I n t e g e r > c a t e g o r i c a l F e a t u r e s I n f o =
JavaSparkContext jsc = new HashMap < >();
new J a v a S p a r k C o n t e x t ( s p a r k C o n f ) ; String impurity = ” gini ” ;
I n t e g e r maxDepth = 5 ;
The SparkContext is able to connect to a variety of cluster I n t e g e r maxBins = 3 2 ;
managers, such as YARN and Mesos, which manage in allo-
cating resources across different applications. Once connected, f i n a l D e c i s i o n T r e e M o d e l modelDT =
Apache Spark procures executors on available nodes in the DecisionTree . t r a i n C l a s s i f i e r ( trainingData ,
cluster environment. It then sends the application code (e.g., numClasses , c a t e g o r i c a l F e a t u r e s I n f o , i m p u r i t y ,
JAR files) to the executors, and finally, the SparkContext maxDepth , maxBins ) ;
sends tasks to the executors to run.

Splunk for Data Insights: Definitive Reference for Developers and Engineers
From Everand
Splunk for Data Insights: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Step 2 Big Data Analytics and Machine Learning
No ratings yet
Step 2 Big Data Analytics and Machine Learning
11 pages
Basic Marine Engineering For Maritime Students
100% (5)
Basic Marine Engineering For Maritime Students
55 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
WATERAX Genuine Parts Catalog PDF
100% (1)
WATERAX Genuine Parts Catalog PDF
77 pages
2021 Article 9362
No ratings yet
2021 Article 9362
21 pages
21CS71 Big Data Analytics
No ratings yet
21CS71 Big Data Analytics
17 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
Machine Learning With Spark Nick Pentreath Download
No ratings yet
Machine Learning With Spark Nick Pentreath Download
61 pages
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with MLlib: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
No ratings yet
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
42 pages
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
No ratings yet
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
21 pages
A New Platform For Distributed
No ratings yet
A New Platform For Distributed
19 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Big Data With ML
100% (1)
Big Data With ML
19 pages
A R S M L D A B D: Eview On The Ignificance of Achine Earning For ATA Nalysis in IG ATA
No ratings yet
A R S M L D A B D: Eview On The Ignificance of Achine Earning For ATA Nalysis in IG ATA
18 pages
Annex 1 - Description of The Action (Part B)
No ratings yet
Annex 1 - Description of The Action (Part B)
79 pages
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
No ratings yet
Machine Learning On Big Data: Opportunities and Challenges: Version of Record
27 pages
Entregable Final - Big Data y Machine Learning (Diaz Granados Alexander Angel)
No ratings yet
Entregable Final - Big Data y Machine Learning (Diaz Granados Alexander Angel)
18 pages
Big Data Machine Learning and The BlockChain Techn
No ratings yet
Big Data Machine Learning and The BlockChain Techn
4 pages
Machine Learning with Spark Nick Pentreath available instanly
No ratings yet
Machine Learning with Spark Nick Pentreath available instanly
147 pages
May2024 (Volume45Issue01) IJCSMS Ranveer 15 23
No ratings yet
May2024 (Volume45Issue01) IJCSMS Ranveer 15 23
11 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
1 page
Spark MLlib
No ratings yet
Spark MLlib
1 page
Spark On Hadoop Vs MPI OpenMP On Beowulf
No ratings yet
Spark On Hadoop Vs MPI OpenMP On Beowulf
10 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
9 pages
A Mini Review of Machine Learning in Big Data Analytics Applications
No ratings yet
A Mini Review of Machine Learning in Big Data Analytics Applications
17 pages
Colossal-AI for Large-Scale Model Training: The Complete Guide for Developers and Engineers
From Everand
Colossal-AI for Large-Scale Model Training: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learnin-WPS Office
No ratings yet
Machine Learnin-WPS Office
8 pages
U4 BDH
No ratings yet
U4 BDH
19 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
An Auto-Explained Automated Machine Learning Tool For Big
No ratings yet
An Auto-Explained Automated Machine Learning Tool For Big
6 pages
ML Cia A1 2382487
No ratings yet
ML Cia A1 2382487
8 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Optimized Deep Learning on Apple Silicon with PyTorch MPS: The Complete Guide for Developers and Engineers
From Everand
Optimized Deep Learning on Apple Silicon with PyTorch MPS: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DE in AI
No ratings yet
DE in AI
14 pages
Machine Learning Tools and Toolkits in The Explora
No ratings yet
Machine Learning Tools and Toolkits in The Explora
7 pages
Big Data Presentation
No ratings yet
Big Data Presentation
13 pages
Productflyer - 978 1 4842 0964 6 PDF
No ratings yet
Productflyer - 978 1 4842 0964 6 PDF
1 page
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
No ratings yet
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
1 page
Machine Learning Fundamentals: Concepts, Models, and Applications
From Everand
Machine Learning Fundamentals: Concepts, Models, and Applications
Amar Sahay
No ratings yet
Exploring The Intersection of Deep Learning and Big Data
No ratings yet
Exploring The Intersection of Deep Learning and Big Data
4 pages
BD 07 Spark
No ratings yet
BD 07 Spark
49 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Big Data Analytics L-6
No ratings yet
Big Data Analytics L-6
3 pages
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
ML Libraries Frameworks Updated
No ratings yet
ML Libraries Frameworks Updated
13 pages
ML Cloud
No ratings yet
ML Cloud
12 pages
Day5 FDP IoT Part1
No ratings yet
Day5 FDP IoT Part1
89 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Machine Learning Platform Design and Application Based On SparkProceedings of SPIE The International Society For Optical Engineering
No ratings yet
Machine Learning Platform Design and Application Based On SparkProceedings of SPIE The International Society For Optical Engineering
6 pages
Big Data Analytics Using Artificial Intelligence: Apache Spark For Scalable Batch Processing
No ratings yet
Big Data Analytics Using Artificial Intelligence: Apache Spark For Scalable Batch Processing
3 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Big Data in Python
No ratings yet
Big Data in Python
10 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations
From Everand
MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations
Ross Brigoli
No ratings yet
Class 10 - DECEMBER PREBOARD EXAM
No ratings yet
Class 10 - DECEMBER PREBOARD EXAM
11 pages
Assignemnt2 - Xchart Rchart
No ratings yet
Assignemnt2 - Xchart Rchart
2 pages
Cross Line Laser PKLL 7 F5
No ratings yet
Cross Line Laser PKLL 7 F5
44 pages
Glass Block Technical Presentation
No ratings yet
Glass Block Technical Presentation
16 pages
An Exploratory Analysis On The Perception of Filipino Aircraft Maintenance Technicians On The Use of Paperless Maintenance Documentations
No ratings yet
An Exploratory Analysis On The Perception of Filipino Aircraft Maintenance Technicians On The Use of Paperless Maintenance Documentations
14 pages
DC80
No ratings yet
DC80
13 pages
Polygon Shafts Hubs
No ratings yet
Polygon Shafts Hubs
6 pages
Global Organic Textile Standard - GOTS
No ratings yet
Global Organic Textile Standard - GOTS
3 pages
1 Normal Stress
No ratings yet
1 Normal Stress
4 pages
Davanagere
No ratings yet
Davanagere
11 pages
2005 US Army Packaging of Materiel Preservation FM 38-700 267p
No ratings yet
2005 US Army Packaging of Materiel Preservation FM 38-700 267p
267 pages
BID NOTICE UNDER OPEN INTERNATIONAL BIDDING STATCOM Amended JULY24
No ratings yet
BID NOTICE UNDER OPEN INTERNATIONAL BIDDING STATCOM Amended JULY24
2 pages
S. Radhakrshinan
No ratings yet
S. Radhakrshinan
37 pages
Characteristics of Bacteria Worksheet: Wall. The Cell Wall Gives The Bacteria Shape and Support. Certain Kinds of
No ratings yet
Characteristics of Bacteria Worksheet: Wall. The Cell Wall Gives The Bacteria Shape and Support. Certain Kinds of
6 pages
Medieval English Architecture
No ratings yet
Medieval English Architecture
4 pages
Cloud Module 1
No ratings yet
Cloud Module 1
8 pages
Finite Element Formulation For Vector Field Problems - Linear Elasticity
No ratings yet
Finite Element Formulation For Vector Field Problems - Linear Elasticity
33 pages
Rfid Logger With Mysql Database
No ratings yet
Rfid Logger With Mysql Database
10 pages
Pipe Support Span Chart
No ratings yet
Pipe Support Span Chart
1 page
1065 (USA / SAE) : Group Subgroup Comment Application Material
No ratings yet
1065 (USA / SAE) : Group Subgroup Comment Application Material
2 pages
Wecall Catalog
100% (2)
Wecall Catalog
20 pages
Research Proposal Covid 19
No ratings yet
Research Proposal Covid 19
19 pages
Stoic H Practice Key
No ratings yet
Stoic H Practice Key
2 pages
Pretest Grade 7 Chs
No ratings yet
Pretest Grade 7 Chs
4 pages
High Quality Knitting in The Nordic Tradition Instant EPUB Download
0% (1)
High Quality Knitting in The Nordic Tradition Instant EPUB Download
15 pages
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
No ratings yet
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
6 pages
2010 Golf GTD Data
No ratings yet
2010 Golf GTD Data
3 pages
Documento 18
No ratings yet
Documento 18
6 pages

Big Data Machine Learning Using Apache S

Uploaded by

Big Data Machine Learning Using Apache S

Uploaded by

IEEE BIG DATA 2017 1

Big Data Machine Learning using Apache Spark

TABLE I: Datasets’ attributes

TABLE II: The employed configurations of two VMs on a

(b) The results obtained on the HEPPMASS

(a) Running time of K-Means clustering over the

V. D ISCUSSION AND O UTLOOK R EFERENCES

You might also like