Big Educational Data & Analytics Survey
Big Educational Data & Analytics Survey
ABSTRACT The proliferation of mobile devices and the rapid development of information and commu-
nication technologies (ICT) have seen increasingly large volume and variety of data being generated at an
unprecedented pace. Big data have started to demonstrate significant values in higher education. This paper
gives several contributions to the state-of-the-art for Big data in higher education and learning technologies
research. Currently, there is no comprehensive survey or literature review for Big educational data. Most
literature reviews from a few authors have focused on one of these fields: educational mining, learning
analytics with discussions on one or two aspects such as Big data technologies without educational focus,
social media data in education, etc. Most of these literature reviews are short and insufficient to provide
more inclusive reviews for Big educational data. In this paper, we present a comprehensive literature review
of the current and emerging paradigms for Big educational data. The survey is presented in five parts: (1) The
first part presents an overview and classification of Big education research to show the full landscape in this
field, which also gives a concise summary of the overall scope of this paper; (2) The second part presents a
discussion for the various data sources from education platforms or systems including learning management
systems (LMS), massive open online courses (MOOC), learning object repository (LOR), OpenCourseWare
(OCW), open educational resources (OER), social media, linked data and mobile learning contributing to
Big education data; (3) The third part presents the data collection, data mining and databases in Big education
data; (4) The fourth part presents the technological aspects including Big data platforms and architectures
such as Hadoop, Spark, Samza and Big data tools for Big education data; and (5) The fifth part presents
different approaches of data analytics for Big education data. This part provides a more inclusive discussion
on data analytics which is beyond traditional forms of learning analysis in higher education. This includes
predictive analytics, learning analytics including collaborative, behavior, personal learnings and assessment,
followed by recommendation systems, graph analytics, visual analytics, immersive learning and analytics,
etc. The final part of the paper discusses social (e.g. privacy and ethical issues) and technological challenges
for Big data in education. This part also illustrates the technological challenges faced by giving an example
for utilizing graph-based analytics for a cross-institution learning analytics scenario.
INDEX TERMS Big data, learning technologies, educational data, learning analytics.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
116392 VOLUME 8, 2020
K. L.-M. Ang et al.: Big Educational Data & Analytics: Survey, Architecture and Challenges
such as course management and learning management sys- The authors in [153] reported on a case study applying a
tems (LMS), massive open online courses (MOOC), Open- Big data framework towards a LMS which was conducted at
CourseWare (OCW), Open Educational Resources (OER), the Catholic University of Murcia. The authors commented
and social media sites such as Twitter, Facebook, YouTube on the challenges of managing the large volume of data
and personal learning environments (PLE). The scalability generated by users in the LMS and employed statistical and
to data processing and analysis enable the development of association rule techniques to speed up the statistical analysis
new insights and valuable information from these educational of the data. In this study the size of the Big data generated by
data and have further shown promise in higher education to the LMS was 70GB from data sources such as student activ-
benefit academics, students and the whole education ecosys- ity, learning modality (e.g. on-campus, online, and blended),
tem. Since Big data and analytics is employed to draw useful number of accesses to the LMS, tools employed by students
insights or values (the 4th V ) from the educational data, we use and their associated events. In the era of Big education
the term Big educational data to describe this emerging field. data, educational data mining (EDM) and data analytics are
There has been growing interest in the education commu- becoming essential tools to address the challenges. Data min-
nity to gain insights of Big educational data to improve ing or also termed as knowledge discovery is known for its
the learning performance of students, recommend courses, effectiveness in discovering hidden information embedded in
analyze learning patterns, predict dropout, improve the work- the educational data. A recent literature review paper on EDM
ing effectiveness of instructors and reduce administrative can be found in [11]. This review work presented twenty
workload. years of data mining research in e-learning environments,
Big data technologies comprise of architectures and tech- from an educational perspective. This paper presented a wide-
nologies which are designed to extract valuable information scale review of 525 papers where both terms of ‘‘data min-
from very large volumes from a wide variety of data sources. ing’’ and ‘‘education’’ were analyzed and used as keywords.
Some common platforms for Big data technologies which The review included 72 papers focused on teaching-learning
have been developed are Hadoop, Samza and Spark. Hadoop evaluation. The analyzed papers showed that the researches
is commonly used for the information processing of complex in EDM have expanded into several different sub-areas and
Big data systems and off-line processing. Samza is mainly themes.
used to address the large volumes for high rate stream data Other literature reviews paper on EDM can be found
processing, and Spark is often used for off-line rapid Big in [12]–[18]. Learning analytics (LA) or sometimes referred
data processing. In the context of Big data in education, some to as academic analytics, and EDM are interconnected areas
specific Big data architectures or frameworks [1]–[10] have in education research. A recent literature review paper on
been proposed for education. The authors in [1] proposed a EDM and LA together for 21st century higher education can
distributed architecture for the information processing of Big be found in [19]. There are different definitions of LA from
education data and predicting student performance with and different authors. Some authors define it in terms of the use of
without sentiment analytics. The authors in [2] proposed a student-generated data for the prediction of educational out-
five-layered architecture termed the Concept Definition for comes for tailoring education, whereas other authors define
Big Data Architecture for education. The authors in [3] pro- LA as a tool to help educators examine, understand and
posed a cloud-based architecture to analyze educational data support student study behaviors and change their learning
from the Moodle system in the cloud using Apache Hadoop. environments. A literature review of the current landscape
The authors in [4] proposed a Big data architecture for edu- of the usage of LA in higher education can be found in
cation using Spark to identify patterns of lecture data that [20]. This study was based on the analysis of 252 papers
students have taken for the year and semester. The authors on learning analytics in higher education published between
in [5] proposed a logging architecture for an E-Learning Big 2012 and 2018. The work by [21] proposed a literature
Data Ecosystem. The authors in [6] proposed a Big data review of the LA landscape from its evolution, status and
infrastructure using the Hadoop platform. The platform is trends. The authors discussed LA as arising from a knowledge
deployed within the e-learning infrastructure of a laboratory. discovery paradigm to understand the learning process. The
The authors in [7] proposed an architecture based on the work by [22] discussed the evidence on four propositions of
Apache Hadoop distributed computing architecture to pro- LA including whether LA improves learning outcomes and
cess the Big data of Holland vocational interest theory. student retention, completion and progression. The work by
Other works on frameworks and platforms for Big educa- [23] focused on the current research trends of LA and its
tion data can be found in [8]–[10]. Further details will be limitations and methods. Another literature review focused
discussed later in the paper. Big data analytics is changing on the use of LA in higher educational settings can be
the educational industry and gives new opportunities for both found in [24]. Up to this point, we can see that there is no
learners and instructors. In general, there are three challenges comprehensive survey or review for Big educational data.
for Big educational data analysis to be addressed: (1) The Most reviews have either focused on EDM or LA from only
huge amount of data to be processed; (2) The complex and the education aspects. There are some short papers on Big
unstructured data analytics; and (3) The difficulty to find the education data but they only provide short overviews of Big
hidden value in the Big education data in a timely manner. data in education and challenges. Therefore, there is a need of
a solid review that combine all aspects in both technologies TABLE 1. Overall classification of big educational data research.
and education for Big education data. A comprehensive lit-
erature review of Big education data which emphasizes on
all aspects of Big data technologies, architectures and data
analytics for education is the major contribution in this paper.
The literature review in this paper has been comprehen-
sively carried out using an extensive search of the relevant
databases including IEEE Xplore, Springer, ScienceDirect,
ACM conference proceedings and other sources using com-
bination of keywords such as ‘‘Big data’’, ‘‘Education’’,
‘‘Learning analytics’’, ‘‘Education data mining’’, ‘‘Learn-
ing management system’’, ‘‘MOOC’’, ‘‘immersive learning’’,
etc. For example, when using IEEE Xplore, a search with
the keyword combination of ‘‘Big data’’ and ‘‘Education’’
returned 585 journals and 1452 conference papers. Of this,
recent papers most relevant to Big educational data were
surveyed.
In this paper, the data sources from education plat-
forms or systems including LMS, MOOC, learning object
repository (LOR), OCW, OER, social media, linked data
and mobile learning contributing to Big education data are
discussed. This is followed by the data collection, data mining
and databases for education. This paper also gives discus-
sions for the technological aspects which include Big data
platforms such Hadoop, Spark and Samza and Big data tools
for Big education data. The Big data architectures or frame-
works specifically proposed to education are reviewed and
discussed in detail. The most challenging part of this paper
is to present a comprehensive literature review on data ana-
lytics from both technology and education aspects and this is
beyond traditional forms of analysis in education. The works
on data analytics are classified into predictive analytics, learn-
ing analytics which includes collaborative and interactive
learning, behavior learning, personal learning and others.
Recommendation systems or recommender for education
which is an emerging topic in data analytics is also presented.
Other emerging analytics such as graph analytics, visual ana-
lytics, immersive learning and analytics are also included.
The final part of the paper provides some experimental
insights for utilizing graph analytics for a university-based
learning analytics scenario. The technological and social
challenges for Big data in education and insights for future II. OVERVIEW AND RESEARCH CLASSIFICATION
direction are also discussed. The rest of the paper is orga- The paper first presents the overview and classification of Big
nized as follows. Section II gives background information and educational data and analytics research as shown in Table 1 to
research classifications. Section III describes the data sources give a concise summary of the overall scope of this paper. The
from education systems that form the Big education data. research works are classified into the various categories based
Section IV reviews the data collection, mining and databases on the following: (1) Big educational data; (2) Technological
in education systems. Section V presents the technological aspects for Big data for education; (3) Data analytics for Big
aspects for Big education data. Section VI gives a com- education data; and (4) Future challenges for Big education
prehensive literature review on data analytics. Section VII data. Table 1 also allows the reader to see the full landscape
discusses future challenges for Big data in education. This of the research field of Big education data.
section also illustrates the usefulness and technological chal-
lenges faced by giving an example for utilizing graph-based III. DATA SOURCES FROM EDUCATION SYSTEMS
analytics for a cross-institution learning analytics scenario. CONTRIBUTING TO BIG EDUCATION DATA
The paper is concluded with some comments and remarks in Data from education systems can be found in various sources
Section VIII. such as student information systems, student administrative
systems, learning management systems and from library formats of audio, video, text, and images besides the data in
information systems. New education developments and appli- relational databases from institutions. This section presents
cations of information technology together with Internet tech- sources that contribute to Big educational data by reviewing
nology have led to the online education industry. Higher the current education systems or platforms. Fig. 1 shows a
education institutions are increasingly offering and delivering pictorial overview of research areas and data sources in Big
online learning resulting in a large volume and availabil- education data.
ity of educational digital libraries, storage repositories and
tools. Furthermore, enrolled students and offered courses
from massive open online courses (MOOC) are becoming A. LEARNING MANAGEMENT SYSTEMS (LMS)
large and diverse, resulting in a growing abundance in data Learning management systems (LMS) are educational man-
for analytics. There is also increasingly different varieties and agement platforms for the administration, delivery, tracking
FIGURE 1. Overview of research areas and data sources for big education data.
and reporting of educational curriculum and courses. Moo- ent pedagogical approaches called c-MOOC and x-MOOC
dle [28] is one of the most popular open source LMS to distinguish MOOC are often used [30]. The c-MOOC
options available today. Other examples of LMS [29] are emphasize the openness and networking among learners and
Canvas [151], Sakai [152], ATutor, Eliademy. Forma LMS, facilitators where anyone can contribute to the contents,
Dokeos and OpenOLAT. The LMS concept emerged from e- whereas x-MOOC are more facilitator-centric; the contents
Learning. In general, LMS have three major functions: (1) are prepared by the facilitators. Coursera [31] and edX [32]
Management of educational courses and students; (2) Man- are two established MOOC. Other examples of MOOC [33]
agement of online assessments and tracking student progress include Udacity, Duolingo, Treehouse and Google Primer.
and attendance; and (3) Providing feedback to users and
students. The LMS provides services and tools to instructors
C. OPEN EDUCATIONAL RESOURCES (OER) &
to create course content which contains text, images, tables,
OpenCourseWare
interactive tests, and slideshows. The LMS can also be used
to engage the student with contact tools and control access Open educational resources (OER) are educational mate-
to the educational content. For instructors, the LMS enables rials that are freely available in the public domain. The
the management of courses and modules, enrollment of stu- OER include licensed text, media, and other digital assets
dents, and generation of reports on students. Most modern that are useful for teaching, learning, and assessment. The
LMS are web-based information technology systems. With term OER was introduced at the 2002 UNESCO Forum on
the advancement of technology, various tools and strategies Open Courseware [34]. Some examples of OER include:
can be employed for embedding content into LMS such as (1) university curriculum and courses, video lectures and
SCORM (Sharable Content Object Reference Model) [26], assignments; (2) Interactive simulations about a specific topic
and LTI (Learning Tools Interoperability) [27]. (e.g. mathematics, chemistry, etc.); (3) Digital textbooks that
are supported with additional learning materials; (4) Lesson
plans, worksheets and learning activities; and (5) Transla-
B. MASSIVE OPEN ONLINE COURSES (MOOC) tions and adaptations of previously-published OER. Some
Massive Open Online Courses (MOOC) employ web-based well-known examples of OER [35] include Khan Academy,
learning technologies to enroll large number of students OpenStax CNX, Open Textbook Library, Curriki, and Wiki-
worldwide. MOOC learning materials and contents can be media Commons. OpenCourseWare (OCW) [36] is a subset
delivered as text-based or video-based materials. Two differ- of OER. OCW refers to the free and open digital publication
of high-quality college and university level educational mate- data collection about student learning and experiences. Edu-
rials. Examples of OCW include MIT OCW, Johns Hopkins cational data can be collected at a rapid pace with the advance
OCW and CORE (China Open Resources for Education). of online technologies (e.g. MOOC and LMS) which have the
capability to track and collect a huge amount of educational
D. SOCIAL MEDIA data about learner experience. The Experience API (xAPI)
Social media sites such as Twitter, Facebook and YouTube [25] is an open data specification for data collection across
provide a platform for learners to share their educational learning tools. The authors in [43] use the xAPI standard
experiences, emotions, concerns about the learning process to collect, track and store educational data retrieved from
and seek social support from peers. These digital data provide an e-learning environment called Kalboard 360. The tracked
knowledge and perspectives for instructors to understand data is classified into three features (behavioral, demographic
the student’s experiences outside the classroom environment. and academic background features). Another major source
The data from social-based environments can provide valu- of educational data can be obtained from social media (e.g.
able knowledge to inform on student learning and assist insti- blogs, online social networks, microblogs). It is challenging
tutional decision-making on interventions for at-risk students, to collect social media data related to student learning experi-
improve education quality and increase student retention, and ences and behavior because of the variety and diversity of the
success [37]. The abundance and diversity of the social media language used. The authors in [37] performed data collection
data raises challenges for algorithms to capture the embedded from Twitter using an educational account on a commercial
information within the data. social media monitoring tool.
the Greek University Open Data, and the Linking Italian administrators and using EDM to set parameters to improve
University Statistics Project. site efficiency and adapt it to the behavior of users.
The authors in [13] presented a systematic review on
C. EDUCATIONAL DATA MINING EDM focusing on clustering algorithms and its applicability
Data mining techniques are increasingly gaining significance and usability in the context of EDM. The authors term this
in the education sector and the outcomes from data min- approach when applied to analyze datasets from educational
ing techniques can provide invaluable support for decision systems as Educational Data Clustering (EDC). Different
making. The field of data mining in education is termed approaches for EDC were reviewed including 166 studies for
as Educational Data Mining (EDM). EDM is an emerging e-learning and clustering, examination failure and clustering,
discipline that focuses on applying data mining tools and intelligent tutor system and clustering, learning style and
techniques to education related data. This section presents a clustering, student modeling and clustering, student moti-
literature review of the literature or survey papers for EDM vation and clustering, student profiling and clustering, etc.
and highlights their main contributions. A recent literature In [14], the authors performed a literature review focused on
review or survey paper can be found in [11]. This review the different agents in the educational context as students,
presents twenty years of data mining research in e-learning educators, researchers, institutions, and managers. The sur-
environments, from an educational perspective. The authors vey reviewed DM techniques applied to education, and mod-
identified and classified challenges for research to improve els to provide updated information and improve institutional
student learner performances. Another literature review paper efficiency. The review of techniques included forecast perfor-
by [19] published in 2019 focused on EDM and learning ana- mance modelling, undesired behaviour detection, monitoring
lytics in higher education. The work in this literature review support, recommendation planning and scheduling, and intel-
covered four main areas: (1) computer-supported learning ligent tutoring. Other review works on EDM can be found
analytics (CSLA) and the use of DM techniques to derive in [16]–[18]. The literature review paper of [16] discussed
actionable information based on student interaction in LMS an explanation of the DM techniques in order of relevance,
environments; (2) computer-supported predictive analytics tendencies, and limitations faced by e-learning environments.
(CSPA) and the use of EDM and LA to predict student In [17], the authors introduced a new perspective on the indi-
performance and retention in courses based on assessment, vidualization and interaction between the educational actors
engagement and domain knowledge in a learning activity; (3) and highlighted the trends and challenges of EDM from
computer-supported behavioral analytics (CSBA) and the use the perspectives of educational actors. In [18], the authors
of DM techniques to identify student behavioral patterns and discussed the results of researches upon the behavior detec-
preferences when participating in online learning activities; tion, personalization and student’s performance evaluation
and (4) computer-supported visualization analytics (CSVA) obtained by DM techniques such as clustering, classification,
and the combination of information visualization techniques and regression. In the work by [52], the authors focused
with advances in data mining and knowledge representation on detecting the students’ circumvention risks through pre-
to offer a visual analysis of student behavior with respect to dictive models and provide a custom recommendation to
the learning activity. students by identifying their needs and learning disabilities.
Other review papers on EDM for education can be found The objectives were to present a literature review of EDM
in the works by [12]–[18], [52], [53]. Table 2 shows a sum- focused on student’s retention and evasion, recommendation
mary of the various surveys which have been proposed for systems and course administration. The work in [53] covered
EDM. The table gives various details including the year, ubiquitous and pervasive data mining applied to education
survey objectives, and remarks and comments. The authors for fraud detection and identification of students that require
in [15] surveyed the history and applications of data mining special attention.
techniques in the educational field for traditional educational
system, web-based educational system, intelligent tutoring V. TECHNOLOGICAL ASPECTS FOR BIG EDUCATION
system, and e-learning. The authors discussed concepts for DATA
EDM such as prediction, clustering, relationship mining, In this section, some common platforms for Big data such as
outlier detection, text mining, and social network analysis. Hadoop, Spark and Samza will be discussed. Hadoop, Samza
In [12], the authors targeted to highlight the main data min- and Spark are currently the popular systems for Big data
ing techniques applied in the e-learning environment and analysis. Hadoop is used for off-line and complex educational
proposed three useful orientations for EDM research: (1) Big data processing, Samza is mainly used to solve the high
Orientation towards students and using EDM to recommend data rate and large amounts for streaming education data pro-
activities, resources and learning tasks to learners based on cessing, and Spark is often used for off-line rapid education
the tasks already accomplished by the learner and their suc- Big data processing. The authors in [6] provided a general
cesses; (2) Orientation towards educators and using EDM overview of Big data computing and discussed main charac-
to obtain objective feedback for instruction, evaluate the teristics such as data organization, decision-making, domain
structure of the course content and its effectiveness on the specific tools and platform tools. The authors illustrate the
learning process; and (3) Orientation towards academics and infrastructure that enables users to extract the maximum
1) HADOOP PLATFORM
Hadoop is an open source, distributed data processing dis-
tributed system infrastructure developed by the Apache Foun-
dation. It enables distributed and parallel processing of large
FIGURE 2. Samza technology core architecture [56].
amount of data sets across clusters of many computers. It fea-
tures low cost, high efficiency, high reliability, high scalabil-
ity, and high fault tolerance. Hadoop consists of the HDFS
distributed file system, MapReduce and several general- Other Business Intelligence (BI) Tools Although an
purpose tools. increase Big data technology is huge, it doesn’t mean the end
MapReduce MapReduce is a paradigm of parallel pro- of classical BI tools like Cognos, QlikView, SPSS and so on.
gramming across big datasets working with many comput- The trend is that BI tools would be able to work with new Big
ers (nodes). It supports the use of inexpensive computer Data technologies side by side.
clusters to perform distributed parallel computing on large Data Storage: NoSQL databases are inherently schema
datasets up to petabytes. The data can be in the form of less and highly scalable. These databases support frameworks
structured or unstructured forms (e.g. weblog records, e- like MapReduce, Dryad etc. for the parallel processing of
commerce click trails, binary or multi-line records). It is large amounts of data. The paper by [54] investigated educa-
mainly composed of two functions: (1) Map function; and tional technology for Big data analysis and the exploration of
(2) Reduce function. The Map function is responsible for the development trend for online education. The authors gath-
processing standardized data whereas the Reduce function ered data, attached importance to the basic function and value
mainly summarizes the results after the Map function. of education data, and explored the education technology that
HDFS is a distributed, scalable and portable filesystem for matches the Big data analysis. The work by [55] discussed the
the Hadoop framework written in Java. HDFS stores large relationship between Big data and cloud computing, Big data
files (from gigabytes to terabytes) across many servers. HDFS storage systems and Apache Hadoop technology.
provides unstructured data storage for Big data. HDFS is
characterized by ‘‘write once read many times’’ and is very 2) SPARK PLATFORM
suitable for reading Big data. HDFS is a typical master-slave Apache Spark is a distributed computing framework like
architecture. HDFS has the advantages of high fault tolerance MapReduce but maintains data in Resilient Distributed
and high scalability. Dataset (RDD). It is useful for algorithms that perform iter-
Hive is a data warehouse infrastructure built on top of ative operations and data flow processing. Spark provides
Hadoop which provides summarization of data, query and Shark, an interactive query analyzer, Bagel, a high-volume
analysis. Hive supports analysis of big datasets stored in graph processing and analyzer, Spark Streaming, a real-time
HDFS, Amazon S3 file system etc. It provides an SQL –like analyzer, and Mllib, a machine learning library.
language called HiveQL, supporting indexes.
NoSQL: is a database system providing a mechanism for 3) SAMZA PLATFORM
storage and retrieval of data with less constrained than tradi- Samza is a distributed stream processing framework for real-
tional SQL (relational) databases. time data processing. In Samza, the data stream is partitioned,
Hadoop Common provides java libraries and utilities and each partition is given a specific ID or offset. Samza
which are required by other Hadoop modules. places the storage and processing on the same machine and
Mahout: Mahout is an open source machine learning and does not load additional memory while maintaining pro-
data mining algorithms sets based on Hadoop which has cessing efficiency and providing a framework for a flexible
implemented many machine learning and data mining algo- pluggable API. Fig. 2 shows the Samza technology core
rithms. architecture [56].
data including teaching, education management, scientific derived from unstructured data gave a 10% improvement in
research, campus life and so on. The Logic Layer is the core the accuracy of results compare with the traditional single
part of the whole system, which is the value of the data predictive model. The authors in [59] proposed an approach
mining. The Presentation Layer provides a visual interface using predictive analytics for e-learning with the Hadoop Big
for users. The graphical data analysis interface can help users data platform. Their work used the decision tree classification
to perform the Holland analysis, curriculum optimization and approach (C4.5) in a Hadoop framework to predict student
student employment decision. Other works on frameworks performance. The C4.5 algorithm was proposed because: (1)
and platforms for Big education data can be found in [8]–[10]. It is able to handle both discrete attributes, and continuous
The work in [8] used the Hadoop platform to conduct parallel attributes; (2) It can process partially complete training data
mining of educational literature on Big data. The paper has sets with values not present; (3) Pruning can be done while
analyzed the main function of text mining technology, and constructing the trees to prevent the over-fitting problem.
combined Canopy and the k-means algorithm to analyze and The work by [60] proposed a two-stage model, supported by
research the educational Big data literature. The authors in data mining techniques that uses the information available
[9] presented a framework for a Big data education system at the end of the first year of students’ academic career
based on Hadoop. They examined the MapReduce system (path) to predict their overall academic performance. This
for the education system and the huge volumes of data were study proposed to segment students based on the evidence
stored in HDFS. The authors in [10] provided a comparison of failure or high performance at the beginning of the degree
on the Hadoop, Spark and Samza platforms, and presented an program, and the students’ performance levels predicted by
architecture of Spark for education. the model. A data set of 2459 students spanning the years
from 2003 to 2015 from a European Engineering School of a
VI. DATA ANALYTICS FOR BIG EDUCATION DATA public research University was used to validate the proposed
This section gives comprehensive discussions for data ana- methodology. The empirical results demonstrated the ability
lytics for Big education data from two areas: (1) Predictive of the proposed model to predict the students’ performance
analytics; and (2) Learning analytics. A brief literature review level with an accuracy above 95%.
of some emerging trends and opportunities in applications of The ASSISTment [61] system designed by Worcester
Big data in educational data mining and learning analytics can Polytechnic Institute and Carnegie Mellon University can
be found in [57] and [58]. tutor students and assess the student learning at the same
time. This system targets the problem that instructors wish
A. PREDICTIVE ANALYTICS (PA) to do assisting and assessing at the same time in class. The
The prediction of how well a student or a group will perform system gives assessment results by predicting the student’s
on a learning task is one of the most popular and useful performance on standard test given by official assessment
applications of educational predictive analytics. It can also system such as MCAS (The Massachusetts Comprehensive
be used to identify at-risk students who are likely to fail. Assessment System). It collects the student’s reaction infor-
However, there is a challenging problem to solve due to the mation (such as accuracy, speed, the number of hints required
large number of circumstances that can impact student perfor- and performance on sub-steps) and predicts the student’s
mance, such as socioeconomic status, cultural background, performance based on the correlation model trained by past
demographic characteristics and psychological profile. This data of past months and years. Since the students work on
section gives discussions for predictive analytics from three the system every week, the ASSISTment system can keep
application areas: (1) Student performance; (2) Dropout pre- updating the value of metrics and provide increasingly accu-
diction and academic early warning systems; and (3) Courses rate predictions. The authors in [62] developed a predictive
selection. model to forecast the student performance in higher level
modules based on the contextual factors. The authors ana-
1) STUDENT PERFORMANCE PREDICTION lyzed data from 1037 students across various specializations,
The authors in [1] provides a discussion on Big data, learn- with different mode of study, age group, gender and different
ing analytics and use of natural language processing (NLP) sponsors. The Rapid Miner open source tool for predictive
in higher education. They proposed an integrated analytics analytics and visualization was chosen for the study. The
model with predictive analytics for student performance on outcome of the work showcased that negative correlation
their Big data architecture with data access, storage and exists between age and the academic performance, whereas
processing layers. The architecture has been discussed in positive correlation exists between lower level and higher-
Section V. Their analytics model utilizes different types of level modules.
data to predict student performance and support student Other examples of predictive analytics for student per-
progress. The authors incorporate the usage of sentiment formance can be found in [63]–[70]. The authors in [63]
analysis in their predictive analytics to and employ a dis- used student information like attendance, class test, seminar
tributed technology system capable of supporting academic and assignment marks collected from the student manage-
authorities and advisors at educational institutions in making ment system to predict the performance at the end of the
decisions. Their experiment results showed that the features semester. This paper investigated the accuracy of decision
tree techniques for predicting student performance. The work any a priori structure of functions. The proposed GP model
in [64] analyzed live video streaming and the students online also provided instructors with individualized suggestions to
learning behaviors and their performance in their courses. students in any performance state (at-risk, just survive, aver-
The student participation and login frequency, as well as the age or good) as well as increasing students’ awareness.
number of chat messages and questions that they submitted The authors in [69] proposed an educational data mining
to their instructors were analyzed together with the student’s (EDM) case study based on the data collected from learning
final grades. The results of the study showed a consider- management system (LMS) of e-learning center and elec-
able variability in students’ questions and chat messages tronic education system of Iran University of Science and
and revealed that combining EDM with traditional statistical Technology (IUST). The authors implemented a model to
analysis provides a strong and coherent analytical framework predict the GPA of graduated students. To achieve goals,
capable of enabling a deeper and richer understanding of a common methodology of data mining was utilized which
students learning behaviors and experience. The authors in is called CRISP. Our results show that there can be confident
[65] explored the use of predictive modeling methods for models for predicting educational attributes. The work in [70]
identifying students in virtual learning environments (VLE) also used data mining as a predictive tool for performance
who will benefit most from tutor interventions. The meth- improvement of engineering students. The authors applied
ods discussed included decision-tree classification, support the C4.5, ID3 and CART decision tree algorithms on engi-
vector machine (SVM), general unary hypotheses automaton neering student data to predict their performance in the final
(GUHA), Bayesian networks, and linear and logistic regres- exam. The authors showed that the outcome of the decision
sion. The methods were trialed through building and testing tree classifiers predicted the number of students who are
predictive models using data from several Open University likely to pass, fail or promoted to next year. Their results
(OU) modules. This work highlighted the importance of provided steps to improve the performance of the students
understanding how a student’s pattern of behavior changes who were predicted to fail or promoted. The comparative
during the course. The authors commented on two findings: analysis of the results also showed that the prediction has
(1) VLE activity is a useful data source to include for pre- helped the weaker students to improve and brought out better
dicting student outcome but should not be viewed as an outcomes in the result.
absolute measure of engagement but rather with reference to
a student’s own past behavior; and (2) Feature selection has 2) DROPOUT PREDICTION AND ACADEMIC EARLY
a big impact on the reliability of a model generated from the WARNING SYSTEMS
data regardless of which model type is chosen. One of the biggest challenges every institution face is how to
The work in [66] demonstrated how web usage mining can improve student retention and reduce attrition. There could
be applied in e-learning systems to predict the marks that be several reasons for student attrition including academic
university students will obtain in the final exam of a course. issues (inadequate preparation, student disinterest with con-
In this work, the authors developed a specific Moodle min- tent or delivery method); motivational issues (low level of
ing tool oriented and compared the performance of different commitment to the institution, perceived irrelevance of the
data mining techniques for classifying students. Several well- institution’s experience); psychosocial issues (social factors,
known classification methods were used such as statistical emotional issues); and financial issues (inability to afford
methods, decision trees, rule and fuzzy rule induction meth- fees, perception that cost outweighs benefits) [71]. Two
ods, and neural networks. The authors carried out several emerging areas to improve student retention and reduce attri-
experiments using available and filtered data to try to obtain tion are (1) Dropout prediction; and (2) Development of
more accuracy. The authors in [67] used predictive analytics academic early warning systems. Dropout prediction is one
to identify the factors influencing the performance of students of the major research topics in learning analytics (LA) for
in final examinations and found a suitable data mining algo- Big education data. The prediction of dropout is very useful
rithm to predict the grade of students. The authors designed to instructors and to be able to identify how likely a student
a neural network (multilayer perceptron) tool using the .NET would drop out during the course. The instructor can make
framework to predict the grade of the student when given the some adjustments during the teaching process to mitigate
various parameters as input and achieved an accuracy of 72% and reduce the likelihood (e.g. send email reminders or give
which showed the potential efficiency of the MLP algorithm. positive feedback to students who have been identified to be
The obtained results from hypothesis testing showed that the very likely to drop out during the course).
type of school did not influence student performance and on Some examples of LA for dropout prediction can be found
the other hand, the parents’ occupation played a major role in in the works by [72]–[80]. The authors in [72] investigated
predicting grades. The work in [68] proposed an approach to dropout prediction in massive open online courses (MOOC).
predict student performance through genetic programming. The objective was to predict from the student behavior log
The authors used activity theory derived participation indi- data the likelihood of students dropping out from the MOOC
cators as inputs into a Genetic Programming (GP) model to in the next ten days. In this work, the authors collected
develop a student performance prediction model. Their GP 39 courses data from the XuetangX platform which is one of
model was able to build a prediction model without assuming the largest online learning platforms in China. The authors
used four supervised classification models (SVM, logistic system has been shown to be able to correctly distinguish if
regression, random forest and gradient boosting decision tree the student will get either an ABC grade or a DF grade with
(GBDT)) to perform the dropout prediction task and achieved 92% accuracy. The authors in [83] proposed an approach for
the highest classification accuracy of 88% accuracy with Big data analytics for predicting academic course preference
the GBDT. The work in [73] used machine learning (ML) using Hadoop and MapReduce. In their work, they derived
techniques to demonstrate that categorizing student perfor- preferable courses for pursuing training for students based on
mance data and exercise sets were adequate parameters for course combinations. The input dataset collected from stu-
identifying possible dropouts during a course. The authors dents is split into various clusters and provided to the mapper
used experimental data from a computer science course and that maps data to the output which are represented as <key,
showed that their ML techniques could provide automatic value > pairs. The output obtained from the mapper are then
detection of student dropouts during the second week of the combined in the combiner and then sent to the reducer. The
eight-week courses. authors in [84] developed educational models to predict how
The work in [74] utilized education data mining to analyze learning materials might be designed to fit the knowledge of
the factors affecting student academic performance which the student. Their approach used educational data mining to
contributed towards the student failure and dropout. The develop educational models to predict how learning materials
authors showed that their techniques enabled the identifica- might be designed to fit the knowledge of the student.
tion of weak students shown to have poor performance. The
authors in [75] used learning analytics to manage dropout B. LEARNING ANALYTICS
rates based on a set of pedagogical actions in distance edu- Learning Analytics (LA) is the collection and analysis of
cation courses and reported an average of 87% prediction usage data associated with student learning. This section
accuracy and an average reduction of 11% in dropout rates. gives discussions for LA from five areas: (1) Collaborative
Other works for dropout prediction can be found in [76]–[80]. and interactive Learning; (2) Behavior learning; (3) Personal-
The authors in [77] conducted experiments using a dataset ized learning; (4) Social network analytics; and (5) Learning
of 419 students to determine the best predictors of dropout at and assessment analytics.
different stages in a course. The authors in [77] extracted fea-
tures from student behavior from completed curriculum and 1) COLLABORATIVE & INTERACTIVE LEARNING
applied machine learning algorithms to predict the dropout Collaborative analytics are commonly used to deal with
rate. The authors in [78] used data mining algorithms to issues related to providing instructional strategies that sup-
predict student failure from high dimensional and imbalanced ports and enhances the collaboration process among students
behavior data. A second emerging area in LA for Big edu- who work together in small groups. A collaborative learning
cation data is the development of academic early warning environment (CLE) aims to improve continuous and recipro-
systems (AEWS). The objective of an AEWS is to discover cal student-educator interaction, cooperation towards knowl-
and identify existing and potential academic problems of edge construction, and knowledge and experience exchange
students in the early stages of education and inform students to reach common goals. The work in [85] presented an
so that remedial actions can be taken to mitigate the risks. The empirical case study to investigate the impact of collaborative
authors in [81] proposed an AEWS based on Big education learning patterns on student achievements with educational
data collected from different departments of the university data captured from a CLE platform. The authors analyzed
such as the academic affairs, library and other departments. the progress time series reflecting students’ contributions to
The authors used principal component analysis (PCA) to an assignment to investigate different styles of collabora-
locate the key predictors and utilized three machine learning tions. By comparing the collaborative learning patterns of the
algorithms to train and test their classifiers from their sample same groups in completing different assignments, the authors
data. Their results showed that the naïve Bayesian algorithm explored the pattern impact on the grades received as a result
gave the best accuracy rate of 86% for three-semester data of teacher assessments of these assignments and identified the
and 85.4% for one-semester data. characteristic patterns that lead to better learning outcomes
either in terms of quality or efficiency. The authors showed
3) COURSES SELECTION that continuous focus, self-reflection, live collaboration, and
This section focuses on the articles or works where learning even distribution of workload and contributions were more
analytics is used as a tool for courses selection. The authors in likely to lead to more refined and coherent assignments, and
[82] proposed a system termed as Degree Compass to be used consequently achieve better marks. A different approach was
by students who are not familiar with navigating their way taken by the authors in [86] which proposed using student
through a degree program. The Degree Compass system uses interaction to measure the effectiveness of collaboration in
data from hundreds of thousands of past students with the virtual learning environments (VLE). In this work, the user
data of a particular student (course grades, standardized test activity logs from the learning platform were used as the main
scores, college transcript grades, etc.) to recommend courses tool for inferring learners’ activities to fit certain behaviors
to students that is most likely to achieve the best grade and and preferences. The work by [87] examined the effects of
which also fits with the program of study of the student. The learning analytics as supporting tools for instructors to guide
cooperating groups. Other examples of papers on collabora- in micro-learning, information is delivered in small portions
tive and interactive learning for LA can be found in the works that are easy to learn effectively [100] and content can be
by [88]–[91]. delivered according to a tailored knowledge composition pat-
terns that are best retained by individual students. Personal-
2) BEHAVIOUR LEARNING ized learning has been advocated as an effective approach that
The concept of behavior learning is important to understand could be applied at different stages of the curriculum to ensure
student learning and evaluating student performance. The deep learning and leaves students with knowledge absorbed
authors in [92] proposed searching for student behavioral quicker and retained longer.
patterns while accessing and browsing educational resources.
In this work, the authors extracted behavioral patterns related 4) SOCIAL LEARNING AND NETWORK-BASED ANALYTICS
to the student interactions with the educational media. Their Social and networked-based learning and analytics benefit
results demonstrated the usefulness of student perception and from the utilization of technology to establish connections
identified the trends regarding the use of educational media between students, instructors, communities and resources
for learning. The authors in [93] developed an evaluation [101]. The use of EDM and LA for social networks analysis
system for student learning through the factors analysis that has been reported to be associated with student learning and
influences their behavior during the media usage. The goal building knowledge in social and cultural settings to discover
was to improve the evaluation method in order to improve patterns of collaboration, assessment and communications.
the students’ behavior in relation to use of the learning media. The work by [102] showed that by collecting data about user
To evaluate the level of student’s learning, the decision tree behavior, LA could be useful for providing recommendations
technique was used. The authors in [94] developed a system about learning resources and activities. The work by [103]
to explore and visualize generated data in virtual learning showed that mining students’ online social interaction was
environments and analyzed these data using web-mining and important for recommending appropriate learning partners
statistical techniques to extract behavior patterns of the stu- in a web-based cooperative learning environment. Another
dent. The authors in [95] grouped and analyzed access data work for EDM and LA to aid educational decision makers
in order to recognize behavior patterns (e.g. identify whether by providing the environment to share and collaborate with
the instructions were inadequate or insufficient, or to identify other team members to take the appropriate actions for a given
visibility problems in the content posted) in order to review learning task can be found in [104].
and organize the educational content. The authors in [96]
presented a framework for analyzing student activity data 5) LEARNING & ASSESSMENT ANALYTICS USING
in open-ended learning environments (OELE) that integrates EXPERIENCE API
model-driven behavior characterization and data-driven pat- The Experience API (xAPI) standard is a specification for
tern discovery. The model-driven approach used linked task learning technologies which can be used for data collection
and strategy models to provide more precise interpretation of describing the wide range of experiences of the learner in the
student activity sequences as learning and problem-solving context of formal learning, informal learning and social learn-
strategies while the pattern mining approach enables the iden- ing [105]. The authors in [109] gave two classifications for
tification of new variations of strategies and of gaps in the research works using the xAPI specification in the context for
coverage of the current strategy model. Other examples of learning analytics: (1) The first category deals with the defi-
papers on behavior learning can be found in the works by ciencies of xAPI specification such as limitations of learning
[97], [98]. interactions and inconsistency of learning behaviors across
platforms in addressing specific issues related to the learning
3) PERSONALIZED LEARNING context; and (2) The second category deals with tracking and
Personalized learning is aimed at customizing the learning analyzing the learning experience using the xAPI specifica-
journey of a student to maximize his/her learning potential tion. The work by [43] used the xAPI standard to track educa-
and hence fulfill the goal of education and career with sat- tional data from an e-learning environment called Kalboard
isfaction and accomplishment. With the help of Big data 360. The tracked data is classified into behavioral, demo-
technologies, learning can be made increasingly personal- graphic and academic background features and three data
ized, and instructors can watch learners and track which areas mining techniques (ANN, naive Bayes and decision tree clas-
within a program of study they find challenging and spend sifier) were employed to evaluate the impact of such features
most of their time, the learning materials they revisit often, on student performance. The experimental results showed
the sections they recommend to their peers, the learning styles that there was a strong relationship between learner behaviors
they prefer, and the time of day they learn better [99]. With and their academic achievement. The authors in [107] pro-
the emergence of various learning strategies such as micro- posed a 3D design activity stream for STEM education based
learning, multimedia learning and flipped classroom, learn- on xAPI. The xAPI can describe learner experiences as active
ing personalization has been recognized as an effective and statements with eight attributes (UUID, ACTOR, VERB,
adaptable interface between the student and the knowledge to OBJECT, RESULT, CONTEXT, TIMESTAMP and VER-
allow effective learning and knowledge transfer. For example, SION). For example, the specification <ACTOR, VERB,
OBJECT, CONTEXT > composes a simple activity flow. tems. In collaborative-based filtering systems, an item will
Experiments were carried out at the Li Jun School in China. be recommended to the user based on the preference of other
The authors collected more than 22,000 data elements and similar users for the same item. The sets of users which
showed that their xAPI could completely record the learning have the strongest correlation in the past will be identified
paths of students. Their results also showed that students had as nearest neighbors, and the score of the new items will be
different operating habits and learning paths which provided predicted based upon the scores of its nearest neighbors. The
the basis for the evaluation of students’ spatial thinking ability correlation or log-likelihood ratio measures can be used to
and engineering design skill in the interactive learning envi- identify preferred items for the user. Content-based filtering
ronment. The authors in [108] discussed some experiences recommender systems utilize a series of discrete and pre-
and learnt lessons from implementing xAPI for projects in tagged characteristics of an item in order to recommend addi-
the Netherlands. The authors remarked on the need for a tional items which have similar properties. Content-based
centralized approach for data collection to get a complete recommendation systems find out items of interest for users
picture of student behavior which may be stored on many by analyzing item descriptions. These systems generate lists
heterogeneous IT systems. Furthermore, the xAPI recipes of item profiles for the users based on the data provided by
need to be seen in their infrastructural context. An ETL users. It uses two metrics called term-frequency (TF) and
(Extract Transform Load) layer with communal best practices inverse document frequency (IDF). The TF determines how
encoded in the transforms and applied across the higher many times the item has occurred in a document whereas
education sector can enforce the authoritative standard and the IDF identifies the importance of the item. The product
decrease the overall costs. of TF×IDF is used to identify the importance of the item.
The authors in [106] discussed a case study to show the Knowledge-based recommendation systems are based upon
suitability of using xAPI (Tin Can API) for self-regulated the knowledge of a user’s need for an item and can therefore
learning (SRL). The authors proposed an extension of xAPI reason about the relationship between a need and a possi-
for recording SRL-related actions termed as xAPI-SRL. Their ble recommendation. The knowledge about the user needs,
monitoring system had several steps: (1) Author – filter state- preferences, etc. are used to perform the recommendation.
ments from the selected author; (2) SRL – filter SRL related Current recommender systems typically combine one or more
actions; (3) Time – select time window and organize records approaches into a hybrid recommendation system to improve
time wise; (4) Object – filter or organize statements attending the recommendation accuracy. Examples of recommendation
to the object; (5) Grouping and analysis – analyze groups of systems for educational data can be found in [112]–[119].
statements attending to how they relate to each other. A recent For specific course recommendation of MOCC, some
work by [109] explored the use of xAPI in learning analytics approaches such as collaborative filtering, content-based fil-
for MOOC environments which generated big assessment tering and hybrid recommendation systems can be found
data (Big data) given the massive number of courses proposed in [113]–[115], [116]. The authors in [113] proposed a
and the high number of learners enrolled. These assessment systematic methodology for recommending personalized
data must be tracked, processed and analyzed as the learn- courses and considering the sequence of learning curriculum.
ing data. The authors in [110] commented that assessment In their system, they considered a measurable context space
analytics has the potential to make valuable contributions with Lipschitz condition, where space is divided into many
to the field of learning analytics by extending its scope and subspaces to represent different types of students. The course
increasing its usefulness. The authors also state that the role clusters are defined to capture the prerequisite dependencies
that assessment analytics could play in the learning process is among courses. Their dataset is composed of three parts: (1)
significant and yet it is underdeveloped and underexplored. Data of courses; (2) Context information of the students; and
(3) Feedback reward records. The course data was obtained
C. RECOMMENDATION SYSTEMS from the biggest MOOC platform in China called ‘iCourse’
A recommendation system or recommender is an informa- which contains nearly all the Chinese online courses. The
tion filtering system that seeks to predict the rating or pref- context information was collected from 4939 anonymized
erence a user would give to an item. These systems have students in Huazhong University of Science and Technol-
been very helpful in applications such as e-commerce (e.g. ogy and Central China Normal University (∼20,000 learning
Amazon), entertainment (e.g. Netflix, YouTube and Spotify), records). The reward records are the scores of courses and
service industries, and social media platforms (e.g. Twitter the degree of satisfaction. The authors in [114] proposed a
and Facebook). Recently, recommender systems have gained Big data solution on Hadoop platform for recommendation
popularity in the education sector to generate various kinds of pedagogical documents that meet the identified needs of
of recommendations for learning institutions, instructors and the learner. This system will be established by using Big
students. This sub-section explores recommendation systems data as a tool to analyze the performance and skill level of
for Big data in education. The various recommendation tech- students individually and then create personalized learning
niques can be broadly categorized into four types [111]: experiences that fit into their specific learning paths. The
(1) Collaborative-based filtering; (2) Content-based filtering; authors used a semantic approach which recommends learn-
(3) Knowledge-based systems; and (4) Hybrid-based sys- ing objects by comparing the textual contents of resources
that form a corpus of pedagogical documents and proposed Some examples of using graph-based analytics and machine
an algorithm for similarity measurement between the doc- learning to address challenges and opportunities for educa-
ument viewed by the learner and the documents of corpus tion can be found in the works by [127]–[129]. The authors
of pedagogical documents available in order to select from in [127] used observed prerequisite relations among courses
those which are most similar to the viewed document. Their to learn a directed universal concept graph and used the
work was implemented and tested on the Hadoop Big data induced graph to predict unobserved prerequisite relations
platform. For the implementation of the recommendation among a broader range of courses. This is particularly useful
algorithm, modules were coded in Python using scikit-learn to infer prerequisite relations among courses from different
and NLTK python packages. For parallelization, MapReduce providers e.g. universities, MOOC, etc. The authors proposed
was leveraged to process the data stored in Google File a new framework called Concept Graph Learning (CGL)
System (GFS). The authors in [115] also designed and imple- for inference within and across two graphs at the course
mented a personalized recommendation system on Big data level and at the induced concept level. The explicit learn-
platform. Their system can help people to automatically exca- ing of the directed graph for universal concepts is the key
vate interesting and valuable information from target data. part of the framework. Once the concept graph is learned,
A personalized education resource recommendation system it could be used to predict unobserved prerequisite relations
which can handle Big data is studied and implemented. The among different courses including those not in the training
results showed that the personalized recommendation system set and from multiple sources. Their experiments showed
of educational resources based on Big data has been put promising results for cross-universities setting. The universal
into use in a university network and achieved the expected transferability is particularly desirable in MOOC environ-
design goal. This system, combining the discipline classi- ments where courses are offered by different universities and
fication tree and the recommended structure, provides the instructors.
resilient processing ability with the increase of data and the The authors in [128] addressed the graph analysis problem
personalized recommendation function based on the security, in multi-source relational learning for educational data. When
high efficiency and real-time of Big data. It provides effec- the numbers of nodes in multiple graphs are large, the labeled
tive help for the students and teachers to make use of the training instances are extremely sparse. Existing methods
valuable teaching resources. However, when they evaluated such as tensor factorization or tensor kernel machines do
their recommendation algorithm, the MovieLens dataset (not not work well because of the lack of convex formulation
educational data) was used to verify the performance. for the optimization, the poor scalability of the algorithms
Other educational recommendation systems can be found in handling combinatorial numbers of tuples and the non-
in [116]–[119]. The authors in [116] built a personalized transductive nature of the learning methods which limits their
English learning recommender system for students to set ability to leverage unlabeled data in training. The authors pro-
basic score of lessons. The collaborative filtering technique posed a Cross-graph Relational Learning (CGRL) approach
and content-based method was used. Another author [117] for predicting the strengths or labels of multi-relational tuples
developed a recommender system for predicting student of heterogeneous object types. They formulated the CGRL
performance. Their approach mapped educational data to as a convex optimization problem which enable transduc-
user/item. The matrix factorization technique was used to tive learning using both labeled and unlabeled tuples and
generate the recommendation and logistic regression to vali- proposed a scalable algorithm that guarantees the optimal
date their approach. An automated recommender system for solution and enjoys a linear time complexity with respect to
course selection can be found in [118]. The collaborative the sizes of input graphs. The authors conducted the experi-
recommendation technique was used to recommend elec- ments on 34,340 DBLP publication records in the domain of
tive courses to students by using association rule mining Artificial Intelligence. Tuples in the form of (Author, Paper,
to generate course association rules. The authors in [119] Venue) were extracted from the publication records leading to
built a semantic educational recommender system in for- 15,514 tuples (cross-graph interactions) after preprocessing.
mal e-learning scenarios. They used a conceptual approach The authors showed that their proposed method success-
which can be used as personalized recommender in e-learning fully scaled to the large cross-graph inference problem, and
scenarios in their work. Other examples of earlier works outperformed other representative approaches significantly.
on recommendation systems for e-learning can be found A recent work on graph analytics by [129] presented the
in [120]–[126]. The Recommendation Agent for e-learning early detection prediction of learning outcomes in online
systems is one of the first collaborative filtering educational short course via learner behaviors. Through evaluation on
recommendation systems that have been established [120]. data captured from three two-week courses hosted through
delivery platforms, the authors made three key observations:
D. GRAPH ANALYTICS (1) Behavioral data contains signals predictive of learning
Graph analytics can be used to determine the strength and outcomes in short-courses (with classifiers achieving AUCs
direction of relationships between objects in a graph. This ≥ 0.8 after the two weeks); (2) Early detection is possible
section discusses some research works to address challenges within the first week (AUCs ≥ 0.7 with the first week of
of Big data from online education data using graph analysis. data); and (3) Content features have an ‘‘earliest’’ detection
capability (with higher AUC in the first few days), while the spatial patterns of the individual who learned during the test
SLN features become the more predictive set over time as duration. They also analyzed the group learning patterns from
the network matures. They also discuss how their method can mobile learners and the location distribution.
generate behavioral analytics for instructors. The authors in [135] developed a novel approach, Be
the Data, which exploits embodiment in visual analyt-
E. VISUAL ANALYTICS ics to invoke experiential learning. The authors designed
Visual analytics (VA) focuses on analytical reasoning facil- and proposed a visual analytics approach to teach stu-
itated by interactive visual interfaces and scientific visual- dents about exploring alternative two-dimensional (2D) pro-
ization. This section gives some review and discussions and jections of high dimensional data points using weighted
applications of VA in Big education data. The authors in [130] multi-dimensional scaling. In their approach, each student
presented a systematic review of the emerging field for visual embodies a data point, and the position of students in a phys-
learning analytics of educational data. The authors found that: ical space represents a 2D projection of the high-dimensional
(1) Few works have been done to bring visual learning ana- data. Students physically move within the room with respect
lytics tools into classroom settings; (2) Few studies have con- to each other to collaboratively construct alternative projec-
sidered the background information from the students such as tions and receive visual feedback about relevant data dimen-
demographics or prior performance; (3) Traditional statistical sions. The approach exploits a large interactive room called
visualization techniques such as bar plots and scatter plots the Cube and includes a large overhead display, a vision-
are still commonly used in learning analytics contexts; and based motion tracking system, and a software system for
(4) While some studies employ sophisticated visualizations, direct manipulation of high-dimensional data. To use the
there is a lack of studies that employ sophisticated visualiza- system, a group of students enter the Cube and embody
tions and engage deeply with educational theories. Two other virtual data points by wearing trackable hats which detect
studies for visual data mining can be found in [131] and [132]. the locations of students in real-time. Their experimental
The use of VA methods can help turn the features of education findings indicate that Be the Data approach provided the
into a visible type of representation, with the ability of being engagement to enable students to quickly learn about high-
seen and interpreted by means of variety of diagrams, charts, dimensional data and analysis processes despite their mini-
tables, infographics and other forms of visual factors [133]. mal prior knowledge. They identified student data analytical
For example, the activities have characteristic of geolocation strategies that employ this form of embodiment and found
which can be projected onto a map, while the resources of both qualitative and quantitative evidence of student improve-
knowledge can also be converted into the map. A map-based ment in understanding high-dimensional data. Visual Analyt-
management and visual analysis method will largely benefit ics approaches can also be usefully employed in MOOC. For
the users and the researchers from taking advantages of the example, VisMOOC [136] is an interactive visual analytics
Big data in education. system, which can analyze video clickstream data by using
The authors in [134] proposed a novel map-based method a seeking diagram, PeakVizor [137] uses correlation view
to manage and analyze the mobile learning in Big education and flow view to uncover spatial and temporal information of
data. They retrieved the geographic location information from peaks in video clickstreams from MOOC, and DropoutSeer
the GPS for the activities of participants recorded by the system [138] uses timeline view by stack timelines and glyph
mobile learning systems and projected the data onto a map to uncover the participants’ learning activities and patterns,
with a geographic reference and projection parameters. The which can also predict the dropout.
layers of the new generated map can be subsequently inte-
grated with an open map service like Google Map or Baidu F. IMMERSIVE LEARNING & ANALYTICS
Map. The learning activities and resources can be described The emergence of immersive learning approaches enabled
as points, lines or polygons in the form of vector on the by virtual reality (VR) technologies have given instructors
map. The map-based representations provide new methods and educators more flexibility and tools in designing active-
(e.g. the map browsing) to perform exploration of learning based learning environments. Immersive learning techniques
practices. With their approach, the activities of users scattered use computer graphics and human-computer interaction tech-
among the space are reorganized on a geographic map with nologies to create simulated virtual worlds in which student
location changes in time series, and the resources are geo- learning can take place by employing suitable pedagogical
tagged with the information from the developers or adopters, approaches to create virtual worlds where learners could
which are converted to a map style according to their hierar- learn collaboratively [139], [140]. For example, the Second
chical structures. The authors performed experiments using Life virtual world enables learners to create avatars in the
mobile learning data from the platform named M-starC of virtual world for interaction with virtual objects and virtual
Central China Normal University (CCNU), which allows environments [141]. Compared to traditional learning envi-
participants to use a mobile learning application for the access ronments, immersive learning environments allow learners to
of the learning resources. Their experiment aimed to analyze explore problems and experience solutions in the virtual envi-
the personal learning patterns. Classes were obtained from ronment through experiential learning. The authors in [142]
the data using the k-means method. The clusters revealed the proposed an empirical study of designing and evaluating
an immersive learning experience for a MOOC termed the sidered a scenario where learning analytics (LA) could be
VirtualHK MOOC. The authors work showed that immersive used to track students and their performance could be flagged
learning experience may not directly impact the knowledge to deny a student access to future education programs based
gain of learning but can improve the overall learning experi- on the pre-conceived student ability for institution decision-
ence in better motivating learners and making the learning making leading to unintended outcomes. The authors in
more enjoyable. The student feedback and sentiment anal- [145] remarked that LA presents significant student privacy
ysis showed that 52.73% of the learners gave ‘‘positive’’ challenges for higher education institutions. In their work,
comments and 47.27% gave ‘‘neutral’’ comments for the the authors also posited four proponents that LA must justify
immersive learning experience. in relation to the use of student data: (1) LA systems should
provide controls for differential access to private student data;
G. SOCIAL MEDIA ANALYTICS (2) Institutions must be able to justify their data collection
Student interactions and informal conversations on social using specific criteria; (3) The actual or perceived positive
media (e.g. Twitter, Facebook) give useful insights into their consequences of LA may not be equally beneficial for all
educational experiences, emotions and concerns about the students. A full accounting is required of how benefits are
learning process. However, data collection and analytics from distributed between institutions and students and among stu-
social media data can be challenging due to the complexity. dents; and (4) Students should be made aware of collec-
The collection of social media data has been presented in the tion and use of their data and permitted reasonable choices
previous section (Section IV). Normally, the student learn- regarding collection and use of that data. The authors in
ing experiences acquired from social media content would [146] remarked that privacy and data protection are major
require human interpretation. However, the growing scale of stumbling blocks for a data-driven educational future. In this
data volume and variety demands automatic data analytics work, the authors proposed three principles to guide the prac-
techniques. This section focuses on a brief of mining social tical deployment of LA and Big education data systems: (1)
media data such as Twitter, followed by the inductive content Privacy and data protection in LA are achieved by negotiating
analysis which frequently used in social media analytics and data sharing with each student; (2) How the educational
prominent themes. The previous section (Section IV) only institution will use data and act upon the insights of analysis
presents the reviews of education data mining research. Here should be clarified in close dialogue with the students; and
we give some examples of studies on Twitter from the fields (3) In negotiating privacy and data protection measures with
of data mining, machine learning and natural language pro- students, schools and universities should use this opportunity
cessing for education models and algorithms. The authors to strengthen their personal data literacies.
in [37] presented a work on mining social media data for
understanding student learning experience from Twitter posts B. TECHNOLOGICAL CHALLENGES
at Purdue University. The authors conducted a qualitative There are several technological opportunities and challenges
analysis taken from 25,000 tweets from engineering stu- for employing Big data in education and learning analytics
dents and implemented a classification algorithm for tweets due to the large and increasing amounts of online education
reflecting the student’s problems. Their work presented a data. As discussed in Section V, Big education systems would
methodology that showed how data from social media can require access to a high-performance computational infras-
be used to provide insight into student learning experiences. tructure which can handle a large amount of data for capture,
The proliferation of multimedia technology in social learning storage, processing and visualization. There are also several
spaces allows student emotions and sentiments to be captured issues and considerations for practical deployment of Big
and automatically classified from audio-visual devices such education data systems due to lack of interoperability of insti-
as web-cameras and microphones [149]. tutional data systems and different forms of data storage in
disparate databases [143]. The absence of cross-institutional
VII. CHALLENGES FOR BIG DATA IN EDUCATION AND policies for data sharing and integration creates another major
LEARNING ANALYTICS challenge to be addressed for Big education systems [38].
This section presents challenges for Big data in education To illustrate some technological challenges and the useful-
and learning analytics from two perspectives: (1) social chal- ness and potential of exploiting cross-institutional Big educa-
lenges; and (2) technological challenges. The technological tion data, we performed an investigation for practical deploy-
and practical challenges are illustrated by giving an example ment of a Big education data system across some institutions
for utilizing graph analytics for a university-based learning in Australia. The objective of the system is to detect the
analytics scenario. unobserved prerequisite dependencies among online courses
for different universities in Australia. This system would
A. SOCIAL CHALLENGES be useful for students to infer prerequisite relations among
As in many fields where large amounts of data are being courses from different providers (e.g. universities, MOOC)
collected, there are also several important social challenges to chart their learning pathways.
including privacy, ethical, security and safety issues to be Our approach is based on graph-based analytics like
addressed for Big education data. The authors in [144] con- the techniques proposed by [127], [128]. The graph-based
TABLE 3. Data Statistics for crawled university subject data. TABLE 4. Performance using MAP for cross-institution subject data.
Learning@Scale, AIED (Artificial Intelligence in Education) [22] R. Ferguson and D. Clow, ‘‘Where is the evidence?: A call to action
and periodicals such as JEDM (Journal of Educational Data for learning analytics,’’ in Proc. 7th Int. Learn. Analytics Knowl. Conf.,
Mar. 2017, pp. 56–65.
Mining), IEEE Transactions on Learning Technologies for [23] P. Leitner, M. Khalil, and M. Ebner, ‘‘Learning analytics in higher
the latest research. education—A literature review,’’ in Learning Analytics: Fundaments,
Applications, and Trends (Studies in Systems, Decision and Control),
Vol. 94, A. Peña-Ayala, Ed. Cham, Switzerland: Springer, 2017, pp. 1–23.
REFERENCES [24] Z. K. Papamitsiou and A. A. Economides, ‘‘Learning analytics and educa-
[1] A. S. Alblawi and A. A. Alhamed, ‘‘Big data and learning analytics in tional data mining in practice: A systematic literature review of empirical
higher education: Demystifying variety, acquisition, storage, NLP and evidence,’’ Educ. Technol. Soc., vol. 17, no. 4, pp. 49–64, Oct. 2014.
analytics,’’ in Proc. IEEE Conf. Big Data Analytics (ICBDA), Nov. 2017, [25] L. K. Chew, ‘‘Using xAPI and learning analytics in education,’’ in Elearn-
pp. 124–129. ing Forum Asia, 2016, pp. 13–15.
[2] P. Michalik, J. Stofa, and I. Zolotova, ‘‘Concept definition for big data [26] O. Bohl, J. Scheuhase, R. Sengler, and U. Winand, ‘‘The sharable content
architecture in the education system,’’ in Proc. IEEE 12th Int. Symp. Appl. object reference model (SCORM)—A critical review,’’ in Proc. Int. Conf.
Mach. Intell. Informat. (SAMI), Jan. 2014, pp. 331–334. Comput. Edu., Dec. 2002, pp. 950–951.
[27] J. P. Leal and R. Queirós, ‘‘Using the learning tools interoperability frame-
[3] R. Machova, J. Komarkova, and M. Lnenicka, ‘‘Processing of big edu-
work for LMS integration in service oriented architectures,’’ Technol.
cational data in the cloud using apache Hadoop,’’ in Proc. Int. Conf. Inf.
Enhanced Learn. Tech-Educ., to be published.
Soc. (i-Soc.), Oct. 2016, pp. 46–49.
[28] M. Dougiamas and P. Taylor, ‘‘Moodle: Using learning communities to
[4] M.-S. Lee, E. Kim, C.-S. Nam, and D.-R. Shin, ‘‘Design of educational
create an open source course management system,’’ in Proc. EdMedia+
big data application using spark,’’ in Proc. 19th Int. Conf. Adv. Commun.
Innovate Learn., Assoc. Advancement Comput. Educ. (AACE), 2003,
Technol. (ICACT), Feb. 2017, pp. 355–357.
pp. 171–178.
[5] Q. Zheng, H. He, T. Ma, N. Xue, B. Li, and B. Dong, ‘‘Big log analysis [29] The Top Open Source Learning Management Systems. Accessed:
for E-Learning ecosystem,’’ in Proc. IEEE 11th Int. Conf. e-Bus. Eng., Feb. 2020. [Online]. Available: https://elearningindustry.com/top-open-
Nov. 2014, pp. 258–263. source-learning-management-systems
[6] D. Marjanovic, M. Milovanovic, and B. Radenkovic, ‘‘Hadoop infrastruc- [30] M. H. Mohamed and M. Hammond, ‘‘MOOCs: A differentiation by
ture for education,’’ in Proc. 14th Int. Symp. New Bus. Models Sustain. pedagogy, content and assessment,’’ Int. J. Inf. Learn. Technol., vol. 35,
Competitiveness, 2014, pp. 365–370. no. 1, pp. 2–11, Jan. 2018.
[7] C. Zhenyu, ‘‘The application of big data in higher vocational education [31] A. Agrawal, A. Kumar, and P. Agrawal, ‘‘Massive open online courses:
based on holland vocational interest theory,’’ in Proc. Int. Conf. Ind. EdX. org, Coursera. com and NPTEL, a comparative study based on
Informat. Comput. Technol., Intell. Technol., Ind. Inf. Integr. (ICIICII), usage statistics and features with special reference to India,’’ INFLIBNET
Dec. 2017, pp. 37–40. Centre, Tech. Rep., 2015.
[8] H. Wang, Q. Wang, and W. Wang, ‘‘Text mining for educational literature [32] S. I. El Ahrache, H. Badir, Y. Tabaa, and A. Medouri, ‘‘Massive open
on big data with Hadoop,’’ in Proc. IEEE Int. Conf. Smart Cloud (Smart- online courses: A new dawn for higher education,’’ Int. J. Comput. Sci.
Cloud), Sep. 2018, pp. 166–170. Eng., vol. 5, no. 5, p. 323, 2013.
[9] R. Swathi, N. P. Kumar, L. Kirankranth, L. S. Madhav, and R. Seshadri, [33] [Online]. Available: http://sociallearningcommunity.com/10-of-the-best-
‘‘Systematic approach on big data analytics in education systems,’’ in mooc-providers/
Proc. Int. Conf. Intell. Comput. Control Syst. (ICICCS), Jun. 2017, [34] [Online]. Available: https://en.unesco.org/events/experts-meeting-
pp. 420–423. defining-open-educational-resources-oer-indicators
[10] J. Chen, J. Tang, Q. Jiang, Y. Wang, C. Tao, X. Zhang, and J. Liao, [35] [Online]. Available: http://discourse.col.org/t/what-are-examples-
‘‘Research on architecture of education big data analysis system,’’ of-oer/27
in Proc. IEEE 2nd Int. Conf. Big Data Anal. (ICBDA), Mar. 2017, [36] J. C. Taylor, ‘‘Open courseware futures: Creating a parallel universe,’’
pp. 601–605. e-JIST, vol. 10, no. 1, pp. 1–7, 2007.
[11] M. W. Rodrigues, S. Isotani, and L. E. Zárate, ‘‘Educational data mining: [37] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan, ‘‘Mining social media
A review of evaluation process in the e-learning,’’ Telematics Informat., data for understanding students’ learning experiences,’’ IEEE Trans.
vol. 35, no. 6, pp. 1701–1717, Sep. 2018. Learn. Technol., vol. 7, no. 3, pp. 246–259, Jul. 2014.
[12] C. Romero and S. Ventura, ‘‘Educational data mining: A survey from [38] A. Dix, ‘‘Challenge and potential of fine grain, cross-institutional
1995 to 2005,’’ Expert Syst. Appl., vol. 33, no. 1, pp. 135–146, Jul. 2007. learning data,’’ in Proc. 3rd ACM Conf. Learn. Scale L@S, 2016,
[13] A. Dutt, M. A. Ismail, and T. Herawan, ‘‘A systematic review on educa- pp. 261–264.
tional data mining,’’ IEEE Access, vol. 5, pp. 15991–16005, 2017. [39] C. K. Pereira, S. W. M. Siqueira, B. P. Nunes, and S. Dietze, ‘‘Linked
[14] C. Romero and S. Ventura, ‘‘Educational data mining: A review of the data in education: A survey and a synthesis of actual research and future
state of the art,’’ IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, challenges,’’ IEEE Trans. Learn. Technol., vol. 11, no. 3, pp. 400–412,
no. 6, pp. 601–618, Nov. 2010 Jul. 2018.
[40] D. Taibi and S, Dietze, ‘‘Fostering analytics on learning analytics
[15] R. Sachin and M. Vijay, ‘‘A survey and future vision of data mining
research: The LAK dataset,’’ in Proc. CEUR Workshop. vol. 974, 2013,
in educational field, advanced computing communication technologies
pp. 5–7.
(ACCT),’’ in Proc. 2nd Int. Conf., Jan. 2012, pp. 96–100.
[41] R. Meymandpour and J. G. Davis, ‘‘Ranking universities using linked
[16] S. K. Mohamad and Z. Tasir, ‘‘Educational data mining: A review,’’ in
open data,’’ J. Stud. Int. Educ., vol. 18, no. 2, pp. 318–327, 2007.
Proc. 9th Int. Conf. Cognit. Sci., vol. 97, pp. 320–324, Nov. 2013.
[42] B. E. Penteado, ‘‘Correlational analysis between school performance and
[17] R. Jindal and M. D. Borah, ‘‘A survey on educational data mining and municipal indicators in Brazil supported by linked open data,’’ in Proc.
research trends,’’ Int. J. Database Manage. Syst., vol. 5, no. 3, pp. 53–73, 25th Int. Conf. Companion World Wide Web WWW Companion, 2016,
Jun. 2013. pp. 507–512.
[18] A. Peña-Ayala, ‘‘Educational data mining: A survey and a data mining- [43] E. A. Amrieh, T. Hamtini, and I. Aljarah, ‘‘Preprocessing and analyz-
based analysis of recent works,’’ Expert Syst. Appl., vol. 41, no. 4, ing educational data set using X-API for improving student’s perfor-
pp. 1432–1462, Mar. 2014. mance,’’ in Proc. IEEE Jordan Conf. Appl. Electr. Eng. Comput. Technol.
[19] H. Aldowah, H. Al-Samarraie, and W. M. Fauzy, ‘‘Educational data (AEECT), Nov. 2015, pp. 1–5.
mining and learning analytics for 21st century higher education: A review [44] C. Keßler, M. d’Aquin, and S. Dietze, ‘‘Linked data for science and
and synthesis,’’ Telematics Informat., vol. 37, pp. 13–49, Apr. 2019. education,’’ Semantic Web, vol. 4, no. 1, pp. 1–2, 2013.
[20] O. Viberg, M. Hatakka, O. Bälter, and A. Mavroudi, ‘‘The current land- [45] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives,
scape of learning analytics in higher education,’’ Comput. Hum. Behav., ‘‘Dbpedia: A nucleus for a Web of open data,’’ In The Semantic Web.
vol. 89, pp. 98–110, Dec. 2018. Berlin, Germany: Springer, 2007, pp. 722–735.
[21] A. Peña-Ayala, ‘‘Learning analytics: A glance of evolution, status, and [46] K. Bollacker, R. Cook, and P. Tufts, ‘‘Freebase: A shared database of
trends according to a proposed taxonomy,’’ Wiley Interdiscipl. Rev., Data structured general human knowledge,’’ in Proc. AAAI vol. 7, Jul. 2007,
Mining Knowl. Discovery, vol. 8, no. 3, May 2018, Art. no. e1243. pp. 1962–1963.
[47] T. Rebele, F. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum, [70] K. Bunkar, U. K. Singh, B. Pandya, and R. Bunkar, ‘‘Data mining:
‘‘YAGO: A multilingual knowledge base from wikipedia, wordnet, Prediction for performance improvement of graduate students using clas-
and geonames,’’ in Proc. Int. Semantic Web Conf. Cham, Switzerland: sification,’’ in Proc. 9th Int. Conf. Wireless Opt. Commun. Netw. (WOCN),
Springer, Oct. 2016, pp. 177–185. Sep. 2012, pp. 1–5.
[48] N. Bassiliades, ‘‘Collecting university rankings for comparison using [71] [Online]. Available: web.ysu.edu/gen/ysu_generated_bin/documents/
Web extraction and entity linking techniques,’’ in Information and Com- basic_module/Key_Causes_of_Student_AttritionComprehensive_
munication Technologies in Education, Research, and Industrial Applica- Retention_Plan.pdf
tions (Communications in Computer and Information Science), vol. 469, [72] J. Liang, J. Yang, Y. Wu, C. Li, and L. Zheng, ‘‘Big data application in
2014, pp. 23–46. education: Dropout prediction in edx MOOCs,’’ in Proc. IEEE 2nd Int.
[49] J. Robinson, J. Stan, and M. Ribière, ‘‘Using linked data to reduce Conf. Multimedia Big Data (BigMM), Apr. 2016, pp. 440–443.
learning latency for e-book readers,’’ in Proc. Extended Semantic Web [73] R. Kanth, M.-J. Laakso, P. Nevalainen, and J. Heikkonen, ‘‘Future edu-
Conf., 2012, pp. 28–34. cational technology with big data and learning analytics,’’ in Proc. IEEE
[50] L. D. Rubenstein, ‘‘Using TED talks to inspire thoughtful practice,’’ 27th Int. Symp. Ind. Electron. (ISIE), Jun. 2018, pp. 906–910.
Teacher Educator, vol. 47, no. 4, pp. 261–267, Oct. 2012. [74] A. Pradeep, S. Das, and J. J. Kizhekkethottam, ‘‘Students dropout factor
[51] [Online]. Available: http://data.linkededucation.org/linkedup/catalog/ prediction using EDM techniques,’’ in Proc. Int. Conf. Soft-Comput.
[52] R. A. Huebner, ‘‘A survey of educational data-mining research,’’ Res. Netw. Secur. (ICSNS), Feb. 2015, pp. 1–7.
Higher Educ. J., vol. 19, no. 4, pp. 1–13, 2013. [75] W. L. Cambruzzi, S. J. Rigo, and J. L. Barbosa, ‘‘Dropout prediction
[53] P. Guleria and M. Sood, ‘‘Data mining in education: A review on the and reduction in distance education courses with the learning analytics
knowledge discovery perspective,’’ Int. J. Data Mining Knowl. Manage. multitrail approach,’’ J. UCS, vol. 21, no. 1, pp. 23–47, 2015.
Process, vol. 4, no. 5, pp. 47–60, Sep. 2014. [76] C. Márquez-Vera, A. Cano, C. Romero, A. Y. M. Noaman, H. Mousa
[54] S. Yu, D. Yang, and X. Feng, ‘‘A big data analysis method for online Fardoun, and S. Ventura, ‘‘Early dropout prediction using data mining:
education,’’ in Proc. 10th Int. Conf. Intell. Comput. Technol. Autom. A case study with high school students,’’ Expert Syst., vol. 33, no. 1,
(ICICTA), Oct. 2017, pp. 291–294. pp. 107–124, Feb. 2016.
[55] I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, and [77] G. Dekker, M. Pechenizkiy, and J. Vleeshouwers, ‘‘Predicting students
S. U. Khan, ‘‘The rise of ‘big data’ on cloud computing: Review and open drop out: A case study,’’ Educ. Data Mining, to be published.
research issues,’’ Inf. Syst., vol. 47, pp. 98–115, Jan. 2015. [78] C. Márquez-Vera, A. Cano, C. Romero, and S. Ventura, ‘‘Predicting
student failure at school using genetic programming and different data
[56] S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst,
mining approaches with high dimensional and imbalanced data,’’ Int. J.
I Gupta, and R. H. Campbell, ‘‘Samza: Stateful scalable stream
Speech Technol., vol. 38, no. 3, pp. 315–330, Apr. 2013.
processing at LinkedIn,’’ Proc. VLDB Endowment, vol. 10, no. 12,
pp. 1634–1645, Aug. 2017. [79] G. Dekker, M. Pechenizkiy, and J. Vleeshouwers, ‘‘Predicting students
drop out: A case study,’’ presented at the Educ. Data Mining, Jul. 2009.
[57] S. Roy and S. N. Singh, ‘‘Emerging trends in applications of big data in
[80] J. Bayer, H. Bydzovská, J. Géryk, T. Obsivac, and L. Popelinsky, ‘‘Pre-
educational data mining and learning analytics,’’ in Proc. 7th Int. Conf.
dicting drop-out from social Behaviour of students,’’ Int. Educ. Data
Cloud Comput., Data Sci. Eng. Confluence, Jan. 2017, pp. 193–198.
Mining Soc., to be published.
[58] L. Cen, D. Ruta, and J. Ng, ‘‘Big education: Opportunities for big
[81] Z. Wang, C. Zhu, Z. Ying, Y. Zhang, B. Wang, X. Jin, and H. Yang,
data analytics,’’ in Proc. IEEE Int. Conf. Digit. Signal Process. (DSP),
‘‘Design and implementation of early warning system based on educa-
Jul. 2015, pp. 502–506, doi: 10.1109/ICDSP.2015.7251923.
tional big data,’’ in Proc. 5th Int. Conf. Syst. Informat. (ICSAI), Nov. 2018,
[59] M. S. Vyas and R. Gulwani, ‘‘Predictive analytics for e learning system,’’
pp. 549–553.
in Proc. Int. Conf. Inventive Syst. Control (ICISC), Jan. 2017, pp. 1–4.
[82] T. Denley, ‘‘Degree compass: A course recommendation system,’’ Edu-
[60] V. L. Miguéis, A. Freitas, P. J. V. Garcia, and A. Silva, ‘‘Early seg- cause Rev. Online, pp. 1–5, Jun. 2013.
mentation of students according to their academic performance: A pre- [83] P. Guleria and M. Sood, ‘‘Big data analytics: Predicting academic course
dictive modelling approach,’’ Decis. Support Syst., vol. 115, pp. 36–51, preference using Hadoop inspired mapreduce,’’ in Proc. 4th Int. Conf.
Nov. 2018. Image Inf. Process. (ICIIP), Dec. 2017, pp. 1–4.
[61] M. Feng, N. Heffernan, and K. Koedinger, ‘‘Addressing the assessment [84] A. Pejic and P. S. Molcer, ‘‘Exploring data mining possibilities on com-
challenge with an online system that tutors as it assesses,’’ User Model. puter based problem solving data,’’ in Proc. IEEE 14th Int. Symp. Intell.
User-Adapted Interact., vol. 19, no. 3, pp. 243–266, Aug. 2009. Syst. Informat. (SISY), Aug. 2016, pp. 171–176.
[62] M. Jose, P. S. Kurian, and V. Biju, ‘‘Progression analysis of students in a [85] L. Cen, D. Ruta, L. Powell, and J. Ng, ‘‘Learning alone or in a group -
higher education institution using big data open source predictive model- an empirical case study of the collaborative learning patterns and their
ing tool,’’ in Proc. 3rd MEC Int. Conf. Big Data Smart City (ICBDSC), impact on student grades,’’ in Proc. Int. Conf. Interact. Collaborative
Mar. 2016, pp. 1–5. Learn. (ICL), Dec. 2014.
[63] B. Kumar and S. Pal, ‘‘Mining educational data to analyze students [86] Á. F. Agudo-Peregrina, S. Iglesias-Pradas, M. Á. Conde-González, and
performance,’’ Int. J. Adv. Comput. Sci. Appl., vol. 2, no. 6, pp. 1–8, 2012. Á. Hernández-García, ‘‘Can we predict success from log data in VLEs?
[64] M. H. Abdous, H. Wu, and C. J. Yen, ‘‘Using data mining for predicting Classification of interactions for learning analytics and their relation with
relationships between online question theme and final grade,’’ J. Educ. performance in VLE-supported F2F and online learning,’’ Comput. Hum.
Technol. Soc., vol. 15, no. 3, p. 77, 2012. Behav., vol. 31, pp. 542–550, Feb. 2014.
[65] A. Wolff, Z. Zdrahal, D. Herrmannova, and P. Knoth, ‘‘Predicting student [87] A. van Leeuwen, J. Janssen, G. Erkens, and M. Brekelmans, ‘‘Supporting
performance from combined data sources,’’ in Educational Data Mining, teachers in guiding collaborating students: Effects of learning analytics
2013. in CSCL,’’ Comput. Edu., vol. 79, pp. 28–39, Oct. 2014.
[66] C. Romero, P. G. Espejo, A. Zafra, J. R. Romero, and S. Ventura, [88] J. Janssen, G. Erkens, and G. Kanselaar, ‘‘Visualization of agreement and
‘‘Web usage mining for predicting final marks of students that use discussion processes during computer-supported collaborative learning,’’
moodle courses,’’ Comput. Appl. Eng. Edu., vol. 21, no. 1, pp. 135–146, Comput. Hum. Behav., vol. 23, no. 3, pp. 1105–1125, May 2007.
Mar. 2013. [89] R. Cerezo, M. Sánchez-Santillán, M. P. Paule-Ruiz, and J. C. Núñez,
[67] V. Ramesh, P. Parkavi, and K. Ramar, ‘‘Predicting student performance: ‘‘Students’ LMS interaction patterns and their relationship with achieve-
A statistical and data mining approach,’’ Int. J. Comput. Appl., vol. 63, ment: A case study in higher education,’’ Comput. Edu., vol. 96,
no. 8, pp. 35–39, 2012. pp. 42–54, May 2016.
[68] W. Xing, R. Guo, E. Petakovic, and S. Goggins, ‘‘Participation- [90] Á. Fidalgo-Blanco, M. L. Sein-Echaluce, F. J. García-Peñalvo, and
based student final performance prediction model through interpretable M. Á. Conde, ‘‘Using learning analytics to improve teamwork assess-
genetic programming: Integrating learning analytics, educational data ment,’’ Comput. Hum. Behav., vol. 47, pp. 149–156, Jun. 2015.
mining and theory,’’ Comput. Hum. Behav., vol. 47, pp. 168–181, [91] P. Williams, ‘‘Assessing collaborative learning: Big data, analytics and
Jun. 2015. university futures,’’ Assessment Eval. Higher Edu., vol. 42, no. 6,
[69] M. Nasiri, B. Minaei, and F. Vafaei, ‘‘Predicting GPA and aca- pp. 978–989, Aug. 2017.
demic dismissal in LMS using educational data mining: A case min- [92] L. dos Santos Machado and K. Becker, ‘‘Distance education: A Web usage
ing,’’ in Proc. 6th Nat. 3rd Int. Conf. E-Learn. E-Teach., Feb. 2012, mining case study for the evaluation of learning sites,’’ in Proc. 3rd IEEE
pp. 53–58. Int. Conf. Adv. Technol., Jul. 2003, pp. 360–361.
[93] L. Wang, J. Li, L. Ding, and P. Li, ‘‘E-learning evaluation system based [116] M.-H. Hsu, ‘‘A personalized english learning recommender system for
on data mining,’’ in Proc. 2nd Inf. Eng. Electron. Commerce (IEEC), ESL students,’’ Expert Syst. Appl., vol. 34, no. 1, pp. 683–688, Jan. 2008.
Jul. 2010, pp. 1–3. [117] N. Thai-Nghe, L. Drumond, A. Krohn-Grimberghe, and L. Schmidt-
[94] V. Pascual-Cid, L. Vigentini, and M. Quixal, ‘‘Visualising virtual learning Thieme, ‘‘Recommender system for predicting student performance,’’
environments: Case studies of the Website exploration tool,’’ in Proc. 14th Procedia Comput. Sci., vol. 1, no. 2, pp. 2811–2819, 2010.
Int. Conf. Inf. Visualisation, Jul. 2010, pp. 149–155. [118] O. C. Santos and J. G. Boticario, ‘‘Requirements for semantic educa-
[95] I. L. M. Ricarte, G. R. F. Junior,‘‘A methodology for mining data from tional recommender systems in formal E-learning scenarios,’’ Algorithms,
computer-supported learning environments,’’ Informática na educação: vol. 4, no. 2, pp. 131–154, 2011.
Teoria Prática, vol. 14, no. 2, pp. 83–94, 2011. [119] O. R. Zaiane, ‘‘Building a recommender agent for e-learning systems,’’
[96] J. S. Kinnebrew, J. R. Segedy, and G. Biswas, ‘‘Integrating model-driven in Proc. Int. Conf. Comput. Edu., Dec. 2002, pp. 55–59.
and data-driven techniques for analyzing learning behaviors in open- [120] J. Lu, ‘‘Personalized e-learning material recommender system,’’ in Proc.
ended learning environments,’’ IEEE Trans. Learn. Technol., vol. 10, Int. Conf. Inf. Technol. Appl., 2004, pp. 374–379.
no. 2, pp. 140–153, Apr. 2017. [121] F.-H. Wang and H.-M. Shao, ‘‘Effective personalized recommendation
[97] A. Nussbaumer, E.-C. Hillemann, C. Gütl, and D. Albert, ‘‘A competence- based on time-framed navigation clustering and association mining,’’
based service for supporting self-regulated learning in virtual environ- Expert Syst. Appl., vol. 27, no. 3, pp. 365–377, Oct. 2004.
ments,’’ J. Learn. Anal., vol. 2, no. 1, pp. 101–133, 2015. [122] N. Baloian, P. Galdames, C. A. Collazos, and L. A. Guerrero, ‘‘A model
[98] J. L. Sabourin, B. W. Mott, and J. C. Lester, ‘‘Early prediction of student for a collaborative recommender system for multimedia learning mate-
self-regulation strategies by combining multiple models,’’ Int. Educ. Data rial,’’ in Proc. Int Conf. Collaboration Technol., Sep. 2004, pp. 281–288.
Mining Soc., to be published.
[123] C.-M. Chen, H.-M. Lee, and Y.-H. Chen, ‘‘Personalized e-learning system
[99] K. Pietrosanti. When E-Learning Technologies Embrace Big Data. using item response theory,’’ Comput. Edu., vol. 44, no. 3, pp. 237–255,
Accessed: Feb. 2020. [Online]. Available: https://www.docebo.com/ Apr. 2005.
2013/12/06/when-elearning-technologiesembrace-big-data-2/
[124] M. Gomez-Albarran and G. Jimenez-Diaz, ‘‘Recommendation and stu-
[100] K. Habitzel, T. D. Mrk, B. Stehno, and S. Prock, ‘‘Microlearning:
dents’ authoring in repositories of learning objects: A case-based reason-
Emerging concepts, practices and technologies after e-learning,’’ Proc.
ing approach,’’ Int. J. Emerg. Technol. Learn. (iJET), vol. 4, pp. 35–40,
Microlearning Learn. Work. New Media, vol. 5, no. 3, 2006.
Oct. 2009.
[101] R. Ferguson and S. B. Shum, ‘‘Social learning analytics: Five
[125] M. K. Khribi, M. Jemni, and O. Nasraoui, ‘‘Toward a hybrid rec-
approaches,’’ presented at the Proc. 2nd Int. Conf. Learn. Anal. Knowl.,
ommender system for e-learning personalization based on Web usage
2012.
mining techniques and information retrieval,’’ in Proc. World Conf. E-
[102] E. Duval, ‘‘Attention please!: Learning analytics for visualization and Learn. Corporate, Government, Healthcare Higher Educ., Oct. 2007,
recommendation,’’ presented at the Proc. 1st Int. Conf. Learn. Anal. pp. 6136–6145.
Knowl., 2011.
[126] Y. Yang, H. Liu, J. Carbonell, and W. Ma, ‘‘Concept graph learning from
[103] C.-M. Chen, C.-M. Hong, and C.-C. Chang, ‘‘Mining interactive social
educational data,’’ in Proc. 8th ACM Int. Conf. Web Search Data Mining
network for recommending appropriate learning partners in a Web-based
WSDM, 2015, pp. 159–168.
cooperative learning environment,’’ in Proc. IEEE Conf. Cybern. Intell.
[127] H. Liu and Y. Yang, ‘‘Cross-graph learning of multi-relational associa-
Syst., Sep. 2008, pp. 642–647.
tions,’’ in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 2235–2243.
[104] E. A. Heathcote and S. P. Dawson, ‘‘Data mining for evaluation, bench-
marking and reflective practice in a LMS,’’ presented at the E-Learn [128] W. Chen, C. G. Brinton, D. Cao, A. Mason-Singh, C. Lu, and M. Chiang,
World Conf. E-Learn. Corporate, Government, Heathcare Higher Educ., ‘‘Early detection prediction of learning outcomes in online short-courses
Vancouver, BC, Canada, Oct. 2005. via learning behaviors,’’ IEEE Trans. Learn. Technol., vol. 12, no. 1,
pp. 44–58, Jan. 2019.
[105] [Online]. Available: https://xapi.com/overview/
[106] M. Manso-Vazquez, M. Caeiro-Rodriguez, and M. Llamas-Nistal, [129] C. Vieira, P. Parsons, and V. Byrd, ‘‘Visual learning analytics of educa-
‘‘XAPI-SRL: Uses of an application profile for self-regulated learning tional data: A systematic literature review and research agenda,’’ Comput.
based on the analysis of learning strategies,’’ in Proc. IEEE Frontiers Edu. Edu., vol. 122, pp. 119–135, Jul. 2018.
Conf. (FIE), Oct. 2015, pp. 1–8. [130] J. Yoo, S. Yoo, C. Lance, and J. Hankins, ‘‘Student progress monitoring
[107] Y. Wu, S. Guo, and L. Zhu, ‘‘Design and implementation of data collec- tool using treeview,’’ presented at the ACM SIGCSE Bulletin, 2006.
tion mechanism for 3D design course based on xAPI standard,’’ Interact. [131] L. P. Macfadyen and P. Sorenson, ‘‘Using LiMS (the learner interaction
Learn. Environments, pp. 1–18, Dec. 2019. monitoring system) to track online learner engagement and evaluate
[108] A. Berg, M. Scheffel, H. Drachsler, S. Ternier, and M. Specht, ‘‘Dutch course design,’’ presented at the Educ. Data Mining, Jun. 2010.
cooking with xAPI recipes: The good, the bad, and the consistent,’’ in [132] J. Zeitz, N. Self, L. House, J. R. Evia, S. Leman, and C. North, ‘‘Bringing
Proc. IEEE 16th Int. Conf. Adv. Learn. Technol. (ICALT), Jul. 2016, interactive visual analytics to the classroom for developing EDA skills,’’
pp. 234–236. J. Comput. Sci. Colleges, vol. 33, no. 3, pp. 115–125, 2018.
[109] A. Nouira, L. Cheniti-Belcadhi, and R. Braham, ‘‘An enhanced xAPI data [133] D. Zhou, H. Li, S. Liu, B. Song, and T. Hu, ‘‘A map-based visual
model supporting assessment analytics,’’ Procedia Comput. Sci., vol. 126, analysis method for patterns discovery of mobile learning in education
pp. 566–575, Jan. 2018. with big data,’’ in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2017,
[110] C. Ellis, ‘‘Broadening the scope and increasing the usefulness of learning pp. 3482–3491.
analytics: The case for assessment analytics,’’ Brit. J. Educ. Technol., [134] X. Chen, J. Zeitz Self, L. House, J. Wenskovitch, M. Sun, N. Wycoff,
vol. 44, no. 4, pp. 662–664, Jul. 2013. J. Robertson Evia, S. Leman, and C. North, ‘‘Be the data: Embodied
[111] L. Cao, ‘‘Non-IID recommender systems: A review and framework visual analytics,’’ IEEE Trans. Learn. Technol., vol. 11, no. 1, pp. 81–95,
of recommendation paradigm shifting,’’ Engineering, vol. 2, no. 2, Mar. 2018.
pp. 212–224, Jun. 2016. [135] C. Shi, S. Fu, Q. Chen, and H. Qu, ‘‘VisMOOC: Visualizing video click-
[112] S. Dwivedi and V. S. K. Roshni, ‘‘Recommender system for big data in stream data from massive open online courses,’’ in Proc. IEEE Pacific
education,’’ in Proc. 5th Nat. Conf. E-Learn. E-Learn. Technol. (ELEL- Visualizat. Symp. (PacificVis), Apr. 2015, pp. 159–166.
TECH), Aug. 2017, pp. 1–4. [136] Q. Chen, Y. Chen, D. Liu, C. Shi, Y. Wu, and H. Qu, ‘‘PeakVizor:
[113] Y. Hou, P. Zhou, J. Xu, and D. O. Wu, ‘‘Course recommendation of Visual analytics of peaks in video clickstreams from massive open
MOOC with big data support: A contextual online learning approach,’’ in online courses,’’ IEEE Trans. Vis. Comput. Graphics, vol. 22, no. 10,
Proc. IEEE INFOCOM Conf. Comput. Commun. Workshops (INFOCOM pp. 2315–2330, Oct. 2016.
WKSHPS), Apr. 2018, pp. 106–111. [137] Y. Chen, Q. Chen, M. Zhao, S. Boyer, K. Veeramachaneni, and H. Qu,
[114] M. Qbadou, I. Salhi, and K. Mansouri, ‘‘Towards an educational recom- ‘‘DropoutSeer: Visualizing learning patterns in massive open online
mendation system based on big data techniques-case of Hadoop,’’ in Proc. courses for dropout reasoning and prediction,’’ in Proc. IEEE Conf. Vis.
4th Int. Conf. Optim. Appl. (ICOA), Apr. 2018, pp. 1–5. Analytics Sci. Technol. (VAST), Oct. 2016, pp. 111–120.
[115] L. Feng and G. Wei-wei, ‘‘Design and implementation of personalized [138] J. Herrington, T. C. Reeves, and R. Oliver, ‘‘Immersive learning technolo-
recommendation system under big data platform,’’ in Proc. 11th Int. Conf. gies: Realism and online authentic learning,’’ J. Comput. Higher Edu.,
Intell. Comput. Technol. Autom. (ICICTA), Sep. 2018, pp. 291–294. vol. 19, no. 1, pp. 80–99, Sep. 2007.
[139] Z. Pan, A. D. Cheok, H. Yang, J. Zhu, and J. Shi, ‘‘Virtual reality KENNETH LI-MINN ANG (Senior Member, IEEE) received the B.Eng.
and mixed reality for virtual learning environments,’’ Comput. Graph., and Ph.D. degrees from Edith Cowan University, Australia. He was an
vol. 30, no. 1, pp. 20–28, Feb. 2006. Associate Professor of networked and computer systems with the School
[140] S. C. Baker, R. K. Wentz, and M. M. Woods, ‘‘Using virtual worlds in of Information and Communication Technology (ICT), Griffith University.
education: Second Life as an educational tool,’’ Teach. Psychol., vol. 36, He is currently a Professor with the School of Science and Engineering,
no. 1, pp. 59–64, Jan. 2009. University of Sunshine Coast. His research interests include big data analyt-
[141] H. H. S. Ip, C. Li, S. Leoni, Y. Chen, K.-F. Ma, C. H.-T. Wong, and Q. Li, ics, multimedia Internet-of-Things, embedded systems, wireless multimedia
‘‘Design and evaluate immersive learning experience for massive open
sensor systems, reconfigurable computing and the development of real-world
online courses (MOOCs),’’ IEEE Trans. Learn. Technol., vol. 12, no. 4,
computer systems, and machine learning. He has published over 180 articles
pp. 503–515, Oct. 2019.
[142] B. Daniel, ‘‘Big data and analytics in higher education: Opportunities in journals and international refereed conferences. He is a Fellow of the
and challenges,’’ Brit. J. Educ. Technol., vol. 46, no. 5, pp. 904–920, Higher Education Academy, U.K.
Sep. 2015.
[143] B. K. Daniel, ‘‘Big data and data science: A critical review of issues for
educational research,’’ Brit. J. Educ. Technol., vol. 50, no. 1, pp. 101–113,
Jan. 2019.
[144] A. Rubel and K. M. L. Jones, ‘‘Student privacy in learning analytics: An
information ethics perspective,’’ Inf. Soc., vol. 32, no. 2, pp. 143–159, FENG LU GE received the B.Sc. degree in information engineering from
Mar. 2016. the Dalian University of Technology, China, the M.Sc. degree from the Uni-
[145] T. Hoel and W. Chen, ‘‘Privacy and data protection in learning analytics versity of Wollongong, Australia, and the Ph.D. degree from Charles Sturt
should be motivated by an educational maxim—Towards a proposal,’’ University. He is currently an Engineer with Pacific Telecom & Navigation
Res. Pract. Technol. Enhanced Learn., vol. 13, no. 1, pp. 1–14, Dec. 2018. Ltd., Hong Kong. He was previously a Postdoctoral Researcher with Charles
[146] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel,
Sturt University. His research interests include data analytics, computer
R. Ramakrishnan, and C. Shahabi, ‘‘Big data and its technical chal-
vision, and robotics.
lenges,’’ Commun. ACM, vol. 57, no. 7, pp. 86–94, Jul. 2014.
[147] R. H. L. Ip, L.-M. Ang, K. P. Seng, J. C. Broster, and J. E. Pratley,
‘‘Big data and machine learning for crop protection,’’ Comput. Electron.
Agricult., vol. 151, pp. 376–383, Aug. 2018.
[148] K. P. Seng, L. M. Ang, and C. S. Ooi, ‘‘A combined rule-based & machine
learning audio-visual emotion recognition approach,’’ IEEE Trans. Affect.
Comput., vol. 9, no. 1, pp. 3–13, Jan./Mar. 2018. KAH PHOOI SENG (Member, IEEE) received the B.Eng. and Ph.D. degrees
[149] Y. Zhang, R. Jin, and Z.-H. Zhou, ‘‘Understanding bag-of-words model: from the University of Tasmania, Australia. She is currently an Adjunct Pro-
A statistical framework,’’ Int. J. Mach. Learn. Cybern., vol. 1, nos. 1–4, fessor with the School of Engineering and Information Technology, UNSW.
pp. 43–52, Dec. 2010. Before returning to Australia, she was a Professor and the Department Head
[150] [Online]. Available: https://www.instructure.com/canvas/
of computer science and networked system with Sunway University. Before
[151] B. Flanagan and H. Ogata, ‘‘Learning analytics platform in higher edu-
joining Sunway University, she was an Associate Professor with the School
cation in Japan,’’ Knowl. Manage. E-Learn. (KM&EL), vol. 10, no. 4,
pp. 469–484, Nov. 2018. of Electrical and Electronic Engineering, Nottingham University. She has
[152] M. Cantabella, R. Martínez-España, B. Ayuso, J. A. Yáñez, and published over 230 articles in journals and international refereed confer-
A. Muñoz, ‘‘Analysis of student behavior in learning management sys- ences. She is the lead author of the book Multimodal Analytics for Next-
tems through a big data framework,’’ Future Gener. Comput. Syst., vol. 90, Generation Big Data Technologies and Applications. Her research interests
pp. 262–272, Jan. 2019. include data analytics, big data, machine learning, artificial intelligence (AI)
[153] O. K. Akputu, K. P. Seng, Y. Lee, and L.-M. Ang, ‘‘Emotion recognition and intelligent systems, the Internet of Things (IoT), multimodal signal
using multiple kernel learning toward E-learning applications,’’ ACM processing, pervasive computing and sensor networks, HCI and affective
Trans. Multimedia Comput., Commun., Appl., vol. 14, no. 1, pp. 1–20, computing, and mobile software development.
Jan. 2018.