The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
The Rise of Big Data Science: A Survey of Techniques, Methods and Approaches in The Field of Natural Language Processing and Network Theory
cognitive computing
Article
The Rise of Big Data Science: A Survey of Techniques,
Methods and Approaches in the Field of Natural
Language Processing and Network Theory
Jeffrey Ray 1 , Olayinka Johnny 2 , Marcello Trovati 1, *, Stelios Sotiriadis 3 and Nik Bessis 1
1 Department of Computer Science, Edge Hill University, Ormskirk, L39 4QP, UK; Rayj@edgehill.ac.uk (J.R.);
Nik.Bessis@edgehill.ac.uk (N.B.)
2 Department of Department of Electronics, Computing and Mathematics, University of Derby,
Derby DE22 1GB, UK; fabyinka@yahoo.com
3 Department of Computer Science and Information Systems, Birkbeck University of London,
London WC1E 7HX, UK; steliosot@msn.com
* Correspondence: trovatim@edgehill.ac.uk
Received: 30 May 2018; Accepted: 31 July 2018; Published: 2 August 2018
Abstract: The continuous creation of data has posed new research challenges due to its complexity,
diversity and volume. Consequently, Big Data has increasingly become a fully recognised scientific
field. This article provides an overview of the current research efforts in Big Data science,
with particular emphasis on its applications, as well as theoretical foundation.
Keywords: Big Data; text mining; NLP; network theory; Bayesian networks
1. Introduction
Data driven approaches have become a crucial part in most of the scientific fields, as well as within
the business, social sciences, humanities and the financial sectors. Given that data are continuously
created via human activity, financial transactions, sensor information, the ability to identify actionable
insights and useful trends has become a priority for many organisations [1].
Big Data research mainly focuses on four main properties, although in a different context a higher
number of such properties are considered [2]:
• Volume: The amount of data produced daily is enormous. The combination of real-time and historical
data provides a wealth of information to identify the appropriate and best decision process.
• Velocity: Real-time data raise numerous challenges as suitable processing power must be allocated
to allow an efficient assessment within specific time constraints. However, depending on the
sources, type and dynamics of such data, various techniques need to be implemented to provide
sufficient efficiency.
• Variety: Data consist of various types, structures, and format. For example, information is
collected from audio or video sources, as well as from sensors and textual sources, to name but a
few. This diversity requires suitable tools and techniques that can be applied to efficiently deal
with the different data types.
• Veracity: Data are likely to contain contradictory and erroneous information, which could
jeopardise the whole process of acquisition, assessment, and management of information.
Decision Models have been developed based on various techniques and methods, which share
numerous inter-dependencies. This article aims to provide a survey of some specific approaches,
techniques and methods in Big Data, as depicted in Figure 1, with particular emphasis on automated
decision support and modelling. In particular, Sections 2 and 3 discuss Machine Learning techniques
(with emphasis on Natural Language Processing) and Network Theory within Big Data. Section 4
provides an overview of Bayesian Networks with specific application to data analysis, assessment and
extraction. Section 5 focuses on the identification of data inconsistencies and general approaches to
address this challenge. Finally, Section 6 concludes the article.
Figure 1. The main research areas discussed in this article, and their mutual inter-dependencies.
from a dataset. Examples of Unsupervised Learning include K −means for clustering and a priori for
rule or association discovery [3].
The Semi-Supervised Learning methodology is a combination of both Supervised and Unsupervised
Learning, which addresses problems that contain both labelled and unlabelled data. Many real-world
problems fall into this area as it can be expensive to utilise experts in a particular area to label an entire
dataset. Unsupervised Learning discovers the data structure and the Supervised Learning creates best
guess predictions for the unlabelled data [4]. An example of implementation of Semi-Supervised Learning
is the Python VADER sentiment analysis tool, which assesses the sentiment polarity of each word from
social media platforms [5].
The Reinforcement Learning methodology is also widely used, and it utlises complex algorithms
to take actions based on its current state. It then reevaluates the outcome to again make a decision based
on its new condition. The machine is trained to assess different scenarios, including make specific
decisions in a training environment. This allows a trial and error approach until the most appropriate
options is identified. Popular reinforcement Learning methods include: Markov Decision Process and
a neural network based NEAT (Evolving Neural Networks through Augmenting Topologies) [6].
The ability to produce intelligent analytics makes Machine Learning well suited to address various
challenges in Big Data. In fact, Machine Learning is not restricted to one data type and its highly
versatile analytical process can lead to rapid decision-making assessments and processes.
• VB is the linking verb, which needs to be associated with an influence type of relation.
One aspect of NLP, which has been extensively investigated, focuses on sentiment analysis, which aims
to detect “opinions” or polarity from textual data sources [13]. This can be particularly useful in
supporting the specific information extracted. In fact, if the overall opinion related to a specific context
is “positive”, then it may suggest that the corresponding information is discussed in positive terms [14].
As discussed in the next sections, the concepts and mutual relationships naturally create a network
structure, whose investigation can provide useful tools to investigate the overall modelling system.
3. Network Theory
Network theory has become increasingly popular in numerous research fields, including
mathematics, computer science, biology, and the social sciences [15–17]. In particular, the ability
to model complex and evolving systems has enabled its applicability to decision-making approaches
and knowledge discovery systems. The aim of this section is to provide a general overview of some
properties relevant to Decision Models, rather than an in-depth discussion. Refer to [17] for an
exhaustive analysis of Network Theory.
Networks are defined as sets of nodes V = {vi }in=1 , which are connected as specified by the
edge-set E = {eij }in6= j=1 [18]. Real-world networks are utilised to model complex systems, which often
consist of numerous components. Therefore, the resulting complexity can lead to models that are
computationally demanding. To balance accuracy with efficiency, in [12,19,20], the authors proposed a
method, based on data and text mining techniques, to determine and assess the optimal topological
reduction approximating specific real-world datasets. In [21], the topological properties of such
networks are further analysed to identify the connecting paths, which are sequences of adjacent edges.
This approach enables the identification of the mutual influences of any two concepts corresponding
to specific nodes.
The importance of such process is that it firstly allows the identification of a topological structure
which can give an insight into the corresponding datasets. Secondly, it is possible to extract information
on the system modelled by such network that can be used to determine relevant intelligence.
The algorithms utilised for the reduced network topology extraction process are introduced,
and the reader can refer to that article for further details. Furthermore, these algorithms also allow
the identification of the long-tail distribution in the case of scale-free networks, resulting in a more
accurate and relevant extraction [12,20].
Random networks are defined by probabilistic processes, which govern their overall topology,
and the existence of any edge is based on a probability p. Such networks have been extensively
investigated, and several associated properties have been identified depending on their theoretical,
or applied context. More specifically, the fraction pk of nodes with degree k is characterised by the
following equation
zk e−z
pk ≈ ,
k!
where z = (n − 1) p [18].
When random networks are used to model real-world scenarios, the relationships among the
nodes are purely random. In such cases of the edge connecting nodes, the relationships captured by
the edges are unlikely to be associated with meaningful influence. In fact, if a random network is
associated to a purely randomised system, then the relations between nodes do not follow a specific
law [21].
Scale-free networks appear in a numerous contexts, including the World Wide Web links,
biological and social networks [18], and the continuous enhancement of data analysis tools is leading
to the identification of more examples of such networks.
Big Data Cogn. Comput. 2018, 2, 22 6 of 18
These are characterised by a node degree distribution, which follows a power law. In particular,
for large values of k, the fraction pk of nodes in the network having degree k, is defined as
pk ≈ k −γ (1)
where γ has been empirically shown to be typically in the range 2 < γ < 3 [18].
A consequence of Equation (1), is the likelihood of the existence of highly connected hubs, which
suggests that in scale-free networks the way information spreads across them tends to exhibit a
preferential behaviour [18].
Another important property is when new nodes are created, these are likely to be connected
to existing nodes that are already well linked. Furthermore, since the connectivity of nodes follows
a distribution which is not purely random, networks that are topologically reduced to scale-free
structures are likely to capture influence relations between the corresponding nodes, and their
dynamics provides to predictive capabilities related to their evolution.
Since the Dionysus and Mapper algorithms are based on the point cloud properties of datasets,
the data are to be embedded onto a specific co-ordinate system. Furthermore, Mapper allows the
analysis of two-dimensional and one-dimensional datasets, which enables a more efficient method for
data analysis.
However, the Dionysus library has some limitations in the construction of an alpha shape
filtration [32], which makes Mapper algorithm and Python Mapper solutions more suitable compared
to to the Dionysius library.
The Manifold Learning algorithms contained within the Scikit-Learn package require the
embedding of the data onto a low dimensional sub-manifold, as opposed to the Mapper algorithm.
Furthermore, the corresponding dataset must be locally uniform and smooth. However, the Mapper
algorithm output is not intended to faithfully reconstruct the data or reform the data to suit a data
model, as it provides a representation of the data structure.
The Manifold learning and Mapper solutions provide a useful set of data analysis tools,
which enable a suitable representation of data structures and they can be selected once the structure
of the corresponding dataset has been identified. Since both libraries are native to the Python
programming language, this allows an integration with other popular data science Python packages.
The Mapper algorithm has been extensively utilised for commercial data applications, due to its
capability of analysing large datasets containing over 500,000 features. This also allows the analysis of
Big Data without deploying Hadoop, map reduce and SQL database, which provides further flexibility
and reliability.
P( a|b) P(b)
P(b| a) = , (2)
P( a)
Big Data Cogn. Comput. 2018, 2, 22 8 of 18
where P( a) and P(b) are the probability of a and b, respectively, and P( a|b) is the probability of a
given that b has occurred. Equation (2) can be also expressed in more general terms by considering a
hypothesis H updated by additional evidence E and past experience c [34]. More specifically,
P( H |c) P( E| H, c)
P( H | E, c) = , (3)
P( E|c)
Automated NLP understanding systems rely on several data sources which are often partially or
very little known, resulting in problematic tasks as the integration of disambiguation and consequently
the use of probabilistic tools has proved to be very challenging. In fact, even though such knowledge
sources are well known to be probabilistic with well defined models of some specific linguistic levels,
the combination of the probabilistic knowledge sources is still little understood [35]. Bayesian networks
applications to NLP have clear advantages. In particular, they allow the evaluation of the impact of
different independence assumptions in a uniform framework, as well as the possibility of modelling
the behaviour of highly structured linguistic knowledge sources [33].
The ambiguity of the syntax and semantics within natural language makes the development
of rule-based approaches very challenging to address even very limited domains of text. This has
led to probabilistic approaches where models of natural language are learnt from large text sets.
A probabilistic model of a natural language subtask consists of a set of random values with certain
probabilities, associated with lexical, syntactic, semantic, and discourse features [34] and the use
of Bayesian networks applied to multiple natural language processing subtasks in a single model
supports inferencing mechanisms which improve simple classification techniques [36].
demonstrated to have important applications to the processes of semantic growth. Such properties
are based on on statistical properties linked to theoretical properties of the associated semantic
networks. Furthermore, such networks exhibit small-world structures characterised by highly
clustered neighbourhoods and a short average path length [17]. Such networks also show a scale-free
organisation [18] defined by a relatively small number of well-connected nodes, with the distribution
of node connectivities, which is governed by a power function.
Causal inference plays a fundamental role in any question-answering technique and reasoning
process with important Artificial Intelligence applications such as decision-making and diagnosis in
BNs [40,41]. On the other hand, the investigation of the properties of BNs enables effective causal
inference especially in complex domains [42]. The conditional dependencies in a Bayesian Network
are often based on known statistical and computational techniques and contain much information,
which can be successfully analysed to extract causal relations [8]. Often, any two concepts linked by
paths in a network defined by the relationships extracted from text, can be complex to fully identify
in terms of the corresponding influence (or causality) they may represent. This is usually due to
either the topological structure of the network not fully being known, or partial knowledge of the
structure of the paths between them. An important concept to understand the influence between
two concepts is causality discovery [40], which aims to pinpoint the causal relationship between them
when it is not explicitly defined. Typically, semantic similarity measurement plays a significant role in
semantic and information retrieval in contexts where detection of conceptually close but not identical
entities is essential. Similarity measurement is often carried out by comparing common and different
features such as parts, attributes and functions. In [43], a method based on adding thematic roles as
an additional type of features to be compared, is introduced. Semantic distance is closely linked to
causal relationship as it describes how closely two concepts are connected. However, much of the work
on this topic is concerned about linguistic or semantic similarity of terms based on both the context
and the lexicographic properties of words [40]. One of the main setbacks of this approach is that a
hierarchical structure of the concepts can lead to an oversimplification of the problem. The important
question is not merely how far two concepts are, but how much a concept is influential with respect to
another one. The difference is subtle but crucial when dealing with causal discovery. Semantic distance
can also be applied to information retrieval methods in order to improve automated assignment of
indexing based descriptors, as well as to semantic vocabulary integration which enables to choose the
closest related concepts while translating in and out of the multiple vocabularies.
• Finally, the tense of the verb, which can be either active or passive. If it cannot be determined,
then it is defined as unknown.
Consider, for example, the following two statements: “smoking causes lung cancer”, and “there is no proven
direct dependency between antidepressants and liver damage”. In the former, “smoking” and “lung cancer”
are linked by a direct (causal) relationship, whereas, in the latter, “antidepressants” and “liver damage”
are not linked by any relation. Subsequently, the network generated by the concepts and relations
extracted above, is analysed to identify its topological properties, which lead to the most appropriate
BNs related to specific term-queries. In particular, the dynamical properties of the network are assessed
to investigate the global behaviour of concepts and their mutual relations. For example, there are
instances of biomedical concepts previously considered as independent, and subsequent research has
suggested the opposite. Furthermore, depending on the data sources, claims can be substantiated
or argued against. Therefore, it is crucial to consider this type of “information fluctuations” and
assess the parameters influencing tis dynamics to identify the most accurate relation. The evaluation
results demonstrates the potential of this approach, especially in providing valuable resources to BN
modellers to facilitate the decision-making process.
and integration. Figure 3 shows the architecture of the major phases in the integration and fusion of
heterogeneous datasets and the inconsistency levels that are addressed.
The first phase is the schema matching where the schematic mapping between the contents
of the respective data sources are done. This is basically the extraction phase in a typical
Extract–Transform–Load (ETL) framework, where schema inconsistencies are identified and resolved.
The second phase is the duplicate detection, where objects that refer to the same real world entities
are identified and resolved. This is at the tuple level, which is the transform phase in the ETL process.
The final phase is the fusing of data, which is the process that involves combining multiple records
that represent the same real world object into a single, consistent state. This is the phase where the
process performs attempts to resolves conflicts associated with the datasets. The identification of
data value inconsistency is the final state. Therefore, such identification is only possible when both
schema inconsistencies and data representation inconsistencies have been resolved. These kinds of
inconsistencies are not universal; rather they are hidden and contextual.
of inconsistency give rise to conflicting circumstances which present itself as data inconsistency
problem in Big Data analysis and integration.
in different data sources use different unit of measurement. A potential drawback of [52] is that, since tuples
might be used as qualitative measures of uncertainty while processing queries over incompatible domains,
it is possible that inconsistencies among common attributes could be ignored a situation. Furthermore,
the identification of inconsistencies at the data value level would enable the identification of hidden
inconsistencies, which might not be universal but rather contextual [53]. In particular, when large datasets
are analysed, the probability of generating inconsistencies, such as cycles or different probability evaluations
representing the same real-world entity, increases almost exponentially.
Appendix A provides a further description of the main approaches in the identification and
assessment of data inconsistencies.
6. Conclusions
With the continuous creation of data, Big Data research has become increasingly crucial within the
majority of data-driven fields. Consequently, it has attracted considerable attention from multi-disciplinary
research areas. However, data exhibit highly dynamical properties, which need to be harnessed to facilitate
the knowledge discovery and the decision modelling processes. Furthermore, there is compelling evidence
that cutting-edge algorithm and methods need to be continuously introduced to address the multiple
challenges posed by the diverse and large quantity of data. Moreover, new frontiers of Big Data have been
opened due to the interconnections with disciplines and topics, previously considered as unrelated to data
analysis. Therefore, further research and investigation is required to enhance the current state-of-the-art
understanding of data. This article focuses on a survey of specific research areas with particular emphasis on
decision-making techniques, including Network Theory, Bayesian Networks, NLP and Machine Learning,
which have enhanced our capability of identifying, extracting and assessing actionable insights from Big
Data.
Author Contributions: Jeffrey Ray investigated the main topics related to Sections 2 and 3, and Marcello Trovati
specifically focused on Section 4. Olayinka Johnny designed and led the discussion of Data Inconsistencies in
Section 5. Finally, Nik Bessis and Stelios Sotiriadis contributed to the overall discussion and organisation of the
article
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflicts of interest.
Big Data Cogn. Comput. 2018, 2, 22 15 of 18
Appendix A
Approach Description Category of Approaches Performance Pros (+) and cons (-)
inconsistency strategy benchmark
framework
Conflict Nodes in conflict hypergraph Schema level Deciding Execution time of + considers inconsistencies in relational model
hypergraph represent database facts and computing the + considers graphical representation of inconsistencies.
[54] hyper edges are sets of facts hypergraph and conflict - does not consider Big Data sets
giving rise to a violation of the detection in queries. - does not consider risks to data and
integrity constraints. format inconsistencies
[55] It reduces conjunctive query Schema level Deciding Performance overhead + considers schema inconsistencies in relational database
with certainty to binary integer of computing consistent - does not consider Big Data sets
programming. first-order conjunctive - does not consider risks to data and format inconsistencies
query
Multiplex Construct an approximation of Schema and data Deciding No benchmark + considers data value inconsistencies in relational databases
[56] the true set of records, with a level + defines an approximation framework to resolve data
lower bound set of records and inconsistencies
an upper bound set of records. - does not consider contextual inconsistencies.
- does not provide algorithm
FusionPlex It qualifies each individual Data level Deciding No benchmark + considers inconsistencies in multiple data sources
[57] source of data and then uses + identifies and resolved inconsistencies
meta-data of such qualification - prone to error as inconsistencies are subjectively defined by
to resolve conflicts among data users
- manual resolution of inconsistencies
- does not provide specific algorithms
DUMAS [58] The algorithm considers a tuple Schema and data Deciding Effectiveness of schema +considers schema matching in relational data model.
as a single string and applies a representation level matching algorithm in + describes an algorithm
string similarity measure to finding a complete + detects duplicates in datasets.
extract the most similar tuple matching of two - does not consider contextual inconsistencies.
pairs. schemas, given K - does not consider data value inconsistencies
duplicates
[44] Applies fuzzy multi-attribute Schema and Data Mediating Used round robin + consider fuzzy multi-attribute decision making approach
decision making approach based value level strategy to test + describes an algorithm
on data source quality criteria to performance - does not consider contextual inconsistencies.
select the “best” data source’s effectiveness of the - does not consider big data sets
data as the data inconsistency algorithm. Reports ideal
solution. performance.
[59] Maps conflicting attributes to Schema and Data mediating No benchmark + consider semantically related attributes in data sources
common domains by means of a value level - assumed source data have same entity type which may result
mechanism of virtual attributes in false positive
and then apply algebraic - modelled imprecise information and lack of conflict between
operations to the resulting the tuples described does not guarantee that the data is
partial values consistent
[55] Approach based on Dempster- Data value level mediating + considers dataset conflicts
Shafer theory and assigns + applies evidential theory to resolve data inconsistencies
probabilities to attribute values. - does not consider textual similarities
- difficult to determine the probabilities and where the values
come from
- the source may not always have a common key
[56] Uses probabilistic partial values Data value level Mediating +consider domain mismatch in relational database
by associating the uncertain + consider value attributes types in relations
answer-tuples of a query with - did not consider big datasets
degrees of uncertainty. - did not consider contextual inconsistencies.
Active Atlas, use a decision Data value Deciding accuracy of learning + training-based framework
[57] tree forest to learn both mapping rules + consider mapping rules for objects
duplicate detection rules and + created functions to identify inconsistencies.
weights for string - does not consider big datasets
transformations, which are used - challenges of learning the data sets
for comparing fields. - could result in false positive
[58] Uses co-reference resolution to Data kevel Mediating measured the + consider textual datasets
converts the textual data into confidence values of + consider semantic representation
TextGraph structure and then the semantic links and + creates graphical link between words and concepts
addresses synonym by learning evaluate the degree of - does not present clear data inconsistency
synonym patterns from recall for frequently - does not capture causality relations in texts
TextGraph triples used queries
[60] The approach describes different Data level Deciding Execution time of + classification of conflicts in ontologies
types of conflicts and it uses a identifying conflict in + describes simplification in data relations
rule-based approach to define number of statements. + uses semantic mappings
conditions that signal a conflict Reports scalability - requires users intervention to identify conflicting statements
in data. issues with increasing - does not consider risks to data and
number of statements format inconsistencies
- reports scalability issues with increasing number of
statements
Figure A1. The main approaches to data inconsistency, as discussed in Section 5 [54–60].
References
1. Molnar, E.; Kryvinska, N.; Gregus̆, M. Customer Driven Big-Data Analytics for the Companies’ Servitization.
In Proceedings of the Spring Servitization Conference 2014 (SSC 2014), Birmingham, UK, 12–14 May 2014;
Baines, T., Clegg, B., Harrison, D., Eds.; Aston Business School, Aston University: Birmingham, UK, 2014;
pp. 133–140.
2. Gupta, R.; Gupta, H.; Mohania, M. Cloud Computing and Big Data Analytics: What Is New from Databases
Perspective? In Big Data Analytics; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg,
Germany, 2012; pp. 42–61.
3. Baldi, P.; Brunak, S. Bioinformatics: A Machine Learning Approach; MIT Press: Cambridge, MA, USA, 2002.
Big Data Cogn. Comput. 2018, 2, 22 16 of 18
4. Wissem, I.; Sabeur, A.; Haithem, M.; Mondher, M.; Engelbert, M.N. An Experimental Survey on Big Data
Frameworks. Future Gener. Comput. Syst. 2018, 86, 546–564.
5. Hutto, E.; Gilbert, C.J. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media
Text. In Proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM-14),
Ann Arbor, MI, USA, 1–4 June 2014.
6. Stanley, K.; Miikkulainen, R. Evolving Neural Networks through Augmenting Topologies. Evolut. Comput.
2002, 10, 99–127. [CrossRef] [PubMed]
7. Trovati, M.; Hayes, J.; Palmieri, F.; Bessis, N. Automated extraction of fragments of Bayesian networks from
textual sources. Appl. Soft Comput. 2017, 60, 508–519. [CrossRef]
8. Sanchez-Graillet, O.; Poesio, M. Acquiring Bayesian Networks from Text. Available online: https://nats-www.
informatik.uni-hamburg.de/intern/proceedings/2004/LREC/pdf/240.pdf (accessed on 30 April 2018).
9. Feldman, R.; Sanger, J. The Text Mining Handbook; Cambridge University Press: Cambridge, UK, 2006.
10. Blei, D.M.; Ng, A.Y.; Jordan, M.; Lafferty, J. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022.
11. De Marneffe, M.F.; MacCartney, B.; Manning, C.D. Generating Typed Dependency Parses from Phrase
Structure Parses. In Proceedings of the 2006 5th International Conference on Language Resources and
Evaluation, Genoa, Italy, 22–28 May 2006.
12. Trovati, M.; Bessis, N.; Huber, A.; Zelenkauskaite, A.; Asimakopoulou, E. Extraction, Identification and
Ranking of Network Structures from Data Sets. In Proceedings of the 2014 Eighth International Conference
on Complex, Intelligent and Software Intensive Systems, Birmingham, UK, 2–4 July 2014; pp. 331–337.
13. Liu, B. Sentiment Analysis and Opinion Mining; Morgan and Claypool Publishers: San Rafael, CA, USA, 2012.
14. Ray, J.; Trovati, M. A Survey of Topological Data Analysis (TDA) Methods Implemented in Python. In
Advances in Intelligent Networking and Collaborative Systems. INCoS 2017; Lecture Notes on Data Engineering
and Communications Technologies, vol. 8; Springer: Berlin, Germany, 2017; Volume 60, pp. 508–519.
15. Trovati, M.; Asimakopoulou, E.; Bessis, N. An investigation on human dynamics in enclosed spaces.
J. Comput. Electr. Eng. 2018, 67, 195–209. [CrossRef]
16. Bessis, N.; Dobre, C. Big Data and Internet of Things: A Roadmap for Smart Environments; Springer: Berlin,
Germany, 2014.
17. Watts, D.J.; Strogatz, H.S. Collective Dynamics of Small-World Networks. Nature 1998, 393, 440–442.
[CrossRef] [PubMed]
18. Barabási, A.S.; Albert, R. Emergence of Scaling in Random Networks. Science 1999, 286, 509–512. [PubMed]
19. Trovati, M.; Asimakopoulou, E.; Bessis, N. An Analytical Tool to Map Big Data to Networks with Reduced
Topologies. In Proceedings of the 2014 International Conference on Intelligent Networking and Collaborative
Systems, Salerno, Italy, 10–12 September 2014; pp. 411–414.
20. Trovati, M. Reduced Topologically Real-World Networks: A Big-Data Approach. Int. J. Distrib. Syst. Technol.
2015. [CrossRef]
21. Trovati, M.; Bessis, N. An influence assessment method based on co-occurrence for topologically reduced
Big Datasets. In Soft Computing; Springer: Berlin/Heidelberg, Germany, 2015.
22. Carlsson, G.; Harer, J. Topology and Data. Bull. Math. Soc. 2009, 46, 255–308. [CrossRef]
23. Edelsbrunner, H.; Harer, J. Computational Topology: An Introduction; American Mathematical Society:
Providence, RI, USA, 2010.
24. Ray, J.; Trovati, M. A Survey of Topological Data Analysis (TDA) Methods Implemented in Python.
In Proceedings of the INCoS 2017 Advances in Intelligent Networking and Collaborative Systems, Toronto,
ON, Canada, 24–26 August 2017; pp. 594–600.
25. Goodman, J.E. Surveys on Discrete and Computational Geometry: Twenty Years Later; AMS-IMS-SIAM Joint
Summer Research Conference, Snowbird, Utah, 18–22 June 2006; American Mathematical Society: Providence,
RI, USA, 2008.
26. Morozov, D. Welcome to Dionysus Documentation! Available online: http://www.mrzv.org/software/
dionysus/ (accessed on 1 June 2018).
27. Scikit-Learn 2.2. Manifold Learning: Scikit-Learn 0.18.1 Documentation. Available online: http://scikit-
learn.org/stable/modules/manifold.html (accessed on 1 June 2018).
28. Singh, G.; Memoli, F.; Carlsson, G. Mapper: A topological mapping tool for point cloud data. In Eurographics
Symposium on Point-Based Graphics; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1991.
Big Data Cogn. Comput. 2018, 2, 22 17 of 18
29. Müllner, D.; Babu, A. Python Mapper: An open-source toolchain for data exploration, analysis, and
visualization. Stanf. Edumuellnermapper 2013. Available online: http://danifold.net/mapper/ (accessed on
1 June 2018).
30. Python Mapper Code. Available online: https://github.com/calstad/mapper/blob/master/doc/source/
installation/index.rst (accessed on 1 June 2018).
31. Chow, Y.Y. Application of Data Analytics to Cyber Forensic Data A Major Qualifying Project Report;
MITRE Corporation: McLean, VA, USA, 2016.
32. Giesen, J.; Cazals, F.; Pauly, M.; Zomorodian, A. The conformal alpha shape filtration. Vis. Comput. 2006, 22,
531–540. [CrossRef]
33. Jensen, F.V. Bayesian networks. Wiley Interdiscip. Rev. Comput. Statist. 2009, 1, 307–315. [CrossRef]
34. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann
Publishers, Inc.: Burlington, MA, USA, 1998.
35. Narayanan, S.; Jurafsky, D. Bayesian Models of Human Sentence Processing. In Proceedings of the 20th
Annual Conference of the Cognitive Science Society, Madison, WI, USA, 1–4 August 1998; pp. 752–757.
36. Pedersen, T. Integrating Natural Language Subtasks with Bayesian Belief Networks. In Proceedings of the
1999 Pacific Asia Conference on Expert Systems, Los Angeles, CA, USA, 11–12 February 1999.
37. Trovati, M.; Bagdasar, O. Influence Discovery in Semantic Networks: An Initial Approach. In Proceedings of
the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, Cambridge,
UK, 26–28 March 2014.
38. Blanco, E.; Castell, N.; Moldovan, D. Causal Relation Extraction. In Proceedings of the Sixth International
Conference on Language Resources and Evaluation (LREC’08), Marrakesh, Morocco, 28–30 May 2008.
39. Steyvers, M.; Tenenbaum, J.B. The large-scale structure of semantic networks: Statistical analyses and a
model of semantic growth. Cogn. Sci. 2005, 29, 41–78. [CrossRef] [PubMed]
40. Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. Advances in Knowledge Discovery and Data
Mining; American Association for Artificial Intelligence: Menlo Park, CA, USA, 1996.
41. Jiang, J.J.; Conrath, D.W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy.
In Proceedings of the 1997 10th International Conference Research on Computational Linguistics (ROCLING
X), Taipei, Taiwan, 3 August 1997.
42. Ben-Gal, I. Bayesian Networks. InEncyclopedia of Statistics in Quality and Reliability; Ruggeri, F., Faltin, F.,
Kenett, R., Eds.; John Wiley & Sons: Hoboken, NJ, USA, 2007.
43. Janowicz, K. Extending Semantic Similarity Measurement with Thematic Roles. In Lecture Notes in Computer
Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3799.
44. Wang, X.; Huang, L.; Xu, X.; Zhang, Y.; Chen, J.Q. A Solution for Data Inconsistency in Data Integration.
J. Inf. Sci. Eng. 2011, 27, 681–695.
45. Bansal, S.K.; Kagemann, S. Integrating Big Data: A Semantic Extract-Transform-Load Framework.
IEEE Comput. Soc. 2015, 3, 42–50. [CrossRef]
46. Azzini, A.; Ceravolo, P. Consistent Process Mining over Big Data Triple Stores. In Proceedings of the 2013
IEEE International Congress on Big Data, Santa Clara, CA, USA, 27 June–2 July 2013; pp. 54–61.
47. Carol, I.; Kumar, S.B.R. Conflict Identification and Resolution in Heterogeneous Datasets: A Comprehensive
Survey. Int. J. Comput. Appl. 2015, 12, 113. [CrossRef]
48. Dong, X.L.; Naumann, F. Data fusion: resolving data conflicts for integration. Proc. VLDB Endow. 2009, 2,
1654–1655, [CrossRef]
49. Zhang, D. On Temporal Properties of Knowledge Base Inconsistency. In Transactions on Computational Science
V; Lecture Notes in Computer Science Series; Springer: Berlin, Germany, 2009; Volume 5540, pp. 20–37.
50. Zhang, D. Granularities and inconsistencies in Big Data analysis. Int. J. Softw. Eng. Knowl. Eng. 2013, 23,
887–893. [CrossRef]
51. Chomicki, J.; Marcinkowski, J.; Staworko, S. Computing consistent query answers using conflict hypergraphs.
In Proceedings of the 2004 Thirteenth ACM International Conference on Information and Knowledge
Management, Washington, DC, USA, 8–13 November 2004; ACM: New York, NY, USA, 2004; pp. 417–426.
52. DeMichiel, L.G. Resolving database incompatibility: An approach to performing relational operations over
mismatched domains. IEEE Trans. Knowl. Data Eng. 1989, 1, 485–493. [CrossRef]
Big Data Cogn. Comput. 2018, 2, 22 18 of 18
53. Trovati, M.; Castiglione, A.; Bessis, N.; Hill, R. Kuramoto Model Based Approach to Extract and Assess
Influence Relations. In Proceedings of the 2015 7th International Symposium on Computational Intelligence
and Intelligent Systems, Guangzhou, China, 21–22 November 2015.
54. Francis, W.N.; Kucera, H. The Brown Corpus: A Standard Corpus of Present-Day Edited American English;
Department of Linguistics, Brown University: Providence, RI, USA, 1979.
55. Ebel, H.; Mielsch, L.I.; Bornholdt, S. Scale-free Topology of E-mail Networks. Phys. Rev. 2002, 66, 035103.
[CrossRef] [PubMed]
56. Wren, J.D. Using Fuzzy Set Theory and Scale-free Network Properties to Relate MEDLINE Terms.
Soft Comput. 2006, 10, 4. [CrossRef]
57. Niedermayer, D. An Introduction to Bayesian Networks and Their Contemporary Applications.
Available online: http://www.niedermayer.ca/papers/bayesian/bayes.html (accessed on 1 June 2018).
58. Qi, G.; Pan, J.Z. A Tableau Algorithm for Possibilistic Description Logic A LC. In Lecture Notes in Computer
Science; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5341.
59. Srinivas, K. OWL Reasoning in the Real World: Searching for Godot. In Proceedings of the 22nd International
Workshop on Description Logics (DL 2009), Oxford, UK, 27–30 July 2009.
60. Sharkey, N.E. Connectionist Natural Language Processing: Readings from Connection Science; Harkey Kluwer
Academic Publishers: Alphen aan den Rijn, The Netherlands, 1992.
c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).