Big Data Quality Assessment Model for Unstructured Data

Conference Paper · November 2018

DOI: 10.1109/INNOVATIONS.2018.8605945

3 authors:

Ikbal Taleb Mohamed Adel Serhani

Concordia University Montreal United Arab Emirates University


Rachida Dssouli
Concordia University Montreal


Big Data Quality Assessment Model
for Unstructured Data
Ikbal Taleb Mohamed Adel Serhani Rachida Dssouli
CIISE College of Information Technology CIISE
Concordia University UAE University Concordia University
Montreal, QC, Canada Al Ain, UAE Montreal, QC, Canada
i_taleb@live.concordia.ca serhanim@uaeu.ac.ae rachida.dssouli@concordia.ca

Abstract— Big Data has gained an enormous momentum the Big Data Quality assessment is an important phase integrated
past few years because of the tremendous volume of generated and within data pre-processing. It is a phase where the data is
processed Data from diverse application domains. Nowadays, it is prepared following the user or application requirements. When
estimated that 80% of all the generated data is unstructured. the data is well defined with a schema, or in a tabular format, its
Evaluating the quality of Big data has been identified to be quality evaluation becomes easier as the data description will
essential to guarantee data quality dimensions including for help mapping the attributes to quality dimensions and set the
example completeness, and accuracy. Current initiatives for quality requirements as baseline to assess the quality metrics. In
unstructured data quality evaluation are still under the other case, when there is no structure to follow, an
investigations. In this paper, we propose a quality evaluation
intermediary phase needs to be defined to parse, analyses, mine,
model to handle quality of Unstructured Big Data (UBD). The later
captures and discover first key properties of unstructured big data
detect a schema or a path the unstructured data went through. To
and its characteristics, provides some comprehensive mechanisms achieve this objective, a set of techniques such as classification,
to sample, profile the UBD dataset and extract features and clustering, searching or mining is used to draw a set of artifacts
characteristics from heterogeneous data types in different formats. that act as a filter, a translator of UBD to a more efficient
A Data Quality repository manage relationships between Data readable format ready for quality evaluation. The amount of
quality dimensions, quality Metrics, features extraction methods, resulting data is generally equal or far less than the input. A
mining methodologies, data types and data domains. An analysis reduction of managed data is to be defined to assess the
of the samples provides a data profile of UBD. This profile is efficiency and impact of these techniques on the intermediate
extended to a quality profile that contains the quality mapping assessment results.
with selected features for quality assessment. We developed an
UBD quality assessment model that handles all the processes from Assessing the quality of unstructured data is a tedious task.
the UBD profiling exploration to the Quality report. The model In the following, we enumerate some data characteristics and
provides an initial blueprint for quality estimation of unstructured data quality aspects that adds more difficulty to the assessment
Big data. It also, states a set of quality characteristics and process: (1) data size, (2) heterogeneity (3) multiple data type
indicators that can be used to outline an initial data quality schema and formats (4) multi-sources, multi-files (5) what DQD to
of UBD. choose from? (6) define clearly the quality of unstructured data,
for example what are the quality dimensions for a UBD set
Keywords—Big Data, Data Quality, Unstructured Data, contain: Text files, Images, Videos, Audio files, Web pages,
Quality of Unstructured Big Data. PDF files, twitter data, Facebook data, etc. In a presence of such
data diversity, we need to define the followings:
a) UBD Quality Project
Big data is commonly defined as the way we gather, store, b) Set of requirements with default DQD’s to start with.
manipulate, analyze and get insight from a fast-increasing c) Sampling strategy for UBD that must consider the type and
heterogeneous data. Most of the new generated data is format of the data. There is no attributes or observations that
unstructured due to the increase of mobile and human’s can be used as the basis for sampling
unlimited generated data from social medias that combine text, d) How to extract features, variable, attributes that characterize
pictures, audio, video, in an unstructured way. Unstructured data UBD, and evaluate its quality based on these finding.
is a fast-increasing phenomenon than all other types of data, e) Select the best techniques, methods, strategies that extract a
industry analysts say. It will increase by as much as 800 percent useful information from UBD or convert it to a schema-
during the next five years according to a survey conducted by based data.
[1]. This urge the need to automatically characterize and In the age of Big Data, many trending data analytics
categorize such data. These classifications are strongly coupled directions are now focusing on the analysis of customer
with the semantic meaning of what the data represents. In many behavior, feedback, comments about their products or services.
cases, the data comes in a format and a quality state in which it They mine the social media data streams, from Twitter,
is impossible to process immediately as it is, and if so, the results Facebook, YouTube, Instagram, websites, forums, text
cannot guarantee a valuable analysis and insights. messaging to get some valuable insights. A thousand of terabyte
of data is available to be analyzed using techniques such as
sentiments analysis, and deep learning. The necessity to assess B. Unstructured Big Data
the quality of this data before engaging in large processing that To make decision we need relevant information that is extracted
costs time and money is a must. Approximating the quality of from data using processing and analysis. In this rich context,
such data sets is the first step towards successful big data project.
data exist in numerous formats, with different types and from
Finally, structured data is always easy to be handled as Big Data
several sources and knowledge domains. Unstructured data is
by data analytics applications rather than unstructured one.
growing faster than structured data. It is explained by the
The rest of paper is organized as follow: next section number of Facebook posts, tweets, photos and emails created in
introduces Big Data and data quality fundamentals, definition, every second.
characteristics, and lifecycle. Section 3 surveys the most Table 1. The 5 V’s characteristics of Big Data
important research works on Unstructured Big Data quality Big Data
# Description Attributes & Metrics
evaluation and management. Section 4 introduces our V's
Unstructured Big Data quality assessment Model. Section 5 1 Volume Scale and Size of Data in Storage Terabyte to Exabyte
analyses and discuss our model experimentation's. Finally, the Data generation frequency: the
speed in which this data is
last section concludes the paper and points to some ongoing and generated, produced, created,
Milliseconds to seconds.
challenging directions. 2 Velocity refreshed, and streamed.
Batch, Near Time, Real
Time, Streams.
Viscosity: how difficult is the data
Binary, raw, text,
In this section, we introduce some Big Data foundations and all multimedia. Structured,
the elements that cooperate to contribute to such ecosystem. 3 Variety Multiple different forms of the data
unstructured, and semi-
A. Big Data overview Uncertainty of the Data that leads
to confidence or trust in the data.
We always dealt with Big Data, the moment we started gathering How can we trust the data? What is
data and storing it in different ways. Big Data is being incompleteness, ambiguity,
4 Veracity its provenance? Is it reliable? Is it
latency, trustfulness, and
considered in every domain, in academia, in industry, in accurate? Is the data verifiable and
traceability (provenance).
businesses, in social media, and in research. It has a lifecycle and truthful? What is Rigor in Data
characteristics to be defined and followed. Deriving business value and Big Data strategy, Big Data
5 Value insights from the data. project, targets goals and
1) Big Data Lifecycle suitable analytics process.
Figure 1 describe the most important stages that the data goes 1) Unstructured Data
through till the purpose that it was gathered and used for. From By default, the name unstructured data imply mess, noise and a
the data inception, collection, transport through inter-networks,
chaos in data organization. In contrast, it refers to a data that
saved into distributed storage around the world that offers the
doesn’t have a schema, no Metadata, and no rules or constraints
best quality price with a reliable network. Then pre-processed to
filter only the best quality data and forwarded to processing and to follows when it has been created; likewise, the structural
analytics for insight extraction. database model. Even it has some basic low-level internal
structure but no pre-defined data models or schema.
Unstructured data has two meaning: (1) no structure at all or (2)
an unknown structure. It may be textual or non-textual, and
human or machine-generated. It may also be stored within a
non-relational database like No-SQL. In Table 2, we illustrate
some unstructured data domains and the data types it generates
and manages.
Table 2. Unstructured Data Domains
Data Domains Data types
Healthcare Doctors notes, X-rays, IRM, scanner images
Finance Stock market data, bank transactions
Fig. 1. Big Data Lifecycle Scientific Research DNA data, satellite data
2) Big Data Characteristics Customer Relationship Customer feedback, forums comments
In the annual McKinney Global Institute report [2], three data Management (CRM)
dimensions characterizing Big Data were introduced. The Social Media Facebook posts, twitter,
Media contents Videos, images, audio, speech, music
Volume, Velocity and Variety, also called the 3 V’s. Lately the Web Contents Web pages, blogs, news
number of dimensions increased from 3, 4, 7 and even to 10 V’s IoT Sensor data, RFID data
[2]–[7]. As illustrated in Table 1, we compiled the most Log files Network logs, web pages click, Facebook logs
important V’s that describes Big Data. As the name suggest Big Documents text, web, pdf, office docs, scanned docs
Data is more than simply a matter of size; it is a prospect to
unearth insights to make beneficial decisions. Thus,
2) Unstructured Big Data Characteristics
Visualization, Variability, Volatility, Virality, Vulnerability, In addition, to the noticeable difference which is the columnar
Viscosity, Validity are extended characteristics. data model, the major difference is the effortlessness of
analyzing structured data versus unstructured data. The existing
pre-processing and analytics tools are very mature for structured
data, but still in embryo state for unstructured one. The Variety C. Data Quality (DQ)
characteristic of Big Data defines different formats of data (e.g. According to [19], data quality is not easy to describe, its
document, emails) that are not always stored in structured meanings are data domain dependent and context-aware.
relational database systems. Its follow two classes of UBD:
Overall, data quality is continuously related to the quality of its
• Human generated: Text files, Emails, social media, data source [20].
websites, mobile data messages, messaging chats, business Data Quality is differently perceived in both academia and
applications, Media files (audio, video, image) industry. In [21], data quality from ISO 25012 Standard is
• Machine generated: scientific data, satellite imagery, defined as “the capability of data to satisfy stated and implied
digital surveillance, sensor data, network logs, IoT devices. needs when used under specified conditions”. However, in [22],
Many different sources of data in several domains that it is summarized as “fitness for use” or “meeting user needs”.
feeds the unstructured contents which prevail the name of data 1) Data Quality Dimensions (DQD’s)
domains illustrated in Table 3. Therefore, it is important to note According to [22]–[24], to measure and manage data quality the
that unstructured big data is also characterized by velocity and concepts of Data Quality Dimension (DQD) is presented. There
volume. are many quality dimensions that are classified under categories
3) Unstructured Data Management that define them. In Figure 4, we summaries some essential
Since big data can include both structured and unstructured data, DQD categories based on [12]–[14]: the contextual dimensions
the exploration of unstructured data can be handled using that are associated to the information and intrinsic dimensions
existing Big Data Ecosystems. Such systems and tools include that refer to objective and native data attributes. Examples of
Hadoop, business intelligence software (analytics, data mining, intrinsic data quality dimensions include Accuracy, Timeliness,
reporting), Data Integration Tools, Document Management Consistency, and Completeness. Each Data Quality Dimension
Systems, Search and indexing Tools, Unstructured Information is associated with a specific metrics. A metric is a method, or
Management Architecture (UIMA) [8] and IBM/Apache formula established to measure a score or ratio from the data by
component software architecture for analysis of unstructured quantifying its DQDs. A metric provides how to evaluate a DQD
data. In the following, we highlight some unstructured data types from simple formulas to more complex multivariate expressions.
and some methodologies used to manage and analyze them. 2) Data Quality Assessment
a) Textual (text, Pdf, scanned docs, email body) With a set of Metrics, it is possible now to evaluate
quantitatively the quality when following a data driven strategy
Unstructured textual data is transformed and explored using a
on existing data. For structured data, its quality assessment is
combination of techniques as in Text Mining [9]–[12] such as
apparent as data is available and attributes with their
data mining, machine learning, Natural Language Processing
corresponding values are accessible. However, for unstructured
(NLP), information retrieval and knowledge management.
data, needs a different approach when we don’t know how it is
Moreover, Search engines tools are used for indexing,
organized, and what are we are going to assess. The introduction
cataloguing, categorizing to make information and text search
of a module that extract, discover, or define attributes and
easy characterizing the Unstructured Data. Also, other
features with specific DQD mapping is mandatory to proceed
Techniques are used varying from text analytics, OCR to
with the quality exploration.
patterns, terms, topics detection and discovery for the sake of
structuring the textual data.
b) Social Media (Twitter, Facebook), CRM
For twitter data, sentiment analysis [13]–[15], opinion mining
[16]–[18], are well-known techniques applied to extract trends
in multitude of areas like elections, events and much more. In
CRM systems, a semantic analysis on multisource unstructured
data a semantic analysis is conducted to annotate, extract, and
rate customer feedbacks.
c) Media (Video, Audio, Image)
Digital photos, Videos, and Audio files are stored in a structured
format such as JPG/ PNG, Mov/MP4, and WAV/MP# Fig. 2. Data Quality and Data Structure
respectively. However, all these data don’t express any
3) Unstructured Data Quality
information about what is in the data. It needs to be treated to
We initially should ask the following question: “What and how
comprehend its meaning. Automatic Media data tagging,
should we measure, evaluate or assess in this big diversity of
labelling, indexing after analyzing and processing will help to
heterogeneous unstructured data?”. First, we need to notice that
search within the media files efficiently. Processing this kind of
not all DQD’s apply here, since most of them are used for
unstructured data needs some advanced algorithms for image,
structured data. Even if they are applicable to all, some
audio, speech, and video processing to gather patterns or any
intermediate dimensions might be defined for each type of
information that can be indexed.
unstructured data. For example, readability of text data is
assessed by its reading easiness [25]. Then, transforming it to a
structured data that can be measured and queried. For example,
in Text mining, a combination of techniques such as data data types. Other feature extraction algorithms based on
mining, machine learning, Natural Language Processing (NLP), data domain, format and types are also needed.
information retrieval and knowledge management are used to 5) What Sampling strategies to use? with large volumes of
map the data to a schema. The quality assessment of the unstructured data, the quality estimation of a representative
extracted relevant features that explains some quality indicators population is mandatory as we don’t want the time and cost
will results in a set of quality scores that can be easily mapped of preprocessing especially for unknown unstructured data
to traditional DQD’s [24], [26], [27]. So, the quality of the to explode big data project budget.
unstructured data will depend on the structure that the data will
fit into. Figure 2 provides an illustration about the relationship
between data quality and data structure. The more the data is
structured the more its quality increases and its evaluation
becomes easier.
In [28], the authors extracted quality indicators from patient
charts using a quantitative approach. They used canary, an NLP
software to discover and mine knowledge hidden in
unstructured clinical data. While authors in [29], [28], [30] used
quality indicators but specifically for unstructured text data
mining processes such as Interpretability, Relevancy, and
Accuracy. The exploitation of these indicators must be
converted into a number in the range of [0,1] as expressed by
quality metrics to measure a structured DQD (e.g. Fig. 3. Unstructured Big Data Quality Proceeding
D. Quality of unstructured Big Data 6) What Quality Assessment Methodology to consider? the
A data driven strategy is followed to handle the quality of quality assessment for each tuple (DQD, feature, Samples)
unstructured data. A set of steps and settings or pre- is representative of the whole Data; the choice of data
requirements must be defined before proceeding with the sample size and iterations depend on the sampling strategy
Quality Assessment of the Unstructured Big Data. If the used.
Structured data model makes the quality assessment process III. LITERATURE REVIEW
relaxed with a set of known columns (attributes) organized in
rows (observations). It is not the case for unstructured data, We survey in this section, the very few available works on
which includes many intermediate processes or modules to unstructured data quality assessment, UBD management and
either 1) convert the data to a structured one and assess its exploration and we highlight at the end the remaining
quality or 2) use new techniques to extract meaningful features challenges that are still to be studied. Most of the works on
that represents the data and apply quality evaluation unstructured big data quality are limited to specific cases of
methodology. We illustrate in Figure 3, the steps we need to textual data; and DQD’s to consider when dealing with
proceed to accomplish the quality assessment of Unstructured unstructured Big Data analysis [33]. The authors identified
Big Data. important DQDs such as relevance, comparability, timeliness,
1) What is the Type, Format, or Data Domain? discovering accuracy, coherence, accessibility and ambiguousness related to
these information’s or extracting it from metadata or any specific Big data lifecycle phases. In [25], the authors identified
description that came with data is priceless since it is some quality metrics to be used to evaluate unstructured data
essential to start with this process to explore the data such as images usefulness and textual data readiness. An initial
contents. overview of quality assessment of UBD including quality
2) What DQDs to use to map the Quality? depending of the indicators that must be used for unstructured data. These
data type, the DQD’s are selected and then mapped to the indicators are used in the traditional DQD and their scores are
unstructured data indicators. Even, the effectiveness of converted.
DQD selection is related to the discovery of attributes and In [29], a definition of Unstructured data quality based on the
features that are mapped to DQDs. similarity of input data to the data expected by its consumers,
3) What Quality Metrics to consider? with unknown data and to data representing the real world. They characterize
features or new features, a creation, update, rewrite, or fork DQD’s to be used in these similarity process and propose
of exiting metrics to handle new discovered or extracted measurable quality indicators to assess UBD quality.
data features is mandatory. For example, the contrast ratio In [34], the authors insists to have a characterization of quality
of an image is obtained by a metric that reads the digital in order to exploit Web data, these characterizations are
picture and compute contrast intensity % using the related materialized in data Trustworthiness and provenance. They also
formula [25]. In [31], [32] more metric and techniques are target some aspects of Big Data quality using examples of
enumerated to assess the quality of multimedia data. sensor data quality.
4) How to identify attributes or features to evaluate? a list of Other authors targeted social media data as unstructured big
attributes can be discovered easily for certain format and data for the purpose of quality evaluation. In [35], the authors
redefined DQD’s and metrics to adapt to Big data context of 1. Quality Requirements: the quality requirements are
unstructured data twitter feeds. They defined a set of metrics to expressed in DQD acceptance scores, or a set of its
evaluate the quality such metrics include readability, indicators. Also, for some textual data a golden corpus or
completeness and usefulness. Then implemented a real time set is used as baseline for similarity analysis. For Media
system to evaluate the quality of twitter stream. In [36], Anne data like images, a set of image characteristics is defined
et al. introduced a new architectural solution to evaluate and with its related DQD’s and Indicators. For example, High
manage the quality of social media data within the processing resolution image has higher number of pixels color levels.
phase of the big data lifecycle. The objective was to improve
2. Data Sources: represented as a collection of data sets or
business decision making by providing real-time customer
files, mixed or sorted by types, format or domain. This
insights from twitter feeds data using sentiment analysis for
simple representation of data will help avoiding quality
customer satisfaction [37].
issues that may appear when auto-discovery of data types
To the best of our knowledge, a need of an assessment model
and domain is handled. This process is well known as data
for Unstructured Big Data Quality is of paramount importance.
Most of the existing works are limited to a particular aspect of
quality assessments of big data or to a specific type of 3. Data Quality Repository: stores and manages all related
unstructured data. Moreover, no quality management of such components of the data such as list of data domains, data
data has been proposed. In this paper, we present a tentative types, and subtypes. DQD’s, Quality Indicators, Metrics by
Model of Unstructured Big data Quality Assessment from data data types. Data types features as in videos and pictures
collection to quality reports generation. with their extraction functions. Predefined Mapping
between DQD’s, Features and metrics. It also stores the
To address the challenges of assessing quality of unstructured
4. Sampling and Profiling: the data is sampled using BLB
Big Data, we propose a quality assessment model that selects
Bootstrap [38], then profiled to extract any metadata or
quality dimensions specific to each data type and evaluates its
useful information. After profiling, the data is identified by
extracted features. Since unstructured data has no columnar
its type and domain if any. This will ease the next process
values, we use a quantitative approach of data quality based on
of selecting which methodology to use for data quality
data contents. The model illustrated in Figure 4, has several
based its data type. The sampling of media data is done
components that the data goes through to achieve at the end a
differently for each type; for example, for Audio data,
quality assessment report.
fragments are extracted and used to create a set of samples.
5. Preparation Process for Quality Evaluation: for textual
data, text mining [39], [40] methodology is used to extract
useful information about the textual data and assess its
quality based on the extracted mining results. A definition
of textual data quality indicators is of high importance to
identify, create metrics that evaluate these indicators,
combine and quantify them into a quality dimensions score.
For videos data, feature extraction consists of discovering
the format of the video first, then a sampling is applied to
extract samples of video frames, afterwards extracting the
set of features linked to these videos. Video characteristics
consist of for example resolution (SD, HD, 4k), color
saturation, and picture brightness. All these characteristics
represent quality indicators of the video when the
associated metrics are computed. The same scenarios apply
for images and audio data.
6. Feature Selection and Quality Mapping: for each feature
extracted from the videos, a DQD is selected with its
related metric to evaluate its quality. The DQ repository
contains a list of all common file types with all the
predefined quality indicators and metrics.
Fig. 4. Unstructured Big Data Quality Assessment Model 7. Quality Assessment: after all the previous steps are
completed, the assessment process runs the evaluation
In the following, a description of each components and its algorithm on sample data (set of videos frames, set of
interaction with the data flow being transmitted between is images, set of audio fragments) and quality report is built.
depicted in Figure 4.
