Papers by Adriana Marotta
arXiv (Cornell University), Apr 22, 2022
The importance of context in data quality (DQ) was shown many years ago and nowadays is widely ac... more The importance of context in data quality (DQ) was shown many years ago and nowadays is widely accepted. Early approaches and surveys defined DQ as fitness for use and showed the influence of context on DQ. This paper presents a Systematic Literature Review (SLR) for investigating how context is taken into account in recent proposals for DQ management. We specifically present the planning and execution of the SLR, the analysis criteria and our results reflecting the relationship between context and DQ in the state of the art and, particularly, how that context is defined and used for DQ management. CCS Concepts: • Data Quality → Context in Data Quality Management.
Organizations face many challenges in obtaining information and value from data for the improveme... more Organizations face many challenges in obtaining information and value from data for the improvement of their operations. For example, business processes are rarely modeled explicitly, and their data is coupled with business data and implicitly managed by the information systems, hindering a process perspective. This paper presents a proposal of a framework that integrates process and data mining techniques and algorithms, process compliance, data quality, and adequate tools to support evidence-based process improvement in organizations. It aims to help reduce the effort of identification and application of techniques, methodologies, and tools in isolation for each case, providing an integrated approach to guide each operative phase, which will expand the capabilities of analysis, evaluation, and improvement of business processes and organizational data.
Lecture Notes in Computer Science, 2014
Data collection and analysis are key artifacts in any software engineering experiment. However, t... more Data collection and analysis are key artifacts in any software engineering experiment. However, these data might contain errors. We propose a Data Quality model specific to data obtained from software engineering experiments, which provides a framework for analyzing and improving these data. We apply the model to two controlled experiments, which results in the discovery of data quality problems that need to be addressed. We conclude that data quality issues have to be considered before obtaining the experimental results.
Lecture Notes in Computer Science, 2023
A Data Warehouse (DW) is a database that stores information oriented to satisfy decisionmaking re... more A Data Warehouse (DW) is a database that stores information oriented to satisfy decisionmaking requests. It is a database with some particular features concerning the data it contains and its utilisation .The features of DWs cause the DW design process and strategies to be different from the ones for OLTP Systems. We address the DW Design problem through a schema transformation approach. We propose a set of schema transformation primitives, which are high-level operations that transform relational sub-schemas into other relational subschemas. We also provide some tools that can help in DW design process: (a) the design trace, (b) a set of DW schema invariants, (c) a set of rules that specify how to correct schemainconsistency situations that were generated by applications of primitives, and (d) some strategies for designing the DW through application of primitives.
Many researchers have presented the need to incorporate and maintain data quality in Data Warehou... more Many researchers have presented the need to incorporate and maintain data quality in Data Warehousing Systems. However, there is no consensus in the research community on how to do it. On the other hand, data loaded into the Data Warehouse come from different sources with different levels of quality. Analysis domains of these data can vary and users can perceive the quality in different ways, depending on their profile, the task to be performed, etc. Data quality depends on multiple factors: the sources, the task to be performed with the data, user preferences, etc. Hence, data quality depends on the context in which these data will be used. For this reason, we present a proposal to assess data quality in Data Warehousing Systems with an approach based on Contexts.
Journal of Data and Information Quality, Oct 15, 2020
Property Graph databases are being increasingly used within the industry as a powerful and flexib... more Property Graph databases are being increasingly used within the industry as a powerful and flexible way to model real-world scenarios. With this flexibility, a great challenge appears regarding profiling tasks due to the need of adapting them to these new models while taking advantage of the Property Graphs’ particularities. This article proposes a set of data profiling tasks by integrating existing methods and techniques and an taxonomy to classify them. In addition, an application pipeline is provided while a formal specification of some tasks is defined.
AMW, 2017
There is general agreement among data quality researchers in that completeness is one of the most... more There is general agreement among data quality researchers in that completeness is one of the most important data quality dimensions. In particular, data density can be a crucial factor in data processing and decision making tasks. Most techniques for data quality evaluation regarding density are limited to counting null values. However, density is not only about null values but also about not-null values when there should be null values, as the latter degrades the quality of data too. Besides, the existence of null values not necessarily implies a data quality problem. In this work we present a technique based on the application of data mining techniques for data quality assessment. Our proposal consists in creating a classification model from available data having null and not null values and then using that model to assess if a particular attribute of a record should or should not have a null value. This technique allows us to evaluate if a null value is an error, if it is correct, or if it is uncertain, and also we can evaluate if a not-null value is acceptable, is an error (it should be null) or is uncertain.
Communications in computer and information science, 2021
Lecture Notes in Computer Science, 2016
Data Warehousing Systems DWS are of great relevance for supporting decision making and data analy... more Data Warehousing Systems DWS are of great relevance for supporting decision making and data analysis. This has been proven over time, through the generalization of its development and use in all kind of organizations. Many researchers have presented the need to incorporate and maintain Data Quality DQ in DWS. However, there is no consensus in the research community on how or whether it is possible to define a set of quality dimensions for DWS, since such set may depend on the purpose for which the data are used. Moreover, quality requirements may vary among different domains and among different users. The contribution of this paper is twofold: a study of existing proposals that relate DQ with DWS and with contexts, and a proposal of a framework for assessing DQ in DWS. This proposal is the starting point of a broader and deeper investigation that will allow quality management in DWS.
The process of building Data Warehouses (DW) is well known with well defined stages but at the sa... more The process of building Data Warehouses (DW) is well known with well defined stages but at the same time, mostly carried out manually by IT people in conjunction with business people. Web Warehouses (WW) are DW whose data sources are taken from the web. We define a flexible WW, which can be configured accordingly to different domains, through the selection of the web sources and the definition of data processing characteristics. A Business Process Management (BPM) System allows modeling and executing Business Processes (BPs) providing support for the automation of processes. To support the process of building flexible WW we propose a two BPs level: a configuration process to support the selection of web sources and the definition of schemas and mappings, and a feeding process which takes the defined configuration and loads the data into the WW. In this paper we present a proof of concept of both processes, with focus on the configuration process and the defined data.
Springer eBooks, 2015
The application of Data Mining (DM) techniques for DQ, often called Data Quality Mining (DQM), of... more The application of Data Mining (DM) techniques for DQ, often called Data Quality Mining (DQM), offers a wide range of possibilities for DQ assessment. The goal of this work is to propose a mechanism for data currency assessment using statistics and DM techniques. The proposed approach consists on estimating the validity period for the entities using a training set and then evaluating the probability of currency of the last known data value for each entity. The proposed scheme helps in two ways to lead to an always up-to-date database: it can warn if a certain data value is becoming obsolete, and it can inform the data manager about the best frequency for updating data.
Lecture Notes in Computer Science, 2018
Data quality management in document oriented data stores has not been deeply explored yet, presen... more Data quality management in document oriented data stores has not been deeply explored yet, presenting many challenges that arise because of the lack of a rigid schema associated to data. Data quality is a critical aspect in this kind of data stores, since its control is not possible and it is not a priority in the data storage stage. Additionally, data quality evaluation and improvement are also very difficult tasks due to the schema-less characteristic of data. This paper presents a first step towards data quality management in document oriented data stores. In order to address the problem, the paper proposes a strategy for defining data granularities for data quality evaluation and analyses some data quality dimensions relevant to document stores.
Clei Electronic Journal, Aug 1, 2016
The process of building Data Warehouses (DW) is well known with well defined stages but at the sa... more The process of building Data Warehouses (DW) is well known with well defined stages but at the same time, mostly carried out manually by IT people in conjunction with business people. Web Warehouses (WW) are DW whose data sources are taken from the web. We define a flexible WW, which can be configured accordingly to different domains, through the selection of the web sources and the definition of data processing characteristics. A Business Process Management (BPM) System allows modeling and executing Business Processes (BPs) providing support for the automation of processes. To support the process of building flexible WW we propose a two BPs level: a configuration process to support the selection of web sources and the definition of schemas and mappings, and a feeding process which takes the defined configuration and loads the data into the WW. In this paper we present a proof of concept of both processes, with focus on the configuration process and the defined data.
ICIQ, 2016
The increasing amount of data published on the Web poses the new challenge of making possible the... more The increasing amount of data published on the Web poses the new challenge of making possible the exploitation of these data by different kinds of users and organizations. Additionally, the quality of published data is highly heterogeneous and the worst problem is that it is unknown for the data consumer. In this context, we consider Web Warehouses (WW) (Data Warehouses populated by web data sources) as a valuable tool for analysis and decision making based on open data. In previous work we proposed the construction of WW with BPMN 2.0 Business Processes, automating the construction process through a two-phases approach, system configuration and system feeding. In this paper, we focus on the problem of including data quality management in the WW system, based on data quality models definitions, and allowing data quality assessment and data quality aware integration. In order to achieve this, we extend previous work with the modeling of the extra activities for data quality management in BPMN 2.0 and its implementation in a BPMS.
A Data Warehouse (DW) is a database that stores information oriented to satisfy decisionmaking re... more A Data Warehouse (DW) is a database that stores information oriented to satisfy decisionmaking requests. It is a database with some particular features concerning the data it contains and its utilisation. The features of DWs cause the DW design process and strategies to be different from the ones for OLTP Systems. This work presents a brief description of different approaches and techniques that address the DW Design problem.
Ag gr ra ad de ec ci im mi ie en nt to os s Quisiera agradecer a mi tutor, el Profesor Raúl Ruggi... more Ag gr ra ad de ec ci im mi ie en nt to os s Quisiera agradecer a mi tutor, el Profesor Raúl Ruggia, quien me guió durante todo el proceso de investigación y escritura de esta tesis. También quisiera agradecer a los Profesores Regina Motz, Alejandro Gutiérrez, y Nora Szasz, de quienes recibí valiosos aportes en distintas instancias de este trabajo, y a todos los integrantes del Grupo CSI por el apoyo que me brindaron. v A Ab bs st tr ra ac ct t A Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making requests. It is a database with some particular features concerning the data it contains and its utilisation. In this work we concentrate in DW design and DW evolution. The features of DWs cause the DW design process and strategies to be different from the ones for OLTP Systems. We address the DW Design problem through a schema transformation approach. We propose a set of schema transformation primitives, which are high-level operations that transform relational subschemas into other relational sub-schemas. We also provide some tools that can help in DW design process: (a) the design trace, (b) a set of DW schema invariants, (c) a set of rules that specify how to correct schema-inconsistency situations that were generated by applications of primitives, and (d) some strategies for designing the DW through application of primitives. Schema evolution in a DW can be generated by two different causes: (i) a change in the source schema or (ii) a change in the DW requirements. In this work we address the problem of source schema evolution. We separate this problem into two phases: (1) determination of the changes that must be done to the DW schema and to the trace, and (2) application of evolution to the DW. For solving (1) we use the transformation trace that was generated in the design. In order to solve (2) we propose an adaptation of the existing models and techniques for database schema evolution, to DW schema evolution, taking into account the features that differentiates the DWs from traditional operational databases. K Ke ey yw wo or rd ds s Data Warehouse, DW design, DW schema evolution, schema transformation, Relational DW, DW design trace vi C Co on nt te en nt ts s CHAPTER 1. C CH HA AP PT TE ER R 3 3.. D Da at ta a W Wa ar re eh ho ou us se e l lo og gi ic ca al l d de es si ig gn n
The socio-technical system supporting an organization’s daily operations is becoming more complex... more The socio-technical system supporting an organization’s daily operations is becoming more complex, with distributed infrastructures integrating heterogeneous technologies enacting business processes and connecting devices, people, and data. This situation promotes large amounts of data in heterogeneous sources, both from their business processes and organizational data. Obtaining valuable information and knowledge from this is a challenge to make evidence-based improvements. Process mining and data mining techniques are very well known and have been widely used for many decades now. However, although there are a few methodologies to guide mining efforts, there are still elements that have to be defined and carried out project by project, without much guidance. In previous works, we have presented the PRICED framework, which defines a general strategy supporting mining efforts to provide organizations with evidence-based business intelligence. In this paper, we refine such ideas by presenting a concrete methodology. It defines phases, disciplines, activities, roles, and artifacts needed to provide guidance and support to navigate from getting the execution data, through its integration and quality assessment, to mining and analyzing it to find improvement opportunities.
Information & Software Technology, Nov 1, 2022
Clepsydra. Revista de Estudios de Género y Teoría Feminista
During the last decades the presence of women in the area of Computer Science has decreased in mo... more During the last decades the presence of women in the area of Computer Science has decreased in most countries. In the last years, at the Facultad de Ingeniería (School of Engineering), Universidad de la República, Uruguay, we have carried out several activities with the goal of bringing girls closer to Computer Science area, on the occasion of Girls in ICT Day. Through these activities, we intend to eliminate certain preconceived ideas and negative stereotypes that promote girls’ distancing from careers of this area. This paper presents the experience of the three virtual workshops carried out during 2021 (in virtual modality, due to the COVID-19 pandemic), which focused on programming, data, and geographic information systems, and the analysis of a set of virtual platforms considered for each workshop.
Uploads
Papers by Adriana Marotta