Skip to main content

Helena Galhardas

Followers

29

Following

7

Co-authors

7

Public Views

Pieter van der Sluis

Scott H . Hawley

Belmont University

leonardo pacheco

Universidade Estadual do Centro-Oeste

Stanford University

Cristiano Amancio

PUC Minas

Interests

Uploads

Papers by Helena Galhardas

Smart Scheduling of Continuous Data-Intensive Workflows with Machine Learning Triggered Execution

arXiv (Cornell University), Dec 12, 2016

To extract value from evergrowing volumes of data, coming from a number of different sources, and... more To extract value from evergrowing volumes of data, coming from a number of different sources, and to drive decision making, organizations frequently resort to the composition of data processing workflows, since they are expressive, flexible, and scalable. The typical workflow model enforces strict temporal synchronization across processing steps without accounting the actual effect of intermediate computations on the final workflow output. However, this is not the most desirable behavior in a multitude of scenarios. We identify a class of applications for continuous data processing where workflow output changes slowly and without great significance in a short-to-medium time window, thus wasting compute resources and energy with current approaches. To overcome such inefficiency, we introduce a novel workflow model, for continuous and data-intensive processing, capable of relaxing triggering semantics according to the impact input data is assessed to have on changing the workflow output. To assess this impact, learn the correlation between input and output variation, and guarantee correctness within a given tolerated error constant, we rely on Machine Learning. The functionality of this workflow model is implemented in SmartFlux, a middleware framework which can be effortlessly integrated with existing workflow managers. Experimental results indicate we are able to save a significant amount of resources while not deviating the workflow output beyond a small error constant with high confidence level.

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

Transactions on Large-Scale Data- and Knowledge-Centered Systems LI

Digital data is produced in many data models, ranging from highly structured (typically relationa... more Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6,7,8,12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performanceprecision trade-offs on real-world datasets.

Keyword Search in Heterogeneous Data Sources

Data journalism is the field of investigative journalism work based first and foremost on digital... more Data journalism is the field of investigative journalism work based first and foremost on digital data. As more and more of human activity leaves strong digital traces, data journalism is an increasingly important trend. Important journalism projects increasingly involve diverse data sources, having heterogeneous data models, different structures, or no structure at all; the Offshore Leaks is a prime example. Inspired by our collaboration with Le Monde, a leading French newspaper , we designed a novel content management architecture, together with an algorithm for exploiting such heterogeneous corpora through keyword search: given a set of search terms, find links between them within and across the different datasets which we interconnect in a graph. Our work recalls keyword search in structured and unstructured data, but data heterogeneity makes it computationally harder. We analyze the performance of our algorithm on real-life datasets.

Ask : a Natural Language Search System for Medicine Information

Health personnel deals with medicines on a daily basis. They need to have access to comprehensive... more Health personnel deals with medicines on a daily basis. They need to have access to comprehensive information about medicines as fast as possible. Several books and websites are at their disposal, as well as independent software packages with extra search capabilities that can be used in Pocket PCs or mobiles. The public, in general, is also interested in quickly accessing information about medicines. Despite all the electronic possibilities available nowadays, the search functionalities provided are usually based on keywords or class-oriented (allowing, for instance, a search by laboratory or by ATC classification). We propose Medicine.Ask which is a question-answering system about medicines that couples state of the art techniques in Information Extraction and Natural Language Processing. It supplies information about medicines through a (controlled) set of questions posed in Natural Language (Portuguese). An example of such a question is: “Which are the medicines for influenza th...

On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

ArXiv, 2018

Scientific research requires access, analysis, and sharing of data that is distributed across var... more Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We ...

Towards an Information Management System for a Research Lab

An Information Management System (IMS) is a software platform that enables the management of a va... more An Information Management System (IMS) is a software platform that enables the management of a vast array of information about products, customers, employees, suppliers, projects, production, assets and finances of an organization. Scientific research labs and education centers also benefit from the use of an IMS. This paper presents the design, implementation and validation of a new IMS for research labs. The system is modular in nature. Each module is independent from the others and designed to be integrated with other external software systems. A set of performance and usability tests were performed to ensure that the system accomplishes its requirements.

Acronym Expander at SDU@AAAI-21: an Acronym Disambiguation Module

In order to properly determine which of several possible meanings an acronym A in sentence s has,... more

A Machine Learning based Natural Language Interface for a database of medicines (poster paper)

Specifying complex correspondences between relational schemas and RDF models for generating customized R2RML Mappings

The W3C RDB2RDF Working Group proposed a standard language to map relational data into RDF triple... more The W3C RDB2RDF Working Group proposed a standard language to map relational data into RDF triples, called R2RML. However, creating R2RML mappings may sometimes be a difficult task because it involves the creation of views (within the mappings or not) and referring to them in the R2RML mapping. To overcome such difficulty, this paper first proposes algebraic correspondence assertions, which simplify the definition of relational-to-RDF mappings and yet are expressive enough to cover a wide range of mappings. Algebraic correspondence assertions include data-metadata mappings (where data elements in one schema serve as metadata components in the other), mappings containing custom value functions (e.g., data format transformation functions) and union, intersection and difference between tables. Then, the paper shows how to automatically compile algebraic correspondence assertions into R2RML mappings.

UNIANO: robust and efficient anomaly consensus in time series sensitive to cross-correlated anomaly profiles

Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), 2021

Time series anomaly detection is an active research area, combining dozens of state-of-the-art me... more Time series anomaly detection is an active research area, combining dozens of state-of-the-art methods that place heterogeneous views on what is an anomaly. This diversity of views-local and global, point and segment, univariate and multivariate, context-free and context-aware anomalies-is associated with moderate-to-high output divergences between methods. As a result, the user is faced with the difficult and laborious task of selecting the most appropriate methods and identifying cross-method consensus in an attempt to optimize recall and precision. Despite the relevance of establishing agreement criteria, existing principles are scarce and suffer from major problems: 1) show biases towards methods with correlated/redundant anomaly profiles; 2) depend on anomaly score thresholding; 3) prevent online detection; and 4) offer consensus not subjected to sound statistical testing. This work proposes UNIANO (UNIfied ANOmaly), an approach that combines simple yet effective empirical multivariate distribution statistics to address these drawbacks, guaranteeing a parameter-free and statistically robust integration of heterogeneous anomaly views. In this context, anomalies detected by less prevalent and concordant anomaly profiles, such as context-aware profiles in the presence of complementary variables, are not undervalued. Given a n-length time series and m views, UNIANO is aided by adequate data structures to achieve O(n log m 2 n) training time and linear O(m) testing-and-updating time. The gathered results confirm the relevance of the proposed approach.

REEL: A Relation Extraction Learning Framework (poster paper), DL 2014

Benchmarking with TPC-H on Off-the-Shelf Hardware: An Experiments Report

Most medium-sized enterprises run their databases on inexpensive off-the-shelf hardware; still, a... more Most medium-sized enterprises run their databases on inexpensive off-the-shelf hardware; still, answers to quite complex queries, like ad-hoc Decision Support System (DSS) ones, are required within a reasonable time window. Therefore, it becomes increasingly important that the chosen database system and its tuning be optimal for the specific database size and design. Such optimization could occur in-house, based on tests with academic database benchmarks adapted to the small-scale, easy-to-use requirements of a medium-sized enterprise. This paper focuses on industry standard TPC-H database benchmark that aims at measuring the performance of ad-hoc DSS queries. Since the only available TPC-H results feature large databases and run on high-end hardware, we attempt to assess whether the standard test is meaningfully downscalable and can be performed on off-the-shelf hardware, common in medium-sized enterprises. We present in detail the benchmark and the steps that a non-expert must take to run a benchmark test following the TPC-H specifications. In addition, we report our own benchmark tests, comparing an open-source and a commercial database server running on off-the-shelf inexpensive hardware under a number of equivalent configurations, varying parameters that affect the performance of DSS queries.

Performance Analysis of One-to-Many Data Transformations

Relational Database Systems often support activities like data warehousing, cleaning and integrat... more Relational Database Systems often support activities like data warehousing, cleaning and integration. All these activities require performing some sort of data transformations. Since data often resides on relational databases, data transformations are often specified using SQL, which is based of relational algebra. However, many useful data transformations cannot be expressed as SQL queries due to limited expressive power of relational algebra. In particular, an important class of data transformations that produces several output tuples for a single input tuple cannot be expressed in that way. In this report, we analyze alternatives to process one-to-many data transformations using Relational Database Systems, and compare them in terms of expressiveness, optimizability and performance.

Declaratively Cleaning your Data with AJAX

by Helena Galhardas and Daniela Florescu

Journées Bases de Données Avancées, 2000

Extending the Relational Algebra with the Mapper Operator

Application scenarios such as legacy data migration, Extract-Transform-Load (ETL) processes, and ... more Application scenarios such as legacy data migration, Extract-Transform-Load (ETL) processes, and data cleaning require the transformation of input tuples into output tuples. Traditional approaches for implementing these data transformations enclose solutions as Persistent Stored Modules (PSM) executed by an RDBMS or transformation code using a commercial ETL tool. Neither of these is easily maintainable or optimizable. A third approach consists of combining SQL queries with external code, written in a programming language. However, this solution is not expressive enough to specify an important class of data transformations that produce several output tuples for a single input tuple.

A Decision Support System for IST Academic Information

Informatica (slovenia), 2003

This article describes the Decision Support System (DSS) for Academic,Information being developed... more This article describes the Decision Support System (DSS) for Academic,Information being developed ,at Instituto Superior Técnico, the Engineering School of the Technical University ofLisbon. In Portuguese, this project has been given the acronym,SADIA (Sistema de Apoio à Decisão da Informação Académica). This paper ,focuses ,on the ,early ,phases ,of the ,DSS

Declarative Data Cleaning: Model, Language, and Algorithms

by Helena Galhardas and Daniela Florescu

A Survey of Data Quality Tools

by José Barateiro and Helena Galhardas

A Framework for Classifying Scientific Metadata

The scientific community, public organizations andadministrations have generated a large amount o... more The scientific community, public organizations andadministrations have generated a large amount ofdata concerning the environment. There is a needto allow sharing and exchange of this type of informationby various kinds of users including scientists,decision-makers and public authorities. Metadataarises as the solution to support these requirements.We present a formal framework forclassification of metadata that will give a uniformdefinition of what metadata

Declaratively cleaning your data using AJAX

Data quality concerns arise when correcting anomalies in a single data source, or integrating dat... more Data quality concerns arise when correcting anomalies in a single data source, or integrating data comingfrom multiple sources into a single data repository. The information handled may also need to undergo aformatting and normalization process so that the resulting data is structured and presented according to theapplication requirements. The main causes of data anomalies are: (1) the absence of universal

Smart Scheduling of Continuous Data-Intensive Workflows with Machine Learning Triggered Execution

arXiv (Cornell University), Dec 12, 2016

To extract value from evergrowing volumes of data, coming from a number of different sources, and... more To extract value from evergrowing volumes of data, coming from a number of different sources, and to drive decision making, organizations frequently resort to the composition of data processing workflows, since they are expressive, flexible, and scalable. The typical workflow model enforces strict temporal synchronization across processing steps without accounting the actual effect of intermediate computations on the final workflow output. However, this is not the most desirable behavior in a multitude of scenarios. We identify a class of applications for continuous data processing where workflow output changes slowly and without great significance in a short-to-medium time window, thus wasting compute resources and energy with current approaches. To overcome such inefficiency, we introduce a novel workflow model, for continuous and data-intensive processing, capable of relaxing triggering semantics according to the impact input data is assessed to have on changing the workflow output. To assess this impact, learn the correlation between input and output variation, and guarantee correctness within a given tolerated error constant, we rely on Machine Learning. The functionality of this workflow model is implemented in SmartFlux, a middleware framework which can be effortlessly integrated with existing workflow managers. Experimental results indicate we are able to save a significant amount of resources while not deviating the workflow output beyond a small error constant with high confidence level.

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

Transactions on Large-Scale Data- and Knowledge-Centered Systems LI

Digital data is produced in many data models, ranging from highly structured (typically relationa... more Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown, or inapplicable information. While some data models allow representing nulls by special tokens, so-called disguised missing values (DMVs, in short) are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability, or inapplicability of the information. In this work, we tackle the detection of a particular kind of DMV: texts freely entered by human users. This problem is not tackled by DMV detection methods focused on numeric or categoric data; further, it also escapes DMV detection methods based on value frequency, since such free texts are often different from each other, thus most DMVs are unique. We encountered this problem within the ConnectionLens [6,7,8,12] project where heterogeneous data is integrated into large graphs. We present two DMV detection methods for our specific problem: (i) leveraging Information Extraction, already applied in ConnectionLens graphs; and (ii) through text embeddings and classification. We detail their performanceprecision trade-offs on real-world datasets.

Keyword Search in Heterogeneous Data Sources

Data journalism is the field of investigative journalism work based first and foremost on digital... more Data journalism is the field of investigative journalism work based first and foremost on digital data. As more and more of human activity leaves strong digital traces, data journalism is an increasingly important trend. Important journalism projects increasingly involve diverse data sources, having heterogeneous data models, different structures, or no structure at all; the Offshore Leaks is a prime example. Inspired by our collaboration with Le Monde, a leading French newspaper , we designed a novel content management architecture, together with an algorithm for exploiting such heterogeneous corpora through keyword search: given a set of search terms, find links between them within and across the different datasets which we interconnect in a graph. Our work recalls keyword search in structured and unstructured data, but data heterogeneity makes it computationally harder. We analyze the performance of our algorithm on real-life datasets.

Ask : a Natural Language Search System for Medicine Information

Health personnel deals with medicines on a daily basis. They need to have access to comprehensive... more Health personnel deals with medicines on a daily basis. They need to have access to comprehensive information about medicines as fast as possible. Several books and websites are at their disposal, as well as independent software packages with extra search capabilities that can be used in Pocket PCs or mobiles. The public, in general, is also interested in quickly accessing information about medicines. Despite all the electronic possibilities available nowadays, the search functionalities provided are usually based on keywords or class-oriented (allowing, for instance, a search by laboratory or by ATC classification). We propose Medicine.Ask which is a question-answering system about medicines that couples state of the art techniques in Information Extraction and Natural Language Processing. It supplies information about medicines through a (controlled) set of questions posed in Natural Language (Portuguese). An example of such a question is: “Which are the medicines for influenza th...

On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

ArXiv, 2018

Scientific research requires access, analysis, and sharing of data that is distributed across var... more Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We ...

Towards an Information Management System for a Research Lab

An Information Management System (IMS) is a software platform that enables the management of a va... more An Information Management System (IMS) is a software platform that enables the management of a vast array of information about products, customers, employees, suppliers, projects, production, assets and finances of an organization. Scientific research labs and education centers also benefit from the use of an IMS. This paper presents the design, implementation and validation of a new IMS for research labs. The system is modular in nature. Each module is independent from the others and designed to be integrated with other external software systems. A set of performance and usability tests were performed to ensure that the system accomplishes its requirements.

Acronym Expander at SDU@AAAI-21: an Acronym Disambiguation Module

In order to properly determine which of several possible meanings an acronym A in sentence s has,... more

A Machine Learning based Natural Language Interface for a database of medicines (poster paper)

Specifying complex correspondences between relational schemas and RDF models for generating customized R2RML Mappings

The W3C RDB2RDF Working Group proposed a standard language to map relational data into RDF triple... more The W3C RDB2RDF Working Group proposed a standard language to map relational data into RDF triples, called R2RML. However, creating R2RML mappings may sometimes be a difficult task because it involves the creation of views (within the mappings or not) and referring to them in the R2RML mapping. To overcome such difficulty, this paper first proposes algebraic correspondence assertions, which simplify the definition of relational-to-RDF mappings and yet are expressive enough to cover a wide range of mappings. Algebraic correspondence assertions include data-metadata mappings (where data elements in one schema serve as metadata components in the other), mappings containing custom value functions (e.g., data format transformation functions) and union, intersection and difference between tables. Then, the paper shows how to automatically compile algebraic correspondence assertions into R2RML mappings.

UNIANO: robust and efficient anomaly consensus in time series sensitive to cross-correlated anomaly profiles

Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), 2021

Time series anomaly detection is an active research area, combining dozens of state-of-the-art me... more Time series anomaly detection is an active research area, combining dozens of state-of-the-art methods that place heterogeneous views on what is an anomaly. This diversity of views-local and global, point and segment, univariate and multivariate, context-free and context-aware anomalies-is associated with moderate-to-high output divergences between methods. As a result, the user is faced with the difficult and laborious task of selecting the most appropriate methods and identifying cross-method consensus in an attempt to optimize recall and precision. Despite the relevance of establishing agreement criteria, existing principles are scarce and suffer from major problems: 1) show biases towards methods with correlated/redundant anomaly profiles; 2) depend on anomaly score thresholding; 3) prevent online detection; and 4) offer consensus not subjected to sound statistical testing. This work proposes UNIANO (UNIfied ANOmaly), an approach that combines simple yet effective empirical multivariate distribution statistics to address these drawbacks, guaranteeing a parameter-free and statistically robust integration of heterogeneous anomaly views. In this context, anomalies detected by less prevalent and concordant anomaly profiles, such as context-aware profiles in the presence of complementary variables, are not undervalued. Given a n-length time series and m views, UNIANO is aided by adequate data structures to achieve O(n log m 2 n) training time and linear O(m) testing-and-updating time. The gathered results confirm the relevance of the proposed approach.

REEL: A Relation Extraction Learning Framework (poster paper), DL 2014

Benchmarking with TPC-H on Off-the-Shelf Hardware: An Experiments Report

Most medium-sized enterprises run their databases on inexpensive off-the-shelf hardware; still, a... more Most medium-sized enterprises run their databases on inexpensive off-the-shelf hardware; still, answers to quite complex queries, like ad-hoc Decision Support System (DSS) ones, are required within a reasonable time window. Therefore, it becomes increasingly important that the chosen database system and its tuning be optimal for the specific database size and design. Such optimization could occur in-house, based on tests with academic database benchmarks adapted to the small-scale, easy-to-use requirements of a medium-sized enterprise. This paper focuses on industry standard TPC-H database benchmark that aims at measuring the performance of ad-hoc DSS queries. Since the only available TPC-H results feature large databases and run on high-end hardware, we attempt to assess whether the standard test is meaningfully downscalable and can be performed on off-the-shelf hardware, common in medium-sized enterprises. We present in detail the benchmark and the steps that a non-expert must take to run a benchmark test following the TPC-H specifications. In addition, we report our own benchmark tests, comparing an open-source and a commercial database server running on off-the-shelf inexpensive hardware under a number of equivalent configurations, varying parameters that affect the performance of DSS queries.

Performance Analysis of One-to-Many Data Transformations

Relational Database Systems often support activities like data warehousing, cleaning and integrat... more Relational Database Systems often support activities like data warehousing, cleaning and integration. All these activities require performing some sort of data transformations. Since data often resides on relational databases, data transformations are often specified using SQL, which is based of relational algebra. However, many useful data transformations cannot be expressed as SQL queries due to limited expressive power of relational algebra. In particular, an important class of data transformations that produces several output tuples for a single input tuple cannot be expressed in that way. In this report, we analyze alternatives to process one-to-many data transformations using Relational Database Systems, and compare them in terms of expressiveness, optimizability and performance.

Declaratively Cleaning your Data with AJAX

by Helena Galhardas and Daniela Florescu

Journées Bases de Données Avancées, 2000

Extending the Relational Algebra with the Mapper Operator

Application scenarios such as legacy data migration, Extract-Transform-Load (ETL) processes, and ... more Application scenarios such as legacy data migration, Extract-Transform-Load (ETL) processes, and data cleaning require the transformation of input tuples into output tuples. Traditional approaches for implementing these data transformations enclose solutions as Persistent Stored Modules (PSM) executed by an RDBMS or transformation code using a commercial ETL tool. Neither of these is easily maintainable or optimizable. A third approach consists of combining SQL queries with external code, written in a programming language. However, this solution is not expressive enough to specify an important class of data transformations that produce several output tuples for a single input tuple.

A Decision Support System for IST Academic Information

Informatica (slovenia), 2003

This article describes the Decision Support System (DSS) for Academic,Information being developed... more This article describes the Decision Support System (DSS) for Academic,Information being developed ,at Instituto Superior Técnico, the Engineering School of the Technical University ofLisbon. In Portuguese, this project has been given the acronym,SADIA (Sistema de Apoio à Decisão da Informação Académica). This paper ,focuses ,on the ,early ,phases ,of the ,DSS

Declarative Data Cleaning: Model, Language, and Algorithms

by Helena Galhardas and Daniela Florescu

A Survey of Data Quality Tools

by José Barateiro and Helena Galhardas

A Framework for Classifying Scientific Metadata

The scientific community, public organizations andadministrations have generated a large amount o... more The scientific community, public organizations andadministrations have generated a large amount ofdata concerning the environment. There is a needto allow sharing and exchange of this type of informationby various kinds of users including scientists,decision-makers and public authorities. Metadataarises as the solution to support these requirements.We present a formal framework forclassification of metadata that will give a uniformdefinition of what metadata

Declaratively cleaning your data using AJAX

Data quality concerns arise when correcting anomalies in a single data source, or integrating dat... more Data quality concerns arise when correcting anomalies in a single data source, or integrating data comingfrom multiple sources into a single data repository. The information handled may also need to undergo aformatting and normalization process so that the resulting data is structured and presented according to theapplication requirements. The main causes of data anomalies are: (1) the absence of universal