This work formalizes the new framework for anomaly detection, called active anomaly detection. Th... more This work formalizes the new framework for anomaly detection, called active anomaly detection. This framework has, in practice, the same cost of unsupervised anomaly detection but with the possibility of much better results. We show that unsupervised anomaly detection is an undecidable problem and that a prior needs to be assumed for the anomalies probability distribution in order to have performance guarantees. Finally, we also present a new layer that can be attached to any deep learning model designed for unsupervised anomaly detection to transform it into an active anomaly detection method, presenting results on both synthetic and real anomaly detection datasets.
Early-stage detection of cutaneous melanoma can vastly increase the chances of cure. Excision bio... more Early-stage detection of cutaneous melanoma can vastly increase the chances of cure. Excision biopsy followed by histological examination is considered the gold standard for diagnosing the disease, but requires long high-cost processing time, and may be biased, as it involves qualitative assessment by a professional. In this paper, we present a new machine learning approach using raw data for skin Raman spectra as input. The approach is highly efficient for classifying benign versus malignant skin lesions (AUC 0.98, 95% CI 0.97-0.99). Furthermore, we present a high-performance model (AUC 0.97, 95% CI 0.95-0.98) using a miniaturized spectral range (896-1039 cm-1), thus demonstrating that only a single fragment of the biological fingerprint Raman region is needed for producing an accurate diagnosis. These findings could favor the future development of a cheaper and dedicated Raman spectrometer for fast and accurate cancer diagnosis.
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Corpora used to learn open-domain QuestionAnswering (QA) models are typically collected from a wi... more Corpora used to learn open-domain QuestionAnswering (QA) models are typically collected from a wide variety of topics or domains. Since QA requires understanding natural language, open-domain QA models generally need very large training corpora. A simple way to alleviate data demand is to restrict the domain covered by the QA model, leading thus to domain-specific QA models. While learning improved QA models for a specific domain is still challenging due to the lack of sufficient training data in the topic of interest, additional training data can be obtained from related topic domains. Thus, instead of learning a single open-domain QA model, we investigate domain adaptation approaches in order to create multiple improved domain-specific QA models. We demonstrate that this can be achieved by stratifying the source dataset, without the need of searching for complementary data unlike many other domain adaptation approaches. We propose a deep architecture that jointly exploits convolutional and recurrent networks for learning domain-specific features while transferring domain-shared features. That is, we use transferable features to enable model adaptation from multiple source domains. We consider different transference approaches designed to learn span-level and sentence-level QA models. We found that domainadaptation greatly improves sentence-level QA performance, and span-level QA benefits from sentence information. Finally, we also show that a simple clustering algorithm may be employed when the topic domains are unknown and the resulting loss in accuracy is negligible.
2021 International Joint Conference on Neural Networks (IJCNN), 2021
A particular challenge while designing duplex steel is the minimization of surface defects such a... more A particular challenge while designing duplex steel is the minimization of surface defects such as heating slivers, since these defects may significantly increase production costs as they remain mostly undetected, being observed only during the final product inspection. Heating slivers may be originated in different stages of the steelmaking process, and to identify their formation in duplex stainless steel, we propose a feature decomposition approach that resulted in hundreds of thousands of predictive models learned from defective and non-defective duplex plates, thus offering diverse models regarding their incidence. We grouped these models based on their competing explanations, as a means to isolate the different possible root causes for heating sliver. Finally, we search for optimal models in the graphs derived from intra-cluster feature relationships.
Online social networks are used at a daily basis by millions of users around the world. More and ... more Online social networks are used at a daily basis by millions of users around the world. More and more people use these networks to interact, to issue opinions, and to share content about several different topics, such as entertainment, weather, work, family, traffic, and even their health conditions. In summary, social networks became another social place with their own meaning and evolving dynamically. Many events are late perceived and mentioned by traditional media, but may happen in the social networks at real time, so that it is possible to detect them and such detection subsidize the determination of predictive models. The goal of this work is to exploit content available in the social networks to detect the occurrence and predict epidemics surges, in particular those associated with dengue. Starting from the collection and processing of data from social networks, we identify how surges manifest in social networks and exploit these features to build an alarm system, as part of an epidemics suveillance system based on rumours. The results show that we were able to achieve high precisions (more than 99%), as well as to predict surges in advance. Resumo. As redes sociais online fazem parte do cotidiano de milhões de pessoas do mundo inteiro. Cada vez mais pessoas utilizam essas redes para interagir, opinar e compartilhar conteúdos sobre os mais diversos tópicos, como diversão, clima, trabalho, família, trânsito e mesmo sua condição de saúde. Em suma, as redes sociais se tornaram mais um lugar social com significados próprios, evoluindo dinamicamente. Muitos acontecimentos são tardiamente percebidos e divulgados pelos meios de comunicação tradicionais, mas podem acontecer nas redes sociais em tempo real, sendo passíveis de serem detectados e de subsidiarem a construção de modelos de previsão. O objetivo deste trabalho é explorar conteúdo disponível nas redes sociais para detectar a ocorrência e prever surtos epidemiológicos, em particular de dengue. A partir da coleta e processamento de dados das redes sociais, identificamos as características de surtos e exploramos essas características para a construção de um alarme, constituindo um sistema de vigilância epidemiológica de rumores. Resultados mostram que fomos capazes de atingir altas precisões (mais de 99%), assim como antecipar surtos.
Proceedings of the 18th Brazilian symposium on Multimedia and the web - WebMedia '12, 2012
ABSTRACT Due to the growing popularity of the Web, there is an increasing number of people who pe... more ABSTRACT Due to the growing popularity of the Web, there is an increasing number of people who performs e-business transactions. On the other hand, this popularity has also attracted the attention of criminals, raising the number of fraud cases in Web and financial losses that reach billions of dollars per year. This paper proposes a methodology, based on the knowledge discovery process, to detect fraud in online payment systems. In order to evaluate this methodology we define the concept of economical efficiency and applied it to an actual dataset of one of the largest Latin American electronic payment systems. The results show a very good performance for our proposal, providing gains of up to 46.5% in comparison with the strategy currently employed.
Models have gained the spotlight in many discussions surrounding COVID-19. The urgency for timely... more Models have gained the spotlight in many discussions surrounding COVID-19. The urgency for timely decisions resulted in a multitude of models as informed policy actions must be made even when so many uncertainties about the pandemic still remain. In this paper, we use machine learning algorithms to build intuitive country-level COVID-19 motion models described by death toll velocity and acceleration. Model explainability techniques provide insightful data-driven narratives about COVID-19 death toll motion models—while velocity is explained by factors that are increasing/reducing death toll pace now, acceleration anticipates the effects of public health measures on slowing the death toll pace. This allows policymakers and epidemiologists to understand factors driving the outbreak and to evaluate the impacts of different public health measures.
Efficiently retrieving and understanding messages from social media is challenging, considering t... more Efficiently retrieving and understanding messages from social media is challenging, considering that shorter messages are strongly dependent on context. Assuming that their audience is aware of background and real world events, users can shorten their messages without compromising communication. However, traditional data mining algorithms do not account for contextual information. We argue that exploiting context can lead to advancements in the analysis of social media messages. Recall rate increases if context is taken into account, leading to context-aware methods for filtering messages without resorting only to keywords. A novel approach for subject classification of social media messages, using computational linguistics techniques, is proposed, employing both textual and extra-textual or contextual information. Experimental analysis over sports-related messages indicates over 50i¾?% improvement in retrieval rate over text-based approaches due to the use of contextual information.
Name ambiguity in the context of bibliographic citation records is a hard problem that affects th... more Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on ...
This work formalizes the new framework for anomaly detection, called active anomaly detection. Th... more This work formalizes the new framework for anomaly detection, called active anomaly detection. This framework has, in practice, the same cost of unsupervised anomaly detection but with the possibility of much better results. We show that unsupervised anomaly detection is an undecidable problem and that a prior needs to be assumed for the anomalies probability distribution in order to have performance guarantees. Finally, we also present a new layer that can be attached to any deep learning model designed for unsupervised anomaly detection to transform it into an active anomaly detection method, presenting results on both synthetic and real anomaly detection datasets.
Early-stage detection of cutaneous melanoma can vastly increase the chances of cure. Excision bio... more Early-stage detection of cutaneous melanoma can vastly increase the chances of cure. Excision biopsy followed by histological examination is considered the gold standard for diagnosing the disease, but requires long high-cost processing time, and may be biased, as it involves qualitative assessment by a professional. In this paper, we present a new machine learning approach using raw data for skin Raman spectra as input. The approach is highly efficient for classifying benign versus malignant skin lesions (AUC 0.98, 95% CI 0.97-0.99). Furthermore, we present a high-performance model (AUC 0.97, 95% CI 0.95-0.98) using a miniaturized spectral range (896-1039 cm-1), thus demonstrating that only a single fragment of the biological fingerprint Raman region is needed for producing an accurate diagnosis. These findings could favor the future development of a cheaper and dedicated Raman spectrometer for fast and accurate cancer diagnosis.
2018 International Joint Conference on Neural Networks (IJCNN), 2018
Corpora used to learn open-domain QuestionAnswering (QA) models are typically collected from a wi... more Corpora used to learn open-domain QuestionAnswering (QA) models are typically collected from a wide variety of topics or domains. Since QA requires understanding natural language, open-domain QA models generally need very large training corpora. A simple way to alleviate data demand is to restrict the domain covered by the QA model, leading thus to domain-specific QA models. While learning improved QA models for a specific domain is still challenging due to the lack of sufficient training data in the topic of interest, additional training data can be obtained from related topic domains. Thus, instead of learning a single open-domain QA model, we investigate domain adaptation approaches in order to create multiple improved domain-specific QA models. We demonstrate that this can be achieved by stratifying the source dataset, without the need of searching for complementary data unlike many other domain adaptation approaches. We propose a deep architecture that jointly exploits convolutional and recurrent networks for learning domain-specific features while transferring domain-shared features. That is, we use transferable features to enable model adaptation from multiple source domains. We consider different transference approaches designed to learn span-level and sentence-level QA models. We found that domainadaptation greatly improves sentence-level QA performance, and span-level QA benefits from sentence information. Finally, we also show that a simple clustering algorithm may be employed when the topic domains are unknown and the resulting loss in accuracy is negligible.
2021 International Joint Conference on Neural Networks (IJCNN), 2021
A particular challenge while designing duplex steel is the minimization of surface defects such a... more A particular challenge while designing duplex steel is the minimization of surface defects such as heating slivers, since these defects may significantly increase production costs as they remain mostly undetected, being observed only during the final product inspection. Heating slivers may be originated in different stages of the steelmaking process, and to identify their formation in duplex stainless steel, we propose a feature decomposition approach that resulted in hundreds of thousands of predictive models learned from defective and non-defective duplex plates, thus offering diverse models regarding their incidence. We grouped these models based on their competing explanations, as a means to isolate the different possible root causes for heating sliver. Finally, we search for optimal models in the graphs derived from intra-cluster feature relationships.
Online social networks are used at a daily basis by millions of users around the world. More and ... more Online social networks are used at a daily basis by millions of users around the world. More and more people use these networks to interact, to issue opinions, and to share content about several different topics, such as entertainment, weather, work, family, traffic, and even their health conditions. In summary, social networks became another social place with their own meaning and evolving dynamically. Many events are late perceived and mentioned by traditional media, but may happen in the social networks at real time, so that it is possible to detect them and such detection subsidize the determination of predictive models. The goal of this work is to exploit content available in the social networks to detect the occurrence and predict epidemics surges, in particular those associated with dengue. Starting from the collection and processing of data from social networks, we identify how surges manifest in social networks and exploit these features to build an alarm system, as part of an epidemics suveillance system based on rumours. The results show that we were able to achieve high precisions (more than 99%), as well as to predict surges in advance. Resumo. As redes sociais online fazem parte do cotidiano de milhões de pessoas do mundo inteiro. Cada vez mais pessoas utilizam essas redes para interagir, opinar e compartilhar conteúdos sobre os mais diversos tópicos, como diversão, clima, trabalho, família, trânsito e mesmo sua condição de saúde. Em suma, as redes sociais se tornaram mais um lugar social com significados próprios, evoluindo dinamicamente. Muitos acontecimentos são tardiamente percebidos e divulgados pelos meios de comunicação tradicionais, mas podem acontecer nas redes sociais em tempo real, sendo passíveis de serem detectados e de subsidiarem a construção de modelos de previsão. O objetivo deste trabalho é explorar conteúdo disponível nas redes sociais para detectar a ocorrência e prever surtos epidemiológicos, em particular de dengue. A partir da coleta e processamento de dados das redes sociais, identificamos as características de surtos e exploramos essas características para a construção de um alarme, constituindo um sistema de vigilância epidemiológica de rumores. Resultados mostram que fomos capazes de atingir altas precisões (mais de 99%), assim como antecipar surtos.
Proceedings of the 18th Brazilian symposium on Multimedia and the web - WebMedia '12, 2012
ABSTRACT Due to the growing popularity of the Web, there is an increasing number of people who pe... more ABSTRACT Due to the growing popularity of the Web, there is an increasing number of people who performs e-business transactions. On the other hand, this popularity has also attracted the attention of criminals, raising the number of fraud cases in Web and financial losses that reach billions of dollars per year. This paper proposes a methodology, based on the knowledge discovery process, to detect fraud in online payment systems. In order to evaluate this methodology we define the concept of economical efficiency and applied it to an actual dataset of one of the largest Latin American electronic payment systems. The results show a very good performance for our proposal, providing gains of up to 46.5% in comparison with the strategy currently employed.
Models have gained the spotlight in many discussions surrounding COVID-19. The urgency for timely... more Models have gained the spotlight in many discussions surrounding COVID-19. The urgency for timely decisions resulted in a multitude of models as informed policy actions must be made even when so many uncertainties about the pandemic still remain. In this paper, we use machine learning algorithms to build intuitive country-level COVID-19 motion models described by death toll velocity and acceleration. Model explainability techniques provide insightful data-driven narratives about COVID-19 death toll motion models—while velocity is explained by factors that are increasing/reducing death toll pace now, acceleration anticipates the effects of public health measures on slowing the death toll pace. This allows policymakers and epidemiologists to understand factors driving the outbreak and to evaluate the impacts of different public health measures.
Efficiently retrieving and understanding messages from social media is challenging, considering t... more Efficiently retrieving and understanding messages from social media is challenging, considering that shorter messages are strongly dependent on context. Assuming that their audience is aware of background and real world events, users can shorten their messages without compromising communication. However, traditional data mining algorithms do not account for contextual information. We argue that exploiting context can lead to advancements in the analysis of social media messages. Recall rate increases if context is taken into account, leading to context-aware methods for filtering messages without resorting only to keywords. A novel approach for subject classification of social media messages, using computational linguistics techniques, is proposed, employing both textual and extra-textual or contextual information. Experimental analysis over sports-related messages indicates over 50i¾?% improvement in retrieval rate over text-based approaches due to the use of contextual information.
Name ambiguity in the context of bibliographic citation records is a hard problem that affects th... more Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on ...
Uploads
Papers by Adriano Veloso