Skip to main content

Tatiana Shavrina

National Research University “Higher School of Economics” (HSE), Moscow, Russia, Linguistics, Magistrant

Moscow State University, Department of Theoretical and Applied Linguistics, Graduate Student

Followers

191

Following

31

Co-authors

13

Mentions

12

Public Views

Supervisors: Vladimir Plungian and Vladimir Belikov
Address: Moscow, Moscow City, Russian Federation

less

National University of Singapore

Ramesh Nallapati

Abhishek Singh Gehlot

Vellore Institute of Technology

Hannah Gonzalez

Shizuoka University

InterestsView All (20)

Uploads

Papers by Tatiana Shavrina

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

ArXiv, 2022

The General QA field has been developing the methodology referencing the Stanford Question answer... more The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by timeand labour-consuming annotation, limiting the training data’s potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragrap...

mGPT: Few-Shot Learners Go Multilingual

Cornell University - arXiv, Apr 15, 2022

Recent studies report that autoregressive language models can successfully solve many NLP tasks v... more Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero-and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages, and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. The source code and the mGPT XL model are publicly released. 6 https://tensorflow.org/datasets/ catalog/c4 7 We used the 20201101 dump version for each language.

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Natural Language Engineering

Recent research has reported that standard fine-tuning approaches can be unstable due to being pr... more Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English ca...

Humans Keep It One Hundred: an Overview of AI Journey

Artificial General Intelligence (AGI) is showing growing performance in numerous applications - b... more Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019

SpellRueval : the FiRSt Competition on automatiC Spelling CoRReCtion FoR RuSSian

This paper reports on the first competition on automatic spelling correction for Russian language... more This paper reports on the first competition on automatic spelling correction for Russian language—SpellRuEval—held within the framework of “Dialogue Evaluation”. The competition aims to bring together groups of Russian academic researchers and IT-companies in order to gain and exchange the experience in automatic spelling correction, especially concentrating on social media texts. The data for the competition was taken from Russian segment of Live Journal. 7 teams took part in the competition, the best results were achieved by the model using edit distance and phonetic similarity for candidate search and n-gram language model for their reranking. We discuss in details the algorithms used by the teams, as well as the methodology of evaluation for automatic spelling correction.

AutomAtic spelling correction for russiAn sociAl mediA texts

This paper describes an automatic spelling correction system for Russian. The system utilizes inf... more This paper describes an automatic spelling correction system for Russian. The system utilizes information from different levels, using edit distance for candidate search and a combination of weighted edit distance and language model for candidate hypotheses selection. The hypotheses are then reranked by logistic regression using edit distance score, language model score etc. as features. We also experimented with morphological and semantic features but did not get any advantage. Our system has won the first SpellRuEval competition for Russian spell checkers by all the metrics and achieved F1-Measure of 75%.

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Grameval 2020 Shared Task: Russian Full Morphology and Universal Dependencies Parsing

Computational Linguistics and Intellectual Technologies, 2020

The paper presents the results of GramEval 2020, a shared task on Russian morphological and synta... more The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission ro...

Text collections for evaluation of Russian morphological taggers

Journal of Linguistics/Jazykovedný casopis, 2017

The paper describes the preparation and development of the text collections within the framework ... more The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification

Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Panel Session 2: Intelligence measures, AGI Benchmarks and Evaluations

Panel discussion with authors of papers about 'Intelligence measures, AGI Benchmarks and Eval... more

AGI-20 Main Conference - Livestream Day 1

Russian SuperGLUE Creating a Language Understanding Evaluation Benchmark

Texts as an unlimited, but indirect trace of human thinking processes are the richest source for ... more

MOROCCO: Model Resource Comparison Framework

ArXiv, 2021

The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of ... more The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with jiant environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.1

Methods of misspelling detection and correction: A historical overview

Methods of computational linguistics in the evaluation of artifi cial intelligence systems

Voprosy Jazykoznanija

The article discusses current research in the fi eld of applied linguistics dedicated to the eval... more The article discusses current research in the fi eld of applied linguistics dedicated to the evaluation of artifi cial intelligence (AI) systems. Linguistic tests are used as the principal tool for evaluating the level of intelligence of such systems, being the most aff ordable way of training AI systems and, at the same time, having high variability necessary for the formulation of intellectual tasks. This paper provides an overview of current methodology for training and testing AI systems and describes the gold standards of textual tasks (benchmarks) in the General Language Understanding Evaluation (GLUE) methodology. We also present an overview of how the theoretical apparatus and practices of linguistics are used to create a Russian-language test for examining the abilities of AI systems, the Russian SuperGLUE. Further convergence of machine learning and linguistic methods can fi ll gaps both in the practice of evaluating AI systems and in their eff ective training.

SynTagRus gapping test set

A test set that contains manually annotated sentences with gapping. The test set was compiled fro... more

Genre Classification on Text-Internal Features: a Corpus Study

This paper is part of a greater study examining the features of the genre extracted from the text... more This paper is part of a greater study examining the features of the genre extracted from the text directly and suitable both for classification tasks and for adapting models of automatic morphological and syntactic tagging on data from various genres. The purpose of this work is to identify and describe the significant features of the genre, received from big data exclusively from the texts themselves (without the use of metadata, information about the author, date, literary style and method, etc.). Based on the selected features, it is explored how texts can be delimited according to one of the mostly used philological classifications with these features and the hypothesis whether with reliance solely on textinternal features texts can be grouped by genre quite successfully or not is being checked.

DaNetQA: A Yes/No Question Answering Dataset for the Russian Language

Lecture Notes in Computer Science

Using Generative Pretrained Transformer-3 Models for Russian News Clustering and Title Generation tasks

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

ArXiv, 2022

The General QA field has been developing the methodology referencing the Stanford Question answer... more The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by timeand labour-consuming annotation, limiting the training data’s potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragrap...

mGPT: Few-Shot Learners Go Multilingual

Cornell University - arXiv, Apr 15, 2022

Recent studies report that autoregressive language models can successfully solve many NLP tasks v... more Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero-and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages, and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. The source code and the mGPT XL model are publicly released. 6 https://tensorflow.org/datasets/ catalog/c4 7 We used the 20201101 dump version for each language.

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Natural Language Engineering

Recent research has reported that standard fine-tuning approaches can be unstable due to being pr... more Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English ca...

Humans Keep It One Hundred: an Overview of AI Journey

Artificial General Intelligence (AGI) is showing growing performance in numerous applications - b... more Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019

SpellRueval : the FiRSt Competition on automatiC Spelling CoRReCtion FoR RuSSian

This paper reports on the first competition on automatic spelling correction for Russian language... more This paper reports on the first competition on automatic spelling correction for Russian language—SpellRuEval—held within the framework of “Dialogue Evaluation”. The competition aims to bring together groups of Russian academic researchers and IT-companies in order to gain and exchange the experience in automatic spelling correction, especially concentrating on social media texts. The data for the competition was taken from Russian segment of Live Journal. 7 teams took part in the competition, the best results were achieved by the model using edit distance and phonetic similarity for candidate search and n-gram language model for their reranking. We discuss in details the algorithms used by the teams, as well as the methodology of evaluation for automatic spelling correction.

AutomAtic spelling correction for russiAn sociAl mediA texts

This paper describes an automatic spelling correction system for Russian. The system utilizes inf... more This paper describes an automatic spelling correction system for Russian. The system utilizes information from different levels, using edit distance for candidate search and a combination of weighted edit distance and language model for candidate hypotheses selection. The hypotheses are then reranked by logistic regression using edit distance score, language model score etc. as features. We also experimented with morphological and semantic features but did not get any advantage. Our system has won the first SpellRuEval competition for Russian spell checkers by all the metrics and achieved F1-Measure of 75%.

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Grameval 2020 Shared Task: Russian Full Morphology and Universal Dependencies Parsing

Computational Linguistics and Intellectual Technologies, 2020

The paper presents the results of GramEval 2020, a shared task on Russian morphological and synta... more The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission ro...

Text collections for evaluation of Russian morphological taggers

Journal of Linguistics/Jazykovedný casopis, 2017

The paper describes the preparation and development of the text collections within the framework ... more The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.

A Study on Manual and Automatic Evaluation for Text Style Transfer: The Case of Detoxification

Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Panel Session 2: Intelligence measures, AGI Benchmarks and Evaluations

Panel discussion with authors of papers about 'Intelligence measures, AGI Benchmarks and Eval... more

AGI-20 Main Conference - Livestream Day 1

Russian SuperGLUE Creating a Language Understanding Evaluation Benchmark

Texts as an unlimited, but indirect trace of human thinking processes are the richest source for ... more

MOROCCO: Model Resource Comparison Framework

ArXiv, 2021

The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of ... more The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with jiant environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.1

Methods of misspelling detection and correction: A historical overview

Methods of computational linguistics in the evaluation of artifi cial intelligence systems

Voprosy Jazykoznanija

The article discusses current research in the fi eld of applied linguistics dedicated to the eval... more The article discusses current research in the fi eld of applied linguistics dedicated to the evaluation of artifi cial intelligence (AI) systems. Linguistic tests are used as the principal tool for evaluating the level of intelligence of such systems, being the most aff ordable way of training AI systems and, at the same time, having high variability necessary for the formulation of intellectual tasks. This paper provides an overview of current methodology for training and testing AI systems and describes the gold standards of textual tasks (benchmarks) in the General Language Understanding Evaluation (GLUE) methodology. We also present an overview of how the theoretical apparatus and practices of linguistics are used to create a Russian-language test for examining the abilities of AI systems, the Russian SuperGLUE. Further convergence of machine learning and linguistic methods can fi ll gaps both in the practice of evaluating AI systems and in their eff ective training.

SynTagRus gapping test set

A test set that contains manually annotated sentences with gapping. The test set was compiled fro... more

Genre Classification on Text-Internal Features: a Corpus Study

This paper is part of a greater study examining the features of the genre extracted from the text... more This paper is part of a greater study examining the features of the genre extracted from the text directly and suitable both for classification tasks and for adapting models of automatic morphological and syntactic tagging on data from various genres. The purpose of this work is to identify and describe the significant features of the genre, received from big data exclusively from the texts themselves (without the use of metadata, information about the author, date, literary style and method, etc.). Based on the selected features, it is explored how texts can be delimited according to one of the mostly used philological classifications with these features and the hypothesis whether with reliance solely on textinternal features texts can be grouped by genre quite successfully or not is being checked.

DaNetQA: A Yes/No Question Answering Dataset for the Russian Language

Lecture Notes in Computer Science

Using Generative Pretrained Transformer-3 Models for Russian News Clustering and Title Generation tasks

Корпусные менеджеры. Работа со Sketch Engine

Презентация с семинара «Корпусные менеджеры. Работа со Sketch Engine». Корпусной менеджер - след... more Презентация с семинара «Корпусные менеджеры. Работа со Sketch Engine».
Корпусной менеджер - следующий этап обработки текстового материала после скачивания и очистки. КМ позволяет эффективно и быстро выстроить пайплайн обработки текста, не прибегая к самостоятельному строительству цепочки, что подходит для небольших или предварительных корпусных исследований.