Papers by Tatiana Shavrina
ArXiv, 2022
The General QA field has been developing the methodology referencing the Stanford Question answer... more The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by timeand labour-consuming annotation, limiting the training data’s potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragrap...
Cornell University - arXiv, Apr 15, 2022
Recent studies report that autoregressive language models can successfully solve many NLP tasks v... more Recent studies report that autoregressive language models can successfully solve many NLP tasks via zero-and few-shot learning paradigms, which opens up new possibilities for using the pre-trained language models. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages, and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. The source code and the mGPT XL model are publicly released. 6 https://tensorflow.org/datasets/ catalog/c4 7 We used the 20201101 dump version for each language.
Natural Language Engineering
Recent research has reported that standard fine-tuning approaches can be unstable due to being pr... more Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English ca...
Artificial General Intelligence (AGI) is showing growing performance in numerous applications - b... more Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation. Competing systems pass the final native language exam (in Russian), including versatile grammar tasks (test and open questions) and an essay, achieving a high score of 69%, with 68% being an average human result. During the competition, a baseline for the task and essay parts was proposed, and 80+ systems were submitted, showing different approaches to task understanding and reasoning. All the data and solutions can be found on github https://github.com/sberbank-ai/combined_solution_aij2019
This paper reports on the first competition on automatic spelling correction for Russian language... more This paper reports on the first competition on automatic spelling correction for Russian language—SpellRuEval—held within the framework of “Dialogue Evaluation”. The competition aims to bring together groups of Russian academic researchers and IT-companies in order to gain and exchange the experience in automatic spelling correction, especially concentrating on social media texts. The data for the competition was taken from Russian segment of Live Journal. 7 teams took part in the competition, the best results were achieved by the model using edit distance and phonetic similarity for candidate search and n-gram language model for their reranking. We discuss in details the algorithms used by the teams, as well as the methodology of evaluation for automatic spelling correction.
This paper describes an automatic spelling correction system for Russian. The system utilizes inf... more This paper describes an automatic spelling correction system for Russian. The system utilizes information from different levels, using edit distance for candidate search and a combination of weighted edit distance and language model for candidate hypotheses selection. The hypotheses are then reranked by logistic regression using edit distance score, language model score etc. as features. We also experimented with morphological and semantic features but did not get any advantage. Our system has won the first SpellRuEval competition for Russian spell checkers by all the metrics and achieved F1-Measure of 75%.
Computational Linguistics and Intellectual Technologies, 2020
The paper presents the results of GramEval 2020, a shared task on Russian morphological and synta... more The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission ro...
Journal of Linguistics/Jazykovedný casopis, 2017
The paper describes the preparation and development of the text collections within the framework ... more The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)
Panel discussion with authors of papers about 'Intelligence measures, AGI Benchmarks and Eval... more Panel discussion with authors of papers about 'Intelligence measures, AGI Benchmarks and Evaluations' at AGI-20 conference. Held live on Underline, 06/24/2020
Texts as an unlimited, but indirect trace of human thinking processes are the richest source for ... more Texts as an unlimited, but indirect trace of human thinking processes are the richest source for AI modelling that we have. How to handle it and where does SuperGLUE come about - let's find out together.
ArXiv, 2021
The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of ... more The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with jiant environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.1
Voprosy Jazykoznanija
The article discusses current research in the fi eld of applied linguistics dedicated to the eval... more The article discusses current research in the fi eld of applied linguistics dedicated to the evaluation of artifi cial intelligence (AI) systems. Linguistic tests are used as the principal tool for evaluating the level of intelligence of such systems, being the most aff ordable way of training AI systems and, at the same time, having high variability necessary for the formulation of intellectual tasks. This paper provides an overview of current methodology for training and testing AI systems and describes the gold standards of textual tasks (benchmarks) in the General Language Understanding Evaluation (GLUE) methodology. We also present an overview of how the theoretical apparatus and practices of linguistics are used to create a Russian-language test for examining the abilities of AI systems, the Russian SuperGLUE. Further convergence of machine learning and linguistic methods can fi ll gaps both in the practice of evaluating AI systems and in their eff ective training.
A test set that contains manually annotated sentences with gapping. The test set was compiled fro... more A test set that contains manually annotated sentences with gapping. The test set was compiled from SynTagRus (v. 2015) the dependency treebank for Russian that provides comprehensive manually-corrected morphological and syntactic annotation.
This paper is part of a greater study examining the features of the genre extracted from the text... more This paper is part of a greater study examining the features of the genre extracted from the text directly and suitable both for classification tasks and for adapting models of automatic morphological and syntactic tagging on data from various genres. The purpose of this work is to identify and describe the significant features of the genre, received from big data exclusively from the texts themselves (without the use of metadata, information about the author, date, literary style and method, etc.). Based on the selected features, it is explored how texts can be delimited according to one of the mostly used philological classifications with these features and the hypothesis whether with reliance solely on textinternal features texts can be grouped by genre quite successfully or not is being checked.
Lecture Notes in Computer Science
Uploads
Papers by Tatiana Shavrina
Корпусной менеджер - следующий этап обработки текстового материала после скачивания и очистки. КМ позволяет эффективно и быстро выстроить пайплайн обработки текста, не прибегая к самостоятельному строительству цепочки, что подходит для небольших или предварительных корпусных исследований.