Vuk Batanović
I am a researcher at the Innovation Center of the School of Electrical Engineering, University of Belgrade, Serbia. My work is centered on the field of natural language processing, particularly on computational models of semantic similarity and sentiment analysis. My research also involves other semantic problems, such as coreference resolution, distributional semantics, and the impact of morphological normalization on semantic tasks. One of my main points of interest is dealing with the particularities of short-text processing. In addition, I am focused on creating solutions which are easily applicable not only to English, but to other, less prominent languages as well.
Address: Serbia
Address: Serbia
less
Related Authors
Francesco Perono Cacciafoco
Xi'an Jiaotong-Liverpool University
Peter Revesz
University of Nebraska Lincoln
Imran Zualkernan
American University of Sharjah
Pruthwik Mishra
International Institute of Information Technology, Hyderabad
Prafulla Bafna
Sicsr
InterestsView All (13)
Uploads
Theses by Vuk Batanović
Metodologija obuhvata sve faze izrade statističkih rešenja, od prikupljanja tekstualnog sadržaja, preko anotacije podataka, do formulisanja, obučavanja i evaluacije modela mašinskog učenja. Njena upotreba je detaljno ilustrovana na dva semantička problema – analizi sentimenta i određivanju semantičke sličnosti. Kao primer jezika sa ograničenim resursima korišćen je srpski jezik, ali se predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opšte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema označavanja sentimenta kratkih tekstova, nove metrike za utvrđivanje ekonomičnosti anotacije, kao i nekoliko novih modela za određivanje semantičke sličnosti kratkih tekstova. Rezultati disertacije uključuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize sentimenta i određivanja semantičke sličnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju većeg broja modela na ovim problemima, i prvu komparativnu evaluaciju većeg broja alata za morfološku normalizaciju na kratkim tekstovima na srpskom jeziku.
/
Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian.
Papers by Vuk Batanović
/
Brza integrisana procena (engl.RIA) je procedura Programa Ujedinjenih nacija za razvoj koja podrazumeva poređenje državnih strateških dokumenata o razvoju i ciljeva održivog razvoja koje su definisale Ujedinjene nacije. U ovom radu predstavljamo srpski AutoRIA sistem koji automatizuje ovu proceduru na srpskom, jeziku sa ograničenim resursima, a razvijenom morfologijom. Razmatramo probleme koji se tiču pretprocesiranja podataka za ovaj zadatak, kao i opštu arhitekturu i jezičke specifičnosti sistema.Takođe evaluiramo efekte različitih podešavanja sistema na njegove performanse koristeći rezultate ranije, ručno sprovedene RIA procedure za Srbiju.
In this talk we present a series of collaborations between researchers developing such datasets for Slovene, Croatian and Serbian, three languages with just a few million speakers each. Close relatedness of these languages brings an opportunity for a synchronized approach to the development of resources and technologies, to the benefit of all parties. Due to the complex political environment, however, such an approach has not been established until the start of the ReLDI (Regional Linguistic Data Initiative) project. The main synergistic effect of the collaborations presented here is achieved by drastically lowering the efforts required to produce datasets in additional languages, primarily in the areas of (1) the development of annotation guidelines, (2) setting up the technical requirements for the annotation campaigns and (3) pre-annotation of data with models trained for another, but very close language.
The linguistic levels covered in the resulting datasets are those of tokenisation, sentence segmentation, normalisation, morphosyntax, lemmatisation, dependency parsing, semantic role labeling, named entity recognition and coreference resolution. Two varieties of each of the three languages are covered: the standard variety and the variety of the language of the Internet.
Metodologija obuhvata sve faze izrade statističkih rešenja, od prikupljanja tekstualnog sadržaja, preko anotacije podataka, do formulisanja, obučavanja i evaluacije modela mašinskog učenja. Njena upotreba je detaljno ilustrovana na dva semantička problema – analizi sentimenta i određivanju semantičke sličnosti. Kao primer jezika sa ograničenim resursima korišćen je srpski jezik, ali se predložena metodologija može primeniti i na druge jezike iz ove kategorije.
Pored opšte metodologije, u doprinose ove disertacije spada razvoj novog, fleksibilnog sistema označavanja sentimenta kratkih tekstova, nove metrike za utvrđivanje ekonomičnosti anotacije, kao i nekoliko novih modela za određivanje semantičke sličnosti kratkih tekstova. Rezultati disertacije uključuju i kreiranje prvih javno dostupnih anotiranih skupova podataka za probleme analize sentimenta i određivanja semantičke sličnosti kratkih tekstova na srpskom jeziku, razvoj i evaluaciju većeg broja modela na ovim problemima, i prvu komparativnu evaluaciju većeg broja alata za morfološku normalizaciju na kratkim tekstovima na srpskom jeziku.
/
Statistical approaches to natural language processing typically require considerable amounts of labeled data, and often various auxiliary language tools as well, limiting their applicability in resource-limited settings. This thesis presents a methodology for developing statistical solutions in the semantic processing of natural languages with limited resources. In these languages, not only are existing language resources limited, but so are the capabilities for developing new datasets and dedicated tools and algorithms. The proposed methodology focuses on short texts due to their prevalence in digital communication, as well as the greater complexity regarding their semantic processing.
The methodology encompasses all phases in the creation of statistical solutions, from the collection of textual content, to data annotation, to the formulation, training, and evaluation of machine learning models. Its use is illustrated in detail on two semantic tasks – sentiment analysis and semantic textual similarity. The Serbian language is utilized as an example of a language with limited resources, but the proposed methodology can also be applied to other languages in this category.
In addition to the general methodology, the contributions of this thesis consist of the development of a new, flexible short-text sentiment annotation system, a new annotation cost-effectiveness metric, as well as several new semantic textual similarity models. The thesis results also include the creation of the first publicly available annotated datasets of short texts in Serbian for the tasks of sentiment analysis and semantic textual similarity, the development and evaluation of numerous models on these tasks, and the first comparative evaluation of multiple morphological normalization tools on short texts in Serbian.
/
Brza integrisana procena (engl.RIA) je procedura Programa Ujedinjenih nacija za razvoj koja podrazumeva poređenje državnih strateških dokumenata o razvoju i ciljeva održivog razvoja koje su definisale Ujedinjene nacije. U ovom radu predstavljamo srpski AutoRIA sistem koji automatizuje ovu proceduru na srpskom, jeziku sa ograničenim resursima, a razvijenom morfologijom. Razmatramo probleme koji se tiču pretprocesiranja podataka za ovaj zadatak, kao i opštu arhitekturu i jezičke specifičnosti sistema.Takođe evaluiramo efekte različitih podešavanja sistema na njegove performanse koristeći rezultate ranije, ručno sprovedene RIA procedure za Srbiju.
In this talk we present a series of collaborations between researchers developing such datasets for Slovene, Croatian and Serbian, three languages with just a few million speakers each. Close relatedness of these languages brings an opportunity for a synchronized approach to the development of resources and technologies, to the benefit of all parties. Due to the complex political environment, however, such an approach has not been established until the start of the ReLDI (Regional Linguistic Data Initiative) project. The main synergistic effect of the collaborations presented here is achieved by drastically lowering the efforts required to produce datasets in additional languages, primarily in the areas of (1) the development of annotation guidelines, (2) setting up the technical requirements for the annotation campaigns and (3) pre-annotation of data with models trained for another, but very close language.
The linguistic levels covered in the resulting datasets are those of tokenisation, sentence segmentation, normalisation, morphosyntax, lemmatisation, dependency parsing, semantic role labeling, named entity recognition and coreference resolution. Two varieties of each of the three languages are covered: the standard variety and the variety of the language of the Internet.