Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

Lorella Viola; Jaap Verheul

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

Jaap Verheul

Lorella Viola

2020

This paper discusses the added value of applying machine learning (ML) to contextually enrich digital collections. In this study, we employed ML as a method to geographically enrich historical datasets. Specifically, we used a sequence tagging tool (Riedl and Padó 2018) which implements TensorFlow to perform NER on a corpus of historical immigrant newspapers. Afterwards, the entities were extracted and geocoded. The aim was to prepare large quantities of unstructured data for a conceptual historical analysis of geographical references. The intention was to develop a method that would assist researchers working in spatial humanities, a recently emerged interdisciplinary field focused on geographic and conceptual space. Here we describe the ML methodology and the geocoding phase of the project, focussing on the advantages and challenges of this approach, particularly for humanities scholars. We also argue that, by choosing to use largely neglected sources such as immigrant newspapers (also known as ethnic newspapers), this study contributes to the debate about diversity representation and archival biases in digital practices.

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach Lorella Viola1 a 1Luxembourg and Jaap Verheul2 b (C2DH), Centre for Contemporary and Digital History University of Luxembourg, Belval Campus, Maison des Sciences Humaines, 11, Porte des Sciences, L-4366 Esch-sur-Alzette, Luxembourg 2Department of History and Art History, Utrecht University, Drift 6, 3512 BS, Utrecht, The Netherlands Keywords: Machine Learning, Sequence Tagging, Spatial Humanities, Geographical Enrichment, Immigrant Newspapers. Abstract: This paper discusses the added value of applying machine learning (ML) to contextually enrich digital collections. In this study, we employed ML as a method to geographically enrich historical datasets. Specifically, we used a sequence tagging tool (Riedl and Padó 2018) which implements TensorFlow to perform NER on a corpus of historical immigrant newspapers. Afterwards, the entities were extracted and geocoded. The aim was to prepare large quantities of unstructured data for a conceptual historical analysis of geographical references. The intention was to develop a method that would assist researchers working in spatial humanities, a recently emerged interdisciplinary field focused on geographic and conceptual space. Here we describe the ML methodology and the geocoding phase of the project, focussing on the advantages and challenges of this approach, particularly for humanities scholars. We also argue that, by choosing to use largely neglected sources such as immigrant newspapers (also known as ethnic newspapers), this study contributes to the debate about diversity representation and archival biases in digital practices. 1 INTRODUCTION Advances in machine learning (ML) are allowing researchers both in computer science and the humanities to develop new tools and methods for exploring digital collections. At the same time, libraries are resorting more and more to ML methods to maximise the potential of their collections, to improve the user experience, and to discover the technical requirements necessary to facilitate the discovery and use of digital collections. One effective application of ML is enriching digital data with geographical information. Because ML uses contextual information to disambiguate entities, the method goes beyond the state of the art of place name extraction and disambiguation based on gazetteers or ensembles of NER-tools (Canale, Lisena, and Troncy 2018; Won, MurrietaFlores, and Martins 2018; Mariona Coll Ardanuy and Sporleder 2017; Maria Coll Ardanuy 2017). This article discusses the added value as well as the challenges of using a ML method aimed to subse- a b quently perform conceptual Named Entity Recognition (cNER), a way to identify subjective and situational geographical markers and connect them to explicit geo-references to space. In doing so, cNER aligns with what has been referred to as Spatial Turns, a revision of past approaches to the study of space and place (Murrieta-Flores & Martins, 2019) which acknowledges that place and space are distinct entities as place is created through social experiences and can be both real and imagined. As its evidence, the study uses a corpus of Italian American historical newspapers (ChroniclItaly, Viola 2018) published between 1898 and 1920. We argue that cNER opens up a way to explore the cultural context of geographical place and that, by choosing to use a largely neglected data source such as immigrant newspapers, on the whole this study contributes to the debate about language diversity representation and archival biases in digital practices. https://orcid.org/0000-0001-9994-0841 https://orcid.org/0000-0001-6978-7698 469 Viola, L. and Verheul, J. Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach. DOI: 10.5220/0009094204690475 In Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020), pages 469-475 ISBN: 978-989-758-395-7 Copyright c 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved ARTIDIGH 2020 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities 2 CONTEXT OF THE STUDY The study stands at the crossroads of migration studies, linguistics, history, and digital humanities. It developed within the context of a larger project, the GeoNewsMiner (GNM - Viola et al., 2019) aiming to create a visualisation tool of geo-references. The tool was intended to assist researchers working in spatial humanities, a recently emerged interdisciplinary field focussing on geographic and conceptual space, especially from a historical perspective (Bodenhamer, Corrigan, & Harris, 2010, 2015a). Essentially, spatial humanities are based on Geographic Information Systems (GIS). GIS are used to geo-reference a dataset, map it, display it, and analyse it by cross-referencing different layers of information. Such process of crossreferencing has provided researchers working in fields such as environmental history, historical demography, and economic, urban and medieval history with new perspectives leading them to challenge traditional positions and even explore new questions. One of the conceptual challenges digital humanities is engaging with as it evolves out of GIS mapping, is to identify the layers of meaning humans attached historically to geographical spaces. Only when humans invest meaning in such landscapes, locales and dwellings, the objective space is turned into a “place” that they can be emotionally attached to, and that can be part of collective narratives of pride, longing or loss. Whereas geographical space is “a realm without meaning […] a ‘fact of life’, which, like time, produces the basic coordinates for human life,” the concept of place is space in which humans have invested meaning (Cresswell, 2010, p. 10; Gregory, 2014; Withers, 2009). One of the central aims of spatial humanities has become to connect these cultural constructs to an Euclidian grid of space, for instance by performing “deep mapping” (Bodenhamer, Corrigan, & Harris, 2015b, pp. 26–28, passim). Importantly, the subjective attachment to place is expressed in the way such locations places are given proper names. Whereas GIS codes represent digital definitions of geographical space, toponyms or place names are often much more ambiguous cultural markers that represent attachment, fear, longing or other sentiments. Spatial humanities has begun to connect these subjective connotations to the toponyms that can be extracted as named entities from digitized texts, such as travel narratives, novels and newspapers (Donaldson, Gregory, & Taylor, 2017; Tally, 2011; Taylor, Donaldson, Gregory, & Butler, 2018). Although much work has been done in the field of geographic text analysis (GTA), one of the next chal- 470 lenges within spatial humanities that this article addresses is how such geographical markers change over time as a result of human movement and migration (White, 2010, pp. 17, 18, passim). Here we attempt to tackle questions concerning the way places are mentioned by migrants, individuals who are displaced and have to make sense of their lives across contesting cultural values and traditions. The challenge is to trace how toponyms serve as diasporic idiotopes that migrants use to negotiate the attachment to their lost homeland and the new host society. We argue that spatial humanities can help to extract such indicators of a sense of place in large heritage collections and map the “persistence of place in a hectic world” (Pascual-de-Sans, 2004). Drawing from the application of ML, GIS, data mining, and visualisation, the paper discusses how performing cNER on ethnic newspapers can provide researchers with new perspectives on the geographies of the past so as to open up new ways to study the socio-cognitive dimension of migration history. 3 THE TASK OF NER In order to conceptualise the historicization of place name disambiguation, a method is needed to identify toponymic identifiers in big data collections within their proper historical context (such as place names, borders, and nations). Current Named Entity Recognizers (such as the much-used Stanford Named Entity Tagger) assume rigid designators and are historically static, as often based on contemporary word lists. This means that they cannot interpret historical changes in place names (e.g., St. Petersburg - Petrograd - Leningrad) within their proper historical context and cannot deal with culturally ambiguous geographical references (e.g., America, United States, the New World, Washington) or homonymic identifiers (e.g., Limerick – limerick) (Marrero et al, 2013; Neudecker, 2014). The most common way to overcome these limitations is to train classifiers, employing semi-supervised ML, sometimes using crowdsourcing, or to use hand-crafted grammar-based systems that require intensive supervision by computational linguists. The required annotation labour prohibits application to big datasets of historical periodicals and other serial texts that are currently available (Ju et al., 2016). An alternative route is to disambiguate entities on the basis of contextual information sources such as Wikipedia (Coll Ardanuy, 2017; Coll Ardanuy & Sporleder, 2017; Zhang & Iria, 2009) or to combine place name taggers with a scoring system within an Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach ensemble (Canale, Lisena, & Troncy, 2018; Won, Murrieta-Flores, & Martins, 2018). As recent surveys confirm, however, ML algorithms based on neural networks now outperform all methods that are based on gazetteers or static databases (Yadav & Bethard, 2019). These latter methods present two major advantages in text enriching: 1) they may be based on the historical context of a historical corpus (McDonough, Moncla, & van de Camp, 2019); 2) they are able to recognize toponyms in a dynamic way as a geographical concept (Eijnatten, 2019). We propose to use the term conceptual NER (cNER) as a level of analysis to enrich place name analysis. 4 METHODOLOGY AND DATASET To establish and maintain internal cohesion whilst distinguishing themselves from others, diasporic groups adopt a collective concept of identity. Such shared identity is constructed through a variety of markers which may be religious, linguistic, performative and of other nature, for example geographical. This type of diasporic identity markers is conveyed through language to both express the bond with a ‘remembered homeland’ and the connection with the host society. cNER applies a pragmatic perspective to the task of NER by assessing the socio-linguistic information and historical context in which places are mentioned and discussed. This is done by combining the visualisation of place name references with linguistic, social, and historical data, both automatically -for instance by means of sentiment analysis- and non-automatically, through the qualitative analysis of sample excerpts. The overarching aim is to visualise the narratives behind distant and local places and obtain deeper insights of how such links were maintained and renegotiated over time for contemporary purposes and future generations. Here we focus on the ML (cfr. 4.1) and geocoding parts (cfr. 4.2) of the study. 4.1 Machine Learning The collection was first tagged for entities using an advanced ML sequence tagging tool that implements Tensorflow (Riedl and Padó, 2018). The novelty of the tool lies in the fact that it combines BiLSTM and CRF and character embeddings. The two methods 1 were tested separately against four datasets to compare both individual performances and the combination of the two. They found that combining BiLSTM with a CRF as top layer outperforms CRFs with handcoded features consistently when enough data is available. They concluded that modern RNNs have much to recommend to researchers working in NER as they consistently yield the best performance. Methodologically, they trained the character embeddings with pre-trained word embeddings while training the model itself. They also used characterand subword based word embeddings computed with FastText (Bojanowski et al., 2017) which, by retrieving embeddings for unknown words through the incorporation of subword information, was found to significantly alleviate issues with out-of-vocabulary words. 4.2 Geocoding Once the tagging task was completed, locations were geocoded by using the Google API. Geocoding with Google is a two-stage process that requires Google Geocoding API1. First, Google Geocoding API provides users with a Place ID for each location. The Place ID uniquely identifies a place as it is stored in the Google Places database and on Google Maps. Because the language of the dataset was Italian, the language of the API was set to Italian. It was found that setting the API language as the language of the dataset improved the accuracy of the geocoding results. At the same time, however, this meant that the results were also returned in Italian. Therefore, in order to have the results returned in English, only the Place ID was extracted in this first stage. Once the Place ID was received, it was possible to use the Google Geocoding API to perform reverse geocoding, that is to obtain all the details for the location in English (e.g., geo-coordinates, administrative level). 4.3 ChroniclItaly To demonstrate the potential of cNER, we used a corpus of Italian ethnic newspapers (i.e., ChroniclItaly, Viola 2018) as an example of diasporic media published in the United States between 1898 and 1920. ChroniclItaly is an open access collection that includes all front pages of seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Indiana, Vermont, and West Virginia be- https://developers.google.com/maps/documentation/geoc oding/start 471 ARTIDIGH 2020 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities tween 1898 and 1920. The corpus, which was extracted from the Chronicling America newspaper collection of the Library of Congress, includes 4,810 issues and for a total of 16,624,571 words. Featuring mainstream (prominenti), radical (sovversivi), and politically independent newspapers, ChroniclItaly is a well-balanced resource for the study of the Italian immigrant press of the time. Moreover, because it is entirely digital, this corpus is a powerful tool for conducting text-based searches and analysis, both quantitative and qualitative. The newspapers’ titles are: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia. Although immigrant newspapers have often been used by migration historians also to study questions of belonging in relation to space, such socio-cognitive dimension of migration remains largely unexplored digitally. This includes the lack of use of immigrant newspapers not only as a source of data analysis but also as a starting point for creating research and analysis tools. Thus, by using ChroniclItaly, we also aimed to contribute to the debate about the lack of diversity, archival biases and silences in the archives in digital scholarship. Thanks to larger amounts of data, today the digital analysis of place name references in immigrant storytelling allows researchers to understand how individuals made sense of their diasporic identities within the host community and perhaps reconsider previous interpretations. Finally, it is worth mentioning that digital scholars wishing to carry out research in languages other than English often find themselves confronted with the relative lack of appropriate computational resources, including for instance accessing already available trained models in the desired language. Thus, by both using and creating resources in Italian, the study also addresses the issue of underrepresentation of languages other than English in digital scholarship. 5 TAGGING THE CORPUS The sequence tagging model for the Italian language was trained on I-CAB (Italian Content Annotation Bank), an open access corpus annotated for entities (i.e. persons-PER, organizations-ORG, locationsLOC, and geopolitical entities-GPE), temporal expressions, and relations between entities. I-CAB contains 525 news articles taken from the Italian newspa- 2 https://fasttext.cc/docs/en/crawl-vectors.html 472 per L'Adige and totals up around 180,000 words. Embeddings were computed using Italian Wikipedia and they have been trained using Fastext with 300 dimensions2. Once the training was complete, the output had the following format (Figure 1): Figure 1: Output of the sequence tagger for ChroniclItaly. The first column is the input word, the second column specifies the pre-processed, lowercased word, the third column contains a flag, that is whether the word has been known during training (KNOWN) or not (UNKNOWN). If labels are assigned to the input file, these will appear in the third column. The last column contains the predicted tags. The no-entity tag is O. Because some entities (e.g., Stati Uniti "United States") have multiple words, the tagging scheme distinguishes between the beginning (tag B-...) or the inside of an entity (tag I-...). Figure 2 shows the tags: Figure 2: Tags of the sequence tagger. 6 ML RESULTS The sequence tagger retrieved 1,369 unique locations (both LOC and GPE) which occurred 214,110 times throughout the whole corpus. Because each individual document was time stamped, the number of references to each location was quantified at any given time within the timeframe of ChroniclItaly (i.e., 1898-1920). The results of the F1 score for Italian models are shown in Table 1: Table 1: F1 score for Italian models. Type Score accuracy 98.15% precision 83.64% recall 82.14% FB1 82.88 Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach Table 2 shows the F1 score for each of the entity: Table 2: F1 score for Italian models per entity. Entity Precision Recall FB1 GPE 83.90% 86.18% 85.02 LOC 69.70%% 44.23% 54.12 ORG 73.36% 73.08% 73.22 PER 89.78% 87.59% 88.68 The tagged version of ChroniclItaly is ChroniclItaly 2.0 (Viola, 2019) and it is available as an open access resource3. Figure 3: Static image of the GNM app. 8 7 To visualise the results, we chose to use a Shiny4 app, the GeoNewsMiner (GNM, Viola et al 2019). This allowed us to present and analyse the data in an intuitive, interactive, and reproduceable way. Within GNM, references to place names in Chroniclitaly can be explored according to five different levels of aggregations:      Time: from 1898 to 1920; Newspaper’s title Type of frequency visualization: absolute, percentage Geographical information: include/exclude references to cities/regions Percentile: select the least/most mentioned places In order to reflect the changing geo-political borders within the analysed period (1898-1920), users can additionally choose between three historical world maps displaying the different borders at three crucial points in history which intersect with the timeframe of ChroniclItaly: 1880, 1914, 1920. By default, GNM displays a contemporary (1994) world map. Finally, users can also share their results, download the raw data which may reflect their filters’ selection, and download the visualised map as a .png file. GNM is available as an open access resource,5 a full documentation of the project is also available on GitHub6. Figure 3 shows a static image of the GNM app. 3 4 DISCUSSION VISUALISATION https://doi.org/10.24416/UU01-4MECRO https://shiny.rstudio.com/ The method presented a number of technical challenges. The first remark concerns the performance of the sequence tagger. Although the overall F1 score was satisfactory, the performance for the entity LOC was rather poor. However, in ChroniclItaly GPE entities significantly outnumbered LOC entities (77.2% GPE vs 22.8% LOC), thus at least partially compensating for this limitation. This first issue can be attributed to the second challenge of the study, namely OCR issues. The newspapers in ChroniclItaly were digitized primarily from microfilm holdings. In addition to the wellknown limitations for OCR processes such as unusual text styles or very small fonts, other limitations occur when dealing with old material, including markings on the pages or a general poor condition of the original text. Such limitations also apply to the OCR-generated searchable texts in ChroniclItaly which therefore contain errors. However, the OCR quality was found better in the most recent texts, perhaps due to a better conservation status or better initial condition of the originals which overall improved over the course of the nineteenth century. Therefore, the quality of the OCR data can vary greatly even within the same newspaper. The OCR error limitation could however be at least partially overcome in two ways: first, it was reasonable to assume that important concept words would have been repeated several times within an article thus increasing the likelihood that OCR read them correctly in at least some of the passages. Second, the geo-coding was restricted to place names that were referred to at least more than 8 times across the whole collection as the highest number of false positives was found for occurrences <8. Other minor issues mainly concerned multi-name locations (e.g., Costa del Pacifico ‘Pacific Coast’) 5 6 https://utrecht-university.shinyapps.io/GeoNewsMiner/ https://github.com/lorellav/GeoNewsMiner 473 ARTIDIGH 2020 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities which were tagged as individual parts, as shown in Figure 4: Figure 4: Example of tags of the sequence tagger for multiname locations. The third challenge concerned the geocoding phase. In addition to the geo-coordinates, Google provides further details, such as the tag type[] indicating why those specific geo-coordinates have been attributed to a location. Understanding the type[] of a location is therefore very important, especially when working with historical data. The Google Places database stores places based on a contemporary world map, however the locations in a historical dataset may have changed name or may no longer exist. Moreover, categories such as country, city, region, municipality, etc. which Google uses to determine the location type[] are highly dependent on the location itself and consequently, certain categories may not apply or they may change from country to country. In geocoding the places in ChroniclItaly 2.0, we encountered cases in which the type and level given by Google required a manual edit. The full list of remarks and manual edits is available in the GNM GitHub repository. Despite the described challenges, we argue in favour of ML methods to enrich digital collections. One of the biggest advantages of using ML for tagging, for instance, is that it is not based on gazetteers, therefore place name extraction and disambiguation is more reliable. Another advantage is that models can be continuously improved by increasing the amount and quality of data the algorithms learn from, so that they can make faster and more accurate predictions. Finally, ML methods offer the huge benefit of being completely unsupervised thus bearing great potential of assistance also for not highly computationally trained researchers, such as humanities scholars. Because a solid training in ML was not necessary for using the sequence tagger, indeed, we found that the most challenging part of the project was the geocoding phase. Therefore, ML bears great potential for the humanities: with modest investment of time, it may be possible to enrich automatically large amounts of data, saving scholars precious time and resources that can be dedicated to investigating new avenues of data analysis. 474 9 CONCLUSIONS This paper discussed the added value of applying ML to contextually enrich digital collections. In this study, we employed ML as a method to geographically enrich a historical dataset. Specifically, we used a deep learning architecture for NER tasks (Riedl and Padó 2018) which implements TensorFlow to perform NER on a corpus of historical immigrant newspapers (ChroniclItaly, Viola 2018). The aim was to prepare large quantities of unstructured data for a conceptual historical analysis of geographical references, which we called conceptual Named Entity Recognition, cNER. Triangulating the quantitative information provided by ML with data visualisation and a qualitative (i.e., sentiment), socio-historical and linguistic analysis, cNER enables us to trace and understand the changing cultural constructions that are attached to place names, as they are derived from the historical context. This further pragmatic level of analysis will help us to establish how imagined “place” is defined over time in relation to changing realities of space. We argue that overall the cNER method does better justice to the historical ambiguities that are embedded in the texts themselves than extraction on the basis of gazetteers or static external information allow us to achieve. Despite a number of limitations which were mainly encountered during the geocoding phase, we found that the method has much to recommend particularly to humanities scholars who are more and more confronted with the challenge of exploring collections larger than before and in a digital format. Finally, we also argued that, by choosing to use largely neglected sources such as Italian immigrant newspapers, this study contributed to the debate about language diversity representation and archival biases in digital practices. REFERENCES Ardanuy, Maria Coll. (2017). Entity-Centric Text Mining for Historical Documents. Georg-August-Universitat Gottingen, Göttingen. Ardanuy, Mariona Coll, & Sporleder, C. (2017). Toponym disambiguation in historical documents using semantic and geographic features. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage - DATeCH2017, 175–180. https:// doi.org/10.1145/3078081.3078099 Bodenhamer, D. J., Corrigan, J., & Harris, T. M. (Eds.). (2010). The spatial humanities: GIS and the future of humanities scholarship. Bloomington, Ind.: Indiana Univ. Press. Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach Bodenhamer, D. J., Corrigan, J., & Harris, T. M. (Eds.). (2015a). Deep maps and spatial narratives. Bloomington: Indiana University Press. Bojanowski, P. Grave, E., Joulin, A. and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146. Canale, L., Lisena, P., & Troncy, R. (2018). A Novel Ensemble Method for Named Entity Recognition and Disambiguation Based on Neural Network. In D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, … E. Simperl (Eds.), The Semantic Web – ISWC 2018 (Vol. 11136, pp. 91–107). https://doi.org/10.1007/978-3-030-00671-6_6 Cresswell, T. (2010). Place: A short introduction (Repr.). Malden, Mass.: Blackwell. Donaldson, C., Gregory, I. N., & Taylor, J. E. (2017). Locating the beautiful, picturesque, sublime and majestic: Spatially analysing the application of aesthetic terminology in descriptions of the English Lake District. Journal of Historical Geography, 56, 43–60. https://doi.org/10.1016/j.jhg.2017.01.006 Eijnatten, J. V. (2019). Something about the Weather. Using Digital Methods to Mine Geographical Conceptions of Europe in Twentieth-Century Dutch Newspapers. BMGN - Low Countries Historical Review, 134(1), 28– 61. https://doi.org/10.18352/bmgn-lchr.10655 Gregory, I. N. (2014). Further Reading: From Historical GIS to Spatial Humanities: An Evolving Literature. In I. N. Gregory & A. Geddes (Eds.), Toward spatial humanities: Historical GIS and spatial history (pp. 186– 202). Bloomington, Ind.: Indiana Univ. Press. Ju, Y., Adams, B., Janowicz, K., Hu, Y., Yan, B., & McKenzie, G. (2016). Things and Strings: Improving Place Name Disambiguation from Short Texts by Combining Entity Co-Occurrence with Topic Modeling. In E. Blomqvist, P. Ciancarini, F. Poggi, & F. Vitali (Eds.), Knowledge Engineering and Knowledge Management (Vol. 10024, pp. 353–367). https://doi.org/10.1007/978-3-319-49004-5_23 Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., & Gómez-Berbís, J. M. (2013). Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces, 35(5), 482–489. https://doi.org/10.1016/j.csi.2012.09.004 McDonough, K., Moncla, L., & van de Camp, M. (2019). Named entity recognition goes to old regime France: Geographic text analysis for early modern French corpora. International Journal of Geographical Information Science, 33(12), 2498–2522. https:// doi.org/10.1080/13658816.2019.1620235 Murrieta-Flores, P., & Martins, B. (2019). The geospatial humanities: Past, present and future. International Journal of Geographical Information Science, 33(12), 2424–2429. Neudecker, C. (2014, March 3). Named Entity Recognition for digitised newspapers – Europeana Newspapers. Retrieved 10 November 2019, from http://www. europeana-newspapers.eu/named-entity-recognition-fordigitised-newspapers/ Pascual-de-Sans, A. (2004). Sense of place and migration histories Idiotopy and idiotope. Area, 36(4), 348–357. https://doi.org/10.1111/j.0004-0894.2004.00236. Riedl, M. and Padó, S. 2018. A Named Entity Recognition Shootout for German. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 120–125. Melbourne, Australia, July 15 - 20, 2018 Tally, R. T. (Ed.). (2011). Geocritical explorations: Space, place, and mapping in literary and cultural studies. New York: Palgrave Macmillan. Taylor, J., Donaldson, C. E., Gregory, I. N., & Butler, J. O. (2018). Mapping Digitally, Mapping Deep: Exploring Digital Literary Geographies. Literary Geographies, 4(1), 10–19. Viola, L. (2018). ChroniclItaly: A corpus of Italian American newspapers from 1898 to 1920. Utrecht University. Retrieved from https://public.yoda.uu.nl/ilab/UU01/T4YMOW.html Viola, L. (2019). ChroniclItaly 2.0. A corpus of Italian American newspapers annotated for entities, 18981920 (Version 2.0). Retrieved from https:// doi.org/10.24416/UU01-4MECRO Viola, L., De Bruin, J., van Eijden, K., & Verheul, J. (2019). The GeoNewsMiner (GNM): An interactive spatial humanities tool to visualize geographical references in historical newspapers (v1.0.0). Retrieved from https://github.com/lorellav/GeoNewsMiner White, R. (2010). Spatial History Project. Retrieved 8 November 2019, from https://web.stanford.edu/ group/spatialhistory/cgi-bin/site/pub.php?id=29 Withers, C. W. J. (2009). Place and the ‘Spatial Turn’ in Geography and in History. Journal of the History of Ideas, 70(4), 637–658. https://doi.org/10.1353/jhi.0. 0054 Won, M., Murrieta-Flores, P., & Martins, B. (2018). Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora. Frontiers in Digital Humanities, 5. https://doi.org/10.3389/fdigh.2018. 00002 Yadav, V., & Bethard, S. (2019). A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. ArXiv:1910.11470 [Cs]. Retrieved from http://arxiv.org/abs/1910.11470 Zhang, Z., & Iria, J. (2009). A novel approach to automatic gazetteer generation using Wikipedia. Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, 1–9. Retrieved from http://dl.acm.org/citation.cfm?id= 1699765.1699766 475

Log In

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

Related papers

Related papers

Related topics