Web Science and Digital Libraries Research Group

Posts

Showing posts with the label RegEx

2022-05-19: Regular Expression Rule-Based Approach for Table and Figure Reference Extraction from Scientific Papers

By Yasasi Abeysinghe - May 19, 2022

Tables and figures are an essential part of a well-written scientific paper. Scientific papers use tables to present the bulk of the detailed information such as results and their associations. Many of the basic concepts, process flows, key natural trends, and key discoveries are presented in the figures. In this blog, I present a simple but effective rule-based approach using regular expressions (RegEx) for extracting table and figure references from the text in scientific papers. What does the table or figure reference mean? In scientific papers, the tables and figures are referred to in body text to support the claims. Below are some examples where tables and figures are referred to in body text. As seen in Table 3 , there are 3 cross-listed top 10 features identified by both ANOVA-F and MI (in blue text). Figure 4 shows that evaluation results using the core features exhibit significantly different performances. Overview of the rule-based approach Prior to using the rule-base...

2020-06-07: Regular Expression — A Powerful Tool to Parse Text with Visually Identifiable Patterns

By Muntabir Choudhury - June 07, 2020

In the previous blog , I have discussed how tesseract-OCR performed on scanned Electronic Theses and Dissertations (ETDs). If you have read my earlier blog , we already saw that the process started with converting the cover page of scanned ETDs into images. Then, tesseract-OCR was applied and saved the extracted result into text files. We also saw that OpenCV OCR failed on scanned ETDs. We could try a widely used open-source tool such as GROBID , designed for scholarly papers. However , this article shows that GROBID is intended for extracting bibliographic metadata for born-digital academic papers. Finally, we decided to apply tesseract-OCR to extract the text from the cover page of scanned ETDs. Afterward, a series of regular expressions (RegEx) was performed to extract seven metadata fields, including titles, authors, academic-programs, institutions, advisors, and years. In this blog, I will introduce how RegEx can be a powerful tool to quickly p...