Academia.eduAcademia.edu

A Framework for Duplicate Detection from Online Job Postings

IEEE/WIC/ACM International Conference on Web Intelligence

A Framework for Duplicate Detection from Online Job Postings Yanchang Zhao Haohui Chen Claire M. Mason Data61, CSIRO Australia yanchang.zhao@csiro.au Data61, CSIRO Australia caronhaohui.chen@data61.csiro.au Data61, CSIRO Australia claire.mason@data61.csiro.au ABSTRACT 1 Online job boards have greatly improved the efficiency of job searching and have also provided valuable data for labour market research. However, there are a high proportion of duplicate job postings in most (if not all) job boards, because recruiters and job boards seek to improve their coverage of the market by integrating job postings from many different sources. These duplicate postings undermine the usability of job boards and the quality of labour market analytics derived from them. In this paper, we tackle the challenging problem of duplicate detection from online job postings. Specifically, we design a framework for duplicate detection from online job postings and, under the framework, implement and test 24 methods built with four different tokenisers, three vectorisers and six similarity measures. We conduct a comparative study and experimental evaluation of the 24 methods and compare their performance with a baseline approach. All methods are tested with a real-world dataset from a job boarding platform and are evaluated with six performance metrics. The experiment reveals that the top two methods are Overlap with skip-gram (OS) and Overlap with n-gram (OG), followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all above four methods outperform the baseline approach in detecting duplicates. There are many online job/career platforms, such as SEEK, CareerOne, Adzuna and LinkedIn, which have collected a large volume of job postings. However, due to the fact that recruiters often post job postings to multiple platforms and that platform providers scrape job postings from one another (to improve their coverage of the market), duplicate job postings are common. A recent study by Jijkoun [9] shows that an average job ad can be re-posted two to five times and that the fraction of duplicates can be as high as 50-80%. Identifying duplicate job postings is important for two reasons. First, the usability of the platform is diminished when search results include many duplicates or users receive multiple alerts for the same job. Second, job postings have now become a valued source of data for monitoring labour markets since they provide near-to-real-time data on employer demand for workers and skills [5, 8, 14]. The efficiency of users’ search for relevant job postings and the analysis of job trend and skills in demand is substantially impacted when duplicate job postings are so common. In this paper we describe the development of a framework for duplicate detection from online job postings. Under the framework, we have developed and experimentally evaluated 24 methods on a real-world dataset captured from Adzuna Australia, a job board and search engine for the Australian job market. The 24 methods are built with CCS CONCEPTS · Applied computing → Document analysis; · Computing methodologies → Information extraction; · Information systems → Data cleaning. KEYWORDS duplicate detection, job posting, text mining, document analysis ACM Reference Format: Yanchang Zhao, Haohui Chen, and Claire M. Mason. 2021. A Framework for Duplicate Detection from Online Job Postings. In IEEE/WIC/ACM International Conference on Web Intelligence (WI-IAT ’21), December 14ś17, 2021, ESSENDON, VIC, Australia. ACM, New York, NY, USA, 8 pages. https: //doi.org/10.1145/3486622.3493928 INTRODUCTION • four different tokenisers: word tokeniser, word tokeniser with stop words removal, n-gram tokeniser and skip-gram tokeniser, which are referred to respectively as word, word-2, n-gram and skip-gram in the rest of this paper, • three vectorisers: Document-Term Matrix (DTM), Term Frequency - Inverse Document Frequency (TFIDF) and Latent Semantic Analysis (LSA) [6], and • six similarity measures: Jaccard, Overlap, Cosine, LSA-cosine, TFIDF-Euclidean and TFIDF-cosine. In our experiments, the above methods were compared with Apollo [2], an existing approach for duplicate detection from job postings. The performance of each method was evaluated with six measures: correlation, AUC, accuracy, precision, recall and F1-score. Experimental results show that Overlap with skip-gram (OS) and Overlap with n-gram (OG) were the best two methods, followed by TFIDFcosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and that all of the above four methods outperformed Apollo [2]. Contributions of this work include: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. WI-IAT ’21, December 14ś17, 2021, ESSENDON, VIC, Australia © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-9115-3/21/12. . . $15.00 https://doi.org/10.1145/3486622.3493928 • a framework for detecting duplicates from job postings, • a blocking and re-grouping method to speed up the process, • a comparative study of 24 methods built with four different tokenisers, three vectorisers and six similarity measures, and 249 WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Yanchang Zhao, Haohui Chen, and Claire M. Mason • an experimental evaluation of the above methods and comparison with an existing approach on real-world job posting data. 3 2 RELATED WORK 2.1 Near-Duplicate Web Page Detection Broder et al. [1] developed a method to determine the syntactic similarity of documents and used it to cluster documents on the Internet, with possible applications like a łlost-and-foundž service for documents, filtering web search results and identifying plagiarism. Lecocq [11] summarised multiple methods for near-duplicate content detection, including bag of words, shingling, hashing, MinHash and SimHash. Henzinger [7] performed an evaluation of two near-duplicate detection algorithms on a set of 1.6B web pages, and the results showed that both algorithms achieved high precision for finding near-duplicate pairs on different sites, but worked poorly on pairs from the same site. Manku et al. [12] investigated the problem of detecting near-duplicates for web crawling, i.e., to assess whether a newly crawled web page is a near-duplicate of a previously crawled one. Their study showed that Charikar’s SimHash and fingerprinting technique [4] is practically useful for identifying near-duplicate web pages. They also developed a method to quickly find all fingerprints that are different from a given one, which is fast in both online and batch near-duplicate detection. 2.2 Duplicate Job Ads Detection Jijkoun [9] presented a methodology used at Textkernel for duplicate detection from online job postings, Their study suggested that the fraction of duplicates can be as high as 50-80% in job postings crawled from online job boards and other sites. In that method, a statistical classifier was trained to predict whether two postings referred to the same job. Shingling was used to measure the textual similarity between two documents and a locality sensitive hashing scheme and ElasticSearch were used for improving efficiency. Burk et al. [2] developed a system named Apollo for near-duplicate job postings detection in the online recruitment domain. With a range of techniques including blocking, shingling, boilerplate text removal and Jaccard similarity, their experimental results showed that the system achieved higher precision, recall and F-score than SimHash and Shingling. In their method, the time gap between job postings was not considered, fixed heuristic thresholds and 5shingle (i.e., 5-gram) were used and stop words were not removed. 250 PROBLEM STATEMENT There are various types of duplicate job postings. For example, a job ad could have been posted on multiple sites and therefore end up with multiple copies when crawling. Even on a single site, a job ad can be re-posted multiple times within a few weeks and sometimes even within a few days. From the collected data, we found some job ad series, composed of a single job ad being re-posted multiple times within a short timeframe. Some of them are re-posted after two months, which are likely to be re-advertising when a vacancy is not filled yet. However, there are often also a series of postings with the same ad re-posted almost daily or weekly, which are likely to be spam or posting by bots. Based on exploration of the job postings data and discussion with domain experts, we applied a window of 60 days to identify duplicates from series. That is, a re-posting of an original ad after 60 days is taken as a new job ad, rather than a duplicate. The same window size has been used for job ad de-duplication in existing work [3]. We define a duplicate job posting as follows. A job posting is a duplicate if it has both the same title and the same location and has very similar job description to one or more other postings posted within 60 days prior to it. Assume A and B are two job postings. A is a duplicate of B if Detecting Duplicate Postings in Online Classifieds Kaggle [10] hosted a competition on detecting duplicate postings posted in online classifieds in 2016, where duplicate postings were to be identified based on their contents including Russian language text and images. The postings were collected from online marketplaces where people buy or sell items, rather than job postings. The competition provided a training dataset, which contained pairs of postings that were duplicate and not duplicates. In our work, we take an unsupervised approach, which makes our approach better in that in many instances training datasets are unavailable or very costly to obtain. 2.3 We take this method as a baseline and compare our methods with it in experiments.   Timestamp(A) > Timestamp(B),      0 ≤ Date(A) − Date(B) ≤ 60,    Title(A) = Title(B),    Location(A) = Location(B), and      Sim(A, B) ≥ τ ,  (1) where Date, Timestamp, Title and Location are respectively the post date, timestamp, job title and location of a job posting, Sim is a similarity measure and τ is a similarity threshold. If two postings have very similar job descriptions but different locations or different job titles, they are not considered duplicate of each other, because most likely they are two vacancies at different locations or with similar but different roles. 4 METHODOLOGY This section presents our methodology for duplicate detection. A framework of it is shown in Figure 1, which is composed of four steps. First, the body text of job postings is cleaned by converting to lower case, removing non-alphanumeric symbols, collapsing multiple spaces and removing stop words. Next, tokens are extracted with four tokenisers and job postings are vectorised with Document-Term Matrix (DTM), Term Frequency - Inverse Document Frequency (TFIDF) and Latent Semantic Analysis (LSA) [6]. After that, six similarity measures are used to calculate the similarity score between every pair of job postings. Finally, those pairs with similarity scores above a given threshold are identified as duplicates. Under the framework, 24 methods with the four different tokenisers and the six similarity measures are developed. A Framework for Duplicate Detection from Online Job Postings WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Figure 1: A Framework for Duplicate Detection from Job Postings 4.1 Table 1: An Example Illustrating the Four Tokenisers Tokenisers We use four different tokenisers: word tokeniser, word tokeniser with stop words removal (referred to as łword-2"), n-gram tokeniser and skip-gram tokeniser. 4.1.1 Word Tokeniser. A word tokeniser simply takes every single word as a token by splitting text with spaces and punctuation marks. Here we use a short sentence łThis is a simple example of text tokenisation" as an example. The tokens produced by word tokeniser are shown in the second row in Table 1. If stop words are removed, the tokens left are shown in the third row (łword-2"), where stop words, such as łthis", łisž, łaž and łof", are excluded. 4.1.2 N-Gram. Rather than taking every single word as a token, n-gram extracts n consecutive words as tokens. This can be done with or without stop words removal. Since stop words contribute little to analysis in this work, they are removed before extracting ngrams. The 2-grams for the same sentence are shown in the fourth row (łn-gram (n=2)") of Table 1. 4.1.3 Skip-Gram. Skip-grams are similar to n-grams, but the former skips over gaps when extracting tokens. More specifically, a k-skip-n-gram is a subsequence of n words where the words occur no more than k words away from each other. Skip-gram overcomes the data sparsity problem with n-gram. The 1-skip-2-grams with stop word removal produced from the same text are shown as the last row of Table 1, which includes łsimple text" and łexample tokenisation", in addition to all 2-grams. 4.2 Vectorisation With tokens extracted from job postings, a document-term matrix and a TFIDF matrix are built to vectorise job posting data. In the matrices, each row represents a document (i.e., a job posting) and each column represents a term (i.e., a token). In a document-term matrix M, an entry mi, j is the frequency of term j in document i. In a TFIDF matrix W , an entry is defined as w i, j = tf i, j × log(n/df j ), where tf i, j is the frequency of term j in document i, df j is the number of documents containing term j and n is the total number of documents. In addition to DTM and TFIDF, we also use Latent 251 Tokeniser Tokens word this, is, a, simple, example, of, text, tokenisation simple, example, text, tokenisation simple example, example text, text tokenisation simple example, simple text, example text, example tokenisation, text tokenisation word-2 n-gram (n=2) skip-gram (n=2, k=1) Semantic Analysis [6] for vectorisation. The vectorisation produced by the above three methods is then used to calculate the similarity between every pair of postings. 4.3 Similarity Measures Following the above vectorisation, six metrics, Jaccard, Overlap, Cosine, LSA-cosine, TFIDF-Euclidean and TFIDF-cosine, are used to calculate similarity scores between job postings. 4.3.1 Jaccard Index. Assume that A and B are respectively token sets of two job postings A and B. The Jaccard index is calculated as Jaccard(A, B) = |A ∩ B| |A ∪ B| (2) , where ł| · |" denotes the size of a set. The Jaccard index lies in [0, 1]. A value close to one indicates that two postings are very similar to each other, and therefore, are likely to be duplicate. 4.3.2 Overlap. The Overlap index is similar to Jaccard, but the union of two sets in the denominator is replaced with the smaller set. The Overlap index is defined as Overlap(A, B) = |A ∩ B| min(|A|, |B|) , where A and B are token sets of two job postings, A and B. (3) WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Yanchang Zhao, Haohui Chen, and Claire M. Mason In the context of job postings, sometimes two job postings can have the same job description and therefore are of the same position, but one of them may have some extra information about the recruitment agency or the employer at very beginning, or an extra statement about equal employment opportunity at the very end. In those situations, Overlap would be more effective than Jaccard index in identifying duplicates, which is evidenced by our experimental results in Section 5.4. Running time (seconds) 300 200 100 4.3.3 Cosine. Cosine similarity is calculated as Cosine(A, B) = A•B ||A|| ||B|| , (4) 0 0 where A and B are respectively numeric vectors representing two job postings A and B, ł•" is dot product and ł|| · ||" is L 2 -norm. For example, A and B can be the vectors for corresponding job postings from a term-document matrix of term frequency. 20 40 60 80 100 150 200 300 400 600 800 1000 1500 2000 Group size Figure 2: Running Time vs Group Size 4.3.4 TFIDF-Euclidean and TFIDF-cosine. From a term-document matrix, we derive a TFIDF matrix, based on which Euclidean and cosine similarities are calculated. 4.3.5 LSA-cosine. Another method that we use is LSA-cosine, where the Latent Semantic Analysis (LSA) [6] is applied to the TFIDF matrix and then the vectors from the latent space are used to calculate a cosine similarity. 4.4 10 Speeding Up with Sliding Window, Blocking and Re-Grouping It is computing intensive to calculate similarity between every single pair of job postings and also consumes RAM when vectorising a large number of postings. To speed up the process of duplicate detection, we use sliding window, blocking and re-grouping techniques. With the framework given in Figure 1, every single pair of job postings has to be compared to detect all duplicates. This is computationally intensive, with a time complexity of O(n 2t), where n is the number of job postings and t is the time for computing the similarity between two postings. Fortunately, according to the definition of duplicate in Section 3, we do not have to calculate the similarity between every pair of postings. Instead, we need to check only the posting pairs that are within 60 days from each other, and therefore, we use a sliding window of 60 days to reduce the number of similarity calculations. Moreover, since two duplicate postings have to be of the same job title and same location, we use a blocking technique to speed it up further, by grouping postings by both title and location and then detecting duplicates within every single group in parallel. Nevertheless, the space of job title and location can be very sparse, especially for tiny-sized occupations in a small suburb or in a remote region. This creates many tiny groups, with each consisting of a few job postings only, which brings computing overhead when processing those groups. To reduce the overhead caused by tiny groups, we re-group them by merging very small groups (specifically, with fewer than k job postings) into bigger ones (with at least k job postings) to reduce the number of groups and further improve the processing speed. To find out the optimal group size k, we conduct experiments with various settings of k, ranging from 10 to 252 2000 (on a computer with configuration details given in Section 5). The experimental results are shown in Figure 2. Note that a group size of zero means no re-regrouping was applied. The left-most two bars (k = 0 and k = 10) in the figure show that, compared to simple blocking without re-grouping (k = 0), the run time is almost halved by re-grouping with group size of 10. The run time is further reduced when increasing group size to 20, 40, and so on. When the group size becomes larger than 300, the run time starts to increase, because, with more job postings within every single group, the RAM needed for processing each group increases and the number of similarity calculations also increases. Based on the above result, we use a group size of 300 to re-group tiny groups in the rest of our experiments. Note that this optimal group size is to a large degree dependent on the average size of job postings, the number of parallel processes and the available RAM of a computer. However, the above method is generic and can be applied to find the optimal k for other applications and computer configurations. 4.5 The Process of Duplicate Detection We process the job postings in a temporal order. Given a set of job postings, the process of detecting duplicates from them is composed of six steps as below. 1) Loading data. Load the given dataset of postings (such as those posted in a certain month, week or day) and also the postings within 60 days prior to them. 2) Removing invalid postings. Job postings with empty descriptions or invalid post dates are removed. 3) Blocking. Split data into groups by job location-title, with each group containing all job postings for a unique combination of location and title. 4) Removing singles. Remove every group consisting of one single posting only, because job postings in those groups are not duplicates. 5) Re-grouping. Reduce the number of groups by merging small groups (with size < k) into bigger ones, with every new group having k or more postings. This reduces the number of processes in parallel computing. The optimal value of k is set according to the experimental results given in Section 4.4 and in Figure 2. A Framework for Duplicate Detection from Online Job Postings WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia 6) Detecting duplicates within every group. Follow the process described in Figure 1 to detect duplicates within every group in parallel. For each group, which can consist of a set of multiple combinations of location-title (due to re-grouping): a. clean job descriptions by converting them to lower case, removing non-alphanumeric symbols and collapsing multiple white spaces, b. extract tokens, such as words, n-grams and skip-grams, from job descriptions, c. vectorise the tokens with Document-Term Matrix, TFIDF or LSA, d. calculate similarities between each pair of postings, such as Jaccard, Overlap, Cosine, TFIDF-Euclidean, TFIDF-cosine and LSA-cosine, e. label a posting as duplicate if there are one or more other postings • posted within 60 days prior to it, • of the same title and location as it, and • having a similarity score greater than a given threshold τ (see Table 4 for values). Step 6 runs in parallel, with each process handling a group at a time. The number of cores is set based on the number of available CPU cores, the size of RAM and the memory needed by each process. Steps 4 and 5 reduce the number of groups and make it run faster for parallel computing. 5 Duplicate Flag Number of Pairs Percentage 936 562 1498 62% 38% 100% dup=0 dup=1 Total Length Difference* Min Median Max 0 0 0 90 15 52 684 326 684 * Difference in length of two job postings (in number of words) Table 3: Tokenisers and Vocabulary Sizes Tokeniser Settings Tokens word word-2 n-gram n=1,2,3 skip-gram n=1,2; k=1 n=5 1-grams 1-grams 1-, 2- & 3-grams 1- & 2-grams with 1-skip 5-grams Apollo Stopword Removal Vocabulary Size no yes yes 13,788 13,670 374,392 yes 308,175 no 366,183 Table 4: Optimal Thresholds Chosen Using Youden’s Index [13, 15] EXPERIMENTAL EVALUATION We conducted extensive experiments to evaluate the performance of all the developed methods on a real-world dataset of job postings and compared them with a baseline approach, Apollo [2]. We also compared the impact of different hyper parameters for n-gram and skip-gram, as well as vocabulary size and run time. The experiments were conducted on a MacBook Pro with a 2.9GHz Quad-Core Intel i7 CPU and 16 GB RAM, running MacOS Catalina. 5.1 Table 2: Characteristics of Duplicate and Non-duplicate Job Ads Data The job postings data used in our experiments were kindly provided by Adzuna Australia1 , a job board and search engine that has been operating in Australia since 2013. The job postings in their database are not just sourced from job posters who list job postings directly on the Adzuna Australia platform. Job postings that are posted on other Australian online job boards can be listed on the Adzuna Australia platform free of charge. Adzuna Australia also provides online listing of all the job postings listed in one of Australia’s largest newspapers. Finally, they scrape data from other websites (e.g., large employers’ websites) to capture a wider set of job postings. The range of sources captured in the job postings database means that they are likely to provide good coverage of job postings nationally, but it also increases the potential to capture duplicate job postings. The dataset that we used for this work represents approximately 9.3 million job postings of 17GB in size, covering the period from March 2015 to May 2019. From the above raw data, we created pairs of job postings, first specifying that each pair of job postings must have the same title and location. These criteria were used to 1 http://www.adzuna.com.au 253 Tokeniser Jacc. Overlap Cosine LSA Cosine TFIDF Euc. TFIDF Cosine word word-2 n-gram skip-gram 0.6625 0.6364 0.5318 0.5366 0.8741 0.8318 0.8053 0.8061 0.8575 0.7654 0.7474 0.7491 0.9201 0.8887 0.9388 0.9301 0.8498 0.7827 0.7742 0.7772 0.7687 0.7581 0.6866 0.6936 identify possible duplicate job postings that were being treated by the system as two different legitimate job postings. We randomly selected a sample of such pairs and had them labelled as duplicates (dup=1) or non-duplicates (dup=0) by five domain experts. This provided us with a gold dataset of 1498 pairs of job postings. The median length of these postings was 274 words and the maximum length was 1034. The characteristics of these duplicate and nonduplicate pairs are given in Table 2, where the last three columns give length difference of a pair in the number of words. It shows that duplicate pairs tend to have smaller discrepancy in length than non-duplicate ones. 5.2 Experiment Settings All the methods, together with the Apollo approach [2], were applied to the dataset, and the produced duplicate scores (i.e., similarity scores) were then compared against the labels provided by the five domain experts. The parameter settings of every tokensier and their vocabulary sizes are given in Table 3. After computing all similarity scores, Youden’s Index [13, 15] was used to select the best cut-point for every method (selected thresholds are displayed in Table 4). The threshold for Apollo was set to 0.5, the same setting as that used by Burk et al. [2]. WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Jaccard Yanchang Zhao, Haohui Chen, and Claire M. Mason Overlap 1.00 Jaccard Overlap Cosine TFIDF−cosine 1.00 0.75 0.75 0.50 0.25 0.50 0.00 Cosine LSA−cosine 0.25 0.75 dup 0.50 Similarity Similarity 1.00 0 1 0.25 dup 0.00 0 1 1.00 0.00 0.75 TFIDF−euclidean TFIDF−cosine 1.00 0.50 0.75 0.50 0.25 0.25 0.00 ra m ra m −g −4 ip sk 1− ip −3 −g ra m gr am −g −2 ip sk sk 1− 1− gr am 5− gr am 4− gr am 3− 2− ra m gr am 1− ra m −g −4 ip sk sk 1− ip −3 −g ra m gr am −g −2 ip sk 1− Tokeniser 1− gr am 5− gr am 4− gr am 3− gr am 1− 1− sk sk ip 2− −2 1− −g ra m gr am 3− w or d− 2 d ip −2 w or −g ra m gr am 3− 2 d− w or w or d 0.00 Tokeniser Figure 3: Box-Plots of Similarity Scores for Duplicate (in blue) and Non-Duplicate (in orange) Pairs with Four Tokenisers (see x-axis) Figure 4: Box-Plots of Similarity Scores for Duplicate Pairs (in blue) and Non-Duplicate Ones (in orange) with Various n-Gram and Skip-Gram Tokenisers (see x-axis) 5.3 TCS. All the above four methods have better performance than Apollo [2], which uses a 5-gram tokeniser with the Jaccard similarity. Although duplicate pairs tend to have smaller discrepancy in length (see Table 2), there are many duplicate pairs whose lengths are very different from each other, due to extra text about employer or equal employment opportunity added to the very beginning or end of one job posting but not to the other. From checking the gold label dataset, we discovered that extra text was sometimes introduced at the end of a job ad due to recruiters (or job boards) inserting standard wording or contact information at the end of their job postings. This feature of job postings explains why Overlap is more effective than Jaccard for detecting duplicate job postings. Evaluation Measures To evaluate the effectiveness of duplicate detection, six performance measures are used, namely, correlation (i.e., the correlation between TP+TN predicted values and actual labels), accuracy (A = TP+FP+TN+FN ), TP TP precision (P = TP+FP ), recall (R = TP+FN ), AUC (area under the ×R ROC curve) and F1-score (F1 = 2×P P +R ). In the above formulae, TP and FP are respectively the numbers of true positives and false positives, and TN and FN the numbers of true and false negatives. 5.4 Experimental Results Figure 3 shows the boxplots of similarity scores produced using various methods. The orange boxes (dup=0) show the distribution of similarity scores between pairs of job postings that are not duplicate, and the blue ones (dup=1) are for duplicate pairs. Within each boxplot, the bar in the middle shows the median of similarity scores and the box shows the interquartile range (IQR). The figure clearly shows that duplicate pairs have similarity scores close to 1, while the similarities between non-duplicate pairs are much lower. The orange boxes in sub-figures for Jaccard, Overlap and TFIDF-cosine are further away from their corresponding blue boxes, which suggests that the three methods are more effective in detecting duplicates than the rest. Out of the above three methods, the top four variations showing best differentiation between duplicates and non-duplicates are Overlap with n-gram (OG) and Overlap with skipgram (OS) (see the top-right chart), and TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS) (see the bottom-right chart), in that their orange and blue boxes are more compact and farther apart from each other. These descriptive statistics are also confirmed by the experimental results reported in Table 5, where the methods are ordered descendingly first by AUC and then by F1-score. The top four methods for each metric are highlighted in bold. Based on AUC and F1-score, OS and OG are the best methods, followed by TCG and 254 5.5 Results for N-Gram and Skip-Gram Since TFIDF-Euclidean and LAS-cosine were not as effective as the other methods, they were excluded from further analysis. The remaining four methods were then tested with various parameter settings of n-gram and skip-gram, with results shown in Figure 4 and Table 6. The best performance is achieved by Overlap with 1-skip-4-gram tokeniser. The results also show that, from 1-gram to 5-gram, i.e., with an increase of n in n-gram, the similarity scores become more effective in separating duplicates from non-duplicates. The same observations were made from using skip-gram. 5.6 Vocabulary Size and Run Time To further evaluate the performance of different methods, we also studied vocabulary sizes (see Figure 5) and run time (see Figure 6) of different methods. We can see that, with the increase of n in n-gram (or skip-gram), both vocabulary size and run time increase exponentially. Although the performance of both n-gram and skipgram increases with n (see Figure 4 and Table 6), both require a much larger vocabulary and therefore more RAM and longer run time. The run time of LSA-cosine is much longer than others by one to two magnitudes and therefore it is not included in Figure 6. For A Framework for Duplicate Detection from Online Job Postings WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Table 5: Experimental Results (ordered descendingly first by AUC and then by F1-score) Method* Similarity Measure Tokeniser Correlation AUC Accuracy Precision Recall F1-score OS OG OW2 OW TCG TCS TCW2 TCW Apollo [2] JG JS CG CS JW2 JW CW2 CW TES TEG TEW2 TEW LCW2 LCW LCS LCG Overlap Overlap Overlap Overlap TFIDF-Cosine TFIDF-Cosine TFIDF-Cosine TFIDF-Cosine Jaccard Jaccard Jaccard Cosine Cosine Jaccard Jaccard Cosine Cosine TFIDF-Euclidean TFIDF-Euclidean TFIDF-Euclidean TFIDF-Euclidean LSA-Cosine LSA-Cosine LSA-Cosine LSA-Cosine Skip-gram n-Gram Word-2 Word n-Gram Skip-gram Word-2 Word 5-gram n-Gram Skip-gram n-Gram Skip-gram Word-2 Word Word-2 Word Skip-gram n-Gram Word-2 Word Word-2 Word Skip-gram n-Gram 0.9303 0.9312 0.9144 0.9138 0.9417 0.9404 0.9149 0.9104 0.9341 0.9330 0.9326 0.9200 0.9192 0.9246 0.9220 0.8919 0.8082 0.8236 0.8221 0.8270 0.8021 0.6981 0.6638 0.6301 0.6225 0.9952 0.9951 0.9949 0.9949 0.9931 0.9929 0.9910 0.9899 0.9896 0.9886 0.9885 0.9885 0.9884 0.9868 0.9864 0.9861 0.9810 0.9799 0.9798 0.9794 0.9765 0.9699 0.9690 0.9579 0.9567 0.9760 0.9760 0.9720 0.9733 0.9760 0.9753 0.9700 0.9660 0.9686 0.9700 0.9700 0.9673 0.9673 0.9680 0.9680 0.9619 0.9499 0.9579 0.9579 0.9593 0.9539 0.9179 0.9126 0.8972 0.8985 0.9503 0.9503 0.9392 0.9469 0.9566 0.9549 0.9449 0.9353 0.9432 0.9374 0.9374 0.9325 0.9325 0.9401 0.9401 0.9174 0.8998 0.9179 0.9179 0.9253 0.9157 0.8351 0.8404 0.8148 0.8233 0.9875 0.9875 0.9893 0.9840 0.9804 0.9804 0.9769 0.9769 0.9751 0.9858 0.9858 0.9840 0.9840 0.9769 0.9769 0.9875 0.9751 0.9751 0.9751 0.9698 0.9662 0.9733 0.9466 0.9395 0.9288 0.9686 0.9686 0.9636 0.9651 0.9684 0.9675 0.9606 0.9556 0.9589 0.9610 0.9610 0.9576 0.9576 0.9581 0.9581 0.9512 0.9360 0.9456 0.9456 0.9470 0.9403 0.8989 0.8904 0.8727 0.8729 * Names of the methods (except for Apollo) are derived from the acronyms of similarity measures and tokenisers (see underlines in above columns 2-3). Table 6: Experimental Results for N-Grams and Skip-Grams (in descending order of AUC) Similarity Measure Tokeniser Overlap Overlap Overlap Overlap Overlap Overlap Overlap Overlap TFIDF-cosine TFIDF-cosine TFIDF-cosine TFIDF-cosine TFIDF-cosine TFIDF-cosine TFIDF-cosine TFIDF-cosine 1-skip-4-gram 1-skip-3-gram 5-gram 4-gram 1-skip-2-gram 3-gram 2-gram 1-gram 1-skip-4-gram 5-gram 1-skip-3-gram 4-gram 3-gram 1-skip-2-gram 2-gram 1-gram Correlation AUC Accuracy Precision Recall F1-score 0.9402 0.9364 0.9364 0.9343 0.9303 0.9312 0.9264 0.9144 0.9478 0.9455 0.9455 0.9441 0.9417 0.9404 0.9367 0.9149 0.9956 0.9954 0.9953 0.9952 0.9952 0.9951 0.9950 0.9949 0.9944 0.9938 0.9937 0.9935 0.9931 0.9929 0.9924 0.9910 0.9746 0.9760 0.9760 0.9760 0.9760 0.9760 0.9746 0.9720 0.9746 0.9753 0.9760 0.9753 0.9760 0.9753 0.9746 0.9700 0.9471 0.9503 0.9503 0.9488 0.9503 0.9503 0.9471 0.9392 0.9533 0.9549 0.9582 0.9549 0.9566 0.9549 0.9517 0.9449 0.9875 0.9875 0.9875 0.9893 0.9875 0.9875 0.9875 0.9893 0.9804 0.9804 0.9786 0.9804 0.9804 0.9804 0.9822 0.9769 0.9669 0.9686 0.9686 0.9686 0.9686 0.9686 0.9669 0.9636 0.9667 0.9675 0.9683 0.9675 0.9684 0.9675 0.9667 0.9606 255 WI-IAT ’21, December 14–17, 2021, ESSENDON, VIC, Australia Yanchang Zhao, Haohui Chen, and Claire M. Mason postings. They provide direction for other research efforts aimed at detecting duplicate content in a large body of written material, such as detecting instances of academic plagiarism and duplicate webpages. In this work, we take the body text of a job posting as a whole piece when calculating similarity. However, different components within a job posting have different meanings and therefore can be of different importance in assessing whether two postings are similar. Further research could focus on exploring whether natural language processing techniques can be used to extract various components from a job posting and provide better input to similarity calculations. Another approach is using word embedding based methods and deep learning techniques for vectorisation. Applying the developed framework and method to duplicate detection in other domains also represents an area for future research. Vocabulary Size 3e+06 2e+06 1e+06 w or w d or d− 2 ap ol l o 2− gr a 3− m gr a 4− m gr a 1− 5 m sk −g ra ip 1 − −2 m sk −g ra ip 1 − −3 m sk −g r ip −4 a m −g ra m 0e+00 Tokeniser ACKNOWLEDGMENTS Figure 5: Vocabulary Size We’d like to thank Adzuna Australia for kindly providing data for this research and the Job Skills Dashboard Team at Data61, CSIRO for providing domain knowledge. 60 REFERENCES Run Time (s) Methods 40 Jaccard Overlap Cosine TFIDF−euclidean 20 TFIDF−cosine ra m ra m −g −4 ip ra m −g −3 ip sk 1− gr am −g −2 ip sk 1− 1− sk gr am 5− 4− gr am gr am 3− 2− 2 lo ol ap d− w or w or d 0 Tokeniser Figure 6: Run Time example, with a 3-gram tokeniser, LSA-cosine takes 323 seconds and Overlap takes only 5 seconds. 6 DISCUSSION AND CONCLUSION We have developed a framework for duplicate detection from online job postings, by using four tokenisers, three vectorisers and six similarity metrics and speeding it up with sliding window, blocking and re-grouping techniques. We have conducted a comparative study of all methods, together with an existing approach, using job postings from a real-world jobs board. The experiment revealed that: 1) Overlap with skip-gram (OS) and Overlap with n-gram (OG) achieved the highest AUC and F1-score, followed by TFIDF-cosine with n-gram (TCG) and TFIDF-cosine with skip-gram (TCS), and 2) the best setting for job posting duplicat detection (in this work) is Overlap with 1-skip-4-gram tokeniser. The best method has been used for duplicate detection from monthly job posting data for the Skills Dashboard system at Data61, CSIRO, a national research institute in Australia. The findings of this study are not limited to improving the usability of job boards and the quality of analytics derived from online job 256 [1] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic Clustering of the Web. In Selected Papers from the Sixth International Conference on World Wide Web (Santa Clara, California, USA). Elsevier, Essex, UK, 1157ś1166. [2] H. Burk, F. Javed, and J. Balaji. 2017. Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW). 177ś182. [3] Anthony P. Carnevale, Tamara Jayasundera, and Dmitri M Repnikov. 2014. Understanding Online Job Ads Data. Technical Report. Georgetown University. https: //cew.georgetown.edu/wp-content/uploads/2014/11/OCLM.Tech_.Web_.pdf [4] Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing (Montreal, Quebec, Canada) (STOC ’02). ACM, New York, NY, USA, 380ś388. [5] D. Deming and L. B. Kahn. 2018. Skill requirements across firms and labor markets: Evidence from job postings for professionals. Journal of Labor Economics 36, S1 (2018), 337ś369. [6] Susan T. Dumais. 2004. Latent semantic analysis. Annual Review of Information Science and Technology 38, 1 (2004), 188ś230. https://doi.org/10.1002/aris.1440380105 [7] Monika Henzinger. 2006. Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR ’06). ACM, New York, NY, USA, 284ś291. https://doi. org/10.1145/1148170.1148222 [8] Brad Hershbein and Lisa B. Kahn. 2018. Do Recessions Accelerate Routine-Biased Technological Change? Evidence from Vacancy Postings. American Economic Review 108, 7 (July 2018), 1737ś72. https://doi.org/10.1257/aer.20161570 [9] Valentin Jijkoun. 2016. Online job postings have many duplicates. But how can you detect them if they are not exact copies of each other? https://www.textkernel.com/online-job-posting-many-duplicates-candetect-not-exact-copies/ [10] Kaggle. 2016. Avito Duplicate Ads Detection. https://www.kaggle.com/c/avitoduplicate-ads-detection [11] Dan Lecocq. 2015. Near-Duplicate Detection. https://moz.com/devblog/nearduplicate-detection [12] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting Nearduplicates for Web Crawling. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). ACM, New York, NY, USA, 141ś150. [13] Christian Thiele. 2020. cutpointr: Determine and Evaluate Optimal Cutpoints in Binary Classification Tasks. R package version 1.0.32. https://CRAN.R-project. org/package=cutpointr [14] James Thurgood, Arthur Turrell, David Copple, Jyldyz Djumalieva, and Bradley Speigner. 2018. Using Online Job Vacancies to Understand the UK Labour Market from the Bottom-Up. Bank of England Working Paper 742 (July 2018). [15] W. J. Youden. 1950. Index for rating diagnostic tests. Cancer 3, 1 (1950), 32ś35.