Lect08
Lect08
Lect08
NLP Applications
By Ivan Wong
Introduction
NLP task Use Nature of data
Search Find relevant content for a World wide web/large
given user query. collection of documents
Topic modeling Find topics and hidden Large collection of
patterns in a set of documents
documents.
Text summarization Create a shorter version of Typically a single document
the text with the most
important content
Recommendations Showing related articles Large collection of
documents
Machine translation Translate from one language A single document
to another
Question answering Get answers to queries A single document or a
system directly instead of a set of large collection of
documents. documents
Search and Information Retrieval
• When a user searches using a query, the search engine collects a
ranked list of documents that matches the query.
• For this to happen, an “index” of documents and vocabulary used
in them should be constructed first and is then used to search and
rank results.
• One popular form of indexing textual data and ranking search
results for search engines is TF-IDF.
• Recent developments in DL models for NLP can also be used for
this purpose.
• For example, Google recently started ranking search results and showing
search snippets using the BERT model. They claim that this has improved
the quality and relevance of their search results. This is an important
example of NLP’s usefulness in a modern-day search engine.
Search and Information Retrieval
• Spelling correction:
• The user entered an incorrect spelling, and the search engine offered a suggestion
showing the correct spelling.
• Related queries:
• The “People also ask” feature shows other related questions people ask about Marie
Curie.
• Snippet extraction:
• All the search results show a text snippet involving the query.
• Biographical information extraction:
• On the right-hand side, there’s a small snippet showing Marie Curie’s biographical
details along with some specific information extracted from text. There are also
some quotes and a list of people related to her in some way.
• Search results classification:
• On top, there are categories of search results: all, news, images, videos, and so on.
Compone
nts of a
Search
Engine
Components of a Search Engine
• Crawler
• Collects all the content for the search engine.
• Indexer
• Parses and stores the content that the crawler collects and builds an
“index” so it can be searched and retrieved efficiently.
• Searcher
• Searches the index and ranks the search results for the user query
based on the relevance of the results to the query.
• Feedback
• tracks and analyzes user interactions with the search engine, such as
click-throughs, time spent on searching and on each clicked result,
etc., and uses it for continuous improvement of the search system.
A Typical Enterprise Search Pipeline
• Crawling/content acquisition
• We don’t really need a crawler in this case, as we don’t need
data from external websites.
• Text normalization
• Once we collect the content, depending on its format, we start
by first extracting the main text and discarding additional
information (e.g., newspaper headers).
• Indexing
• For indexing, we have to vectorize the text. TF-IDF is a popular
scheme for this, as we discussed earlier.
A Typical Enterprise Search Pipeline
• The pipeline typically consists of the following steps:
• Query processing and execution: The search query is passed
through the text normalization process as above. Once the query
is framed, it’s executed, and results are retrieved and ranked
according to some notion of relevance. Search engine libraries
like Elasticsearch even provide custom scoring functions to
modify the ranking of documents retrieved for a given query.
• Feedback and ranking:
• To evaluate search results and make them more relevant to the
user, user behavior is recorded and analyzed, and signals such
as click action on result and time spent on a result page are used
to improve the ranking algorithm.
https://www.elastic.co/downloads/elasticsearch
Topic
Modeling
• Say we’re given a large collection
of documents, and we’re asked to
“make sense” out of it.
• Given the large volume of
documents, going through each of
them manually is not an option.
• One way to approach it is to bring
out some words that best describe
the corpus, like the most common
words in the corpus. This is called
a word cloud.
Topic Modeling
• Topic modeling generally refers to a collection of
unsupervised statistical learning methods to discover
latent topics in a large collection of text documents.
• Some of the popular topic modeling algorithms are
latent Dirichlet allocation (LDA), latent semantic
analysis (LSA), and probabilistic latent semantic
analysis (PLSA).
• In practice, the technique that’s most commonly used is
LDA.
LDA
• Learning a topic model on this collection using LDA may
produce an output like this: