Lect08

Other Typical
NLP Applications
By Ivan Wong
Introduction
NLP task Use Nature of data
Search Find relevant content for a World wide web/large
given user query. collection of documents
Topic modeling Find topics and hidden Large collection of
patterns in a set of documents
documents.
Text summarization Create a shorter version of Typically a single document
the text with the most
important content
Recommendations Showing related articles Large collection of
documents
Machine translation Translate from one language A single document
to another
Question answering Get answers to queries A single document or a
system directly instead of a set of large collection of
documents. documents
Search and Information Retrieval
• When a user searches using a query, the search engine collects a
ranked list of documents that matches the query.
• For this to happen, an “index” of documents and vocabulary used
in them should be constructed first and is then used to search and
rank results.
• One popular form of indexing textual data and ranking search
results for search engines is TF-IDF.
• Recent developments in DL models for NLP can also be used for
this purpose.
• For example, Google recently started ranking search results and showing
search snippets using the BERT model. They claim that this has improved
the quality and relevance of their search results. This is an important
example of NLP’s usefulness in a modern-day search engine.
Search and Information Retrieval
• Spelling correction:
• The user entered an incorrect spelling, and the search engine offered a suggestion
showing the correct spelling.
• Related queries:
• The “People also ask” feature shows other related questions people ask about Marie
Curie.
• Snippet extraction:
• All the search results show a text snippet involving the query.
• Biographical information extraction:
• On the right-hand side, there’s a small snippet showing Marie Curie’s biographical
details along with some specific information extracted from text. There are also
some quotes and a list of people related to her in some way.
• Search results classification:
• On top, there are categories of search results: all, news, images, videos, and so on.
Compone
nts of a
Search
Engine
Components of a Search Engine
• Crawler
• Collects all the content for the search engine.
• Indexer
• Parses and stores the content that the crawler collects and builds an
“index” so it can be searched and retrieved efficiently.
• Searcher
• Searches the index and ranks the search results for the user query
based on the relevance of the results to the query.
• Feedback
• tracks and analyzes user interactions with the search engine, such as
click-throughs, time spent on searching and on each clicked result,
etc., and uses it for continuous improvement of the search system.
A Typical Enterprise Search Pipeline
• Crawling/content acquisition
• We don’t really need a crawler in this case, as we don’t need
data from external websites.
• Text normalization
• Once we collect the content, depending on its format, we start
by first extracting the main text and discarding additional
information (e.g., newspaper headers).
• Indexing
• For indexing, we have to vectorize the text. TF-IDF is a popular
scheme for this, as we discussed earlier.
A Typical Enterprise Search Pipeline
• The pipeline typically consists of the following steps:
• Query processing and execution: The search query is passed
through the text normalization process as above. Once the query
is framed, it’s executed, and results are retrieved and ranked
according to some notion of relevance. Search engine libraries
like Elasticsearch even provide custom scoring functions to
modify the ranking of documents retrieved for a given query.
• Feedback and ranking:
• To evaluate search results and make them more relevant to the
user, user behavior is recorded and analyzed, and signals such
as click action on result and time spent on a result page are used
to improve the ranking algorithm.
https://www.elastic.co/downloads/elasticsearch
Topic
Modeling
• Say we’re given a large collection
of documents, and we’re asked to
“make sense” out of it.
• Given the large volume of
documents, going through each of
them manually is not an option.
• One way to approach it is to bring
out some words that best describe
the corpus, like the most common
words in the corpus. This is called
a word cloud.
Topic Modeling
• Topic modeling generally refers to a collection of
unsupervised statistical learning methods to discover
latent topics in a large collection of text documents.
• Some of the popular topic modeling algorithms are
latent Dirichlet allocation (LDA), latent semantic
analysis (LSA), and probabilistic latent semantic
analysis (PLSA).
• In practice, the technique that’s most commonly used is
LDA.
LDA
• Learning a topic model on this collection using LDA may
produce an output like this:
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10%

munching
• Topic B: 20% puppies, 20% kittens, 20% cute, 15% hamster
• Document 1 and 2: 100% Topic A

• Document 3 and 4: 100% Topic B
• Document 5: 60% Topic A, 40% Topic B
Text Summarization
• Text summarization refers to the task of creating a
summary of a longer piece of text.
• The goal of this task is to create a coherent summary
that captures the key ideas in the text.
• It’s useful to do a quick read of large documents, store
only relevant information, and facilitate better retrieval
of information.
• NLP research on the problem of automatic text
summarization was actively pursued by different
research groups around the world starting in the early
2000s as a part of the Document Understanding
Conference series.
Various Types of Summarization
• Extractive versus abstractive summarization
• Extractive summarization refers to selecting important sentences from a piece of
text and showing them together as a summary.
• Abstractive summarization refers to the task of generating an abstract of the
text; i.e., instead of picking sentences from within the text, a new summary is
generated.
• Query-focused versus query-independent summarization
• Query-focused summarization refers to creating the summary of the text
depending on the user query.
• Query-independent summarization creates a general summary.
• Single-document versus multi-document summarization
• Single-document summarization is the task of creating a summary from a single
document
• Multi-document summarization creates a summary from a collection of
documents.
Summarization
Use Cases
• The most common use case
for text summarization is a
single-document, query-
independent, extractive
summarization.
• This is typically used to
create short summaries of
longer documents for
human readers or a
machine.
• A well-known example of
such a summarizer in action
in a real-world product is
the autotldr bot on Reddit.
Recommender Systems for Textual
Data
• News articles, job descriptions, product descriptions, and
search queries all contain a lot of text.
• Hence, textual content and the similarities or relatedness
between different texts is important to consider when
developing recommender systems for textual data.
• A common approach to building recommendation systems is a
method called collaborative filtering.
• It shows recommendations to users based on their past history
and on what users with similar profiles preferred in the past.
• For example, Netflix recommendations use this type of approach at a
large scale.
Content-based Recommendation
Systems
• One approach to building such a content-
based recommendation system is to use a
topic model like we saw earlier in this
chapter.
• Texts similar to the current text in terms of
topic distribution can be shown as “related”
texts.
• However, the advent of neural text
representations has changed the ways
we can show such recommendations.
Let’s take a look at how we can use a
neural text representation to show
related text recommendations.

Lect08

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lect08

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect08

Uploaded by

Copyright:

Available Formats

Other Typical

• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10%

• Document 1 and 2: 100% Topic A

You might also like