Unit iv
Unit iv
Unit iv
TOPICS
INTRODUCTION - LEVELS OF
TEXT REPRESENTATIONS
Computers are brilliant when dealing with numbers. They are faster than
humans in calculations & decoding patterns by many orders of magnitude.
But what if the data is not numerical? What if it's language? What happens
when the data is in characters, words & sentences? How do we make
computers process our language? How does Alexa, Google Home & many
other smart assistants understand & reply to our speech?
• One-Hot encoding
• Bag-of-words representation (BOW)
• Basic BOW — CountVectorizer
• Advanced BOW — TF-IDF
ONE-HOT ENCODING:
sentence = [ [1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1] ]
ONE-HOT ENCODING
The intuition behind one-hot encoding is that each bit represents a possible
category & if a particular variable cannot fall into multiple categories, then a single
bit is enough to represent it
As you may have grasped, the length of an array of word depends on the
vocabulary size. This is not scalable for a very large corpus which could contain up
to 100,000 unique words or even more.
SNIPPET USING PYTHON
ADVANTAGES AND
DISADVANTES OF ONE-HOT
ENCONDING
Advantages of one-hot encoding:
The intuition behind BOW representation is that document having similar words
are similar irrespective of the word positioning
Basic BOW — CountVectorizer
The CountVectorizer computes the frequency of occurrence of a word in a document. It
converts the corpus of multiple sentences (say product reviews) into a matrix of reviews &
words & fills it with frequency of each word in a sentence
BAG-OF-WORDS
REPRESENTATION
BAG-OF-WORDS
REPRESENTATION
BAG-OF-WORDS
REPRESENTATION
As you see the word “nlp” occurs twice in the sentence & also falls in index 3.
Which we can see as the output of the final print statement
There are various parameters that can be tweaked as part of the CountVectorizer
to get the desired results including text preprocessing parameters like lowercase,
strp_accents, preprocessor
COUNT VECTORIZER.
Advantage of CountVectorizer:
• This method ignores the location information of the word. It is not possible
to grasp the meaning of a word from this representation
• The intuition that high-frequency words are more important or give more
information about the sentence fails when it comes to stop words like “is,
the, an, I” & when the corpus is context-specific. For example, in a corpus
about covid-19, the word coronavirus may not add a lot of value
DISTRIBUTED OR
CONTINUOUS TEXT
REPRESENTATIONS
Distributed or continuous text representations, also known as word
embeddings, are a popular technique in data science for representing text
data in a numerical form that can be used for machine learning tasks.
Traditionally, text data has been represented using one-hot encoding,
where each word in a vocabulary is assigned a unique index and
represented as a vector with a 1 in the index corresponding to the word
and 0s elsewhere.
However, one-hot encoding has several limitations, such as the inability to
capture relationships between words and the high dimensionality of the
resulting representation.
Word embeddings, on the other hand, are dense, low-dimensional vectors that
represent words in a continuous vector space.
They are learned by training a neural network on a large corpus of text data, with the
goal of predicting the context in which each word appears.
The resulting word embeddings capture both syntactic and semantic relationships
between words, such that similar words are represented by similar vectors in the
vector space.
Word embeddings have numerous applications in data science, including natural
language processing (NLP), text classification, sentiment analysis, and machine
translation.
They have been shown to improve the performance of many NLP tasks compared to
traditional representations such as bag-of-words or one-hot encoding.
Some popular algorithms for generating word embeddings include Word2Vec, GloVe,
and FastText.
It generally refers to the number of features you have for each sample in
the problem you are trying to classify. For example, the famous Iris flower
dataset only includes 4 features (Sepal length, sepal width, petal width,
petal length), and would be considered as a low dimensional dataset.
VECTOR SPACE MODEL
The Vector Space Model (VSM) is a mathematical model used in information
retrieval to represent text documents as vectors of features. The model
represents documents as high-dimensional vectors, where each dimension
corresponds to a specific term or feature
The VSM is based on the idea that documents with similar content will have
similar vector representations.
It is used to calculate the similarity between documents based on their
vector representations.
The most common measure of similarity used in the VSM is the cosine
similarity, which measures the angle between two vectors.
VECTOR SPACE MODEL
In the VSM, documents are represented as a set of terms, and each term is
assigned a weight that reflects its importance in the document.
The most commonly used weighting scheme is the Term Frequency-Inverse
Document Frequency (TF-IDF) scheme.
This scheme assigns a higher weight to terms that are frequent in the
document but rare in the corpus.
The VSM has many applications in information retrieval, including text
classification, document clustering, and recommender systems. It is also
used in natural language processing tasks such as text summarization and
information extraction.
VECTOR SPACE MODEL
EXAMPLE
Suppose we have a corpus of three documents:
Document 1: "The quick brown fox jumps over the lazy dog“
To apply the VSM, we first preprocess the text by removing stopwords (e.g., "the", "is",
"in") and stemming the remaining words (e.g., "jumps" and "jumping" become "jump").
Next, we create a term-document matrix that represents the frequency of each term in
each document. Here's what the term-document matrix looks like for our example:
brow
quick n fox jump dog lazy sleep sun
Doc 1 1 1 1 1 1 1 0 0
Doc 2 1 1 1 1 1 0 0 0
Doc 3 0 0 0 0 1 1 1 1
Each row in the matrix represents a document, and each column
represents a term. The numbers in the matrix represent the frequency of
each term in each document.
Next, we use the TF-IDF weighting scheme to calculate the weight of each
term in each document. The TF-IDF weight of a term is a product of its Term
Frequency (TF) in the document and its Inverse Document Frequency (IDF)
across the corpus. Here's what the weighted term-document matrix looks
like for our example:
Each row in the matrix represents a document, and each column
represents a term. The numbers in the matrix represent the frequency of
each term in each document.
Next, we use the TF-IDF weighting scheme to calculate the weight of each
term in each document. The TF-IDF weight of a term is a product of its Term
Frequency (TF) in the document and its Inverse Document Frequency (IDF)
across the corpus. Here's what the weighted term-document matrix looks
like for our brow
example:
quick n fox jump dog lazy sleep sun
Doc 1 0.29 0.29 0.29 0.29 0.29 0.29 0 0
Doc 2 0.29 0.29 0.29 0.29 0.29 0 0 0
Doc 3 0 0 0 0 0.29 0.29 0.45 0.45
Finally, we can represent each document as a vector in a high-dimensional
space, where each dimension corresponds to a term. We can use the
weighted term-document matrix to represent each document as a vector
by flattening each row into a vector. For example, the vector representation
of Document 1 is:
[0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0, 0]
We can then use the cosine similarity measure to compare the similarity
between any two documents.
For example, the cosine similarity between Document 1 and Document 2
is 0.67, which indicates that they have some degree of similarity.
CALCULATING COSINE
SIMILARITY
Cosine similarity is a measure of similarity between two non-zero vectors of
an inner product space. It measures the cosine of the angle between two
vectors, which ranges from -1 (opposite directions) to 1 (same direction). A
value of 0 indicates the vectors are orthogonal (perpendicular) to each
other.
Here's an example of how to calculate cosine similarity between two
vectors:
Suppose we have two vectors, A and B:
A = [1, 2, 3, 4, 5] B = [2, 4, 6, 8, 10]
CALCULATING COSINE
SIMILARITY
To calculate the cosine similarity between A and B, we first need to compute the dot
product of A and B. The dot product is the sum of the element-wise products of the
two vectors:
A . B = (1 * 2) + (2 * 4) + (3 * 6) + (4 * 8) + (5 * 10) = 2 + 8 + 18 + 32 + 50 = 110
Next, we need to compute the magnitude (length) of each vector. The magnitude of
a vector is the square root of the sum of the squares of its elements:
||A|| = sqrt((1^2) + (2^2) + (3^2) + (4^2) + (5^2)) = sqrt(55) = 7.42 ||B|| =
sqrt((2^2) + (4^2) + (6^2) + (8^2) + (10^2)) = sqrt(120) = 10.95
Finally, we can compute the cosine similarity between A and B using the dot product
and magnitudes:
similarity = A . B / (||A|| * ||B||) = 110 / (7.42 * 10.95) = 0.878
The cosine similarity between A and B is 0.878, which indicates they are quite similar
because they are in the same direction in the vector space.
TF-IDF WEIGHTING SCHEME
TF-IDF stands for term frequency-inverse document frequency and is a
commonly used weighting scheme in information retrieval and text mining.
It is used to assign weights to each term in a document based on how
important it is in the context of the document and the collection of
documents.
Here's an example of how to calculate TF-IDF for a term in a document:
Suppose we have a collection of documents and we want to calculate the
TF-IDF weight for the term "data" in one of the documents, called
Document A.
TF-IDF WEIGHTING SCHEME
Document A: "The data science team analyzed data and found some
interesting insights."
First, we need to calculate the term frequency (TF) of the term "data" in
Document A. The term frequency is simply the number of times the term
appears in the document divided by the total number of terms in the
document.
TF = (number of times "data" appears in Document A) / (total number of
terms in Document A)
In this case, the term "data" appears twice in Document A and the total
number of terms in the document is 11, so:
TF = 2 / 11 = 0.18
TF-IDF WEIGHTING SCHEME
Next, we need to calculate the inverse document frequency (IDF) of the term "data"
across the entire collection of documents. The IDF measures how rare or common the
term is in the collection of documents and is calculated as the logarithm of the total
number of documents in the collection divided by the number of documents that
contain the term.
IDF = log (total number of documents in the collection / number of documents
containing the term "data")
Suppose our collection contains 100 documents and the term "data" appears in 50 of
them:
IDF = log (100 / 50) = log (2) = 0.30
Finally, we can calculate the TF-IDF weight for the term "data" in Document A by
multiplying the TF and IDF values:
TF-IDF = TF * IDF = 0.18 * 0.30 = 0.054
So the TF-IDF weight of the term "data" in Document A is 0.054. This indicates that the
term "data" is not very important in the context of Document A, but it is still informative
for the overall collection of documents since it is not a very common term in the corpus.
SINGLE DOCUMENT
VISALIZATION
https://jcsites.juniata.edu/faculty/rhodes/ida/textDocViz.html#tagcloud
SINGLE DOCUMENT
VISALIZATION
Single document visualization is a type of data visualization that focuses on
the content of a single document.
It is used to help users quickly understand and absorb the core content of a
document, as well as its key features and relationships.
Single document visualization techniques can be used to represent a wide
variety of document content, including text, images, tables, and graphs.
SINGLE DOCUMENT
VISALIZATION
There are many different single document visualization techniques available, each with
its own strengths and weaknesses. Some of the most common techniques include:
• Word clouds: Word clouds are a simple but effective way to visualize the frequency of
words in a document. They can be used to identify the most important topics in a
document, as well as to identify relationships between different words.
• Tag clouds: Tag clouds are similar to word clouds, but they use tags instead of words.
Tags are short, descriptive words or phrases that are used to categorize documents. Tag
clouds can be used to help users find documents that are relevant to their interests.
• Document maps: Document maps are a more complex type of visualization that shows
the structure of a document. They can be used to help users understand the flow of
information in a document, as well as to identify the relationships between different
sections of the document.
• Timelines: Timelines are a useful way to visualize the chronological order of events in a
document. They can be used to help users understand the development of a story, as
well as to identify key events in a document.
DOCUMENT COLLECTION
VISUALIZATION
Document collection visualization is a type of data visualization that focuses on the
content of a collection of documents
There are many different document collection visualization techniques available, each
with its own strengths and weaknesses. Some of the most common techniques include:
• Document maps: Document maps are a type of visualization that shows the structure of
a document collection. They can be used to help users understand the relationships
between different documents in a collection.
• Term maps: Term maps are a type of visualization that shows the frequency of terms in
a document collection. They can be used to help users identify the most important
topics in a document collection, as well as to identify relationships between different
topics.
• Co-occurrence maps: Co-occurrence maps are a type of visualization that shows the
relationships between different terms in a document collection. They can be used to
help users identify relationships between different concepts in a document collection.
• Topic models: Topic models are a type of statistical model that can be used to identify
the topics in a document collection. They can be used to help users identify the most
important topics in a document collection, as well as to identify relationships between
different topics.
EXTENDED DOCUMENT
VISUALIZATION
Extended text visualization is a type of data visualization that focuses on
the content of a long document, such as a book or a research paper.
It is used to help users quickly understand and absorb the core content of
a long document, as well as its key features and relationships.
There are many different extended text visualization techniques available, each with
its own strengths and weaknesses. Some of the most common techniques include:
• Document maps: Document maps are a type of visualization that shows the
structure of a long document. They can be used to help users understand the
relationships between different sections of a document.
• Term maps: Term maps are a type of visualization that shows the frequency of terms
in a long document. They can be used to help users identify the most important topics
in a long document, as well as to identify relationships between different topics.
• Co-occurrence maps: Co-occurrence maps are a type of visualization that shows the
relationships between different terms in a long document. They can be used to help
users identify relationships between different concepts in a long document.
• Topic models: Topic models are a type of statistical model that can be used to
identify the topics in a long document. They can be used to help users identify the
most important topics in a long document, as well as to identify relationships between
different topics.
• Sentiment analysis: Sentiment analysis is a technique that can be used to identify
the sentiment of a document, such as whether it is positive, negative, or neutral. This
can be helpful for tasks such as identifying the emotional impact of a document.
INTERACTION OPERATIONS
Interaction operations are used to repeatedly perform the same action on a set
of data. They are a powerful tool for data visualization, as they can be used to
create dynamic and interactive visualizations.
There are many different types of iteration operations, each with its own
strengths and weaknesses. Some of the most common types of iteration
operations include:
• Filtering: Filtering is used to remove data that does not meet a certain criteria.
This can be useful for cleaning up data or for focusing on a specific subset of
data.
• Aggregation: Aggregation is used to combine data into a single value. This can
be useful for summarizing data or for creating new features.
• Transformation: Transformation is used to change the format of data. This can
be useful for making data more understandable or for preparing data for
further analysis.
INTERACTION OPERATIONS
Iteration operations are a powerful tool for data visualization. By using iteration
operations, you can create dynamic and interactive visualizations that can help you to
better understand your data.
Here are some examples of how iteration operations can be used in data visualization:
• To create a dynamic visualization that updates as the data changes. For example, you
could use a filtering operation to create a visualization that only shows data for a specific
time period. As the data changes, the visualization would update to reflect the new data.
• To create an interactive visualization that allows users to explore the data. For example,
you could use an aggregation operation to create a visualization that shows the average
value of a data set over time. Users could then interact with the visualization to explore
the data by different time periods.
• To create a visualization that is more understandable and easier to interpret. For
example, you could use a transformation operation to create a visualization that shows
the data in a different format. This could make the data easier to understand and easier
to interpret.
INTERACTION OPERANDS
AND SPACES
In data visualization, interaction operands and spaces are the two key
components that allow users to interact with and explore data.
• Interaction operands are the objects or entities that users can interact with.
They can be individual data points, groups of data points, or even the entire
visualization itself.
• Interaction spaces are the areas of the visualization where users can interact
with the operands. They can be the entire screen, a specific region of the
screen, or even a specific element of the visualization.
The combination of interaction operands and spaces allows users to explore
data in a variety of ways. For example, users can select individual data points
to see more information about them, or they can zoom in on a specific region
of the visualization to get a closer look.
Here are some examples of interaction operands and spaces:
• Interaction operands:
• Individual data points
• Groups of data points
• The entire visualization
• Interaction spaces:
• The entire screen
• A specific region of the screen
• A specific element of the visualization
BENEFITS OF INTERACTION
OPERANDS AND SPACES.
• Improved understanding of data: allow users to explore data in a more
interactive way, which can help them to better understand the data.
• Enhanced data exploration: allow users to explore data in a more flexible
way, which can help them to find patterns and trends that they might not
have otherwise found.
• Improved decision-making: help users to make better decisions by
providing them with a better understanding of the data.
• Enhanced problem-solving: help users to solve problems by providing them
with a better understanding of the data.
A UNIFIED FRAMEWORK
A unified framework for data visualization is a set of principles and
guidelines that can be used to create effective and informative
visualizations.
The framework should be based on a solid understanding of the cognitive
and perceptual processes involved in human visual perception.
It should also take into account the different types of data that can be
visualized, the different audiences that will be viewing the visualizations,
and the different purposes for which the visualizations will be used.
A UNIFIED FRAMEWORK
A unified framework for data visualization can help to ensure that
visualizations are:
• Effective: They communicate the intended message to the intended
audience in a clear and concise way.
• Informative: They provide insights into the data that would not be possible
to see from the data alone.
• Perceptually accurate: They are easy to understand and interpret.
• Elegant: They are visually appealing and engaging.
A UNIFIED
FRAMEWORK
There are a number of different unified frameworks for data visualization that
have been proposed. Some of the most well-known frameworks include:
• The Grammar of Graphics: This framework was developed by Edward Tufte and
is based on the idea that all visualizations can be decomposed into a small
number of basic elements, such as marks, scales, and legends.
• The Information Visualization Design Framework: This framework was developed
by Ben Shneiderman and is based on the idea that all visualizations can be
designed using a set of seven design principles, such as overview first, zoom
and filter, and animation.
• The Data-Ink Ratio: This framework was developed by William Cleveland and is
based on the idea that the most effective visualizations use as little ink as
possible to communicate the intended message.
SCREEN SPACE
In data visualization, screen space is the area of the screen that is used to display the
visualization.
It is important to consider screen space when designing a visualization, as you want
to make sure that the visualization is easy to read and understand.
There are a few things to keep in mind when considering screen space:
• The size of the screen: The size of the screen will obviously affect the amount of
screen space that is available. A larger screen will provide more space for the
visualization, while a smaller screen will provide less space.
• The resolution of the screen: The resolution of the screen will also affect the amount
of screen space that is available. A higher resolution screen will provide more pixels
per inch, which will make it easier to read small text and details.
• The type of visualization: The type of visualization will also affect the amount of
screen space that is needed. Some visualizations, such as bar charts and pie charts,
can be relatively compact, while others, such as heat maps and scatterplots, can take
up more space.
When designing a visualization, it is important to consider the amount of
screen space that is available and to make sure that the visualization is easy to
read and understand. Here are a few tips for using screen space effectively:
• Use whitespace: Whitespace is the empty space around the visualization. It can
be used to make the visualization easier to read and understand.
• Use clear and concise labels: The labels should be clear and concise, and they
should be easy to read.
• Use a consistent style: The visualization should use a consistent style, which
will make it look more professional and polished.
• The size of the data objects: The size of the data objects will obviously affect the
amount of object space that is required. Larger data objects will require more
space, while smaller data objects will require less space.
• The density of the data objects: The density of the data objects will also affect
the amount of object space that is required. A high-density of data objects will
require more space, while a low-density of data objects will require less space.
• The type of visualization: The type of visualization will also affect the amount of
object space that is needed. Some visualizations, such as bar charts and pie
charts, can be relatively compact, while others, such as heat maps and
scatterplots, can take up more space.
DATA SPACE
In data visualization, data space is the area of the visualization that is used to display
the data. Data space can be divided into two main types: physical space and
information space.
Physical space is the actual space on the screen or page that is used to display the
data. Information space is the conceptual space that is used to represent the data.
There are a number of factors that can affect the relationship between the physical
space and information space. These factors include:
• The type of data: The type of data will affect the way that it is represented in the
physical space. For example, numerical data is often represented as points, lines, or
bars, while categorical data is often represented as shapes or colors.
• The amount of data: The amount of data will affect the size of the physical space that
is needed. More data will require more space.
• The complexity of the data: The complexity of the data will affect the way that it is
represented in the physical space. More complex data will require more space and
more complex representations.
ATTRIBUTE SPACE
Attribute space is a term used in data visualization to describe the space in
which data attributes are represented.
Attribute space can be thought of as a multidimensional space, with each
dimension representing a different attribute.
For example, a data set with three attributes, such as age, gender, and
income, would have three dimensions in attribute space.
Attribute space can also be used to identify patterns and trends in data.
For example, a heatmap can be used to visualize the distribution of data
values within a multidimensional space.
By assigning different colors to different values, a heatmap can show how
data values are distributed across the different dimensions of attribute space.
Here are some examples of how attribute space can be used in data
visualization:
• Scatterplots: Scatterplots can be used to visualize the relationship between
two variables. For example, a scatterplot could be used to visualize the
relationship between age and income.
• Heatmaps: Heatmaps can be used to visualize the distribution of data values
within a multidimensional space. For example, a heatmap could be used to
visualize the distribution of population density within a city.
• Treemaps: Treemaps can be used to visualize hierarchical data. For example, a
treemap could be used to visualize the organization of a company.
• Bubble charts: Bubble charts can be used to visualize the relationship between
three variables. For example, a bubble chart could be used to visualize the
relationship between age, income, and education level.
VISUALIZATION STRUCTURE
A visualization structure is a way of organizing data so that it can be easily visualized.
There are many different types of visualization structures, each with its own advantages
and disadvantages.
Some common visualization structures include:
• Trees: Trees are a good way to visualize hierarchical data, such as the organization of a
company.
• Graphs: Graphs are a good way to visualize relationships between entities, such as the
connections between people on social media.
• Matrices: Matrices are a good way to visualize data that can be represented as a table,
such as the results of a survey.
• Charts: Charts are a good way to visualize data that can be represented as a series of
points, such as the stock market.
• Maps: Maps are a good way to visualize data that is related to a geographic location,
such as the distribution of population density.
Here are some examples of how visualization structures can be used: