Lecture 6 - From Unstructured Texts to Structure Data I

Uploaded by

Werd We

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lecture 6 - From Unstructured Texts to Structure Data I

Uploaded by

Werd We

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

FROM UNSTRUCTURED TEXT TO STRUCTURED DATA – PART I

DR. MICHAEL FIRE

Lecture Motivations:
- Learn to analyze massive amounts of text
- Transfer unstructured texts to structured data
Bag-of-Words
A simple text representation model. Using the model,
we count the number of times each word appears in
a document.
Bag-of-Words
Pros:
• Easy to understand and simple to use
• Usually provides decent results

Cons:
• Creates many features (tens-of-thousands of
features)
• Doesn’t take into account other documents
• Doesn’t provide state-of-the-art results
• It’s possible to miss important characters by
removing punctuation
N-Grams
“N-gram is a contiguous sequence of n items from a
given sample of text or speech. The items can be
phonemes, syllables, letters, words or base pairs
according to the application” (from Wikipedia)
N-Grams
Pros:
• Easy to understand and simple to use
• Can be used both for “characters” or “words”
• Can utilize useful punctuation and other special
characters
• Usually provides decent results
Cons:
• Creates many features (hundreds-of-thousands of
features)
• Doesn’t take into account other documents
• Doesn’t provide state-of-the-art results
Term Frequency–Inverse
Document Frequency (TF-IDF)
• TF-IDF is used to reflect how important a word is to
a document in a collection
• A word’s TF–IDF value increases proportionally to the
number of times it appears in a document and is
offset by the number of documents in the corpus that
contains it
Term Frequency
A term frequency, tf(t,d), can be the number of time a
word appears in a document (there can also be other
measures)

see Wikipedia
Inverse Document Frequency
The inverse document frequency, idf(t,D), measures
if words are common or rare across all documents.

Usually it is calculated as the logarithm of the number

of documents in the collection divided by the number
of documents in which the word or term appears.

tfidf(t,d,D) = tf(t,d)*idf(t,D)

see Wikipedia
TF-IDF
Pros:
• Easy to understand and simple to use
• Usually provides decent results
• Takes into account other documents
Cons:
• Creates many features (tens-of-thousands of
features)
• Doesn’t provide state-of-the-art results
Topic Model
A topic model is a statistical model for discovering
the abstract "topics" that occur in a collection of
documents. Topic model algorithms are used to
discover hidden subjects in a large collection of
unstructured texts
Latent Dirichlet Allocation
“A generative statistical model that allows sets of
observations to be explained by unobserved groups
that explain why some parts of the data are similar.”

• A document is a collection of topics

• A topic is collection of keywords
Named-Entity Recognition
Entity Extraction
Entity Extraction is a task that seeks to locate and
classify named entity mentions in unstructured text into
predefined categories.
Sentiment Analysis
Sentiment analysis (also referred to as opinion mining) is
using NLP to identify, extract, quantify, and study
affective states and subjective information.
Word Embeddings
“Word embeddings is the collective name for a set
of language modeling and feature learning
techniques in natural language processing (NLP)
where words or phrases from the vocabulary are
mapped to vectors of real numbers” (Wikipedia)
Word Embeddings
Pros:
• Easy to use
• In many tasks provides state-of-the-art results
Cons:
• Small features space
• Can be domain sensitive (can be missing
words/spelling sensitive)
Recommended Read:
• Topic Modeling with Gensim by Selva Prabhakaran
• A Practitioner's Guide to Natural Language Processing (Part I)—Processing &
Understanding Text by Dipanjan (DJ) Sarkar
• Word Embeddings in Python with Spacy and Gensim by Shane Lynn
• Word Embedding Using BERT In Python, Anirudh S
• BERT Word Embeddings Tutorial, Chris McCormick and Nick Ryan

NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
A Survey On Word Representation In Natural Language
No ratings yet
A Survey On Word Representation In Natural Language
7 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
FALLSEM2024-25_BCSE409L_TH_VL2024250101881_2024-11-15_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE409L_TH_VL2024250101881_2024-11-15_Reference-Material-I
68 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Exam-2
No ratings yet
Exam-2
5 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
DVT UNIT -4 Notes 211124 (1)
No ratings yet
DVT UNIT -4 Notes 211124 (1)
21 pages
Text
No ratings yet
Text
102 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Text Prediction Analysis
No ratings yet
Text Prediction Analysis
12 pages
4. Chapter 8 Text Analytics
No ratings yet
4. Chapter 8 Text Analytics
42 pages
Text Mining
No ratings yet
Text Mining
25 pages
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
No ratings yet
Natural Language Processing (NLP) Introduction:: Top 10 NLP Interview Questions For Beginners
24 pages
Module III
No ratings yet
Module III
42 pages
WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization
No ratings yet
WINSEM2022-23 - CSI3005 - ETH - VL2022230503219 - ReferenceMaterialI - FriFeb1700 00 00IST2023 - TextandDocumentVisualization
20 pages
learn 4
No ratings yet
learn 4
27 pages
Module 3
No ratings yet
Module 3
40 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Exp-7
No ratings yet
Exp-7
9 pages
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
No ratings yet
Introduction To Text Visualization by Nan Cao, Weiwei Cui (Auth.)
122 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
06 Text and Document
No ratings yet
06 Text and Document
43 pages
NLP m3
No ratings yet
NLP m3
111 pages
Allnlp
No ratings yet
Allnlp
15 pages
NLP ANONYMOUS QB Ans
No ratings yet
NLP ANONYMOUS QB Ans
21 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
4th Unit DVT
No ratings yet
4th Unit DVT
40 pages
Widc Tfidf
No ratings yet
Widc Tfidf
20 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Kim 2016
No ratings yet
Kim 2016
5 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
Lect04
No ratings yet
Lect04
44 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
6. Applications of NLP
No ratings yet
6. Applications of NLP
85 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Creating/Printing A Simple Report: This Section
No ratings yet
Creating/Printing A Simple Report: This Section
8 pages
Togaf or Not Togaf
No ratings yet
Togaf or Not Togaf
19 pages
Research
No ratings yet
Research
2 pages
UNIT 1 - Database System Architecture
No ratings yet
UNIT 1 - Database System Architecture
12 pages
Database Architecture Final Final
No ratings yet
Database Architecture Final Final
23 pages
Case Tools 1
No ratings yet
Case Tools 1
47 pages
SQL For Views, Synonyms, and Sequences
No ratings yet
SQL For Views, Synonyms, and Sequences
9 pages
IV SEM DCS MID--1
No ratings yet
IV SEM DCS MID--1
3 pages
5th and 6th Topic
No ratings yet
5th and 6th Topic
8 pages
Exercse-1 Insurance Database
No ratings yet
Exercse-1 Insurance Database
35 pages
Databases June 2017 Assignment - FINAL v2 PDF
No ratings yet
Databases June 2017 Assignment - FINAL v2 PDF
7 pages
LAB 4 DML - INSERT, UPDATE, DELETE Etc-1
No ratings yet
LAB 4 DML - INSERT, UPDATE, DELETE Etc-1
32 pages
Bahria University Islamabad Campus: (Department of Computer Science)
No ratings yet
Bahria University Islamabad Campus: (Department of Computer Science)
8 pages
IS 1200 Part 4 1976 METHOD OF MEASUREMENT OF BUILDING AND CIVIL ENGINEERING WORKS PART IV STONE MASONRY
No ratings yet
IS 1200 Part 4 1976 METHOD OF MEASUREMENT OF BUILDING AND CIVIL ENGINEERING WORKS PART IV STONE MASONRY
13 pages
Distributed Database
No ratings yet
Distributed Database
7 pages
Real-Time Eventual Consistency
No ratings yet
Real-Time Eventual Consistency
14 pages
SQL Server Analysis Services (SSAS) Is The Technology From The Microsoft
No ratings yet
SQL Server Analysis Services (SSAS) Is The Technology From The Microsoft
5 pages
Fundamentals of DBS - CH - 1
No ratings yet
Fundamentals of DBS - CH - 1
36 pages
Lab 1 Solution
No ratings yet
Lab 1 Solution
8 pages
How To Write Citations and Bibliographies in APA Style
No ratings yet
How To Write Citations and Bibliographies in APA Style
5 pages
Grade 10 Ict Note
No ratings yet
Grade 10 Ict Note
6 pages
Ict - Ig 1 (A) - S1T1 2023 - P1
100% (1)
Ict - Ig 1 (A) - S1T1 2023 - P1
7 pages
Vijay Kanth - Azure Data Engineer
No ratings yet
Vijay Kanth - Azure Data Engineer
2 pages
Notes - Verisk Case
No ratings yet
Notes - Verisk Case
4 pages
Table Naming Convention - Ravindra
No ratings yet
Table Naming Convention - Ravindra
2 pages
Text Mining Problem Statement
100% (1)
Text Mining Problem Statement
3 pages
Rule-Based Machine Learning
No ratings yet
Rule-Based Machine Learning
3 pages
Dental Clinic
No ratings yet
Dental Clinic
51 pages
IBM Spectrum Discover Level 2 Quiz Attempt Review PDF
100% (1)
IBM Spectrum Discover Level 2 Quiz Attempt Review PDF
6 pages
ArcView 10 Manual
100% (3)
ArcView 10 Manual
82 pages