Lecture 6-Text Mining and Sentiment Analysis
Lecture 6-Text Mining and Sentiment Analysis
Lecture 6-Text Mining and Sentiment Analysis
Describe text mining and understand the need for text mining
Differentiate among text analytics, text mining and data mining
Understand the different application areas for text mining
Know the process of carrying out a text mining project
Appreciate the different methods to introduce structure to text-based data
Describe sentiment analysis
Develop familiarity with popular applications of sentiment analysis
Learn the common methods for sentiment analysis
Become familiar with speech analytics as it relates to sentiment analysis
Learn three facets of Web analytics—content, structure, and usage mining
Know social analytics including social media and social network analyses
TEXT ANALYTICS AND TEXT MINING
or simply
• Text Analytics = Information Retrieval + Text Mining
TEXT ANALYTICS AND TEXT MINING
Text Analytics, Related Application Areas, and Enabling Disciplines.
TEXT MINING CONCEPTS
• 85-90 percent of all corporate data is in some kind of unstructured form (e.g., text)
• Unstructured corporate data is doubling in size every 18 months
• Tapping into these information sources is not an option, but a need to stay competitive
• Answer: text mining
• A semi-automated process of extracting knowledge from unstructured data
sources
• a.k.a. text data mining or knowledge discovery in textual databases
DATA MINING VERSUS TEXT MINING
• Both seek for novel and useful patterns
• Both are semi-automated processes
• Difference is the nature of the data:
• Structured versus unstructured data
• Structured data: in databases
• Unstructured data: Word documents, PD F files, text excerpts, XM L files, and so on
• To perform text mining – first, impose structure to the data, then mine the structured data.
TEXT MINING CONCEPTS
• Benefits of text mining are obvious especially in text-rich data environments
• e.g., law (court orders), academic research (research articles), finance (quarterly
reports), medicine (discharge summaries), biology (molecular interactions),
technology (patent files), marketing (customer comments), etc.
• Electronic communization records (e.g., Email)
• Spam filtering
• Email prioritization and categorization
• Automatic response generation
TEXT MINING APPLICATION AREA
• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering
TEXT MINING TERMINOLOGY
• Unstructured or semistructured data
• Corpus (and corpora)
• Terms
• Concepts
• Stemming
• Stop words (and include words)
• Synonyms (and polysemes)
• Tokenizing
TEXT MINING TERMINOLOGY
• Term dictionary
• Word frequency
• Part-of-speech tagging
• Morphology
• Term-by-document matrix
• Occurrence matrix
• Singular value decomposition
• Latent semantic indexing
NATURAL LANGUAGE PROCESSING (NL P)
• What is “Understanding” ?
• Human understands, what about computers?
• Natural language is vague, context driven
• True understanding requires extensive knowledge of a topic
• Can/will computers ever understand natural language the
same/accurate way we do?
NATURAL LANGUAGE PROCESSING (NL P)
• Challenges in NL P
• Part-of-speech tagging
• Text segmentation
• Word sense disambiguation
• Syntax ambiguity
• Imperfect or irregular input
• Speech acts
• Dream of A I community
• to have algorithms that are capable of automatically reading and obtaining knowledge from
text
NATURAL LANGUAGE PROCESSING (NL P)
• WordNet
• A laboriously hand-coded database of English words, their definitions, sets of
synonyms, and various semantic relations between synonym sets.
• A major resource for NL P.
• Need automation to be completed.
• Sentiment Analysis
• A technique used to detect favorable and unfavorable opinions toward specific
products and services
• SentiWordNet
NL P TASK CATEGORIES
• Question answering
• Automatic summarization
• Natural language generation & understanding
• Machine translation
• Foreign language reading & writing
• Speech recognition
• Text proofing, optical character recognition
• Optical character recognition
TEXT MINING APPLICATIONS
• Marketing applications
• Enables better CR M
• Security applications
• ECHELO N, OASI S
• Deception detection (…example in the textbook)
• Medicine and biology
• Literature-based gene identification (…)
• Academic applications
• Research stream analysis
APPLICATION CASE
Mining for Lies
• Deception detection
• A difficult problem
• If detection is limited to only text, then the problem is
even more difficult
• The study
• Analyzed text based testimonies of person of interests at
military bases
• Used only text-based features (cues)
APPLICATION CASE 7.3
Mining for Lies
Figure 7.3 Text-Based Deception-Detection Process.
Source: Fuller, C. M., D. Biros, & D. Delen. (2008, January). Exploration of Feature Selection and Advanced Classification
Models for High-Stakes Deception Detection. Proceedings of the Forty-First Annual Hawaii International Conference on System
Sciences (HICS S), Big Island, H I: IEE E Press, pp. 80–99.
APPLICATION CASE 7.3
Mining for Lies
Categories and Examples of Linguistic Features Used in Deception Detection.
Number Construct (Category) Example Cues
1 Quantity Verb count, noun phrase count, etc.
2 Complexity Average number of clauses, average sentence length, etc.