Tweet Segmentation and Its Application To

This paper proposes a framework called HybridSeg for segmenting tweets into meaningful segments to help downstream natural language processing applications like named entity recognition. Segmenting tweets helps address issues with the noisy and short nature of tweets. The paper evaluates two unsupervised named entity recognition algorithms based on tweet segmentation - one using co-occurrence of entities and random walks, the other using part-of-speech tags of constituent words. The part-of-speech based method is shown to outperform the random walk based method on two datasets. Quality of tweet segmentation significantly impacts the accuracy of named entity recognition.

Uploaded by

Aman Foru

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Tweet Segmentation and Its Application To

Uploaded by

Aman Foru

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

TWEET SEGMENTATION AND ITS APPLICATION TO

NAMED ENTITY RECOGNITION

ABSTRACT:

Twitter can be very useful for arranging a time and place to get together. It's like a
conference call with text messaging. It has attracted millions of users to share and disseminate
most up-to-date information. Targeting twitter stream is usually constructed by filtering tweets
with predefined selection criteria. However, many applications in Information Retrieval (IR) and
Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In
this context we get a problem within the sentences of the twit texts such as grammar errors and
spelling errors. In this paper, we propose a novel framework for tweet segmentation in a batch
mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context
information is well preserved and easily extracted by the downstream applications. The
Segmentation models and Named Entity identification can consider the sentences within the
NER algorithm system to evaluate and given the exact corrections of the sentences.
EXISTING SYSTEM:

In previous work limited length of a tweet (i.e., 140 characters) and no restrictions on its
writing styles, tweets often contain grammatical errors, misspellings, and informal abbreviations.
The error-prone and short nature of tweets often make the word-level language models for tweets
less reliable. For example, given a tweet “I call her, no answer. Her phone in the bag, she
dancing.”, there is no clue to guess it’s true theme by disregarding word order (i.e., bag-of-word
model).

DISADVANTAGES OF EXISTING SYSTEM:

 This model is further exacerbated with the limited context provided by the tweet. That is,
more than one explanation for this tweet could be derived by different readers if the tweet
is considered in isolation.
 On the other hand, despite the noisy nature of tweets, the core semantic information is
well preserved in tweets in the form of named entities or semantic phrases.
PROPOSED SYSTEM:

We propose and evaluate two segment-based NER algorithms. Both algorithms are
unsupervised in nature and take tweet segments as input. One algorithm exploits co-occurrence
of named entities in targeted Twitter streams by applying random walk (RW) with the
assumption that named entities are more likely to co-occur together. The other algorithm utilizes
Part-of-Speech (POS) tags of the constituent words in segments. The segments that are likely to
be a noun phrase are considered as named entities.

ADVANTAGES OF PROPOSED SYSTEM:

 Quality of tweet segmentation significantly affects the accuracy of NER.
 POS-based NER method outperforms RWbased method on both datasets.
SYSTEM SPECIFICATION:

SOFTWARE REQUIREMENTS:

 Operating system : Windows 07& XP

 Front End : Visual Studio 2008,2010, ASP.net, C#
 Backend : SQL Server 2005, 2008R2

HARDWARE REQUIREMENTS:

 Processor : Pentium Dual core

 Speed : 1.1 GHz
 RAM : 1GB
 Hard Disk : 20 GB
 Floppy Drive : 1.44 MB
 Key Board : Standard Windows Keyboard
 Mouse : Two or Three Button Mouse