Tweet Segmentation and Its Application To

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

TWEET SEGMENTATION AND ITS APPLICATION TO

NAMED ENTITY RECOGNITION

ABSTRACT:

Twitter can be very useful for arranging a time and place to get together. It's like a
conference call with text messaging. It has attracted millions of users to share and disseminate
most up-to-date information. Targeting twitter stream is usually constructed by filtering tweets
with predefined selection criteria. However, many applications in Information Retrieval (IR) and
Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In
this context we get a problem within the sentences of the twit texts such as grammar errors and
spelling errors. In this paper, we propose a novel framework for tweet segmentation in a batch
mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context
information is well preserved and easily extracted by the downstream applications. The
Segmentation models and Named Entity identification can consider the sentences within the
NER algorithm system to evaluate and given the exact corrections of the sentences.
EXISTING SYSTEM:

In previous work limited length of a tweet (i.e., 140 characters) and no restrictions on its
writing styles, tweets often contain grammatical errors, misspellings, and informal abbreviations.
The error-prone and short nature of tweets often make the word-level language models for tweets
less reliable. For example, given a tweet “I call her, no answer. Her phone in the bag, she
dancing.”, there is no clue to guess it’s true theme by disregarding word order (i.e., bag-of-word
model).

DISADVANTAGES OF EXISTING SYSTEM:


 This model is further exacerbated with the limited context provided by the tweet. That is,
more than one explanation for this tweet could be derived by different readers if the tweet
is considered in isolation.
 On the other hand, despite the noisy nature of tweets, the core semantic information is
well preserved in tweets in the form of named entities or semantic phrases.
PROPOSED SYSTEM:

We propose and evaluate two segment-based NER algorithms. Both algorithms are
unsupervised in nature and take tweet segments as input. One algorithm exploits co-occurrence
of named entities in targeted Twitter streams by applying random walk (RW) with the
assumption that named entities are more likely to co-occur together. The other algorithm utilizes
Part-of-Speech (POS) tags of the constituent words in segments. The segments that are likely to
be a noun phrase are considered as named entities.

ADVANTAGES OF PROPOSED SYSTEM:


 Quality of tweet segmentation significantly affects the accuracy of NER.
 POS-based NER method outperforms RWbased method on both datasets.
SYSTEM SPECIFICATION:

SOFTWARE REQUIREMENTS:

 Operating system : Windows 07& XP


 Front End : Visual Studio 2008,2010, ASP.net, C#
 Backend : SQL Server 2005, 2008R2

HARDWARE REQUIREMENTS:

 Processor : Pentium Dual core


 Speed : 1.1 GHz
 RAM : 1GB
 Hard Disk : 20 GB
 Floppy Drive : 1.44 MB
 Key Board : Standard Windows Keyboard
 Mouse : Two or Three Button Mouse

You might also like