Prevention of Privacy Leaks in Social Networking Post Using NLP
Prevention of Privacy Leaks in Social Networking Post Using NLP
Prevention of Privacy Leaks in Social Networking Post Using NLP
USING NLP
FAREESA BEGUM
M_tech student,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
fareesabegum12345@gmail.com
SYEDA AMBAREEN RANA
Assistant Professor,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
ambareen.rana@mjcollege.ac.in
ABSTRACT
Online social networking sites (OSN) are used by people of all walk of life, such as Facebook,
LinkedIn, Twitter, Instagram are widely used nowadays. In these users share their personal
information that needs to be secure. This is due to increase of interaction with different types of
users and social groups without knowledge about sharing privacy which could be personal
information, location, financial details, etc. If users chat or post in OSNs there may be privacy
leaks, the hackers or intruders use this information for fraud purpose. Many techniques like
machine learning algorithms, NLP, Data mining approaches etc, are used to prevent privacy in
OSNs to evaluate the sensitive information while it has been published could enhance privacy
protection by analyzing various chats or post in OSNs but these techniques are not accurate, time
consuming and not feasible .The proposed work is mainly based on privacy involves users
personal information, Banking details and Location, etc.,. When users chat with the friends in
OSNs this message is annotated with NLP, Rake algorithm and NER. The approach works by
splitting the message after annotating words are match with the predefined privacy rules, if
1
privacy word is found it will report to the user. User will recheck the message to be posted and
decide to alter or send the message. Compared to various techniques the proposed approach is
accurate and easy to protect privacy in the post or message.
purposing, provision to third parties, and For users, sharing this information
sitesare used by people of all walk of life, may expose themselves to a wide
Instagram are widely used nowadays. In not only relatives and close friends,
these users share their personal information but also strangers and even stalkers.
that needs to be secure. It seems this is not In other words, the readers of their
necessarily due to a poor knowledge about posts are anonymous. This will raise
the risks for privacy, but rather to a serious cybersecurity issues if their
and behavior: the study accomplished in There are several main negative
Users don’t pay much attention to Blackburn and Brody (2015) indicate that a
2
surveillance in a romantic relationship. the form of posts. Unwantedly they share
More seriously, some people can gain personal information that need to be secure.
control of their target users by gathering The prevention system to find privacy leaks
their personal information leaked on OSNs in a social posts earlier sharing a post.The
(McFarlane & Bocij, 2013), which will proposed prevention system is based on a
cause harm to the target users. predefine privacy words that might be users
2. Identify theft: The quantities of personal information, Banking details,
personal information can be collected for Location. The technique used to protect
identify theft, which may cause huge privacy is NLP Rake algorithm, common
financial loss to users (Humphreys, Gill & regular expression and NER.
Krishnamurthy, 2010). A report conducted
by the Department of Justice in the USA
shows that 17 million cases of identity theft PROBLEM STATEMENT
caused almost 25 billion dollars losses in While social networks have become a
2015 (Francia III, Hutchinson & Francia, necessary convey for social interactions,
2015). they also raise ethical and privacy issues. A
3. Phishing: Cybercriminals can easily well-known fact is that social networks leak
conduct phishing attacks by collecting information that may be sensitive, about
personal revealing information of the users. However, performing accurate real
targeted victim from OSNs. Then the world on-line privacy attacks in a reasonable
attackers aim to craft a scenario that looks time frame remains a challenging task.
realistic after collecting sufficient
information. Because of the scenario, it is EXISTING SYSTEM
very simple for the attackers to obtain the Previous research demonstrated that privacy
trust of the victim. Then they use the trust preservation is also conditioned by
gained to trick victims into surrendering
(i) optimism bias
privacy information, e.g., credit card details.
(ii) overconfidence bias
(iii) People are not able to evaluate all the
OBJECTIVE
relevant parameters for estimating privacy
Social sites like Twitter used by people all
risks. Thus, their privacy decisions are
walk of life. The user share their feeling in
compromised by incomplete information
3
and bounded rationality. Moreover, we privacy leakage. For preventing privacy in
leverage a technique, previously proposed the post we make use of a rake algorithm,
for the recognition of useful text fragments common regular expressions and spacy NER
with in discussions among developers. The (Name entity recognizer).At the time posting
technique relies on the identification of if any privacy leaks found it will intimating
natural language patterns within sentences users at that time only before posting, it will
contained in the target texts, using the works as a Real time for detecting privacy
Stanford Typed Dependencies (SD) leaks .For that purpose I have taken from
representation of these sentences. The SD twitter developer for accessing my twitter
representation of a sentence models the account for preventing privacy at that time
sentence’s grammatical structure as a list of only.
triples, in which each trip led scribes the
grammatical relation existing between two Advantages of Proposed System
words: the governor and the dependent. 1. The proposed work is based on
the prevention of privacy in social
PROPOSED SYSTEM
accounts by using NLP (i.e., based
For all these reasons, it is important to
on Predefine privacy words).
provide users with a technology that
2. Users shares a post in social
accesses the sensitivity of information as it
media ,before posting the users have
is being published by the consumer and
to check privacy in a post, of if they
alerts them to the risk of sharing
share private it automatic detect
information.Current solutions that SNs 3. The users can detect privacy
adopt for protecting users’ privacy are before sharing a post in social sites,
mainly based on setting customizable in an existing systems privacy can be
privacy preferences, a mechanism that turns detected after they have been post if
out to be insufficient for contrasting uses have lost their information.
leakages of privacy, as it does not all users 4. The proposed work is based on
to have control over data. The solution that Run time.
we propose with this system is RAKE
algorithm and NER to identifying when a
MODULES
sentence (i.e. a post to a SN) entails a risk of
4
TWEEPY opening the website. There is a Python
Tweepy is an open source Python package library that is taken for reach the Python
that gives you a very convenient way to API, known as Tweepy. Here, we are going
access the Twitter API with Python. Tweepy to use Tweepy for doing the same.
includes a set of classes and methods that SCAN POST:
represent Twitter’s models and API Providing security for data is a major
endpoints, and it transparently handles concern in cloud storage systems even
various implementation details, such as: though it comes with attractive benefits. The
internal and external threats cause the data
Data encoding and decoding in cloud to be deleted or corrupted or
HTTP requests tampered. In specific any external adversary
POST TWEET:
Twitter is an online news and social
networking service where users post and
interact with messages. These posts are
known as “tweets”. Twitter is known as the
social media site for robots. We can use
Python for posting the tweets without even
5
Table 3.3.1 List of privacy words to be
detected
Father income
Bank name
Branch name
Financial Details Debit card no
Credit card no
IFSC code
Net banking password
Organization
House no
Street no
Location Road no
City
State
Country
Zip code
6
SYSTEM ARCHITECTURE
7
Fig.4.1.1System Architecture
1. User want to share post in asocial 3. From that generalized data based, it
networking sides , that post is automatically should be mapped with predefined data base
common regular expression ) that message 4. privacy words found it will report to
8
privacy found report to the user by alerting
the user
FLOW CHART
START
Check post by
privacy word
Share post/message in -Personal info
social networking site
Scan
post
using
(NO) NLP (YES)
Not ffound
Found
Recheck
(NO)
No privacy detected
9User can share
STOP
Fig. 4.2.1Flow Chart
The RAKE algorithm is described in the 5. The top T keywords are then extracted
1. Candidates are extracted from the text number of words in the graph.
10
have taken from twitter developer for strings or sets of strings. They are widely
accessing my twitter account for preventing used for validation purposes, like email
privacy at that time only. validation, url-validation, phone number
Working of a System:If we run the code it validation etc.
automatically accessing my twitter account
without opening the twitter website by Spacy NER (Name entity recognizer): is a
python in which twitter API is available. It standard NLP problem which involves
will open a twitter account and if we write a spotting named entities (people, places,
post after that, submitting the post at time organizations etc.) from a chunk of text, and
only it will alerting to user if any privacy classifying them into a predefined set of
leaks found. The privacy can be detected by categories. Some of the practical
using rake algorithm, common regular applications of NER include:
expression and NER (Name entity
Scanning news articles for the
recognizer).
people, organizations and locations
Rake Algorithm:RAKE short for Rapid
reported.
Automatic Keyword Extraction algorithm, is
Providing concise features for search
a domain independent keyword extraction
optimization: instead of searching
algorithm which tries to determine key
the entire content, one may simply
phrases in a body of text by analyzing the
search for the major entities
frequency of word appearance and its co-
involved.
occurrence with other words in the text.
Quickly retrieving geographical
locations talked about in Twitter
Common Regular Expression:A regular
posts.
expression referred to
COMPARITIVE ANALYSIS:
as rationalexpression is a sequence
of characters that define a search pattern.
Usually such patterns are used by string-
searchingalgorithms for "find" operations
on strings, or for input validation.
Regularexpression is a special sequence of
characters that helps you match or find other
11
Fig. 5.3.1Twitter home page and writing
tweet
Snapshots
12
Fig. 5.3.6writing tweet and posting
Fig.5.3.9privacy detected for location
;
CONCLUSION
FUTURE WORK
13
In future it will not only for post in social [5]
sites but also for messages in all social YoungMinBaek,EunmeeKim,andYoungBae.
networking sites. It can also 2014. “My privacy is okay, but their sis
identifylanguagepatternsforothercategoriesof endangered: Why comparative optimism
sensitive information. This application is for matters in online privacy concerns.”
single user in future we can used for the
[6]Greg Bigwood1, Fehmi Ben Abdesslem2,
multiple users which the permission can be
and Tristan Henderson1.2011“Predicting
taken from the developers.
location sharing privacy preferences in
REFERENCES social networking application”
[1]Gerardo anfora, Andrea Di Sorbo, Enrico [7] Young Min Baek, Eunmee Kim, and
Emanuele, Sara Forootani, Corrado Young Bae. 2014. My privacy is okay, but
14
[10] Colin Camerer. 1998. Bounded in Developer Discussions (T). In 30th
Rationality in Individual Decision Making. IEEE/ACM International Conference on
Experimental Economics 1, 2 (01 Sep 1998), Automated Software Engineering, ASE
163–183. https: 2015, Lincoln, NE, USA, November 9-13,
//doi.org/10.1023/A:1009944326196 2015.12–
23.https://doi.org/10.1109/ASE.2015.12
[11] Paolo Cappellari, Soon Ae Chun, and
Mark Perelman. 2017. A Tool for Automatic [15] Nicole B. Ellison, Jessica Vitak,
Assessment and Awareness of Privacy Charles Steinfield, Rebecca Gray, and Cliff
Disclosure. In Proceedings of the 18th Lampe. 2011. Negotiating Privacy Concerns
Annual International Conference on Digital and Social Capital Needs in a Social Media
Government Research (dg.o ’17). ACM, Environment. Springer Berlin Heidelberg,
New York, NY, USA, 586–587. Berlin, Heidelberg, 19–32.
https://doi.org/10.1145/3085228.3085259 https://doi.org/10.1007/978-3-642-21521-6_
3
[12] Fabio Celli, Fabio Pianesi, David
Stillwell, Michal Kosinski, et al. 2013. [16] William B. Frakes and Ricardo Baeza-
Workshop on computational personality Yates. 1992. Information Retrieval: Data
recognition (shared task). In Proceedings of Structures and Algorithms. Prentice-Hall,
the Workshop on Computational Personality Inc., Upper Saddle River, NJ, USA.
Recognition.
[17] J. Neerbeky, I. Assentz, and P. Dolog.
[13] Marie-Catherine De Marneffe, Bill 2017. TABOO: Detecting Unstructured
MacCartney, Christopher D Manning, et al. Sensitive Information Using Recursive
2006. Generating typed dependency parses Neural Networks. In2017 IEEE 33rd
from phrase structure parses. In Proceedings International Conference on Data
of LREC, Vol. 6. Genoa, 449–454. Engineering (ICDE). 1399–1400.
https://doi.org/10.1109/ICDE.2017.195
[14] Andrea Di Sorbo, Sebastiano
Panichella, Corrado Aaron Visaggio, [18] Sebastiano Panichella, Andrea Di
Massimiliano Di Penta, Gerardo Canfora, Sorbo, Emitza Guzman, Corrado A.
and Harald C. Gall. 2015. Development Visaggio, Gerardo Canfora, and Harald C.
Emails Content Analyzer: Intention Mining Gall. 2015. How Can I Improve MyApp?
15
Classifying User Reviews for Software solutions. Information Sciences 421 (2017),
Maintenance and Evolution. In Proceedings 43–69.
of the 2015 IEEE International Conference
[20] WilliamB.FrakesandRicardoBaeza-
on Software Maintenance and Evolution
Yates.1992. Information Retrieval: Data
(ICSME) (ICSME ’15). IEEE , Washington,
Structures and Algorithms. Prentice-
DC,USA,281–290. https://doi.org/
Hall,Inc.,Upper Saddle River, NJ,USA.
10.1109/ICSM.2015.7332474
[21] Marie-Catherine De Marneffe, Bill
[19] Shailendra Rathore, Pradip Kumar
MacCartney, Christopher D Manning,
Sharma, Vincenzo Loia, Young-Sik Jeong,
etal.2006. Generating typed dependency
and Jong Hyuk Park. 2017. Social network
parses from phrase structure parses.In
security: Issues, challenges, threats, and
Proceedings of LREC,Vol.6.Genoa,449–
454.
16