Prevention of Privacy Leaks in Social Networking Post Using NLP

PREVENTION OF PRIVACY LEAKS IN SOCIAL NETWORKING POST
USING NLP
FAREESA BEGUM
M_tech student,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
fareesabegum12345@gmail.com
SYEDA AMBAREEN RANA
Assistant Professor,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
ambareen.rana@mjcollege.ac.in
ABSTRACT
Online social networking sites (OSN) are used by people of all walk of life, such as Facebook,
LinkedIn, Twitter, Instagram are widely used nowadays. In these users share their personal
information that needs to be secure. This is due to increase of interaction with different types of
users and social groups without knowledge about sharing privacy which could be personal
information, location, financial details, etc. If users chat or post in OSNs there may be privacy
leaks, the hackers or intruders use this information for fraud purpose. Many techniques like
machine learning algorithms, NLP, Data mining approaches etc, are used to prevent privacy in
OSNs to evaluate the sensitive information while it has been published could enhance privacy
protection by analyzing various chats or post in OSNs but these techniques are not accurate, time
consuming and not feasible .The proposed work is mainly based on privacy involves users
personal information, Banking details and Location, etc.,. When users chat with the friends in
OSNs this message is annotated with NLP, Rake algorithm and NER. The approach works by
splitting the message after annotating words are match with the predefined privacy rules, if
1
privacy word is found it will report to the user. User will recheck the message to be posted and
decide to alter or send the message. Compared to various techniques the proposed approach is
accurate and easy to protect privacy in the post or message.
INTRODUCTION information, which can be accessed
Privacy concerns with social publicly through OSNs, e.g., likes
networking services is a subset of and dislikes, email addresses,
data privacy, involving education background, hometown,
personal privacy concerning storing, re- activities attended and anything else.
purposing, provision to third parties, and For users, sharing this information
displaying of information concern to oneself allows them to maintain long-term
via the Internet. Types of privacy causes relationships with friends or to
Intrusion, Information collection, communicate with people with
Information processing.Social networking similar interests [3].However, users
sitesare used by people of all walk of life, may expose themselves to a wide
such as Facebook, LinkedIn, Twitter, range of observers, which include
Instagram are widely used nowadays. In not only relatives and close friends,
these users share their personal information but also strangers and even stalkers.
that needs to be secure. It seems this is not In other words, the readers of their
necessarily due to a poor knowledge about posts are anonymous. This will raise
the risks for privacy, but rather to a serious cybersecurity issues if their
cognitive dissonance between knowledge private information is abused.
and behavior: the study accomplished in There are several main negative
found that there was little correlation consequences caused by privacy
between participants’ broader concern about information leakage on as OSNs
privacy on TWITTER and their posting follows.
behavior [1]. 1. Cyberstalking: Cyberstalking has

been increasingly prevalent. LeFebvre,
Users don’t pay much attention to Blackburn and Brody (2015) indicate that a
the massive amount of private large number of users conduct online
2
surveillance in a romantic relationship. the form of posts. Unwantedly they share
More seriously, some people can gain personal information that need to be secure.
control of their target users by gathering The prevention system to find privacy leaks
their personal information leaked on OSNs in a social posts earlier sharing a post.The
(McFarlane & Bocij, 2013), which will proposed prevention system is based on a
cause harm to the target users. predefine privacy words that might be users
2. Identify theft: The quantities of personal information, Banking details,
personal information can be collected for Location. The technique used to protect
identify theft, which may cause huge privacy is NLP Rake algorithm, common
financial loss to users (Humphreys, Gill & regular expression and NER.
Krishnamurthy, 2010). A report conducted
by the Department of Justice in the USA
shows that 17 million cases of identity theft PROBLEM STATEMENT
caused almost 25 billion dollars losses in While social networks have become a
2015 (Francia III, Hutchinson & Francia, necessary convey for social interactions,
2015). they also raise ethical and privacy issues. A
3. Phishing: Cybercriminals can easily well-known fact is that social networks leak
conduct phishing attacks by collecting information that may be sensitive, about
personal revealing information of the users. However, performing accurate real
targeted victim from OSNs. Then the world on-line privacy attacks in a reasonable
attackers aim to craft a scenario that looks time frame remains a challenging task.
realistic after collecting sufficient
information. Because of the scenario, it is EXISTING SYSTEM
very simple for the attackers to obtain the Previous research demonstrated that privacy
trust of the victim. Then they use the trust preservation is also conditioned by
gained to trick victims into surrendering
(i) optimism bias
privacy information, e.g., credit card details.
(ii) overconfidence bias
(iii) People are not able to evaluate all the
OBJECTIVE
relevant parameters for estimating privacy
Social sites like Twitter used by people all
risks. Thus, their privacy decisions are
walk of life. The user share their feeling in
compromised by incomplete information
3
and bounded rationality. Moreover, we privacy leakage. For preventing privacy in
leverage a technique, previously proposed the post we make use of a rake algorithm,
for the recognition of useful text fragments common regular expressions and spacy NER
with in discussions among developers. The (Name entity recognizer).At the time posting
technique relies on the identification of if any privacy leaks found it will intimating
natural language patterns within sentences users at that time only before posting, it will
contained in the target texts, using the works as a Real time for detecting privacy
Stanford Typed Dependencies (SD) leaks .For that purpose I have taken from
representation of these sentences. The SD twitter developer for accessing my twitter
representation of a sentence models the account for preventing privacy at that time
sentence’s grammatical structure as a list of only.
triples, in which each trip led scribes the
grammatical relation existing between two Advantages of Proposed System
words: the governor and the dependent. 1. The proposed work is based on
the prevention of privacy in social
PROPOSED SYSTEM
accounts by using NLP (i.e., based
For all these reasons, it is important to
on Predefine privacy words).
provide users with a technology that
2. Users shares a post in social
accesses the sensitivity of information as it
media ,before posting the users have
is being published by the consumer and
to check privacy in a post, of if they
alerts them to the risk of sharing
share private it automatic detect
information.Current solutions that SNs 3. The users can detect privacy
adopt for protecting users’ privacy are before sharing a post in social sites,
mainly based on setting customizable in an existing systems privacy can be
privacy preferences, a mechanism that turns detected after they have been post if
out to be insufficient for contrasting uses have lost their information.
leakages of privacy, as it does not all users 4. The proposed work is based on
to have control over data. The solution that Run time.
we propose with this system is RAKE
algorithm and NER to identifying when a
MODULES
sentence (i.e. a post to a SN) entails a risk of
4
TWEEPY opening the website. There is a Python
Tweepy is an open source Python package library that is taken for reach the Python
that gives you a very convenient way to API, known as Tweepy. Here, we are going
access the Twitter API with Python. Tweepy to use Tweepy for doing the same.
includes a set of classes and methods that SCAN POST:
represent Twitter’s models and API Providing security for data is a major
endpoints, and it transparently handles concern in cloud storage systems even
various implementation details, such as: though it comes with attractive benefits. The
internal and external threats cause the data
 Data encoding and decoding in cloud to be deleted or corrupted or
 HTTP requests tampered. In specific any external adversary
 Results pagination tries to alter the content of the stored data
 OAuth authentication and convinces the owner of the data that

their data stored in cloud is correct and
 Rate limits
intact. This is being done for high profit.
 Streams
Hence it becomes essential to verify the
correctness and integrity of the outsourced
If you weren’t using Tweepy, then you
data moved to cloud.
would have to deal with low-level details
Several schemes and auditing protocols have
having to do with HTTP requests, data
been introduced for protecting the integrity
serialization, authentication, and rate limits.
of outsourced data using rake algorithm by
This could be time consuming and prone to
keyword extraction.
error. Instead, thanks to Tweepy, you can
focus on the functionality you want to build.
POST TWEET:
Twitter is an online news and social
networking service where users post and
interact with messages. These posts are
known as “tweets”. Twitter is known as the
social media site for robots. We can use
Python for posting the tweets without even
5
Table 3.3.1 List of privacy words to be
detected
LIST OF WORDS TO BE DETECTED IN

THE POST:
Privacy types Predefine DB words

Name
Phone no
Gmail ID
Personal information Address
Father phone no
Father Gmail ID
Father Office name
Time
Father income
Bank name
Branch name
Financial Details Debit card no
Credit card no
IFSC code
Net banking password
Organization
House no
Street no
Location Road no
City
State
Country
Zip code
6
SYSTEM ARCHITECTURE
7
Fig.4.1.1System Architecture
Explanation of the system 2. That parts of a message will be store

architecture: in a generalized data base.
1. User want to share post in asocial 3. From that generalized data based, it
networking sides , that post is automatically should be mapped with predefined data base
taken by NLP (Rake algorithm, NER and for checking privacy, if
common regular expression ) that message 4. privacy words found it will report to
will be split in parts of speech, tokenization the user by alert.
etc. 5. After that if user can change post

again it will check for privacy, if privacy
leaks not found user can share post. If
8
privacy found report to the user by alerting
the user
FLOW CHART
START
Check post by
privacy word
Share post/message in -Personal info
social networking site
Scan
post
using
(NO) NLP (YES)
Not ffound
Found
Privacy detected -Banking details

No privacy detected location
Report to user
User can change

User can share post post/message and
can share (YES)
RE-CHECK
PRIVACY
Recheck
(NO)
No privacy detected
9User can share
STOP
Fig. 4.2.1Flow Chart
4. Adjoining keywords are included if

they occur more than twice in the
document and score high enough. An
Algorithm and Technique Used adjoining keyword is two keyword
RAKE Algorithm phrases with a stop word between them.
The RAKE algorithm is described in the 5. The top T keywords are then extracted
following steps: from the content, where T is 1/3rd of the
1. Candidates are extracted from the text number of words in the graph.
by finding strings of words that do not

include phrase delimiters or stop words Project Description:
(a, the, of, etc). This produces the list of Social networking channels such as twitter
candidate keywords/phrases. are used in all walk of life. In the form of
2. A Co-occurrence graph is built to post, the user expresses their feelings. They
identify the frequency that words are unwantedly share personal data that needs to
associated together in those phrases. be safe. The prevention mechanism for
3. A score is calculated for each phrase detecting privacy leaks before a posting in
that is the sum of the individual word’s social networking (Twitter).For preventing
scores from the co-occurrence graph. An privacy in the post we make use of a rake
individual word score is calculated as the algorithm, common regular expressions and
degree (number of times it appears + spacy NER(Name entity recognizer).At the
number of additional words it appears time posting if any privacy leaks found it
with) of a word divided by it’s frequency will intimating users at that time only before
(number of times it appears), which posting, it will works as a Real time for
weights towards longer phrases. detecting privacy leaks .For that purpose I
10
have taken from twitter developer for strings or sets of strings. They are widely
accessing my twitter account for preventing used for validation purposes, like email
privacy at that time only. validation, url-validation, phone number
Working of a System:If we run the code it validation etc.
automatically accessing my twitter account
without opening the twitter website by Spacy NER (Name entity recognizer): is a
python in which twitter API is available. It standard NLP problem which involves
will open a twitter account and if we write a spotting named entities (people, places,
post after that, submitting the post at time organizations etc.) from a chunk of text, and
only it will alerting to user if any privacy classifying them into a predefined set of
leaks found. The privacy can be detected by categories. Some of the practical
using rake algorithm, common regular applications of NER include:
expression and NER (Name entity
 Scanning news articles for the
recognizer).
people, organizations and locations
Rake Algorithm:RAKE short for Rapid
reported.
Automatic Keyword Extraction algorithm, is
 Providing concise features for search
a domain independent keyword extraction
optimization: instead of searching
algorithm which tries to determine key
the entire content, one may simply
phrases in a body of text by analyzing the
search for the major entities
frequency of word appearance and its co-
involved.
occurrence with other words in the text.
 Quickly retrieving geographical
locations talked about in Twitter
Common Regular Expression:A regular
posts.
expression referred to
COMPARITIVE ANALYSIS:
as rationalexpression is a sequence
of characters that define a search pattern.
Usually such patterns are used by string-
searchingalgorithms for "find" operations
on strings, or for input validation.
Regularexpression is a special sequence of
characters that helps you match or find other
11
Fig. 5.3.1Twitter home page and writing
tweet
Fig. 5.3.2posting tweet to twitter
Table 5.3.1 Comparative analysis
Fig. 5.3.4Writing tweet and posting
Snapshots
Fig. 5.3.5privacy detected for personal

information (contact number)
12
Fig. 5.3.6writing tweet and posting
Fig.5.3.9privacy detected for location
;
CONCLUSION
Social networking sides are using millions

of peoples around the country, they were
sharing all types of information. We have
taken permission from Twitter developer for
accessing twitter account to test the privacy
in sharing post. We proposed a method that
intercepts the sentences deport precise
Fig. 5.3.7privacy detected for location information, by identifying the domain of
privacy that is concerned by the leakage.
Unwantedly they were personal information
which may be as a financial loss, mental
abuse etc, this project will notifies before
share post. So that the user may know what
they were sharing if they sharing privacy it
will notify the user by alerting them. This
used NLP Rake algorithm for identifying
Fig.5.3.8writing tweet and posting privacy word and that will match with the
help of predefined privacy data base.
FUTURE WORK
13
In future it will not only for post in social [5]
sites but also for messages in all social YoungMinBaek,EunmeeKim,andYoungBae.
networking sites. It can also 2014. “My privacy is okay, but their sis
identifylanguagepatternsforothercategoriesof endangered: Why comparative optimism
sensitive information. This application is for matters in online privacy concerns.”
single user in future we can used for the
[6]Greg Bigwood1, Fehmi Ben Abdesslem2,
multiple users which the permission can be
and Tristan Henderson1.2011“Predicting
taken from the developers.
location sharing privacy preferences in
REFERENCES social networking application”
[1]Gerardo anfora, Andrea Di Sorbo, Enrico [7] Young Min Baek, Eunmee Kim, and
Emanuele, Sara Forootani, Corrado Young Bae. 2014. My privacy is okay, but
A.Visaggio 2018.” A Nlp-based Solution to theirs is endangered: Why comparative
Prevent from Privacy Leaks in Social optimism matters in online privacy
Network Posts”. concerns. Computers in Human Behavior 31

(2014), 48 – 56.
[2]Jose Marıa Gomez-Hidalgo, Jose Miguel
https://doi.org/10.1016/j.chb.2013.10.010
Martın-Abreu 2017 ”Prevention of inference
attacks for private information in social [8] Ricardo A. Baeza-Yates and Berthier A.
networking sites” Ribeiro-Neto. 1999. Modern Information

Retrieval. ACM Press / Addison-Wesley.
[3]A Praveena 2017.” Prevention of
http://www.dcc. ufmg.br/irbook/
Inference Attacks for Private Information in
Social Networking Sites” [9] Aylin Caliskan Islam, Jonathan Walsh,
and Rachel Greenstadt. 2014. Privacy
[4]Neha patil 2015 “A Novel approach to
Detective: Detecting Private Information
prevent personal data on a social network
and Collective Privacy Behavior in a Large
using Graph theory”. IEEE privacy security.
Social Network. In Proceedings of the 13th
Workshop on Privacy in the Electronic
Society (WPES ’14). ACM, New York, NY,
USA, 35–46.
https://doi.org/10.1145/2665943.2665958
14
[10] Colin Camerer. 1998. Bounded in Developer Discussions (T). In 30th
Rationality in Individual Decision Making. IEEE/ACM International Conference on
Experimental Economics 1, 2 (01 Sep 1998), Automated Software Engineering, ASE
163–183. https: 2015, Lincoln, NE, USA, November 9-13,
//doi.org/10.1023/A:1009944326196 2015.12–
23.https://doi.org/10.1109/ASE.2015.12
[11] Paolo Cappellari, Soon Ae Chun, and
Mark Perelman. 2017. A Tool for Automatic [15] Nicole B. Ellison, Jessica Vitak,
Assessment and Awareness of Privacy Charles Steinfield, Rebecca Gray, and Cliff
Disclosure. In Proceedings of the 18th Lampe. 2011. Negotiating Privacy Concerns
Annual International Conference on Digital and Social Capital Needs in a Social Media
Government Research (dg.o ’17). ACM, Environment. Springer Berlin Heidelberg,
New York, NY, USA, 586–587. Berlin, Heidelberg, 19–32.
https://doi.org/10.1145/3085228.3085259 https://doi.org/10.1007/978-3-642-21521-6_
3
[12] Fabio Celli, Fabio Pianesi, David
Stillwell, Michal Kosinski, et al. 2013. [16] William B. Frakes and Ricardo Baeza-
Workshop on computational personality Yates. 1992. Information Retrieval: Data
recognition (shared task). In Proceedings of Structures and Algorithms. Prentice-Hall,
the Workshop on Computational Personality Inc., Upper Saddle River, NJ, USA.
Recognition.
[17] J. Neerbeky, I. Assentz, and P. Dolog.
[13] Marie-Catherine De Marneffe, Bill 2017. TABOO: Detecting Unstructured
MacCartney, Christopher D Manning, et al. Sensitive Information Using Recursive
2006. Generating typed dependency parses Neural Networks. In2017 IEEE 33rd
from phrase structure parses. In Proceedings International Conference on Data
of LREC, Vol. 6. Genoa, 449–454. Engineering (ICDE). 1399–1400.
https://doi.org/10.1109/ICDE.2017.195
[14] Andrea Di Sorbo, Sebastiano
Panichella, Corrado Aaron Visaggio, [18] Sebastiano Panichella, Andrea Di
Massimiliano Di Penta, Gerardo Canfora, Sorbo, Emitza Guzman, Corrado A.
and Harald C. Gall. 2015. Development Visaggio, Gerardo Canfora, and Harald C.
Emails Content Analyzer: Intention Mining Gall. 2015. How Can I Improve MyApp?
15
Classifying User Reviews for Software solutions. Information Sciences 421 (2017),
Maintenance and Evolution. In Proceedings 43–69.
of the 2015 IEEE International Conference
[20] WilliamB.FrakesandRicardoBaeza-
on Software Maintenance and Evolution
Yates.1992. Information Retrieval: Data
(ICSME) (ICSME ’15). IEEE , Washington,
Structures and Algorithms. Prentice-
DC,USA,281–290. https://doi.org/
Hall,Inc.,Upper Saddle River, NJ,USA.
10.1109/ICSM.2015.7332474
[21] Marie-Catherine De Marneffe, Bill
[19] Shailendra Rathore, Pradip Kumar
MacCartney, Christopher D Manning,
Sharma, Vincenzo Loia, Young-Sik Jeong,
etal.2006. Generating typed dependency
and Jong Hyuk Park. 2017. Social network
parses from phrase structure parses.In
security: Issues, challenges, threats, and
Proceedings of LREC,Vol.6.Genoa,449–
454.
16

Prevention of Privacy Leaks in Social Networking Post Using NLP

Uploaded by

Copyright:

Available Formats

Prevention of Privacy Leaks in Social Networking Post Using NLP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prevention of Privacy Leaks in Social Networking Post Using NLP

Uploaded by

Copyright:

Available Formats

PREVENTION OF PRIVACY LEAKS IN SOCIAL NETWORKING POST

INTRODUCTION information, which can be accessed

Privacy concerns with social publicly through OSNs, e.g., likes

networking services is a subset of and dislikes, email addresses,

data privacy, involving education background, hometown,

personal privacy concerning storing, re- activities attended and anything else.

displaying of information concern to oneself allows them to maintain long-term

via the Internet. Types of privacy causes relationships with friends or to

Intrusion, Information collection, communicate with people with

Information processing.Social networking similar interests [3].However, users

such as Facebook, LinkedIn, Twitter, range of observers, which include

cognitive dissonance between knowledge private information is abused.

found that there was little correlation consequences caused by privacy

between participants’ broader concern about information leakage on as OSNs

privacy on TWITTER and their posting follows.

behavior [1]. 1. Cyberstalking: Cyberstalking has

the massive amount of private large number of users conduct online

 Results pagination tries to alter the content of the stored data

 OAuth authentication and convinces the owner of the data that

LIST OF WORDS TO BE DETECTED IN

Privacy types Predefine DB words

Explanation of the system 2. That parts of a message will be store

taken by NLP (Rake algorithm, NER and for checking privacy, if

will be split in parts of speech, tokenization the user by alert.

etc. 5. After that if user can change post

Privacy detected -Banking details

User can change

4. Adjoining keywords are included if

RAKE Algorithm phrases with a stop word between them.

following steps: from the content, where T is 1/3rd of the

by finding strings of words that do not

Fig. 5.3.2posting tweet to twitter

Table 5.3.1 Comparative analysis

Fig. 5.3.4Writing tweet and posting

Fig. 5.3.5privacy detected for personal

Social networking sides are using millions

A.Visaggio 2018.” A Nlp-based Solution to theirs is endangered: Why comparative

Prevent from Privacy Leaks in Social optimism matters in online privacy

Network Posts”. concerns. Computers in Human Behavior 31

networking sites” Ribeiro-Neto. 1999. Modern Information

You might also like