Prevention of Privacy Leaks in Social Networking Post Using NLP

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

PREVENTION OF PRIVACY LEAKS IN SOCIAL NETWORKING POST

USING NLP

FAREESA BEGUM
M_tech student,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
fareesabegum12345@gmail.com
SYEDA AMBAREEN RANA
Assistant Professor,
Muffakham Jah college of engineering and Technology, Osmania University,
Hyderabad, Telangana State, India
ambareen.rana@mjcollege.ac.in

ABSTRACT

Online social networking sites (OSN) are used by people of all walk of life, such as Facebook,
LinkedIn, Twitter, Instagram are widely used nowadays. In these users share their personal
information that needs to be secure. This is due to increase of interaction with different types of
users and social groups without knowledge about sharing privacy which could be personal
information, location, financial details, etc. If users chat or post in OSNs there may be privacy
leaks, the hackers or intruders use this information for fraud purpose. Many techniques like
machine learning algorithms, NLP, Data mining approaches etc, are used to prevent privacy in
OSNs to evaluate the sensitive information while it has been published could enhance privacy
protection by analyzing various chats or post in OSNs but these techniques are not accurate, time
consuming and not feasible .The proposed work is mainly based on privacy involves users
personal information, Banking details and Location, etc.,. When users chat with the friends in
OSNs this message is annotated with NLP, Rake algorithm and NER. The approach works by
splitting the message after annotating words are match with the predefined privacy rules, if

1
privacy word is found it will report to the user. User will recheck the message to be posted and
decide to alter or send the message. Compared to various techniques the proposed approach is
accurate and easy to protect privacy in the post or message.

INTRODUCTION information, which can be accessed

Privacy concerns with social publicly through OSNs, e.g., likes

networking services is a subset of and dislikes, email addresses,

data privacy, involving education background, hometown,

personal privacy concerning storing, re- activities attended and anything else.

purposing, provision to third parties, and For users, sharing this information

displaying of information concern to oneself allows them to maintain long-term

via the Internet. Types of privacy causes relationships with friends or to

Intrusion, Information collection, communicate with people with

Information processing.Social networking similar interests [3].However, users

sitesare used by people of all walk of life, may expose themselves to a wide

such as Facebook, LinkedIn, Twitter, range of observers, which include

Instagram are widely used nowadays. In not only relatives and close friends,

these users share their personal information but also strangers and even stalkers.

that needs to be secure. It seems this is not In other words, the readers of their

necessarily due to a poor knowledge about posts are anonymous. This will raise

the risks for privacy, but rather to a serious cybersecurity issues if their

cognitive dissonance between knowledge private information is abused.

and behavior: the study accomplished in There are several main negative

found that there was little correlation consequences caused by privacy

between participants’ broader concern about information leakage on as OSNs

privacy on TWITTER and their posting follows.

behavior [1]. 1. Cyberstalking: Cyberstalking has


been increasingly prevalent. LeFebvre,

Users don’t pay much attention to Blackburn and Brody (2015) indicate that a

the massive amount of private large number of users conduct online

2
surveillance in a romantic relationship. the form of posts. Unwantedly they share
More seriously, some people can gain personal information that need to be secure.
control of their target users by gathering The prevention system to find privacy leaks
their personal information leaked on OSNs in a social posts earlier sharing a post.The
(McFarlane & Bocij, 2013), which will proposed prevention system is based on a
cause harm to the target users. predefine privacy words that might be users
2. Identify theft: The quantities of personal information, Banking details,
personal information can be collected for Location. The technique used to protect
identify theft, which may cause huge privacy is NLP Rake algorithm, common
financial loss to users (Humphreys, Gill & regular expression and NER.
Krishnamurthy, 2010). A report conducted
by the Department of Justice in the USA
shows that 17 million cases of identity theft PROBLEM STATEMENT
caused almost 25 billion dollars losses in While social networks have become a
2015 (Francia III, Hutchinson & Francia, necessary convey for social interactions,
2015). they also raise ethical and privacy issues. A
3. Phishing: Cybercriminals can easily well-known fact is that social networks leak
conduct phishing attacks by collecting information that may be sensitive, about
personal revealing information of the users. However, performing accurate real
targeted victim from OSNs. Then the world on-line privacy attacks in a reasonable
attackers aim to craft a scenario that looks time frame remains a challenging task.
realistic after collecting sufficient
information. Because of the scenario, it is EXISTING SYSTEM
very simple for the attackers to obtain the Previous research demonstrated that privacy
trust of the victim. Then they use the trust preservation is also conditioned by
gained to trick victims into surrendering
(i) optimism bias
privacy information, e.g., credit card details.
(ii) overconfidence bias
(iii) People are not able to evaluate all the
OBJECTIVE
relevant parameters for estimating privacy
Social sites like Twitter used by people all
risks. Thus, their privacy decisions are
walk of life. The user share their feeling in
compromised by incomplete information

3
and bounded rationality. Moreover, we privacy leakage. For preventing privacy in
leverage a technique, previously proposed the post we make use of a rake algorithm,
for the recognition of useful text fragments common regular expressions and spacy NER
with in discussions among developers. The (Name entity recognizer).At the time posting
technique relies on the identification of if any privacy leaks found it will intimating
natural language patterns within sentences users at that time only before posting, it will
contained in the target texts, using the works as a Real time for detecting privacy
Stanford Typed Dependencies (SD) leaks .For that purpose I have taken from
representation of these sentences. The SD twitter developer for accessing my twitter
representation of a sentence models the account for preventing privacy at that time
sentence’s grammatical structure as a list of only.
triples, in which each trip led scribes the
grammatical relation existing between two Advantages of Proposed System
words: the governor and the dependent. 1. The proposed work is based on
the prevention of privacy in social
PROPOSED SYSTEM
accounts by using NLP (i.e., based
For all these reasons, it is important to
on Predefine privacy words).
provide users with a technology that
2. Users shares a post in social
accesses the sensitivity of information as it
media ,before posting the users have
is being published by the consumer and
to check privacy in a post, of if they
alerts them to the risk of sharing
share private it automatic detect
information.Current solutions that SNs 3. The users can detect privacy
adopt for protecting users’ privacy are before sharing a post in social sites,
mainly based on setting customizable in an existing systems privacy can be
privacy preferences, a mechanism that turns detected after they have been post if
out to be insufficient for contrasting uses have lost their information.
leakages of privacy, as it does not all users 4. The proposed work is based on
to have control over data. The solution that Run time.
we propose with this system is RAKE
algorithm and NER to identifying when a
MODULES
sentence (i.e. a post to a SN) entails a risk of

4
TWEEPY opening the website. There is a Python
Tweepy is an open source Python package library that is taken for reach the Python
that gives you a very convenient way to API, known as Tweepy. Here, we are going
access the Twitter API with Python. Tweepy to use Tweepy for doing the same.
includes a set of classes and methods that SCAN POST:
represent Twitter’s models and API Providing security for data is a major
endpoints, and it transparently handles concern in cloud storage systems even
various implementation details, such as: though it comes with attractive benefits. The
internal and external threats cause the data
 Data encoding and decoding in cloud to be deleted or corrupted or
 HTTP requests tampered. In specific any external adversary

 Results pagination tries to alter the content of the stored data

 OAuth authentication and convinces the owner of the data that


their data stored in cloud is correct and
 Rate limits
intact. This is being done for high profit.
 Streams
Hence it becomes essential to verify the
correctness and integrity of the outsourced
If you weren’t using Tweepy, then you
data moved to cloud.
would have to deal with low-level details
Several schemes and auditing protocols have
having to do with HTTP requests, data
been introduced for protecting the integrity
serialization, authentication, and rate limits.
of outsourced data using rake algorithm by
This could be time consuming and prone to
keyword extraction.
error. Instead, thanks to Tweepy, you can
focus on the functionality you want to build.

POST TWEET:
Twitter is an online news and social
networking service where users post and
interact with messages. These posts are
known as “tweets”. Twitter is known as the
social media site for robots. We can use
Python for posting the tweets without even

5
Table 3.3.1 List of privacy words to be
detected

LIST OF WORDS TO BE DETECTED IN


THE POST:

Privacy types Predefine DB words


Name
Phone no
Gmail ID
Personal information Address
Father phone no
Father Gmail ID
Father Office name
Time

Father income
Bank name
Branch name
Financial Details Debit card no
Credit card no
IFSC code
Net banking password
Organization
House no
Street no
Location Road no
City
State
Country
Zip code

6
SYSTEM ARCHITECTURE

7
Fig.4.1.1System Architecture

Explanation of the system 2. That parts of a message will be store


architecture: in a generalized data base.

1. User want to share post in asocial 3. From that generalized data based, it

networking sides , that post is automatically should be mapped with predefined data base

taken by NLP (Rake algorithm, NER and for checking privacy, if

common regular expression ) that message 4. privacy words found it will report to

will be split in parts of speech, tokenization the user by alert.

etc. 5. After that if user can change post


again it will check for privacy, if privacy
leaks not found user can share post. If

8
privacy found report to the user by alerting
the user

FLOW CHART

START
Check post by
privacy word
Share post/message in -Personal info
social networking site

Scan
post
using
(NO) NLP (YES)
Not ffound
Found

Privacy detected -Banking details


No privacy detected location
Report to user

User can change


User can share post post/message and
can share (YES)
RE-CHECK
PRIVACY

Recheck

(NO)
No privacy detected
9User can share

STOP
Fig. 4.2.1Flow Chart

4. Adjoining keywords are included if


they occur more than twice in the
document and score high enough. An
Algorithm and Technique Used adjoining keyword is two keyword

RAKE Algorithm phrases with a stop word between them.

The RAKE algorithm is described in the 5. The top T keywords are then extracted

following steps: from the content, where T is 1/3rd of the

1. Candidates are extracted from the text number of words in the graph.

by finding strings of words that do not


include phrase delimiters or stop words Project Description:
(a, the, of, etc). This produces the list of Social networking channels such as twitter
candidate keywords/phrases. are used in all walk of life. In the form of
2. A Co-occurrence graph is built to post, the user expresses their feelings. They
identify the frequency that words are unwantedly share personal data that needs to
associated together in those phrases. be safe. The prevention mechanism for
3. A score is calculated for each phrase detecting privacy leaks before a posting in
that is the sum of the individual word’s social networking (Twitter).For preventing
scores from the co-occurrence graph. An privacy in the post we make use of a rake
individual word score is calculated as the algorithm, common regular expressions and
degree (number of times it appears + spacy NER(Name entity recognizer).At the
number of additional words it appears time posting if any privacy leaks found it
with) of a word divided by it’s frequency will intimating users at that time only before
(number of times it appears), which posting, it will works as a Real time for
weights towards longer phrases. detecting privacy leaks .For that purpose I

10
have taken from twitter developer for strings or sets of strings. They are widely
accessing my twitter account for preventing used for validation purposes, like email
privacy at that time only. validation, url-validation, phone number
Working of a System:If we run the code it validation etc.
automatically accessing my twitter account
without opening the twitter website by Spacy NER (Name entity recognizer): is a
python in which twitter API is available. It standard NLP problem which involves
will open a twitter account and if we write a spotting named entities (people, places,
post after that, submitting the post at time organizations etc.) from a chunk of text, and
only it will alerting to user if any privacy classifying them into a predefined set of
leaks found. The privacy can be detected by categories. Some of the practical
using rake algorithm, common regular applications of NER include:
expression and NER (Name entity
 Scanning news articles for the
recognizer).
people, organizations and locations
Rake Algorithm:RAKE short for Rapid
reported.
Automatic Keyword Extraction algorithm, is
 Providing concise features for search
a domain independent keyword extraction
optimization: instead of searching
algorithm which tries to determine key
the entire content, one may simply
phrases in a body of text by analyzing the
search for the major entities
frequency of word appearance and its co-
involved.
occurrence with other words in the text.
 Quickly retrieving geographical
locations talked about in Twitter
Common Regular Expression:A regular
posts.
expression referred to
COMPARITIVE ANALYSIS:
as rationalexpression is a sequence
of characters that define a search  pattern.
Usually such patterns are used by string-
searchingalgorithms for "find" operations
on strings, or for input validation.
Regularexpression is a special sequence of
characters that helps you match or find other

11
Fig. 5.3.1Twitter home page and writing
tweet

Fig. 5.3.2posting tweet to twitter

Table 5.3.1 Comparative analysis

Fig. 5.3.4Writing tweet and posting

Snapshots

Fig. 5.3.5privacy detected for personal


information (contact number)

12
Fig. 5.3.6writing tweet and posting
Fig.5.3.9privacy detected for location
;
CONCLUSION

Social networking sides are using millions


of peoples around the country, they were
sharing all types of information. We have
taken permission from Twitter developer for
accessing twitter account to test the privacy
in sharing post. We proposed a method that
intercepts the sentences deport precise
Fig. 5.3.7privacy detected for location information, by identifying the domain of
privacy that is concerned by the leakage.
Unwantedly they were personal information
which may be as a financial loss, mental
abuse etc, this project will notifies before
share post. So that the user may know what
they were sharing if they sharing privacy it
will notify the user by alerting them. This
used NLP Rake algorithm for identifying
Fig.5.3.8writing tweet and posting privacy word and that will match with the
help of predefined privacy data base.

FUTURE WORK

13
In future it will not only for post in social [5]
sites but also for messages in all social YoungMinBaek,EunmeeKim,andYoungBae.
networking sites. It can also 2014. “My privacy is okay, but their sis
identifylanguagepatternsforothercategoriesof endangered: Why comparative optimism
sensitive information. This application is for matters in online privacy concerns.”
single user in future we can used for the
[6]Greg Bigwood1, Fehmi Ben Abdesslem2,
multiple users which the permission can be
and Tristan Henderson1.2011“Predicting
taken from the developers.
location sharing privacy preferences in
REFERENCES social networking application”

[1]Gerardo anfora, Andrea Di Sorbo, Enrico [7] Young Min Baek, Eunmee Kim, and

Emanuele, Sara Forootani, Corrado Young Bae. 2014. My privacy is okay, but

A.Visaggio 2018.” A Nlp-based Solution to theirs is endangered: Why comparative

Prevent from Privacy Leaks in Social optimism matters in online privacy

Network Posts”. concerns. Computers in Human Behavior 31


(2014), 48 – 56.
[2]Jose Marıa Gomez-Hidalgo, Jose Miguel
https://doi.org/10.1016/j.chb.2013.10.010
Martın-Abreu 2017 ”Prevention of inference
attacks for private information in social [8] Ricardo A. Baeza-Yates and Berthier A.

networking sites” Ribeiro-Neto. 1999. Modern Information


Retrieval. ACM Press / Addison-Wesley.
[3]A Praveena 2017.” Prevention of
http://www.dcc. ufmg.br/irbook/
Inference Attacks for Private Information in
Social Networking Sites” [9] Aylin Caliskan Islam, Jonathan Walsh,
and Rachel Greenstadt. 2014. Privacy
[4]Neha patil 2015 “A Novel approach to
Detective: Detecting Private Information
prevent personal data on a social network
and Collective Privacy Behavior in a Large
using Graph theory”. IEEE privacy security.
Social Network. In Proceedings of the 13th
Workshop on Privacy in the Electronic
Society (WPES ’14). ACM, New York, NY,
USA, 35–46.
https://doi.org/10.1145/2665943.2665958

14
[10] Colin Camerer. 1998. Bounded in Developer Discussions (T). In 30th
Rationality in Individual Decision Making. IEEE/ACM International Conference on
Experimental Economics 1, 2 (01 Sep 1998), Automated Software Engineering, ASE
163–183. https: 2015, Lincoln, NE, USA, November 9-13,
//doi.org/10.1023/A:1009944326196 2015.12–
23.https://doi.org/10.1109/ASE.2015.12
[11] Paolo Cappellari, Soon Ae Chun, and
Mark Perelman. 2017. A Tool for Automatic [15] Nicole B. Ellison, Jessica Vitak,
Assessment and Awareness of Privacy Charles Steinfield, Rebecca Gray, and Cliff
Disclosure. In Proceedings of the 18th Lampe. 2011. Negotiating Privacy Concerns
Annual International Conference on Digital and Social Capital Needs in a Social Media
Government Research (dg.o ’17). ACM, Environment. Springer Berlin Heidelberg,
New York, NY, USA, 586–587. Berlin, Heidelberg, 19–32.
https://doi.org/10.1145/3085228.3085259 https://doi.org/10.1007/978-3-642-21521-6_
3
[12] Fabio Celli, Fabio Pianesi, David
Stillwell, Michal Kosinski, et al. 2013. [16] William B. Frakes and Ricardo Baeza-
Workshop on computational personality Yates. 1992. Information Retrieval: Data
recognition (shared task). In Proceedings of Structures and Algorithms. Prentice-Hall,
the Workshop on Computational Personality Inc., Upper Saddle River, NJ, USA.
Recognition.
[17] J. Neerbeky, I. Assentz, and P. Dolog.
[13] Marie-Catherine De Marneffe, Bill 2017. TABOO: Detecting Unstructured
MacCartney, Christopher D Manning, et al. Sensitive Information Using Recursive
2006. Generating typed dependency parses Neural Networks. In2017 IEEE 33rd
from phrase structure parses. In Proceedings International Conference on Data
of LREC, Vol. 6. Genoa, 449–454. Engineering (ICDE). 1399–1400.
https://doi.org/10.1109/ICDE.2017.195
[14] Andrea Di Sorbo, Sebastiano
Panichella, Corrado Aaron Visaggio, [18] Sebastiano Panichella, Andrea Di
Massimiliano Di Penta, Gerardo Canfora, Sorbo, Emitza Guzman, Corrado A.
and Harald C. Gall. 2015. Development Visaggio, Gerardo Canfora, and Harald C.
Emails Content Analyzer: Intention Mining Gall. 2015. How Can I Improve MyApp?

15
Classifying User Reviews for Software solutions. Information Sciences 421 (2017),
Maintenance and Evolution. In Proceedings 43–69.
of the 2015 IEEE International Conference
[20] WilliamB.FrakesandRicardoBaeza-
on Software Maintenance and Evolution
Yates.1992. Information Retrieval: Data
(ICSME) (ICSME ’15). IEEE , Washington,
Structures and Algorithms. Prentice-
DC,USA,281–290. https://doi.org/
Hall,Inc.,Upper Saddle River, NJ,USA.
10.1109/ICSM.2015.7332474
[21] Marie-Catherine De Marneffe, Bill
[19] Shailendra Rathore, Pradip Kumar
MacCartney, Christopher D Manning,
Sharma, Vincenzo Loia, Young-Sik Jeong,
etal.2006. Generating typed dependency
and Jong Hyuk Park. 2017. Social network
parses from phrase structure parses.In
security: Issues, challenges, threats, and
Proceedings of LREC,Vol.6.Genoa,449–
454.

16

You might also like