Deep Web

Deep Web
Part II.B. Techniques and Tools:

Network Forensics
CSF: Forensics Cyber-Security

Fall 2015
Nuno Santos
Summary
}  The Surface Web
}  The Deep Web
2 CSF - Nuno Santos 2015/16

Remember were we are
}  Our journey in this course:
}  Part I: Foundations of digital forensics
}  Part II: Techniques and tools
}  A. Computer forensics
}  B. Network forensics Current focus
}  C. Forensic data analysis

Previously: Three key instruments in cybercrime
Anonymity systems
How criminals hide their IDs
Tools of
cybercrime
Botnets Digital currency

How to launch large scale attacks How to make untraceable payments

Today: One last key instrument – The Web itself
Offender
}  Web allows for accessing services for criminal activity

}  E.g., drug selling, weapon selling, etc.
}  Provides huge source of information, used in:

}  Crime premeditation, privacy violations, identity theft, extortion, etc.
}  To find services and info, there are powerful search engines
}  Google, Bing, Shodan, etc.

The Web: powerful also for crime investigation
Investigator
}  Powerful investigation tool about suspects

}  Find evidence in blogs, social networks, browsing activity, etc.
}  The playground where the crime itself is carried out

}  Illegal transactions, cyber stalking, blackmail, fraud, etc.

An eternal cat & mouse race (who’s who?)
}  The sophistication of offenses (and investigations) is driven

by the nature and complexity of the Web

The web is deep, very deep…
}  What’s “visible” through typical search engines is minimal

What can be found in the Deep Web?
}  Deep Web is not

necessarily bad: it’s
just that the content
is not directly
indexed
}  Part of the deep

web where criminal
activity is carried
out is named the
Dark Web

Some examples of services in the Web “ocean”

Offenders operate at all layers
}  Investigators too!

Roadmap
}  The Surface Web
}  The Deep Web

The Surface Web

The Surface Web
}  The Surface Web is that portion of the World Wide Web
that is readily available to the general public and
searchable with standard web search engines
}  AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet
}  As of June 14, 2015, Google's index of the surface web

contains about 14.5 billion pages

Surface Web characteristics
}  Distributed data
}  80 million web sites (hostnames responding) in April 2006
}  40 million active web sites (don’t redirect, …)
}  High volatility

}  Servers come and go …
}  Large volume

}  One study found 11.5 billion pages in January 2005 (at that
time Google indexed 8 billion pages)

Surface Web characteristics
}  Unstructured data
}  Lots of duplicated content (30% estimate)
}  Semantic duplication much higher
}  Quality of data

}  No required editorial process
}  Many typos and misspellings (impacts IR)
}  Heterogeneous data

}  Different media
}  Different languages

Surface Web composition by file type
To hide
}  As of 2003,
about 70% of
Web content is
images, HTML,
PHP, and PDF files

How to find content and services?
}  Using search engines
1. A web crawler gathers a

snapshot of the Web
3. User submits a search
query
4. Search engine ranks pages that match

the query and returns an ordered list
2. The gathered pages are
indexed for easy retrieval

How a typical search engine works
}  Architecture of a typical search engine
Lots and lots of computers
Users Interface Query Engine
Index
Crawler Indexer
Web
What a Web crawler does
}  The Web crawler is a foundational species

}  Without crawlers, there would be nothing to search
}  Creates and repopulates

search engines data by
navigating the web,
fetching docs and files

What a Web crawler is
}  In general, it’s a program for downloading web pages
}  Crawler AKA spider, bot, harvester
}  Given an initial set of seed URLs, recursively download

every page that is linked from pages in the set
}  A focused web crawler downloads only those pages whose
content satisfies some criterion
}  The next node to crawl is the URL frontier

}  Can include multiple pages from the same host
Crawling the Web: Start from the seed pages
URLs crawled
and parsed
Unseen Web
Seed URLs frontier

pages
Web

Crawling the Web: Keep expanding URL frontier
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
Web crawler algorithm is conceptually simple
}  Basic Algorithm

Initialize queue (Q) with initial set of known URL’s
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop
If already visited L, continue loop
Download page, P, for L
If cannot download P (e.g. 404 error, robot excluded)
continue loop
Index P (e.g. add to inverted index or store cached copy)
Parse P to obtain list of new links N
Append N to the end of Q

But not so simple to build in practice
}  Performance: How do you crawl 1,000,000,000 pages?
}  Politeness: How do you avoid overloading servers?
}  Failures: Broken links, time outs, spider traps.
}  Strategies: How deep to go? Depth first or breadth first?
}  Implementations: How do we store and update the URL

list and other data structures needed?

Crawler performance measures
}  Completeness
Is the algorithm guaranteed to find a solution when
there is one?
}  Optimality
Is this solution optimal?
}  Time complexity

How long does it take?
}  Space complexity

How much memory does it require?

No single crawler can crawl the entire Web
}  Crawling technique may depend on goal
}  Types of crawling goals:

}  Create large broad index
}  Create a focused topic or domain-specific index
}  Target topic-relevant sites
}  Index preset terms
}  Create subset of content to model characteristics
of the Web
}  Need to survey appropriately
}  Cannot use simple depth-first or breadth-first
}  Create up-to-date index
}  Use estimated change frequencies

Crawlers also be used for nefarious purposes
}  Spiders can be used to collect email addresses for

unsolicited communication
}  From: http://spiders.must.die.net

Crawler code available for free

Spider traps
}  A spider trap is a set of web pages that may be used to
cause a web crawler to make an infinite number of
requests or cause a poorly constructed crawler to crash
}  To “catch” spambots or similar that waste a website's bandwidth
}  Common techniques used are:

•  Creation of indefinitely deep directory
structures like
•  http://foo.com/bar/foo/bar/foo/bar/foo/
bar/.....
•  Dynamic pages like calendars that
produce an infinite number of pages for
a web crawler to follow
•  Pages filled with many chars, crashing
the lexical analyzer parsing the page
Search engines run specific and benign crawlers
}  Search engines obtain their listings in two ways:

}  The search engines “crawl” or “spider” documents by following one
hypertext link to
}  Authors may submit their own Web pages
}  As a result, only static Web content can be found on public

search engines
}  Nevertheless, a lot of info can be retrieved by criminals and

investigators, especially when using “hidden” features of the
search engine

Google hacking
}  Google provides keywords for advanced searching
}  Logic operators in search expressions
}  Advanced query attributes: “login password filetype:pdf”
}  Intitle, allintitle }  Related
}  Inurl, allinurl }  Phonebook
}  Filetype }  Rphonebook
}  Allintext }  Bphonebook
}  Site }  Author
}  Link }  Group
}  Inanchor }  Msgid
}  Daterange }  Insubject
}  Cache }  Stocks
}  Info }  Define

There’s entire books dedicated to Google hacking
Dornfest, Rael, Google Hacks 3rd ed,
O’Rielly, (2006)
Ethical Hacking,
http://www.nc-net.info/2006conf/
Ethical_Hacking_Presentation_October_
2006.ppt
A cheat sheet of Google search

features:
http://www.google.com/intl/en/help/
features.html
A Cheat Sheet for Google Search Hacks

-- how to find information fast and
efficiently
http://www.expertsforge.com/Security/
hacking-everything-using-google-3.asp

Google hacking examples: Simple word search
}  A simple search: “cd ls .bash_history ssh”
}  Can return surprising

results: this is the
contents of a
live .bash_history file
Google hacking examples: URL searches
}  inurl: find the search
inurl:admin term within the URL
inurl:admin
users mbox
inurl:admin users
passwords

Google hacking examples: File type searches
}  filetype: narrow down search results to specific file type

filetype:xls “checking
account” “credit card”

Google hacking examples: Finding servers
intitle:"Under
construction" "does not
currently have"
intitle:"Welcome to Windows
2000 Internet Services"

Google hacking examples: Finding webcams
}  To find open unprotected Internet

webcams that broadcast to the
web, use the following query:
}  inurl:/view.shtml
}  Can also search by manufacturer-specific URL patterns

}  inurl:ViewerFrame?Mode=
}  inurl:ViewerFrame?Mode=Refresh
}  inurl:axis-cgi/jpg
}  ...
Google hacking examples: Finding webcams
}  How to Find and View Millions of Free Live Web Cams
http://www.traveltowork.net/2009/02/how-to-find-view-
free-live-web-cams/
}  How to Hack Security Cameras,

http://www.truveo.com/How-To-Hack-Security-Cameras/
id/180144027190129591
}  How to Hack Security Cams all over the World

http://www.youtube.com/watch?
v=9VRN8BS02Rk&feature=related

And we’re just scratching the surface…
What can be found in the depths of the Web?

The Deep Web

The Deep Web
}  Deep Web is the part

of the Web which is
not indexed by
conventional search
engines and therefore
don’t appear in
search results
}  Why is it not indexed

by typical search
engines?

Some content can’t be found through URL traversal
•  Dynamic web pages and searchable databases

–  Response to a query or accessed only through a form
•  Unlinked contents
–  Pages without any backlinks
•  Private web
–  Sites requiring registration and login
•  Limited access web
–  Sites with captchas, no-cache pragma http headers
•  Scripted pages
–  Page produced by javascrips, Flash, etc.

In other times, content won’t be found
}  Crawling restrictions by site owner
}  Use a robots.txt file to keep files off limits from spiders
}  Crawling restrictions by the search engine

}  E.g.: a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
}  Most search engines will not read past the ? in that URL
}  Limitations of the crawling engine

}  E.g., real-time data – changes rapidly – too “fresh”

How big is Deep Web?
}  Studies suggest it’s approx. 500x the surface Web
}  But cannot be determined accurately
}  A 2001 study showed that 60 deep sites exceeded the

size of the surface web (at that time) by 40x

Distribution of Deep Web sites by content type
}  Back in 2001,

biggest fraction
goes to
databases

Approaches for finding content in Deep Web
1.  Specialized search engines
2.  Directories

Specialized search engines
}  Crawl deeper

}  Go beyond top page, or homepage
}  Crawl focused

}  Choose sources to spider—topical sites only
}  Crawl informed

}  Indexing based on knowledge of the specific subject

Specialized search engines abound
}  There’s hundreds of specialized search engines for almost
every topic

Directories
}  Collections of pre-screened web-sites into categories

based on a controlled ontology
}  Including access to content in databases
}  Ontology: classification of human knowledge into

topics, similar to traditional library catalogs
}  Two maintenance models: open or closed

}  Closed model: paid editors; quality control (Yahoo)
}  Open model: volunteer editors; (Open Directory Project)

Example of ontology
}  Ontologies allow for adding structure to Web content

A particularly interesting search engine
}  Shodan lets the user find specific types of computers connected
to the internet using a variety of filters
}  Routers, servers, traffic lights, security cameras, home heating systems
}  Control systems for water parks, gas stations, water plants, power grids,
nuclear power plants and particle-accelerating cyclotrons
}  Why is it interesting?

}  Many devices use "admin" as user name and "1234" as password, and
the only software required to connect them is a web browser

How does Shodan work?
“Google crawls URLs – I don’t do that at all.The only thing I

do is randomly pick an IP out of all the IPs that exist,
whether it’s online or not being used, and I try to connect
to it on different ports. It’s probably not a part of the visible
web in the sense that you can’t just use a browser. It’s not
something that most people can easily discover, just because
it’s not visual in the same way a website is.”
John Matherly, Shodan's creator
}  Shodan collects data mostly on HTTP servers (port 80)

}  But also from FTP (21), SSH (22) Telnet (23), and SNMP (161)

One can see through the eye of a webcam

Play with the controls for a water treatment facility

Find the creepiest stuff…
}  Controls for a crematorium; accessible from your computer

No words needed
}  Controls of Caterpiller trucks connected to the Internet

A Deep Web’s particular case
Dark Web

Dark Web
}  Dark Web is the Web content that exists on darknets
}  Darknets are overlay nets which use the public Internet
but require specific SW or authorization to access
}  Delivered over small peer-to-peer networks
}  As hidden services on top of Tor
}  The Dark Web forms a small part of the Deep Web,
the part of the Web not indexed by search engines

The Dark Web is a haven for criminal activities
}  Hacking services
}  Fraud and fraud

services
}  Markets for

illegal products
}  Hitmen
}  …
Surface Web vs. Deep Web
Surface Web Deep Web

}  Size: Estimated to be 8+ billion }  Size: Estimated to be 5 to 500x
(Google) to 45 billion larger (BrightPlanet)
(About.com) web pages
}  Dynamically generated content
}  Static, crawlable web pages that lives inside databases
}  Large amounts of unfiltered }  High-quality, managed, subject-

information specific content
}  Limited to what is easily found }  Growing faster than surface
by search engines web (BrightPlanet)

Conclusions
}  The Web is a major source of information for both

criminal and legal investigation activities
}  The Web content that is typically accessible through

conventional search engines is named the Surface Web
and represents only a small fraction of the whole Web
}  The Deep Web includes the largest bulk of the Web, a
small part of it (the Dark Web), being used
specifically for carrying out criminal activities

References
}  Primary bibliography

}  Michael K. Bergman, The Deep Web: Surfacing Hidden Value
http://brightplanet.com/wp-content/uploads/2012/03/12550176481-
deepwebwhitepaper1.pdf

Next class
}  Flow analysis and intrusion detection

Deep Web

Uploaded by

Copyright:

Available Formats

Deep Web

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Web

Uploaded by

Copyright:

Available Formats

Deep Web

Part II.B. Techniques and Tools:

CSF: Forensics Cyber-Security

} The Surface Web

} The Deep Web

2 CSF - Nuno Santos 2015/16

} Part I: Foundations of digital forensics

} Part II: Techniques and tools

} A. Computer forensics

} B. Network forensics Current focus

} C. Forensic data analysis

3 CSF - Nuno Santos 2015/16

Botnets Digital currency

4 CSF - Nuno Santos 2015/16

} Web allows for accessing services for criminal activity

} Provides huge source of information, used in:

5 CSF - Nuno Santos 2015/16

} Powerful investigation tool about suspects

} The playground where the crime itself is carried out

6 CSF - Nuno Santos 2015/16

} The sophistication of offenses (and investigations) is driven

7 CSF - Nuno Santos 2015/16

8 CSF - Nuno Santos 2015/16

} Deep Web is not

} Part of the deep

9 CSF - Nuno Santos 2015/16

10 CSF - Nuno Santos 2015/16

} Investigators too!

11 CSF - Nuno Santos 2015/16

} The Surface Web

} The Deep Web

12 CSF - Nuno Santos 2015/16

13 CSF - Nuno Santos 2015/16

} As of June 14, 2015, Google's index of the surface web

14 CSF - Nuno Santos 2015/16

} High volatility

} Large volume

15 CSF - Nuno Santos 2015/16

} Quality of data

} Heterogeneous data

16 CSF - Nuno Santos 2015/16

17 CSF - Nuno Santos 2015/16

1. A web crawler gathers a

4. Search engine ranks pages that match

18 CSF - Nuno Santos 2015/16

Users Interface Query Engine

} The Web crawler is a foundational species

} Creates and repopulates

20 CSF - Nuno Santos 2015/16

} Given an initial set of seed URLs, recursively download

} The next node to crawl is the URL frontier

Seed URLs frontier

22 CSF - Nuno Santos 2015/16

} Basic Algorithm

24 CSF - Nuno Santos 2015/16

} Performance: How do you crawl 1,000,000,000 pages?

} Politeness: How do you avoid overloading servers?

} Failures: Broken links, time outs, spider traps.

}  The Surface Web

}  The Deep Web

}  Part I: Foundations of digital forensics

}  Part II: Techniques and tools

}  A. Computer forensics

}  B. Network forensics Current focus

}  C. Forensic data analysis

}  Web allows for accessing services for criminal activity

}  Provides huge source of information, used in:

}  Powerful investigation tool about suspects

}  The playground where the crime itself is carried out

}  The sophistication of offenses (and investigations) is driven

}  Deep Web is not

}  Part of the deep

}  Investigators too!

}  The Surface Web

}  The Deep Web

}  As of June 14, 2015, Google's index of the surface web

}  High volatility

}  Large volume

}  Quality of data

}  Heterogeneous data

}  The Web crawler is a foundational species

}  Creates and repopulates

}  Given an initial set of seed URLs, recursively download

}  The next node to crawl is the URL frontier

}  Basic Algorithm

}  Performance: How do you crawl 1,000,000,000 pages?

}  Politeness: How do you avoid overloading servers?

}  Failures: Broken links, time outs, spider traps.

}  Strategies: How deep to go? Depth first or breadth first?

}  Implementations: How do we store and update the URL

}  Time complexity

}  Space complexity

}  Types of crawling goals:

}  Spiders can be used to collect email addresses for

}  Common techniques used are:

}  Search engines obtain their listings in two ways:

}  As a result, only static Web content can be found on public

}  Nevertheless, a lot of info can be retrieved by criminals and

}  A simple search: “cd ls .bash_history ssh”

}  Can return surprising

}  filetype: narrow down search results to specific file type

}  To find open unprotected Internet

}  Can also search by manufacturer-specific URL patterns

}  How to Hack Security Cameras,

}  How to Hack Security Cams all over the World

}  Deep Web is the part

}  Why is it not indexed

•  Dynamic web pages and searchable databases

}  Crawling restrictions by the search engine

}  Limitations of the crawling engine

}  A 2001 study showed that 60 deep sites exceeded the

}  Back in 2001,

1.  Specialized search engines

}  Crawl deeper

}  Crawl focused

}  Crawl informed

}  Collections of pre-screened web-sites into categories

}  Ontology: classification of human knowledge into

}  Two maintenance models: open or closed

}  Why is it interesting?

}  Shodan collects data mostly on HTTP servers (port 80)

}  Dark Web is the Web content that exists on darknets