Deep Web

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Deep Web

Part II.B. Techniques and Tools:


Network Forensics

CSF: Forensics Cyber-Security


Fall 2015
Nuno Santos
Summary

}  The Surface Web

}  The Deep Web

2 CSF - Nuno Santos 2015/16


Remember were we are
}  Our journey in this course:

}  Part I: Foundations of digital forensics

}  Part II: Techniques and tools

}  A. Computer forensics

}  B. Network forensics Current focus

}  C. Forensic data analysis

3 CSF - Nuno Santos 2015/16


Previously: Three key instruments in cybercrime

Anonymity systems
How criminals hide their IDs

Tools of
cybercrime

Botnets Digital currency


How to launch large scale attacks How to make untraceable payments

4 CSF - Nuno Santos 2015/16


Today: One last key instrument – The Web itself

Offender

}  Web allows for accessing services for criminal activity


}  E.g., drug selling, weapon selling, etc.

}  Provides huge source of information, used in:


}  Crime premeditation, privacy violations, identity theft, extortion, etc.

}  To find services and info, there are powerful search engines
}  Google, Bing, Shodan, etc.

5 CSF - Nuno Santos 2015/16


The Web: powerful also for crime investigation

Investigator

}  Powerful investigation tool about suspects


}  Find evidence in blogs, social networks, browsing activity, etc.

}  The playground where the crime itself is carried out


}  Illegal transactions, cyber stalking, blackmail, fraud, etc.

6 CSF - Nuno Santos 2015/16


An eternal cat & mouse race (who’s who?)

}  The sophistication of offenses (and investigations) is driven


by the nature and complexity of the Web

7 CSF - Nuno Santos 2015/16


The web is deep, very deep…
}  What’s “visible” through typical search engines is minimal

8 CSF - Nuno Santos 2015/16


What can be found in the Deep Web?

}  Deep Web is not


necessarily bad: it’s
just that the content
is not directly
indexed

}  Part of the deep


web where criminal
activity is carried
out is named the
Dark Web

9 CSF - Nuno Santos 2015/16


Some examples of services in the Web “ocean”

10 CSF - Nuno Santos 2015/16


Offenders operate at all layers

}  Investigators too!

11 CSF - Nuno Santos 2015/16


Roadmap

}  The Surface Web

}  The Deep Web

12 CSF - Nuno Santos 2015/16


The Surface Web

13 CSF - Nuno Santos 2015/16


The Surface Web

}  The Surface Web is that portion of the World Wide Web
that is readily available to the general public and
searchable with standard web search engines
}  AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet

}  As of June 14, 2015, Google's index of the surface web


contains about 14.5 billion pages

14 CSF - Nuno Santos 2015/16


Surface Web characteristics
}  Distributed data
}  80 million web sites (hostnames responding) in April 2006
}  40 million active web sites (don’t redirect, …)

}  High volatility


}  Servers come and go …

}  Large volume


}  One study found 11.5 billion pages in January 2005 (at that
time Google indexed 8 billion pages)

15 CSF - Nuno Santos 2015/16


Surface Web characteristics
}  Unstructured data
}  Lots of duplicated content (30% estimate)
}  Semantic duplication much higher

}  Quality of data


}  No required editorial process
}  Many typos and misspellings (impacts IR)

}  Heterogeneous data


}  Different media
}  Different languages

16 CSF - Nuno Santos 2015/16


Surface Web composition by file type
To hide

}  As of 2003,
about 70% of
Web content is
images, HTML,
PHP, and PDF files

17 CSF - Nuno Santos 2015/16


How to find content and services?
}  Using search engines

1. A web crawler gathers a


snapshot of the Web
3. User submits a search
query

4. Search engine ranks pages that match


the query and returns an ordered list
2. The gathered pages are
indexed for easy retrieval

18 CSF - Nuno Santos 2015/16


How a typical search engine works
}  Architecture of a typical search engine
Lots and lots of computers

Users Interface Query Engine

Index

Crawler Indexer

Web
19 CSF - Nuno Santos 2015/16
What a Web crawler does

}  The Web crawler is a foundational species


}  Without crawlers, there would be nothing to search

}  Creates and repopulates


search engines data by
navigating the web,
fetching docs and files

20 CSF - Nuno Santos 2015/16


What a Web crawler is
}  In general, it’s a program for downloading web pages
}  Crawler AKA spider, bot, harvester

}  Given an initial set of seed URLs, recursively download


every page that is linked from pages in the set
}  A focused web crawler downloads only those pages whose
content satisfies some criterion

}  The next node to crawl is the URL frontier


}  Can include multiple pages from the same host
21 CSF - Nuno Santos 2015/16
Crawling the Web: Start from the seed pages

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web

22 CSF - Nuno Santos 2015/16


Crawling the Web: Keep expanding URL frontier

URLs crawled
and parsed
Unseen Web

Seed
Pages
URL frontier
Crawling thread
23 CSF - Nuno Santos 2015/16
Web crawler algorithm is conceptually simple

}  Basic Algorithm


Initialize queue (Q) with initial set of known URL’s
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop
If already visited L, continue loop
Download page, P, for L
If cannot download P (e.g. 404 error, robot excluded)
continue loop
Index P (e.g. add to inverted index or store cached copy)
Parse P to obtain list of new links N
Append N to the end of Q

24 CSF - Nuno Santos 2015/16


But not so simple to build in practice

}  Performance: How do you crawl 1,000,000,000 pages?

}  Politeness: How do you avoid overloading servers?

}  Failures: Broken links, time outs, spider traps.

}  Strategies: How deep to go? Depth first or breadth first?

}  Implementations: How do we store and update the URL


list and other data structures needed?

25 CSF - Nuno Santos 2015/16


Crawler performance measures
}  Completeness
Is the algorithm guaranteed to find a solution when
there is one?

}  Optimality
Is this solution optimal?

}  Time complexity


How long does it take?

}  Space complexity


How much memory does it require?

26 CSF - Nuno Santos 2015/16


No single crawler can crawl the entire Web
}  Crawling technique may depend on goal

}  Types of crawling goals:


}  Create large broad index
}  Create a focused topic or domain-specific index
}  Target topic-relevant sites
}  Index preset terms
}  Create subset of content to model characteristics
of the Web
}  Need to survey appropriately
}  Cannot use simple depth-first or breadth-first
}  Create up-to-date index
}  Use estimated change frequencies

27 CSF - Nuno Santos 2015/16


Crawlers also be used for nefarious purposes

}  Spiders can be used to collect email addresses for


unsolicited communication
}  From: http://spiders.must.die.net

28 CSF - Nuno Santos 2015/16


Crawler code available for free

29 CSF - Nuno Santos 2015/16


Spider traps
}  A spider trap is a set of web pages that may be used to
cause a web crawler to make an infinite number of
requests or cause a poorly constructed crawler to crash
}  To “catch” spambots or similar that waste a website's bandwidth

}  Common techniques used are:


•  Creation of indefinitely deep directory
structures like
•  http://foo.com/bar/foo/bar/foo/bar/foo/
bar/.....
•  Dynamic pages like calendars that
produce an infinite number of pages for
a web crawler to follow
•  Pages filled with many chars, crashing
the lexical analyzer parsing the page
30 CSF - Nuno Santos 2015/16
Search engines run specific and benign crawlers

}  Search engines obtain their listings in two ways:


}  The search engines “crawl” or “spider” documents by following one
hypertext link to
}  Authors may submit their own Web pages

}  As a result, only static Web content can be found on public


search engines

}  Nevertheless, a lot of info can be retrieved by criminals and


investigators, especially when using “hidden” features of the
search engine

31 CSF - Nuno Santos 2015/16


Google hacking
}  Google provides keywords for advanced searching
}  Logic operators in search expressions
}  Advanced query attributes: “login password filetype:pdf”
}  Intitle,  allintitle   }  Related  
}  Inurl,  allinurl   }  Phonebook  
}  Filetype   }  Rphonebook  
}  Allintext   }  Bphonebook  
}  Site   }  Author  
}  Link   }  Group  
}  Inanchor   }  Msgid  
}  Daterange   }  Insubject  
}  Cache   }  Stocks  
}  Info   }  Define  

32 CSF - Nuno Santos 2015/16


There’s entire books dedicated to Google hacking
Dornfest, Rael, Google Hacks 3rd ed,
O’Rielly, (2006)

Ethical Hacking,
http://www.nc-net.info/2006conf/
Ethical_Hacking_Presentation_October_
2006.ppt

A cheat sheet of Google search


features:
http://www.google.com/intl/en/help/
features.html

A Cheat Sheet for Google Search Hacks


-- how to find information fast and
efficiently
http://www.expertsforge.com/Security/
hacking-everything-using-google-3.asp

33 CSF - Nuno Santos 2015/16


Google hacking examples: Simple word search

}  A simple search: “cd ls .bash_history ssh”

}  Can return surprising


results: this is the
contents of a
live .bash_history file
34 CSF - Nuno Santos 2015/16
Google hacking examples: URL searches
}  inurl: find the search
inurl:admin term within the URL

inurl:admin
users mbox
inurl:admin users
passwords

35 CSF - Nuno Santos 2015/16


Google hacking examples: File type searches

}  filetype: narrow down search results to specific file type


filetype:xls “checking
account” “credit card”

36 CSF - Nuno Santos 2015/16


Google hacking examples: Finding servers

intitle:"Under
construction" "does not
currently have"

intitle:"Welcome to Windows
2000 Internet Services"

37 CSF - Nuno Santos 2015/16


Google hacking examples: Finding webcams

}  To find open unprotected Internet


webcams that broadcast to the
web, use the following query:
}  inurl:/view.shtml

}  Can also search by manufacturer-specific URL patterns


}  inurl:ViewerFrame?Mode=
}  inurl:ViewerFrame?Mode=Refresh
}  inurl:axis-cgi/jpg
}  ...
38 CSF - Nuno Santos 2015/16
Google hacking examples: Finding webcams

}  How to Find and View Millions of Free Live Web Cams
http://www.traveltowork.net/2009/02/how-to-find-view-
free-live-web-cams/

}  How to Hack Security Cameras,


http://www.truveo.com/How-To-Hack-Security-Cameras/
id/180144027190129591

}  How to Hack Security Cams all over the World


http://www.youtube.com/watch?
v=9VRN8BS02Rk&feature=related

39 CSF - Nuno Santos 2015/16


And we’re just scratching the surface…

What can be found in the depths of the Web?

40 CSF - Nuno Santos 2015/16


The Deep Web

41 CSF - Nuno Santos 2015/16


The Deep Web

}  Deep Web is the part


of the Web which is
not indexed by
conventional search
engines and therefore
don’t appear in
search results

}  Why is it not indexed


by typical search
engines?

42 CSF - Nuno Santos 2015/16


Some content can’t be found through URL traversal

•  Dynamic web pages and searchable databases


–  Response to a query or accessed only through a form
•  Unlinked contents
–  Pages without any backlinks
•  Private web
–  Sites requiring registration and login
•  Limited access web
–  Sites with captchas, no-cache pragma http headers
•  Scripted pages
–  Page produced by javascrips, Flash, etc.

43 CSF - Nuno Santos 2015/16


In other times, content won’t be found
}  Crawling restrictions by site owner
}  Use a robots.txt file to keep files off limits from spiders

}  Crawling restrictions by the search engine


}  E.g.: a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
}  Most search engines will not read past the ? in that URL

}  Limitations of the crawling engine


}  E.g., real-time data – changes rapidly – too “fresh”

44 CSF - Nuno Santos 2015/16


How big is Deep Web?
}  Studies suggest it’s approx. 500x the surface Web
}  But cannot be determined accurately

}  A 2001 study showed that 60 deep sites exceeded the


size of the surface web (at that time) by 40x

45 CSF - Nuno Santos 2015/16


Distribution of Deep Web sites by content type

}  Back in 2001,


biggest fraction
goes to
databases

46 CSF - Nuno Santos 2015/16


Approaches for finding content in Deep Web

1.  Specialized search engines

2.  Directories

47 CSF - Nuno Santos 2015/16


Specialized search engines

}  Crawl deeper


}  Go beyond top page, or homepage

}  Crawl focused


}  Choose sources to spider—topical sites only

}  Crawl informed


}  Indexing based on knowledge of the specific subject

48 CSF - Nuno Santos 2015/16


Specialized search engines abound
}  There’s hundreds of specialized search engines for almost
every topic

49 CSF - Nuno Santos 2015/16


Directories

}  Collections of pre-screened web-sites into categories


based on a controlled ontology
}  Including access to content in databases

}  Ontology: classification of human knowledge into


topics, similar to traditional library catalogs

}  Two maintenance models: open or closed


}  Closed model: paid editors; quality control (Yahoo)
}  Open model: volunteer editors; (Open Directory Project)

50 CSF - Nuno Santos 2015/16


Example of ontology
}  Ontologies allow for adding structure to Web content

51 CSF - Nuno Santos 2015/16


A particularly interesting search engine

}  Shodan lets the user find specific types of computers connected
to the internet using a variety of filters
}  Routers, servers, traffic lights, security cameras, home heating systems
}  Control systems for water parks, gas stations, water plants, power grids,
nuclear power plants and particle-accelerating cyclotrons

}  Why is it interesting?


}  Many devices use "admin" as user name and "1234" as password, and
the only software required to connect them is a web browser

52 CSF - Nuno Santos 2015/16


How does Shodan work?

“Google crawls URLs – I don’t do that at all.The only thing I


do is randomly pick an IP out of all the IPs that exist,
whether it’s online or not being used, and I try to connect
to it on different ports. It’s probably not a part of the visible
web in the sense that you can’t just use a browser. It’s not
something that most people can easily discover, just because
it’s not visual in the same way a website is.”

John Matherly, Shodan's creator

}  Shodan collects data mostly on HTTP servers (port 80)


}  But also from FTP (21), SSH (22) Telnet (23), and SNMP (161)

53 CSF - Nuno Santos 2015/16


One can see through the eye of a webcam

54 CSF - Nuno Santos 2015/16


Play with the controls for a water treatment facility

55 CSF - Nuno Santos 2015/16


Find the creepiest stuff…
}  Controls for a crematorium; accessible from your computer

56 CSF - Nuno Santos 2015/16


No words needed
}  Controls of Caterpiller trucks connected to the Internet

57 CSF - Nuno Santos 2015/16


A Deep Web’s particular case

Dark Web

58 CSF - Nuno Santos 2015/16


Dark Web

}  Dark Web is the Web content that exists on darknets

}  Darknets are overlay nets which use the public Internet
but require specific SW or authorization to access
}  Delivered over small peer-to-peer networks
}  As hidden services on top of Tor

}  The Dark Web forms a small part of the Deep Web,
the part of the Web not indexed by search engines

59 CSF - Nuno Santos 2015/16


The Dark Web is a haven for criminal activities
}  Hacking services

}  Fraud and fraud


services

}  Markets for


illegal products

}  Hitmen

}  …
60 CSF - Nuno Santos 2015/16
Surface Web vs. Deep Web

Surface Web Deep Web


}  Size: Estimated to be 8+ billion }  Size: Estimated to be 5 to 500x
(Google) to 45 billion larger (BrightPlanet)
(About.com) web pages
}  Dynamically generated content
}  Static, crawlable web pages that lives inside databases

}  Large amounts of unfiltered }  High-quality, managed, subject-


information specific content

}  Limited to what is easily found }  Growing faster than surface
by search engines web (BrightPlanet)

61 CSF - Nuno Santos 2015/16


Conclusions

}  The Web is a major source of information for both


criminal and legal investigation activities

}  The Web content that is typically accessible through


conventional search engines is named the Surface Web
and represents only a small fraction of the whole Web

}  The Deep Web includes the largest bulk of the Web, a
small part of it (the Dark Web), being used
specifically for carrying out criminal activities

62 CSF - Nuno Santos 2015/16


References

}  Primary bibliography


}  Michael K. Bergman, The Deep Web: Surfacing Hidden Value
http://brightplanet.com/wp-content/uploads/2012/03/12550176481-
deepwebwhitepaper1.pdf

63 CSF - Nuno Santos 2015/16


Next class
}  Flow analysis and intrusion detection

64 CSF - Nuno Santos 2015/16

You might also like