Defensive Data Mining With Google
Defensive Data Mining With Google
Defensive Data Mining With Google
Dan Goldberg
MADJiC Consulting, Inc
[email protected]
Purpose
●
Google (and other search engines) attempt to cat
alog the Internet
– Spiders catalog everything they can find!!!
– How can you find your exposures?
●
Google can also be used to mine unintentional
data about a site
– Information considered public not common knowledge
– Data correlation
Disclaimer
●
Items not covered
– Breaking into computer systems
– Private data exposures
●
You can see that at Black Hat next summer!!
●
All information presented is
– Public knowledge
– Can be found freely on the Internet
– Uses legitimate tools available from Google
What you will get
●
We will cover tools and techniques for:
– Running Google queries about a site;
– Discover and correlate public information;
– Refine searches to find information leaks.
●
How to format advanced queries
Getting started
●
Defenders, Penetration testers, and hackers
all follow a similar methodology:
– Step 1: Reconnaissance
– Step 2: Scanning
– Step 3: Exploit targets
– Step 4: Keep access
– Step 5: Cover tracks
●
We are concerned with step 1 today
What might you find?
●
This is not hacking – it is data mining
– Information may be used for good or evil
●
Unintended Data Disclosure (UDD)
– Some information is made public “by accident”
●
customer and order data
●
student records
– Correlation of seemingly unrelated data
●
Patterns involving encrypted versus clear text
messages
●
How do you find this “stuff”?
Why use Google?
●
Google has a massive snapshot of the
Internet
●
Is currently the industry leader in cataloging
websites
●
Ranks web sites by content AND number of
links TO the site
●
Catalogs many data types (more on this later)
What is a Web Spider?
●
Catalogs web sites
●
Follows hypertext links
●
Attempts to guess names of resources
provided by a site that are not linked
●
May use domain registry to find domains
●
Supposed to obey robot.txt directives
– Do you trust this process?
Why bother with this at all?
●
Information leaks exist
●
You won't know what your leaks are unless
you are looking
●
See your web presence as the world sees it
●
Permits you to close leaks hopefully before
someone else finds them
Closing an information leak
After closing a leak remove cached data from:
– Google's cache
http://www.google.com/webmasters/remove.html#uncache
– The wayback machine (Internet Archive)
http://www.archive.org/about/faqs.php#2
Preventing information leaks
●
Pass all public content through a central
clearing house
●
Audit all web applications especially those
handling sensitive information
– SSNs
– Credit cards
– Student information
●
Some leaks cannot be prevented
– Data correlation can not always be anticipated
Google tools
●
Today we will pick on the Google search
engine primarily
– http://www.google.com
●
Other search engines offer similar tools
●
Loosely based on the popular Oreilly book:
“Google Hacking for Penetration Testers” by
Johnny Long http://johnny.ihackstuff.com/
Lets get started ...
What tools?
●
Google offers many tools for specific
searches
– Advanced Operators
●
Usage: operator:searchterm
– Note there is no space between the operator and the term
– Example:
http://www.google.com/search?hl=en&q=site%3Avirginia.edu
– A cache of web sites
– A search of sites linking to a site or page
– Much more!!!
Google Advanced Operators
query types
●
cache: (e.g. cache:www.example.com)
●
[cache:] will show the version of the web page that Google has
in its cache and highlight the search term(s).
●
link: (e.g. link:www.virginia.edu)
●
[link:] will list webpages that have links to the specified
webpage.
●
related: (e.g. related:virginia.edu)
●
[related:] will list web pages that are "similar" to a specified web
page.
Google Advanced Operators
More alternate query types
●
info: (e.g. site:www.virginia.edu)
●
[info:] will present some information that Google has about that
web page including cached pages, pages that link to it, and
highlight the search term
●
●
site: (e.g. site:www.itc.virginia.edu)
●
[site:] will restrict the results to those websites in the given
domain.
●
[filetype:] (e.g. filetype:pdf)
●
[filetype:] will restrict the results to pages whose names end in
the suffix given
Google Advanced Operators
Query modifiers
●
intitle: (e.g. intitle:student records)
●
[intitle:] will restrict the results to documents containing that
word in the title
●
●
allintitle: (e.g. allintitle:student records)
●
[allintitle:], will restrict the results to those with all of the query
words in the title.
●
●
[intitle:student intitle:records] and [allintitle:student records] yield the
same result
Google Advanced Operators
More query modifiers
[inurl:] (e.g. inurl:phpadmin)
●
[inurl:] will restrict the results to documents containing that word
in the url.
[allinurl:] (e.g. allinurl: student admin)
●
Will restrict the results to those with all of the query words in the
url.
●
[allinurl:] works on words, not url components.
●
– [inurl:student inurl:admin] and [allinurl: student admin]
yield the same results.
Additional operators
●
+ Forces the inclusion of a common word
– e.g. University +of Virginia
●
~ Synonym search
– e.g. ~education
●
Numrange search
– e.g. 500..1000
Advanced operators in use
site:, inurl, and filetype:
Can not search
port numbers
site: in URL
inurl: filetype:
Some operators over lap, such as site: and inurl:
Advanced operators in use
intitle:
intitle:
and
allintitle:
Advanced Operator action
Lets start to use some of these tools
We will start be enumerating the virginia.edu
domain...
our first search will be
site:virginia.edu
More advanced domain search
●
Add a negative search
site:virginia.edu www.virginia.edu
●
You can string many of these together:
site:virginia.edu www.virginia.edu www.people.virginia.edu
●
Output of these searches is ugly
●
Remember we are looking for Internet domains
only
Searching with regular
expressions line
$ links dump 'http://www.google.com/search?q=site:virginia.edu+\ wrap
www.virginia.edu&num=100 > v.edu
$ sed n 's/\. [[:alpha:]]*:\/\/[[:alnum:]]*.virginia.edu\//& /p' v.edu | awk '{print $2}' \
| sort u
http://aands.virginia.edu/
http://artsandsciences.virginia.edu/
http://etext.virginia.edu/
http://faculty.virginia.edu/
http://gwpapers.virginia.edu/
http://legion.virginia.edu/
http://millercenter.virginia.edu/
https://hoosonline.virginia.edu/
http://toolkit.virginia.edu/
...
Expanding your search
Increasing domain searches can be done with many
tools;
– Google sets http://labs.google.com/sets
– Tilde search ~
– Google API http://www.google.com/apis/index.html
– See papers and presentations by sensepost
– www.sensepost.com
What to do now?
●
With this list of domains in hand you can
– Dump interesting subdomains repeating this
process to learn additional subdomains
e.g domain cs.virginia.edu contains
http://acm.cs.virginia.edu/
http://cheetah.cs.virginia.edu/
http://naccio.cs.virginia.edu/
– Who knows how I got this list??
– Apply additional searches to look for information
leaks
Examples and demonstrations
Domain scraping is fun!
Lets start to apply some of the other operators
And some examples ...
Searching with intext:
●
Finding specific terms within a document's
text body
– intext:admin site:virginia.edu returns all
documents containing the search term “admin”
●
Want to narrow this down a bit? Try:
– allinurl: admin php site:virginia.edu
●
(note the use of spaces for allintext)
– This could be a gold mine for someone with
malicious intent. Better check my bugtraq
archives ...
File type searches
●
Results can also be narrowed by focusing on
certain file types (characters after the . )
– These focus on file name suffixes
– inurl:admin site:virginia.edu filetype:php
●
Filetypes can be excluded from a search
– filetype:xls
●
Google indexes 13 nonhtml file formats
– http://www.google.com/help/faq_filetypes.html
– Not related to filetype: search
Cache Search
●
cache:domain+search_term
●
e.g. cache:cs.virginia.edu php
– A cache search will return a cached version of the
page and highlight search terms in the page.
●
Google's cache has been known to contain
personal information that was removed from
the origin site
●
Cache will reference origin site for images
– append “strip=1” to view “text only”
– Prevents log entry at origin server
Some terms to try
●
Here are some search terms you might try in
your Google mining efforts:
●
“logged in as”
●
root
●
Administrator
●
cookie or session_id
●
“Welcome *
Where to get more
●
Google hacking database:
– http://johnny.ihackstuff.com/
●
Many suggested search terms to try
●
Some are rated by other users who have put
them to use others are not rated
Google Dork Detection System
●
A number of recent worms used automated
Google searches to locate targets
●
Increased attention to “Google hacking”
●
Google has begun to filter many known “Bad”
searches with GDDS
●
Search results:
Forbidden
Google error – We're sorry
GDDS evasion
●
Change case
– inurl:admin.php or inurl:admin.PHP or
inurl:admin.PhP
●
Reverse query order
– inurl:phpbb2 inurl:admin.php
●
Space injection
– P o w e r e d by pbpbb2
●
Web clients must send a useragent
– Use curl parameter: $curl –useragent “Mozilla/5.0”
Wrap up
●
We have covered
– Google advanced search tools
– Basic domain scraping
– Several advanced searches
– GDDS
●
Contact me at [email protected]
●
Slides will be posted at
http://www.madjic.net/papers/
Thank you
References
●
Google:
– http://www.google.com
– http://www.googleguide.com
– http://www.google.com/terms_of_service.html
●
Unintended Information Revelation
– http://www.cedar.buffalo.edu/~rohini/UIR/
– http://www.contractoruk.com/news/002194.html
●
Johnny Long's IhackStuff
– http://johnny.ihackstuff.com/