Web Scraping
Web Scraping
Web Scraping
Some of you
may cringe at the thought of combing through a site for po
tential architecture issues, but its one of my favorite activi
tiesan SEO treasure hunt, if you will.
For normal people, the overall site audit process can be
daunting and time-consuming, but with tools like the
Screaming Frog SEO Spider, the task can be made easier
for newbs and pros alike. With a very user-friendly inter
face, Screaming Frog can be a breeze to work with, but the
breadth of configuration options and functionality can make
it hard to know where to begin.
With that in mind, I put together this comprehensive guide
to Screaming Frog to showcase the various ways that
SEO, PPC and other marketing folks can use the tool for
site audits, keyword research, competitive analysis, link
building and more!
To get started, simply select what it is that you are looking
to do:
Basic Crawling
I want to crawl my entire site
I want to crawl a single subdirectory
I want to crawl a specific set of subdomains or
subdirectories
I want a list of all of the pages on my site
I want a list of all of the pages in a specific subdi
rectory
I want to find a list of domains that my client is
currently redirecting to their money site
I want to find all of the subdomains on a site and
verify internal links
Internal Links
I want information about all of the internal and ex
ternal links on my site (anchor text, directives,
links per page etc.)
I want to find broken internal links on a page or
site
I want to find broken outbound links on a page or
site (or all outbound links in general)
I want to find links that are being redirected
I am looking for internal linking opportunities
Site Content
I want to identify pages with thin content
I want a list of the image links on a particular
page
I want to find images that are missing alt text or
images that have lengthy alt text
I want to find every CSS file on my site
I want to find every JavaScript file on my site
I want to identify all of the jQuery plugins used on
the site and what pages they are being used on
I want to find where flash is embedded on-site
I want to find any internal PDFs that are linked
on-site
I want to understand content segmentation within
a site or group of pages
I want to find pages that have social sharing but
tons
I want to find pages that are using iframes
I want to find pages that contain embedded video
or audio content
Sitemap
I want to create an XML Sitemap
I want to check my existing XML Sitemap
General Troubleshooting
I want to identify why certain sections of my site
arent being indexed or arent ranking
I want to check if my site migration/redesign was
successful
I want to find slow loading pages on my site
I want to find malware or spam on my site
Scraping
I want to scrape the meta data for a list of pages
I want to scrape a site for all of the pages that
contain a specific footprint
URL Rewriting
I want to find and remove session id or other pa
rameters from my crawled URLs
I want to rewrite the crawled URLs (e.g: replace
.com with .co.uk, or write all URLs in lowercase)
Keyword Research
I want to know which pages my competitors
value most
I want to know what anchor text my competitors
are using for internal linking
I want to know which meta keywords (if any) my
competitors have added to their pages
Link Building
I want to analyze a list of prospective link loca
tions
I want to find broken links for outreach opportuni
ties
I want to verify my backlinks and view the anchor
text
I want to make sure that Im not part of a link net
work
Bonus Round
Basic Crawling
How to crawl an entire site
By default, Screaming Frog only crawls the subdomain that
you enter. Any additional subdomains that the spider en
counters will be viewed as external links. In order to crawl
additional subdomains, you must change the settings in the
Spider Configuration menu. By checking Crawl All Subdo
mains, you will ensure that the spider crawls any links that
it encounters to other subdomains on your site.
Step 1:
Step 2:
To make your crawl go faster, dont check images, CSS,
JavaScript, SWF, or external links.
Step 2:
Inclusion:
In the example below, we only wanted to crawl the Englishlanguage subdomains on havaianas.com.
PRO Tip:
If you tend to use the same settings for each crawl,
Screaming Frog now allows you to save your configuration
settings:
You can also use this method to identify domains that your
competitors own, and how they are being used. Check out
what else you can learn about competitor sites below.
PRO Tip:
If you find that your crawl is resulting in a lot of server er
rors, go to the Advanced tab in the Spider Configuration
menu, and increase the value of the Response Timeout
and of the 5xx Response Retries to get better results.
Internal Links
I want information about all of the
internal and external links on my site
(anchor text, directives, links per
page etc.)
If you do not need to check the images, JavaScript, flash or
CSS on the site, de-select these options in the Spider
Configuration menu to save processing time and memory.
Once the spider has finished crawling, use the Advanced
Export menu to export a CSV of All Links. This will provide
Site Content
How to identify pages with thin con
tent
After the spider has finished crawling, go to the Internal
tab, filter by HTML, then scroll to the right to the Word
Count column. Sort the Word Count column from low to
high to find pages with low text content. You can drag and
drop the Word Count column to the left to better match
the low word count values to the appropriate URLs. Click
Export in the Internal tab if you prefer to manipulate the
data in a CSV instead.
While the word count method above will quantify the actual
text on the page, theres still no way to tell if the text found
is just product names or if the text is in a keyword-opti
mized copy block. To figure out the word count of your text
blocks, use ImportXML2 by @iamchrisle to scrape the text
blocks on any list of pages, then count the characters from
there. If xPath queries arent your strong suit, the xPath
Helper Chrome extension does a pretty solid job at figuring
out the xPath for you. Obviously, you can also use these
scraped text blocks to begin to understand the overall
word usage on the site in question, but that, my friends, is
another post
PRO Tip:
Right click on any entry in the bottom window to copy or
open a URL.
Alternately, you can also view the images on a single page
by crawling just that URL. Make sure that your crawl depth
is set to 1 in the Spider Configuration settings, then once
the page is crawled, click on the Images tab, and youll
see any images that the spider found.
noindex
follow
nofollow
noarchive
nosnippet
noodp
noydir
noimageindex
notranslate
unavailable_after
refresh
canonical
Sitemap
How to create an XML Sitemap
After the spider has finished crawling your site, click on the
Advanced Export menu and select XML Sitemap.
General Troubleshooting
How to identify why certain sections
of my site arent being indexed or
arent ranking
Wondering why certain pages arent being indexed? First,
make sure that they werent accidentally put into the ro
bots.txt or tagged as noindex. Next, youll want to make
sure that spiders can reach the pages by checking your in
ternal links. Once the spider has crawled your site, simply
export the list of internal URLs as a .CSV file, using the
HTML filter in the Internal tab.
Open up the CSV file, and in a second sheet, paste the list
Next, select your file to upload, and press Start. See the
status code of each page by looking at the Internal tab.
To check if your pages contain your GA code, check out
this post on using custom filters to verify Google Analytics
code by @RachaelGerson.
Scraping
How to scrape the meta data for a
list of pages
So, youve harvested a bunch of URLs, but you need more
information about them? Set your mode to List, then up
load your list of URLs in .txt or .csv format. After the spider
is done, youll be able to see status codes, outbound links,
word counts, and of course, meta data for each page in
your list.
URL Rewriting
How to nd and remove session id or
other parameters from my crawled
URLs
To identify URLs with session ids or other parameters, sim
ply crawl your site with the default settings. When the spi
der is finished, click on the URI tab and filter to Dynamic
Once youve added all of the desired rules, you can test
your rules in the Test tab by entering a test URL in the
space labeled URL before rewriting. The URL after rewrit
ing will be updated automatically according to your rules.
If you wish to set a rule that all URLs are returned in lower
case, simply select Lowercase discovered URLs in the
Options tab. This will remove any duplication by capital
ized URLs in the crawl.
Keyword Research
How to know which pages my com
petitors value most
Generally speaking, competitors will try to spread link pop
ularity and drive traffic to their most valuable pages by link
ing to them internally. Any SEO-minded competitor will
probably also link to important pages from their company
blog. Find your competitors prized pages by crawling their
site, then sorting the Internal tab by the Inlinks column
from highest to lowest, to see which pages have the most
internal links.
PRO Tip:
Drag and drop columns to the left or right to improve your
view of the data.
Link Building
How to analyze a list of prospective
link locations
If youve scraped or otherwise come up with a list of URLs
that needs to be vetted, you can upload and crawl them in
List mode to gather more information about the pages.
When the spider is finished crawling, check for status
codes in the Response Codes tab, and review outbound
links, link types, anchor text and nofollow directives in the
Out Links tab in the bottom window. This will give you an
You can also export the full list of out links by clicking on
All Out Links in the Advanced Export Menu. This will not
only provide you with the links going to external sites, but it
will also show all internal links on the individual pages in
your list.
For more great ideas for link building, check out these two
awesome posts on link reclamation and using Link
Prospector with Screaming Frog by SEERs own
@EthanLyon and @JHTScherck.
Bonus Round
Final Remarks
In closing, I hope that this guide gives you a better idea of
what Screaming Frog can do for you. It has saved me
countless hours, so I hope that it helps you, too!