Web Scraping

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

So, I admit it: I love technical SEO audits.

Some of you
may cringe at the thought of combing through a site for po
tential architecture issues, but its one of my favorite activi
tiesan SEO treasure hunt, if you will.
For normal people, the overall site audit process can be
daunting and time-consuming, but with tools like the
Screaming Frog SEO Spider, the task can be made easier
for newbs and pros alike. With a very user-friendly inter
face, Screaming Frog can be a breeze to work with, but the
breadth of configuration options and functionality can make
it hard to know where to begin.
With that in mind, I put together this comprehensive guide
to Screaming Frog to showcase the various ways that
SEO, PPC and other marketing folks can use the tool for
site audits, keyword research, competitive analysis, link
building and more!
To get started, simply select what it is that you are looking
to do:

Basic Crawling
I want to crawl my entire site
I want to crawl a single subdirectory
I want to crawl a specific set of subdomains or
subdirectories
I want a list of all of the pages on my site
I want a list of all of the pages in a specific subdi
rectory
I want to find a list of domains that my client is
currently redirecting to their money site
I want to find all of the subdomains on a site and
verify internal links

I want to crawl an e-commerce site or other large


site
I want to crawl a site hosted on an older server
I want to crawl a site that requires cookies
I want to crawl using a proxy
I want to crawl pages that require authentication

Internal Links
I want information about all of the internal and ex
ternal links on my site (anchor text, directives,
links per page etc.)
I want to find broken internal links on a page or
site
I want to find broken outbound links on a page or
site (or all outbound links in general)
I want to find links that are being redirected
I am looking for internal linking opportunities

Site Content
I want to identify pages with thin content
I want a list of the image links on a particular
page
I want to find images that are missing alt text or
images that have lengthy alt text
I want to find every CSS file on my site
I want to find every JavaScript file on my site
I want to identify all of the jQuery plugins used on
the site and what pages they are being used on
I want to find where flash is embedded on-site
I want to find any internal PDFs that are linked
on-site
I want to understand content segmentation within
a site or group of pages
I want to find pages that have social sharing but

tons
I want to find pages that are using iframes
I want to find pages that contain embedded video
or audio content

Meta Data and Directives


I want to identify pages with lengthy page titles,
meta descriptions, or URLs
I want to find duplicate page titles, meta descrip
tions, or URLs
I want to find duplicate content and/or URLs that
need to be rewritten/redirected/canonicalized
I want to identify all of the pages that include
meta directives e.g.: nofollow/noindex/noodp
/canonical etc.
I want to verify that my robots.txt file is function
ing as desired
I want to find or verify Schema markup or other
microdata on my site

Sitemap
I want to create an XML Sitemap
I want to check my existing XML Sitemap

General Troubleshooting
I want to identify why certain sections of my site
arent being indexed or arent ranking
I want to check if my site migration/redesign was
successful
I want to find slow loading pages on my site
I want to find malware or spam on my site

PPC & Analytics

I want to verify that my Google Analytics code is


on every page, or on a specific set of pages on
my site
I want to validate a list of PPC URLs in bulk

Scraping
I want to scrape the meta data for a list of pages
I want to scrape a site for all of the pages that
contain a specific footprint

URL Rewriting
I want to find and remove session id or other pa
rameters from my crawled URLs
I want to rewrite the crawled URLs (e.g: replace
.com with .co.uk, or write all URLs in lowercase)

Keyword Research
I want to know which pages my competitors
value most
I want to know what anchor text my competitors
are using for internal linking
I want to know which meta keywords (if any) my
competitors have added to their pages

Link Building
I want to analyze a list of prospective link loca
tions
I want to find broken links for outreach opportuni
ties
I want to verify my backlinks and view the anchor
text
I want to make sure that Im not part of a link net
work

I am in the process of cleaning up my backlinks


and need to verify that links are being removed as
requested

Bonus Round
Basic Crawling
How to crawl an entire site
By default, Screaming Frog only crawls the subdomain that
you enter. Any additional subdomains that the spider en
counters will be viewed as external links. In order to crawl
additional subdomains, you must change the settings in the
Spider Configuration menu. By checking Crawl All Subdo
mains, you will ensure that the spider crawls any links that
it encounters to other subdomains on your site.
Step 1:

Step 2:
To make your crawl go faster, dont check images, CSS,
JavaScript, SWF, or external links.

How to crawl a single subdirectory


If you wish limit your crawl to a single folder, simply enter
the URL and press start without changing any of the de
fault settings. If youve overwritten the original default set
tings, reset the default configuration within the File menu.

If you wish to start your crawl in a specific folder, but want


to continue crawling to the rest of the subdomain, be sure
to select Crawl Outside Of Start Folder in the Spider
Configuration settings before entering your specific start
ing URL.

How to crawl a specic set of subdo


mains or subdirectories
If you wish to limit your crawl to a specific set of subdo
mains or subdirectories, you can use RegEx to set those
rules in the Include or Exclude settings in the Configuration
menu.
Exclusion:
In this example, we crawled every page on havaianas.com
excluding the about pages on every subdomain.
Step 1:

Step 2:

Inclusion:
In the example below, we only wanted to crawl the Englishlanguage subdomains on havaianas.com.

I want a list of all of the pages on my


site

By default, Screaming Frog is set to crawl all images,


JavaScript, CSS and flash files that the spider encounters.
To crawl HTML only, youll have to deselect Check Im
ages, Check CSS, Check JavaScript and Check SWF
in the Spider Configuration menu. Running the spider with
these settings unchecked will, in effect, provide you with a
list of all of the pages on your site that have internal links
pointing to them. Once the crawl is finished, go to the In
ternal tab and filter your results by HTML. Click Export,
and youll have the full list in CSV format.

PRO Tip:
If you tend to use the same settings for each crawl,
Screaming Frog now allows you to save your configuration
settings:

I want a list of all of the pages in a


specic subdirectory
In addition to de-selecting Check Images, Check CSS,

Check JavaScript and Check SWF, youll also want to


de-select Check Links Outside Folder in the Spider Con
figuration settings. Running the spider with these settings
unchecked will, in effect, give you a list of all of the pages
in your starting folder (as long as they are not orphaned
pages).

How to nd a list of domains that my


client is currently redirecting to their
money site
Enter the money site URL into ReverseInternet, then click
the links in the top table to find sites that share the same IP
address, nameservers, or GA code.
From here, you can gather your list of URLs using the
Google Chrome extension Scraper to find all of the links
with the anchor text visit site. If Scraper is already in
stalled, you can access it by right-clicking anywhere on the
page and selecting Scrape similar. In the pop-up win
dow, youll need to change your XPath query to:
//a[text()=visit site]/@href

Next, press Scrape and then Export to Google Docs.


From the Google Doc, you can then download the list as a
.csv file.
Upload the .csv file to Screaming Frog, then use List
mode to check the list of URLs.
When the spider is finished, youll see the status codes in
the Internal tab, or you can look in the Response Codes
tab and filter by Redirection to view all of the domains that
are being redirected to your money site or elsewhere.

NB: When uploading the .csv into Screaming Frog, you


must select CSV as the filetype, otherwise the program
will close in error.
PRO Tip:

You can also use this method to identify domains that your
competitors own, and how they are being used. Check out
what else you can learn about competitor sites below.

How to nd all of the subdomains on


a site and verify internal links.
Enter the root domain URL into ReverseInternet, then click
on the Subdomains tab to view a list of subdomains.
Then, use Scrape Similar to gather the list of URLs, using
the XPath query:
//a[text()=visit site]/@href
Export your results into a CSV, then load the CSV into
Screaming Frog using List mode. Once the spider has fin
ished running, youll be able to see status codes, as well as
any links on the subdomain homepages, anchor text and
duplicate page titles among other things.

How to crawl an e-commerce site or


other large site
Screaming Frog is not built to crawl hundreds of thou
sands of pages, but there are a couple of things that you
can do to avoid breaking the program when crawling large
sites. First, you can increase the memory allocation of the
spider. Second, you can break down the crawl by subdi
rectory or only crawl certain parts of the site using your In
clude/Exclude settings. Third, you can choose not to crawl

images, JavaScript, CSS and flash. By deselecting these


options in the Configuration menu, you can save memory
by crawling HTML only.
PRO Tip:
Until recently, you might have found that your crawls timed
out on large sites, however with Screaming Frog Version
2.11, you can tell the program to pause on high memory
usage. This fail-safe setting helps to keep the program from
crashing before you have the opportunity to save the data
or increase the memory allocation. This is currently a de
fault setting, but if you are planning on crawling a large site,
be sure that Pause On High Memory Usage is checked in
the Advanced tab of Spider Configuration menu.

How to crawl a site hosted on an


older server
In some cases, older servers may not be able to handle the
default number of URL requests per second. To change
your crawl speed, choose Speed in the Configuration
menu, and in the pop-up window, select the maximum num
ber of threads that should run concurrently. From this
menu, you can also choose the maximum number of URLs
requested per second.

PRO Tip:
If you find that your crawl is resulting in a lot of server er
rors, go to the Advanced tab in the Spider Configuration
menu, and increase the value of the Response Timeout
and of the 5xx Response Retries to get better results.

How to crawl a site that requires


cookies
Although search bots dont accept cookies, if you are
crawling a site and need to allow cookies, simply select Al
low Cookies in the Advanced tab of the Spider Configu
ration menu.

How to crawl using a proxy or a dif


ferent user-agent
To crawl using a proxy, select Proxy in the Configuration
menu, and enter your proxy information.

To crawl using a different user agent, select User Agent in


the Configuration menu, then select a search bot from the
drop-down or type in your desired user agent strings.

How to crawl pages that require au


thentication
When the Screaming Frog spider comes across a page
that is password-protected, a pop-up box will appear, in
which you can enter the required username and password.
In order to turn off authentication requests, deselect Re
quest Authentication in the Advanced tab of the Spider
Configuration menu.

Internal Links
I want information about all of the
internal and external links on my site
(anchor text, directives, links per
page etc.)
If you do not need to check the images, JavaScript, flash or
CSS on the site, de-select these options in the Spider
Configuration menu to save processing time and memory.
Once the spider has finished crawling, use the Advanced
Export menu to export a CSV of All Links. This will provide

you with all of the link locations, as well as the correspond


ing anchor text, directives, etc.

For a quick tally of the number of links on each page, go to


the Internal tab and sort by Outlinks. Anything over 100,
might need to be reviewed.

Need something a little more processed? Check out this


tutorial on visualizing internal link data with pivot tables by
@JoshuaTitsworth and this one about using NodeXL with
Screaming Frog to visualize your internal link graph by
@aleyda.

How to nd broken internal links on


a page or site

If you do not need to check the images, JavaScript, flash or


CSS of the site, de-select these options in the Spider Con
figuration menu to save processing time and memory.
Once the spider has finished crawling, sort the Internal
tab results by Status Code. Any 404s, 301s or other sta
tus codes will be easily viewable.
Upon clicking on any individual URL in the crawl results,
youll see information change in the bottom window of the
program. By clicking on the In Links tab in the bottom win
dow, youll find a list of pages that are linking to the se
lected URL, as well as anchor text and directives used on
those links. You can use this feature to identify pages
where internal links need to be updated.
To export the full list of pages that include broken or redi
rected links, choose Redirection (3xx) In Links or Client
Error (4xx) In Links or Server Error (5xx) In Links in the
Advanced Export menu, and youll get a CSV export of the
data.

How to nd broken outbound links

on a page or site (or all outbound


links in general)
After de-selecting Check Images, Check CSS, Check
JavaScript and Check SWF in the Spider Configuration
settings, make sure that Check External Links remains se
lected.
After the spider is finished crawling, click on the External
tab in the top window, sort by Status Code and youll eas
ily be able to find URLs with status codes other than 200.
Upon clicking on any individual URL in the crawl results
and then clicking on the In Links tab in the bottom win
dow, youll find a list of pages that are pointing to the se
lected URL. You can use this feature to identify pages
where outbound links need to be updated.
To export your full list of outbound links, click Export on
the internal tab. You can also set the filter to export links to
external image files, external JavaScript, external CSS, ex
ternal Flash files, and external PDFs. To limit your export to
pages, filter by HTML.

For a complete listing of all the locations and anchor text of


outbound links, select All Out Links in the Advanced Ex
port menu, then filter the Destination column in the ex

ported CSV to exclude your domain.

How to nd links that are being redi


rected
After the spider has finished crawling, select the Re
sponse Codes tab in the top window, then filter by Redi
rection (3xx). This will provide you with a list of any internal
links and outbound links that are redirecting. Sort by Sta
tus Code, and youll be able to break the results down by
type. Click on the In Links tab in the bottom window to
view all of the pages where the redirecting link is used.
If you export directly from this tab, you will only see the
data that is shown in the top window (original URL, status
code, and where it redirects to).
To export the full list of pages that include redirected links,
you will have to choose Redirection (3xx) In Links in the
Advanced Export menu. This will return a CSV that in
cludes the location of all your redirected links. To show in
ternal redirects only, filter the Destination column in the
CSV to include only your domain.
PRO Tip:
Use a VLOOKUP between the 2 export files above to
match the Source and Destination columns with the final
URL location.
Sample formula:
=VLOOKUP([@Destination],response_codes_redirect
ion_(3xx).csv!$A$3:$F$50,6,FALSE)

(Where response_codes_redirection_(3xx).csv is the CSV


file that contains the redirect URLs and 50 is the number
of rows in that file.)
Need to find and fix redirect chains? @dan_shuregives the
breakdown on how to do it here.

I am looking for internal linking op


portunities
Scaling Internal Link Building with Screaming Frog & Ma
jestic by @JHTScherck. Nuff said.

Site Content
How to identify pages with thin con
tent
After the spider has finished crawling, go to the Internal
tab, filter by HTML, then scroll to the right to the Word
Count column. Sort the Word Count column from low to
high to find pages with low text content. You can drag and
drop the Word Count column to the left to better match
the low word count values to the appropriate URLs. Click
Export in the Internal tab if you prefer to manipulate the
data in a CSV instead.

PRO Tip for E-commerce Sites:

While the word count method above will quantify the actual
text on the page, theres still no way to tell if the text found
is just product names or if the text is in a keyword-opti
mized copy block. To figure out the word count of your text
blocks, use ImportXML2 by @iamchrisle to scrape the text
blocks on any list of pages, then count the characters from
there. If xPath queries arent your strong suit, the xPath
Helper Chrome extension does a pretty solid job at figuring
out the xPath for you. Obviously, you can also use these
scraped text blocks to begin to understand the overall
word usage on the site in question, but that, my friends, is
another post

I want a list of the image links on a


particular page
If youve already crawled a whole site or subfolder, simply
select the page in the top window, then click on Image
Info tab in the bottom window to view all of the images
that were found on that page. The images will be listed in
the To column.

PRO Tip:
Right click on any entry in the bottom window to copy or
open a URL.
Alternately, you can also view the images on a single page
by crawling just that URL. Make sure that your crawl depth
is set to 1 in the Spider Configuration settings, then once
the page is crawled, click on the Images tab, and youll
see any images that the spider found.

Finally, if you prefer a CSV, use the Advanced Export


menu to export All Image Alt Text to see the full list of im
ages, where they are located and any associated alt text.

How to nd images that are missing


alt text or images that have lengthy
alt text
First, youll want to make sure that Check Images is se
lected in the Spider Configuration menu. After the spider
has finished crawling, go to the Images tab and filter by
Missing Alt Text or Alt Text Over 100 Characters. You
can find the pages where any image is located by clicking
on the Image Info tab in the bottom window. The pages
will be listed in the From column.
Alternately, in the Advanced Export menu, you can save
time and export, All Image Alt Text or Images Missing Alt
Text into a CSV. The resulting file will show you all of the
pages where each image is used on the site.

How to nd every CSS le on my site


In the Spider Configuration menu, select Check CSS be

fore crawling, then when the crawl is finished, filter the re


sults in the Internal tab by CSS.

How to nd every JavaScript le on


my site
In the Spider Configuration menu, select Check
JavaScript before crawling, then when the crawl is fin
ished, filter the results in the Internal tab by JavaScript.

How to identify all of the jQuery


plugins used on the site and what
pages they are being used on
First, make sure that Check JavaScript is selected in the
Spider Configuration menu. After the spider has finished
crawling, filter the Internal tab by JavaScript, then search
for jquery. This will provide you with a list of plugin files.
Sort the list by the Address for easier viewing if needed,
then view InLinks in the bottom window or export the data
into a CSV to find the pages where the file is used. These
will be in the From column.

Alternately, you can use the Advanced Export menu to ex

port a CSV of All Links and filter the Destination column


to show only URLs with jquery.
PRO Tip:
Not all jQuery plugins are bad for SEO. If you see that a
site uses jQuery, the best practice is to make sure that the
content that you want indexed is included in the page
source and is served when the page is loaded, not after
ward. If you are still unsure, Google the plugin for more in
formation on how it works.

How to nd where ash is embedded


on-site
In the Spider Configuration menu, select Check SWF be
fore crawling, then when the crawl is finished, filter the re
sults in the Internal tab by Flash.
NB: This method will only find .SWF files that are linked on
a page. If the flash is pulled in through JavaScript, youll
need to use a custom filter.

How to nd any internal PDFs that


are linked on-site
After the spider has finished crawling, filter the results in
the Internal tab by PDF.

How to understand content segmen


tation within a site or group of pages
If you want to find pages on your site that contain a spe
cific type of content, set a custom filter for an HTML foot
print that is unique to that page. This needs to be set *be

fore* running the spider. @stephpchang has a great tutorial


on segmenting syndicated content from original content
using custom filters.

How to nd pages that have social


sharing buttons
To find pages that contain social sharing buttons, youll
need to set a custom filter before running the spider. To
set a custom filter, go into the Configuration menu and
click Custom. From there, enter any snippet of code from
the page source.

In the example above, I wanted to find pages that contain a


Facebook like button, so I created a filter for
http://www.facebook.com/plugins/like.php.

How to nd pages that are using


iframes
To find pages that use iframes, set a custom filter for<
iframe before running the spider.

How to nd pages that contain em


bedded video or audio content
To find pages that contain embedded video or audio con
tent, set a custom filter for a snippet of the embed code for
Youtube, or any other media player that is used on the site.

Meta Data and Directives


How to identify pages with lengthy
page titles, meta descriptions, or
URLs
After the spider has finished crawling, go to the Page Ti
tles tab and filter by Over 70 Characters to see the page
titles that are too long. You can do the same in the Meta
Description tab or in the URI tab.

How to nd duplicate page titles,


meta descriptions, or URLs
After the spider has finished crawling, go to the Page Ti
tles tab, then filter by Duplicate. You can do the same
thing in the Meta Description or URI tabs.

How to nd duplicate content and/or


URLs that need to be rewritten/redi
rected/canonicalized
After the spider has finished crawling, go to the URI tab,
then filter by Underscores, Uppercase or Non ASCII
Characters to view URLs that could potentially be rewrit
ten to a more standard structure. Filter by Duplicate and
youll see all pages that have multiple URL versions.Filter
by Dynamic and youll see URLs that include parameters.

Additionally, if you go to the Internal tab, filter by HTML


and scroll the to Hash column on the far right, youll see a
unique series of letters and numbers for every page. If you
click Export, you can use conditional formatting in Excel to
highlight the duplicated values in this column, ultimately
showing you pages that are identical and need to be ad
dressed.

How to identify all of the pages that


include meta directives e.g.: nofol
low/noindex/noodp/canonical etc.
After the spider has finished crawling, click on the Direc
tives tab. To see the type of directive, simply scroll to the
right to see which columns are filled, or use the filter to find
any of the following tags:
index

noindex
follow
nofollow
noarchive
nosnippet
noodp
noydir
noimageindex
notranslate
unavailable_after
refresh
canonical

How to verify that my robots.txt le


is functioning as desired
By default, Screaming Frog will comply with robots.txt. As
a priority, it will follow directives made specifically for the
Screaming Frog user agent. If there are no directives
specifically for the Screaming Frog user agent, then the
spider will follow any directives for Googlebot, and if there
are no specific directives for Googlebot, the spider will fol
low global directives for all user agents. The spider will only
follow one set of directives, so if there are rules set specifi
cally for Screaming Frog it will only follow those rules, and
not the rules for Googlebot or any global rules. If you wish
to block certain parts of the site from the spider, use the
regular robots.txt syntax with the user agent Screaming

Frog SEO Spider. If you wish to ignore robots.txt, simply


select that option in the Spider Configuration settings.

How to nd or verify Schema


markup or other microdata on my
site
To find every page that contains Schema markup or any
other microdata, you need to use custom filters. Simply
click on Custom in the Configuration Menu and enter the
footprint that you are looking for.
To find every page that contains Schema markup, simply
add the following snippet of code to a custom filter: item
type=http://schema.org
To find a specific type of markup, youll have to be more
specific. For example, using a custom filter for span item
prop=ratingValue will get you all of the pages that con
tain Schema markup for ratings.

You can enter up to 5 different filters per crawl. Finally,


press OK and proceed with crawling the site or list of
pages.
When the spider has finished crawling, select the Custom
tab in the top window to view all of the pages that contain
your footprint. If you entered more than one custom filter,
you can view each one by changing the filter on the results.

Sitemap
How to create an XML Sitemap
After the spider has finished crawling your site, click on the
Advanced Export menu and select XML Sitemap.

Save your sitemap, then open it with Excel. Select Read


Only and open the file As an XML table. You may receive
an alert that certain schema cannot be mapped to a work
sheet. Just press Yes.
Now that your Sitemap is in table form, you can easily edit
the change frequency, priority and other values. Be sure to
double-check that the Sitemap only includes a single, pre
ferred (canonical) version of each URL, without parameters
or other duplicating factors. Once any changes have been

made, re-save your file as an XML file.

How to check my existing XML


Sitemap
First, youll need to have a copy of the Sitemap saved on
your computer. You can save any live Sitemap by visiting
the URL and saving the file, or by importing it into Excel.
@RichardBaxter actually has great instructions for import
ing your Sitemap into Excel and checking it using
SEOTools, but since we are talking about Screaming Frog,
read on:
Once you have the XML file saved to your computer, go to
the Mode menu in Screaming Frog and select List. Then,
click on Select File at the top of the screen, choose your
file and start the crawl. Once the spider has finished crawl
ing, youll be able to find any redirects, 404 errors, dupli
cated URLs and more Sitemap dirt in the Internal tab.

General Troubleshooting
How to identify why certain sections
of my site arent being indexed or
arent ranking
Wondering why certain pages arent being indexed? First,
make sure that they werent accidentally put into the ro
bots.txt or tagged as noindex. Next, youll want to make
sure that spiders can reach the pages by checking your in
ternal links. Once the spider has crawled your site, simply
export the list of internal URLs as a .CSV file, using the
HTML filter in the Internal tab.
Open up the CSV file, and in a second sheet, paste the list

of URLs that arent being indexed or arent ranking well.


Use a VLOOKUP to see if the URLs in your list on the sec
ond sheet were found in the crawl.
PRO tip:
If you really want to be fancy, try using my Pages Not In
dexedGoogle Doc/Excel tool, which, in a couple of min
utes, can provide you with the possible reasons why partic
ular pages arent indexed or ranking.

How to check if my site migration/re


design was successful
@ipullrank has an excellent Whiteboard Friday on this
topic, but the general idea is that you can use Screaming
Frog to check whether or not old URLs are being redi
rected by using the List mode to check status codes. If
the old URLs are throwing 404s, then youll know which
URLs still need to be redirected.

How to nd slow loading pages on


my site
After the spider has finished crawling, go to the Response
Codes tab and sort by the Response Time column from
high to low to find pages that may be suffering from a slow
loading speed.

How to nd malware or spam on my


site
First, youll need to identify the footprint of the malware or
the spam.Next, in the Configuration menu, click on Cus
tom and enter the footprint that you are looking for.

You can enter up to 5 different footprints per crawl. Finally,


press OK and proceed with crawling the site or list of
pages.

When the spider has finished crawling, select the Custom


tab in the top window to view all of the pages that contain
your footprint. If you entered more than one custom filter,
you can view each one by changing the filter on the results.

PPC & Analytics


How to verify that my Google Ana
lytics code is on every page, or on a
specic set of pages on my site
SEER Analytics star @RachaelGerson wrote a killer post
on this subject: Use Screaming Frog to Verify Google Ana
lytics Code. Check it out!

How to validate a list of PPC URLs in


bulk
Save your list in .txt or .csv format, then change your
Mode settings to List.

Next, select your file to upload, and press Start. See the
status code of each page by looking at the Internal tab.
To check if your pages contain your GA code, check out
this post on using custom filters to verify Google Analytics
code by @RachaelGerson.

Scraping
How to scrape the meta data for a
list of pages
So, youve harvested a bunch of URLs, but you need more
information about them? Set your mode to List, then up
load your list of URLs in .txt or .csv format. After the spider
is done, youll be able to see status codes, outbound links,
word counts, and of course, meta data for each page in
your list.

How to scrape a site for all of the


pages that contain a specic foot
print
First, youll need to identify the footprint.Next, in the Con
figuration menu, click on Custom and enter the footprint
that you are looking for.

You can enter up to 5 different footprints per crawl. Finally,


press OK and proceed with crawling the site or list of
pages.In the example below, I wanted to find all of the
pages that say Please Call in the pricing section, so I
found and copied the HTML code from the page source.

When the spider has finished crawling, select the Custom


tab in the top window to view all of the pages that contain
your footprint. If you entered more than one custom filter,
you can view each one by changing the filter on the results.
PRO Tip:
If you are pulling product data from a client site, you could
save yourself some time by asking the client to pull the
data directly from their database. The method above is
meant for sites that you dont have direct access to.

URL Rewriting
How to nd and remove session id or
other parameters from my crawled
URLs
To identify URLs with session ids or other parameters, sim
ply crawl your site with the default settings. When the spi
der is finished, click on the URI tab and filter to Dynamic

to view all of the URLs that include parameters.


To remove parameters from being shown for the URLs that
you crawl, select URL Rewriting in the configuration
menu, then in the Remove Parameters tab, click Add to
add any parameters that you want removed from the URLs,
and press OK. Youll have to run the spider again with
these settings in order for the rewriting to occur.

How to rewrite the crawled URLs (e.g:


replace .com with .co.uk, or write all
URLs in lowercase)
To rewrite any URL that you crawl, select URL Rewriting
in the Configuration menu, then in the Regex Replace tab,
click Add to add the RegEx for what you want to replace.

Once youve added all of the desired rules, you can test
your rules in the Test tab by entering a test URL in the
space labeled URL before rewriting. The URL after rewrit
ing will be updated automatically according to your rules.

If you wish to set a rule that all URLs are returned in lower
case, simply select Lowercase discovered URLs in the
Options tab. This will remove any duplication by capital
ized URLs in the crawl.

Remember that youll have to actually run the spider with


these settings in order for the URL rewriting to occur.

Keyword Research
How to know which pages my com
petitors value most
Generally speaking, competitors will try to spread link pop
ularity and drive traffic to their most valuable pages by link
ing to them internally. Any SEO-minded competitor will
probably also link to important pages from their company
blog. Find your competitors prized pages by crawling their
site, then sorting the Internal tab by the Inlinks column
from highest to lowest, to see which pages have the most
internal links.

To view pages linked from your competitors blog, deselect


Check links outside folder in the Spider Configuration
menu and crawl the blog folder/subdomain. Then, in the
External tab, filter your results using a search for the URL
of the main domain. Scroll to the far right and sort the list
by the Inlinks column to see which pages are linked most
often.

PRO Tip:
Drag and drop columns to the left or right to improve your
view of the data.

How to know what anchor text my


competitors are using for internal
linking
In the Advanced Export menu, select All Anchor Text to
export a CSV containing all of the anchor text on the site,

where it is used and what its linked to.

How to know which meta keywords


(if any) my competitors have added
to their pages
After the spider has finished running, look at the Meta Key
words tab to see any meta keywords found for each page.
Sort by the Meta Keyword 1 column to alphabetize the list
and visually separate the blank entries, or simply export the
whole list.

Link Building
How to analyze a list of prospective
link locations
If youve scraped or otherwise come up with a list of URLs
that needs to be vetted, you can upload and crawl them in
List mode to gather more information about the pages.
When the spider is finished crawling, check for status
codes in the Response Codes tab, and review outbound
links, link types, anchor text and nofollow directives in the
Out Links tab in the bottom window. This will give you an

idea of what kinds of sites those pages link to and how. To


review the Out Links tab, be sure that your URL of interest
is selected in the top window.
Of course youll want to use a custom filter to determine
whether or not those pages are linking to you already.

You can also export the full list of out links by clicking on
All Out Links in the Advanced Export Menu. This will not
only provide you with the links going to external sites, but it
will also show all internal links on the individual pages in
your list.

For more great ideas for link building, check out these two
awesome posts on link reclamation and using Link
Prospector with Screaming Frog by SEERs own
@EthanLyon and @JHTScherck.

How to nd broken links for outreach


opportunities
So, you found a site that you would like a link from? Use
Screaming Frog to find broken links on the desired page or
on the site as a whole, then contact the site owner, sug
gesting your site as a replacement for the broken link
where applicable, or just offer the broken link as a token of
good will.

How to verify my backlinks and view


the anchor text
Upload your list of backlinks and run the spider in List
mode. Then, export the full list of outbound links by clicking
on All Out Links in the Advanced Export Menu. This will
provide you with the URLs and anchor text/alt text for all
links on those pages. You can then use a filter on the Des
tination column of the CSV to determine if your site is
linked and what anchor text/alt text is included.
@JustinRBriggs has a nice tidbit on checking infographic
backlinks with Screaming Frog. Check out the other 17 link
building tools that he mentioned, too.

How to make sure that Im not part


of a link network
Want to figure out if a group of sites are linking to each
other? Check out this tutorial on visualizing link networks

using Screaming Frog and Fusion Tables by @EthanLyon.

I am in the process of cleaning up


my backlinks and need to verify that
links are being removed as requested
Set a custom filter that contains your root domain URL,
then upload your list of backlinks and run the spider in List
mode. When the spider has finished crawling, select the
Custom tab to view all of the pages that are still linking to
you.

Bonus Round

Did you know that by right-clicking on any URL in the top


window of your results, you could do any of the following?
Copy or open the URL
Re-crawl the URL or remove it from your crawl
Export URL Info, In Links, Out Links, or Image
Info for that page

Check indexation of the page in Google, Bing


and Yahoo
Check backlinks of the page in Majestic, OSE,
Ahrefs and Blekko
Look at the cached version/cache date of the
page
See older versions of the page
Validate the HTML of the page
Open robots.txt for the domain where the page is
located
Search for other domains on the same IP
Likewise, in the bottom window, with a right-click, you can:
Copy or open the URL in the To for From col
umn for the selected row

Final Remarks
In closing, I hope that this guide gives you a better idea of
what Screaming Frog can do for you. It has saved me
countless hours, so I hope that it helps you, too!

You might also like