How to Use Screaming Frog as a Content Scraper

By November 7, 2016SEO

Many use Screaming Frog solely as a means of crawling websites for key bits of SEO information e.g. page titles, meta descriptions, H1, H2 etc. It’s a very effective tool and is a staple of many an SEO’s toolset, however, people overlook how good it is at bulk scraping websites for custom information.

If you’ve always wondered how to scrape content from a website in bulk but believed it’s the realm of someone with programming experience, think again! Here’s an example we’ve put together to show you how easy it can be.

Let’s say you’re building a comparison website and want to collate lots of product information to share with your readers. For example, you’re reviewing laptops and you want to build a list of the top Lenovo laptops currently on sale with key pieces of the product specification – let’s take the John Lewis website as an example for the source material:

http://www.johnlewis.com/browse/electricals/laptops-netbooks/laptops/_/N-a8f?No=0&Nrpp=180

This is the category page for all of the laptop product range, this is what you’ll use as your starting point for the crawl.

(You’ll need to start from a page that contains links to your target pages otherwise your rules that you set up later will stop Screaming Frog from being able to access the correct pages)

 

1.Choose which pages to include in the crawl

To ensure that screaming frog only looks at the target Lenovo product pages and doesn’t just crawl every link (as default) you’ll want to create an include rule.

Go to the main ribbon and click on ‘Configuration’ and then ‘Include’, a text entry box should appear:

1

This feature basically tells Screaming Frog which URLs to follow. The best way to do this is with a Regular Expression (regex), a rule that allows you to target multiple URLs that match certain characteristics. For this example you’ll want to use the rule:

http://www\.johnlewis\.com/lenovo.*/p.*

It can be confusing at first, but this example basically says “crawl any link that has ‘Lenovo’ and ‘p’ in the URL”. The ‘p’ indicates that it’s a product URL, the page that has all the key information.

The ‘.’ means any singular character and the ‘*’ means any combination of those singular characters, it effectively acts as a catch all for any previous and subsequent characters.

The ‘\’ cancels out any special characters, this allows you to keep certain characters in your regex string that would usually have another regex meaning attached to them. We’ve done this with the ‘.’ In the domain http://www.johnlewis.com as we want them to be ‘escaped’ and not have any regex action.

There are plenty of regex tutorials out there which can be easily found with a Google search, but we’ve always found that testing regex is the best way to learn, the tool http://regexr.com/ is great for this sort of thing as it allows you to see in real time how your rule selects different bits of sample content.

 

2.Crawl settings

Before testing your crawl there are 2 more settings you’ll want to check:
Go to ‘Configuration’, ‘Spider’ and then on the ‘Basic’ tab click ‘Crawl Outside Of Start Folder’. Then, on the ‘Advanced’ tab click ‘Respect Canonical’.

2

These two settings will allow the crawler to access URLs other than the same folder as your starting URL, it will also only crawl canonical URLs (ignoring duplicate URLs, reducing the amount of clean-up work later on).

 

3.Do a test crawl

Once that’s set up do a test run of your crawl, it may take a few minutes but you should end up with a list of about 40 URLs, all Lenovo product pages. If this isn’t the case then run through some of the earlier steps until you’re collecting the right data.

 

4.Target the content

Now you know you’re crawling the right pages, it’s time to look at one of the product pages and find the required information you want to scrape, we’re looking for the containing tag that the key information sits in – Chrome Inspect is the best tool for this task.

If the site is well designed and uses a template for its pages, a rule will be able to grab content from the same area of all of the product pages. There are multiple options within Screaming Frog to select and extract information from a page, but one of the easiest is using CSS Selectors, (these work in exactly the same way as how you’d style a page with CSS) if you’re not familiar with this approach then it isn’t a problem, you can use Chrome and the Inspect tool to copy the selector to your clipboard (as seen in the example below). I won’t go into the details of CSS selectors here, but again there are plenty of tutorials online.

3

For this example, we’ll be collecting the product name and description from the product page with the CSS selectors below:

Product name: #prod-title
Product description: #tabinfo-care-info > span

Copy the selectors and head back to Screaming Frog, go to the ‘Configuration’ drop down, click on ‘Custom’ and then ‘Extraction’, as seen below:

4

Select CSSPath and then name the ‘extractors’ something relevant to the information that’s being gathered, these will be your column headers for any exports. Make sure you’re extracting text (using the drop down on the right) so that you don’t have to remove HTML code from your excel doc later on.

Press OK and then re-crawl the website.

Head to the tab ‘Custom’ and change the drop down to ‘Extraction’, you’ll see a list of your URLs and the content that has been scraped from each of the pages.

5

You can export this into Excel format and you’ll have a list of all the Lenovo laptops from the website including all the product names and product descriptions.

This is just one example of how you can use Screaming Frog to scrape content but there are many many more. Once you’ve tried it out a few times you’ll realise that there’s a lot of other applications for this sort of technique and it can be extremely useful for SEO work, especially on websites with large amounts of pages.

We’ve used this for a client that had thousands of landing pages that needed to be reviewed, some had optimised text copy and some didn’t, we needed to identify these content gaps, as well as providing recommendations on improving any existing copy.

Creating a simple inclusion rule we crawled each of the landing pages and scraped the content if present. We then used an Excel formula to check for the instances of certain keywords within the scraped copy, basically allowing us to see whether there were any optimisation opportunities for certain pages – This was far quicker and more efficient than manually reviewing each page.

Hopefully that shows you how simple it can be to use screaming frog for content scraping, there are lots of great uses of this technique so let us know if you have any of your own and share in the comments!