Back to Knowledge Hub

How Custom Extractions Can Level Up Your Reporting

Author:

Michelle Race

Last updated:

24/01/2024

If you’ve ever had to review a website for a technical audit, content or optimisation review, it’s likely that you’ve used a website crawler to perform this task. A crawler will crawl your website following all links from the homepage and report back on the URLs, page elements and issues found. Out of the box, crawlers are designed to report on these common aspects to suit a wide range of websites and industries. So you would expect to see page titles, descriptions, H1s, status codes, and indexation status as standard.  

While these standard reports are extremely useful information to have, you may find yourself wishing that your crawler reported on more specific parts of your website. Each website is unique and there are often distinct metrics and areas to cover based on your role or SEO specialty whether you are a tech SEO, data analyst or content writer. 

Luckily, most crawlers have a useful feature which allows you to pull additional information from web pages during a crawl and this is commonly known as a custom extraction. The beauty of custom extractions is that you can tailor them to extract the data that is valuable to you and your team, and use them alongside your normal crawl data. 

In this article, I’ll share some custom extraction ideas, highlight the benefits these can bring, and show how these custom extracts can help you level up your SEO reporting and analysis. 

What are custom extractions?

A custom extraction is a way to scrape or extract specific data from a website at scale. The extractions you implement may be useful for just a single crawl or could be ongoing elements that you wish to keep track of. 

Here are some use cases for custom extractions:

Extracting content and links

The most common reason to extract particular elements or sections on the pages you’re crawling is so the extraction data can be reviewed and used for analysis.

Examples:

  • Analyse Content
    : Checking custom content at the top/bottom of e-commerce category pages

  • Check Schema
    : Examining schema elements like ‘datepublished’ timestamps

  • Identify Links
    : Pulling all links within a specified section such as the main navigation 

Checking for the existence of an element

You can use custom extractions to check that an element exists on the pages being crawled which is useful for auditing purposes.

Examples:

  • Tag Tracking
    : Extract the tag manager or analytics code to make sure it’s on every page

  • Test Rendering
    : Extract a certain content area to check that it is being rendered for search engines

  • Segmentation
    : Extract a unique template class or ID to be used to segment a web crawl when there are no identifiers within the URL e.g id=”PDP” on product pages 

Checking for errors or issues

Sometimes there can be particular errors on your website that you may want to track via an extraction. This is especially useful if those URLs don’t return a 4xx status code which would flag as a standard error page in a crawler.

Examples:

  • Flag soft 404s
    : If some of your pages return a 200 status but do show an error message you may want to extract the error wording  e.g ‘This page cannot be found’ and track error pages this way. 

  • Find empty or broken category pages
    : Similar to the above, if there are concerns that some product category pages are empty, you can extract the message that is displayed on the page e.g. ‘No products could be found’.

Save time and resources

You can also use custom extractions to save time and resources. 

Examples:

  • Identify certain subdomain or external links
    : If you wish to find links to a particular subdomain or external domain but don’t want to have to allow crawling of all external URLs you can use an extraction to identify pages linking just to that domain or subdomain instead. 

  • Identify image links or sizes :
    You might want to find pages that contain a specific banner image or check that a template only contains images of a certain size. Using a custom extraction that extracts the image URL or the image size you are looking for would save needing to crawl all images and sort through the data afterwards. 

How to set up custom extractions

The most common way to set up customer extractions is to use Regex, XPath or CSS Selectors, however, crawlers vary, so it’s best to check the documentation of whichever tool you’ll be using. I highly recommend looking at some online training courses and guides (e.g Regex, XPath) to help you get up to speed with the syntax and methods. 

Once you know what you want to extract, it’s time to look at the code and identify patterns and unique parts that relate to it. If you are using Chrome, right click on the page element, select ‘Inspect’ and it will show you the HTML.

Some elements like schema are not visible on the page but only seen in the HTML. For example, if you wanted to capture datePublished schema from your blog posts using regex and the schema looked like this within the HTML: 

"datePublished":"2022-09-26T22:18:45+00:00"
, then your Regex could be as simple as this:
"datePublished":"(.*?)" (
the part between the brackets is saying to capture everything within those quotation marks as the custom extraction).
 

If you wish to use XPath or Selectors but don’t know where to start, you can use Chrome DevTools to check the elements that you need. 

Go to the page, right click and choose ‘Inspect Element’ which will open up the Elements view.  Select the element you want to see the XPath for, right click on it and select “Copy”, this will open up a set of options which can help you generate what you need. You will often still need to make this more specific but it can set you on the right path. 

Testing your custom extractions

There are free online tools you can use to help build your custom extractions such as  RegExr for Regex. 

If you are using XPath or Selectors and quickly want to test on a per page basis, you can use Chrome DevTools. If you go to the Elements view (right click and select Inspect) and then Ctrl + F to bring up the search bar, you can then paste your XPath/Selectors. Unfortunately you can’t use this to test Regex but there are additional Chrome plugins that can help with this.

If you’re on a page that should match, it will highlight the element found when your XPath or Selector is pasted in. If you match more results than expected, it could be a sign that your extraction is too broad and needs to be refined. 

I find that the best way to test your custom extraction is working correctly is to identify some URLs that should match, and some URLs that shouldn’t, based on what you are testing. You should then crawl just those URLs and check the output. The reason to test URLs that shouldn’t match is that if you are setting up an extraction to track a certain template such as a PDP, if you get matches for all URLs you’ll know that it’s not a unique enough identifier. Doing this testing will prevent problems later. 

Another thing to be aware of is that some websites may be using a dynamic serving method where different HTML is provided depending on the user-agent. If a website contains the vary : User-Agent HTTP header, different HTML could be served for a mobile versus desktop user-agent on the same URL. If this is the case, make sure that you are getting the extraction for the version you are crawling as the class name or identifier could change. If your website is using dynamic serving and you are crawling using a smartphone user-agent make sure that the extraction is based on the mobile version.  You can check this by using a plugin or online tool that shows HTTP headers. The Robots Exclusion Checker by Sam Gibson is a fantastic Chrome Plugin that as part of its many SEO uses shows HTTP headers. 

It can take a few tests to ensure that everything works as expected but it will be worth it in the long run. 

Different customer extraction use cases by industry

Not sure where to start? Here are some custom extraction ideas split out by industry. These examples have been invaluable to me in the past when creating audits and helping to support business decisions.  

Ecommerce 

Extract stock status of PDPs (Product Detail Pages)

Extract the text that appears for stock status - you could choose just to focus on out of stock or also add in stock and discontinued products. Use the relevant sold out text e.g ‘Sold Out’ or ‘Out of Stock’, any unique identifiers or extract the relevant availability field in product schema  e.g. "availability":"http://schema.org/OutOfStock"

Tip:
you may need to add extra checks to make sure the extraction only pulls out PDPs and not PLPs if you are extracting just the text as it could also be added in the listings.

Why is this useful?

It allows you to:

  • Check how out of stock pages are being linked to internally

  • Check how much traffic is going to these pages

  • See whether out of stock URLs are linked in the XML sitemaps

  • Track high value products over time to see if or how often they go in and out of stock

Identify empty PLPs (Category Listing Pages)

Extract the text that appears if a listing page is empty e.g ‘no products found’

Tip:
if you are just adding in the text as your extraction make sure that the casing used exactly matches what is on the page.

Why is this useful?

It allows you to:

  • See how many empty category pages there are and how they are being linked to; for example, are they linked to more frequently than non-empty category pages?

  • Track these pages as they could be seen as soft 404s

  • Check how much traffic is going to these pages

  • Check that these empty categories are actually empty and it’s not a bug

Extract the number of products within a PLP

If there is a count somewhere on the page such as ‘x’ items surrounded by a div with a specific class you can extract that:

Or if the count isn’t within its own tag like <p>Showing 10 of 200</p>, your regex used could be this : <p>Showing \d+ of (\d+)<\/p>.

Why is this useful?

It allows you to:

  • Check for pages with either a low or high number of products (you may want to recommend reducing the number of low product categories by combining them with other categories, and/or further dividing categories with a high number of products).

Extract unique PLP content

Find the class/id of the div surrounding the unique content on the page:

Tip:
if there is a class with random looking letters and numbers appended to the end like the above screenshot (description_JnSzY) this class may update to something different at a later date and your extraction will no longer work. If you feel this may be the case with your extraction you may want to look for a different identifier or just match on the first part such as description_. 

Why is this useful?

It allows you to:

  • Review the content added to the page, check for duplicates, and identify categories that don’t currently contain content

Publishing

Extract published and/or modified dates for articles

Extract the schema element for ‘DatePublished’ and/or ‘DateModified’; or extract on-page elements.

Why is this useful?

It allows you to:

  • Check that the schema information matches the on page information

  • Analyse traffic for older articles

  • Identify any articles which might benefit from being updated

Extract related tags or categories

Extract the content of the <div> using the identifier, class or ID for the tags.

Why is this useful?

It allows you to check the categories/tags are being utilised appropriately.

Identify article types & templates

Find a unique identifier for each type e.g how-to, video or listicle.

Why is this useful?

You can use this for segmentation or to compare performance and links for these different page types.

Identify pages with subscription/paywalled content

Look for a unique identifier such as a class, id or unique text that relates to a gated article or look for the schema "isAccessibleForFree": "False" which is the schema recommended for paywalled content. Keep in mind that some websites may have bot behaviour in place that keeps the page accessible to bots but not users.

Why is this useful?

It allows you to:

  • Analyse performance or linking to gated articles

  • Identify pages to check for potential issues with Googlebot accessing the content

Technology

Check CTA (call to action) text on buttons

Extract the text of the buttons used within CTAs by pulling out the <button> text.

Why is this useful?

  • Not all CTAs are links and will be crawled by crawlers. Google and crawlers only follow links using ahref and not <button>. If a website is using <button> within CTAs, an extraction will still enable you to pull the text of these for review. 

  • It can also be useful to identify where <button> is being used, in case it’s supposed to be a crawlable link.

Look for topics or keywords within text content

Get a list of topics and/or keywords and include those words within your extraction.

Why is this useful?

  • Identifying topic or keyword mentions on these pages can be a great way of finding internal linking opportunities. 

Further Ideas 

Extract breadcrumbs

Extract the breadcrumbs elements using the class/ids or use the schema if present.

Why is this useful?

It allows you to:

  • Capture the primary structure of your pages category - very useful if a page can belong to two or more categories 

  • Check for errors within the generation of the breadcrumbs

Identify areas of the website built on different platforms

Look for a class or identifier you can extract within the HTML like the platform name e.g Wordpress: <meta name="generator" content="WordPress 6.0.2">.

Why is this useful?

Identifying pages built with different CMS allows you to determine if an issue is sitewide or platform specific.

Find links to development and/or staging websites

Get a list of known development websites and use an extraction method to pull the link and anchor text into an extraction.

Why is this useful?

Links to development domains can be classed as external URLs or disallowed by robots.txt making them hard to spot. A custom extraction can highlight these links quickly, making issues faster to fix. 

Track appearance of widgets e.g Latest News or Related Categories

Find a unique class or ID of each widget and extract that within the crawl.

Why is this useful?

It allows you to:

  • See which pages these widgets appear on and whether this could be improved

  • See if there are pages which should have the widgets that they are not appearing on

Reporting using custom extractions

Now that you have your custom extractions set-up and they are bringing back the data you can start using the results of these alongside your standard crawl results.

Segmentation or Tasks

If your crawler supports segmentation or something akin to task creation, you could choose to use your custom extractions to improve your issue reporting. 

For example:

  • If you have a custom extraction to track soft 404 errors, you could set up a task or segment to track these with each crawl. 

  • If you are tracking categories with low products you could create a segment which only includes URLs with specific product numbers e.g (1-3) so you can track the overall number of low product URLs  

  • If you are tracking website elements such as schema or templates and in one crawl the extraction is suddenly returning empty it means that either the extraction no longer matches, or that changes have been made to the website that may need to be investigated. Having these extractions tied to an alert, task or segment can help raise the alarm sooner.

Help support SEO initiatives by combining extractions with other sources

Your custom extractions can also help to support new SEO initiatives that you may be considering. Here are a couple of examples: 

Evaluate adding ‘noindex’ to out of stock PDPs

It might be that you are considering adding a noindex tag to out of stock PDPs. Here you can use your out of stock extraction to identify affected PDPs. With Search Console and GA sources added to the crawl, you can see both the traffic, and internal links to these out of stock pages. 

The outcome of this may be that you decide to only noindex out of stock PDPs after they reach a low threshold of visits or clicks, and that you place out of stock PDPs at the end of category lists so that they are given less prominence. Going forward, this custom extraction can help you to keep track and monitor these pages. 

Improve category content on PLPs

If you are looking to get approval to add content to more PLPs, you could use an extraction to output the section with the custom content. 

Once you have added this, you’ll know which PLPs have existing content and which are missing content. You are then able to incorporate GSC and GA data to look at the performance of the two types of pages. The extraction would also be handy to identify duplicate content issues and where pages have outdated content. 

Using custom extractions as part of your monitoring strategy

Your extractions can also be considered as part of your monitoring strategy. If you are extracting important elements vital to your website, you may want to consider setting up additional smaller or template based crawls around them. 

For example, if your site has different page templates it's a good idea to take a few sample pages from each template and crawl them on a regular cadence along with your extractions. If your extraction monitoring a content element on category pages is suddenly empty after a code release, you’ll have an early warning of a potential problem. 

Conclusion

Hopefully I’ve provided some new custom extraction ideas that you’d like to try out, but you definitely shouldn’t feel limited to just the examples in this article. The sky really is the limit, and if there is something within your website that a crawler doesn’t natively pull out but is in the rendered HTML then there is a good chance you’ll be able to extract it. Have fun! 

Michelle Race

Michelle Race - Senior Technical SEO, Lumar

Michelle is a Senior Technical SEO at Lumar. She loves going down rabbit holes to investigate Tech SEO issues and will often be found outside of work playing console games or curled up with a book.

LinkedIn | Twitter