Data Quality
Making Science
 Olivia Jiménez Delgado
Olivia Jiménez Delgado
Consultora SEO

An Introduction to regular expressions in SEO

Friday September 28th, 2018
1 min 30 s

What are regular expressions?

Regular expressions, also known as regex or rational expression, are a sequence of characters that create a search pattern, and as such provide us with an efficient and flexible method of searching and recognizing text strings.

Regular expressions allow:

  • Quickly analyze large amounts of text for specific character patterns
  • Extract, edit, replace or delete substrings of text.
  • Add extracted strings to a collection in order to generate a report.

In many cases they become an indispensable tool.

Regular expressions in SEO

In SEO, regular expressions can be very useful. With its use we can define a pattern and find it quickly in one or several documents.  There is a set of SEO tools on the market that allow the use of Regex, among which we can highlight:

  • Crawlers or tracking tools: Screaming Frog and DeepCrawl among others.
  • Google Analytics: we can create custom filters to extract traffic from certain pages.
  • Google Sheets:  Google’s own spreadsheets we can use the syntax = REGEXTRACT to extract data from URL strings among other uses.

Some basic regular expressions

Diving in a bit deeper in this section, we show you a set of regular expressions very useful in SEO, which are quite effective and save us a lot of work especially when: we have to extract specific information in several documents of a site, or when a certain website is big, and doing a full crawl is a nightmare; we can choose to track a specific path or exclude some paths.

To do so, here are some examples of the use of regular expressions (regex) in crawler tools such as Screaming Frog:

  • If from our blogWhat are regular expressions?Regular expressions, also known as regex or rational expression, are a sequence of characters that create a search pattern, and as such provide us with an efficient and flexible method of searching and recognizing text strings.Regular expressions allow:Quickly analyze large amounts of text for specific character patterns
    Extract, edit, replace or delete substrings of text.
    Add extracted strings to a collection in order to generate a report.
    In many cases they become an indispensable tool.Regular expressions in SEOIn SEO, regular expressions can be very useful. With its use we can define a pattern and find it quickly in one or several documents.There is a set of SEO tools on the market that allow the use of Regex, among which we can highlight:Crawlers or tracking tools: Screaming Frog and DeepCrawl among others.
    Google Analytics: we can create custom filters to extract traffic from certain pages.
    Google Sheet: in Google’s own spreadsheets we can use the syntax = REGEXTRACT to extract data from URL strings among other uses.
    Some basic regular expressions

    Deepening a little in this section, we show you a set of regular expressions very useful in SEO, which are quite effective and save us a lot of work especially when: we have to extract a specific information in several documents of a site, or when a certain site web is so big, that doing a full crawl is a nightmare, and we choose to track a specific path or exclude some.

    To do so, here are some examples of the use of regular expressions (regex) in crawl tools such as Screaming Frog:

    If from our blog https://www.makingscience.com/, we want to track pages containing only the path ‘/ en /’ in the path of the URL, with Screaming we can go to the top menu and select “Configuration” – “Include “And we include within the function” Include “, the following regular expression:. * / En /.*
    As a result, only the URLs containing that path will be tracked, as can be seen in the following images:

Therefore, we already know that all those characters that appear between the signs “. *” Will be the ones that are indicating that they appear in the path of URLs that are intended to be tracked.

Another way to specify this expression, especially when the path we want to crawl appears right after the domain, is to include it in the following way:

https://www.makingscience.com/en/:*

If we want to select only URLs that contain a certain parameter, we can use the following expressions:

If instead, we want to discard a set of URLs in the crawl, from the Screaming menu we go to “Configuration” – “Exclude”, and as in the previous case, we specify with regex which path we do not want to appear. And those URLs that match the indicated exclusion will not appear in the trace directly.

Some use cases can be found below:
To exclude the subdirectory or path “/ en /” from our blog: https://www.makingscience.com/en/, we must include the following syntax in the “Exclude” function: https://www.makingscience.com /in/.*

To exclude a folder or path, which appears interspersed between previous folders, we use the following expression: https://www.domain.com/.*/example-path.*

For example, in the following image we show how we could exclude from the scan all the URLs that are part of the folder “/ seo-social-media /”

If we want to exclude from the scan all the images that appear in our site, the regular expression would be similar to:. Jpg $

And in the following image it can be seen that when discarding the images, none has been tracked in Screaming:

  • If we want to exclude pages that contain a specific term in the URL such as “developer”, the expression regex would be:. * Developer. *
    • If we are interested in excluding URLs that contain the security protocol (HTTPS), the regular expression would be:. * Https. *

And if we want to exclude all pages with HTTP, the regex would be: http://www.dominio.com/.*

  • To give an example of the use of a more complex regular expression, imagine we have grouped in Google Sheet a list of URLs belonging to different domains, and we want to extract only the domains from it, we can make use of the following syntax:

=REGEXEXTRACT(A2;”^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)”)

Next, we specify an example with the use of this syntax in Google Sheets, but with URLs from our own blog, so you can see the result of the process:

Regular expressions cheat sheet

Regular expressions can be more complex, depending on the patterns that we are interested in extracting. Therefore, in the following table you can find a cheat sheet, which will serve as learning to become familiar with the metacharacters that are most used in regex, and that will allow us to create useful expressions that save us time:

We use cookies to improve your experience as a customer. Cookies Policy

×