How To Find Broken Links Using Selenium WebDriver?

How To Find Broken Links Using Selenium WebDriver?

What thoughts come to mind when you come across 404/Page Not Found/Dead Hyperlinks on a website? Aargh! You would find it annoying when you come across broken hyperlinks, which is the sole reason why you should continuously focus on removing the existence of broken links in your web product (or website). Instead of a manual inspection, you can leverage automation for broken link testing using Selenium WebDriver.

When a particular link is broken and a visitor lands on the page, it affects that page’s functionality and results in a poor user experience. Dead links could hurt your product’s credibility, as it ‘might’ give an impression to your visitors that there is a minimal focus on the experience.

If your web product has many pages (or links) that result in a 404 error (or page not found), the product rankings on search engines (e.g., Google) will also be badly affected. Removal of dead links is one of the integral parts of SEO (Search Engine Optimization) activity.

In this part of the Selenium WebDriver tutorial series, we deep dive into finding broken links using Selenium WebDriver. We have demonstrated broken link testing using Selenium Python, Selenium Java, Selenium C#, and Selenium PHP.

In simple terms, broken links (or dead links) in a website (or web app) are links that are not reachable and do not work as anticipated. The links could be temporarily down due to server issues or wrongly configured at the back end.

Selenium WebDriver

Apart from pages that result in 404 error, other prominent examples of broken links are malformed URLs, links to content (e.g., documents, pdf, images, etc.) that have been moved or deleted.

Here are some of the common reasons behind the occurrence of broken links (dead links or link rots):

  • Incorrect or misspelled URL entered by the user.
  • Structural changes in the website (i.e., permalinks) with URL redirects or internal redirects are not properly configured.
  • Links to content like videos, documents, etc. that are either moved or deleted. If the content is moved, the ‘internal links’ should be redirected to the designated links.
  • Temporary website downtime due to site maintenance making the website temporarily inaccessible.
  • Broken HTML tags, JavaScript errors, incorrect HTML/CSS customizations, broken embedded elements, etc., within the page leading, can lead to broken links.
  • Geolocation restrictions prevent access to the website from certain IP addresses (if they are blacklisted) or specific countries in the world. Geolocation testing with Selenium helps ensure that the experience is tailor-made for the location (or country) from where the site is accessed.

Broken links are a big turn-off for the visitors who land on your website. Here are some of the major reasons why you should check for broken links on your website:

  • Broken Links can hurt the user experience.
  • Removal of broken (or dead) links is essential for SEO (Search Engine Optimization), as it can affect the site’s rankings on search engines (e.g., Google).

Broken links testing can be done using Selenium WebDriver on a web page, which in turn can be used to remove the site’s dead links.

When a user visits a website, a request is sent by the browser to the site’s server. The server responds to the browser’s request with a three-digit code called the ‘ HTTP Status Code.’

An HTTP Status Code is the server’s response to a request sent from the web browser. These HTTP Status Codes are considered equivalent to the conversation between the browser (from which URL request is sent) and the server.

Though different HTTP Status Codes are used for different purposes, most of the codes are useful for diagnosing issues in the site, minimizing site downtime, the number of dead links, and more. The first digit of every three-digit status code begins with numbers 1~5. The status codes are represented as 1xx, 2xx.., 5xx for indicating the status codes in that particular range. As each of these ranges consists of a different class of server response, we would limit the discussion to HTTP Status Codes presented for broken links.

Here are the common status code classes that are useful in detecting broken links with Selenium:

Classes of HTTP Status CodeDescription
1xxThe Server is still thinking through the request.
2xxThe request sent by the browser was successfully completed and expected response was sent to the browser by the server.
3xxThis indicates that a redirect is being performed. For example, 301 redirect is popularly used for implementing permanent redirects on a website.
4xxThis indicates that either a particular page (or complete site) is not reachable.
5xxThis indicates that the server was unable to complete the request, even though a valid request was sent by the browser.

Here are some of the common HTTP Status Codes presented by the web server on encountering a broken link:

Capture.PNG

Irrespective of the language used with Selenium WebDriver, the guiding principles for broken link testing using Selenium remains the same. Here are the steps for broken links testing using Selenium WebDriver:

  1. Use the \< a > tag to collect details of all the links present on the webpage.
  2. Send an HTTP request for every link.
  3. Verify the corresponding response code received in response to the request sent in the previous step.
  4. Validate whether the link is broken or not based on the response code sent by the server.
  5. Repeat steps (2-4) for every link present on the page.

In this Selenium WebDriver tutorial, we would demonstrate how to perform broken link testing using Selenium WebDriver in Python, Java, C#, and PHP. The tests are conducted on (Chrome 85.0 + Windows 10) combination, and the execution is carried out on the cloud-based Selenium Grid provided by LambdaTest.

To get started with LambdaTest, create an account on the platform and note the user-name & access-key available from the profile section on LambdaTest. The browser capabilities are generated using LambdaTest Capabilities Generator.

Here is the test scenario used for finding broken links on a website using Selenium:

Test Scenario

  1. Go to LambdaTest Blog i.e. https://www.lambdatest.com/blog on Chrome 85.0
  2. Collect all the links present on the page
  3. Send HTTP request for each link
  4. Print whether the link is broken or not on the terminal

It is important to note that the time spent in broken links testing using Selenium depends on the number of links present on the ‘web page under test.’ The more the number of links on the page, the more time will be spent finding broken links. For example, LambdaTest has a huge number of links (~150+); hence, the process of finding broken links might take some time (approx a few minutes).

Implementation

package org.selenium4;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Iterator;
import java.util.List;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
/* import org.openqa.selenium.chrome.ChromeDriver; */
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class CrossBrowserTest
{
    /*  protected static ChromeDriver driver; */
    WebDriver driver = null;
    String URL = "https://www.lambdatest.com/blog";
    public static String status = "passed";
    String username = "user-name";
    String access_key = "access-key";

    String url = "";
    HttpURLConnection urlconnection = null;
    int responseCode = 200;

    @BeforeClass
    public void testSetUp() throws MalformedURLException {
        /*
        WebDriverManager.chromedriver().setup();
        driver = new ChromeDriver();
        */
        DesiredCapabilities capabilities = new DesiredCapabilities();
        capabilities.setCapability("build", "[Java] Finding broken links on a webpage using Selenium");
        capabilities.setCapability("name", "[Java] Finding broken links on a webpage using Selenium");
        capabilities.setCapability("platform", "Windows 10");
        capabilities.setCapability("browserName", "Chrome");
        capabilities.setCapability("version","85.0");
        capabilities.setCapability("tunnel",false);
        capabilities.setCapability("network",true);
        capabilities.setCapability("console",true);
        capabilities.setCapability("visual",true);

        driver = new RemoteWebDriver(new URL("http://" + username + ":" + access_key + "@hub.lambdatest.com/wd/hub"), capabilities);
        System.out.println("Started session");
    }

    @Test
    public void test_Selenium_Broken_Links() throws InterruptedException {
        driver.navigate().to(URL);
        driver.manage().window().maximize();
        List<WebElement> links = driver.findElements(By.tagName("a"));
        Iterator<WebElement> link = links.iterator();

        /* For skipping email address */
        String mail_to = "mailto";
        String tel ="tel";
        String LinkedInPage = "https://www.linkedin.com";
        int valid_links = 0;
        int broken_links = 0;
        Boolean bLinkedIn = false;
        int LinkedInStatus = 999;

        Pattern pattern = Pattern.compile("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
        Matcher mat;

        while (link.hasNext())
        {
            url = link.next().getAttribute("href");
            System.out.println(url);
            bLinkedIn = false;

            if ((url == null) || (url.isEmpty()))
            {
                System.out.println("URL is either not configured for anchor tag or it is empty");
                continue;
            }

            /* String str="mailto:support@LambdaTest.com"; */
            if ((url.startsWith(mail_to)) || (url.startsWith(tel)))
            {
                System.out.println("Email address or Telephone detected");
                continue;
            }

            if(url.startsWith(LinkedInPage))
            {
                System.out.println("URL starts with LinkedIn, expected status code is 999");
                bLinkedIn = true;
            }

            try {
                urlconnection = (HttpURLConnection) (new URL(url).openConnection());
                urlconnection.setRequestMethod("HEAD");
                urlconnection.connect();
                responseCode = urlconnection.getResponseCode();
                if (responseCode >= 400)
                {
                    /* https://stackoverflow.com/questions/27231113/999-error-code-on-head-request-to-linkedin */
                    if ((bLinkedIn == true) && (responseCode == LinkedInStatus))
                    {
                        System.out.println(url + " is a LinkedIn Page and is not a broken link");
                        valid_links++;
                    }
                    else
                    {
                        System.out.println(url + " is a broken link");
                        broken_links++;
                    }
                }
                else
                {
                    System.out.println(url + " is a valid link");
                    valid_links++;
                }
            } catch (MalformedURLException e) {
                System.out.println(e.getMessage());
                e.printStackTrace();
            } catch (IOException e) {
                System.out.println(e.getMessage());
                e.printStackTrace();
            }
        }
        System.out.println("Detection of broken links completed with " + broken_links + " broken links and " + valid_links + " valid links\n");
    }

    @AfterClass
    public void tearDown()
    {
        if (driver != null) {
            ((JavascriptExecutor) driver).executeScript("lambda-status=" + status);
            driver.quit();
        }
    }
}

Code WalkThrough

1. Import the required packages

The methods in the HttpURLConnection package are used for sending HTTP requests and capturing the HTTP Status Code (or response).

The methods in the regex.Pattern package check if the corresponding link contains an email address or telephone number using a specialized syntax held in a pattern.

import java.net.HttpURLConnection;
import java.util.regex.Pattern;

2. Collect the links present on the page

The links present on the URL under test (i.e., LambdaTest Blog) are located using tagname in Selenium. The tag name used for identification of the element (or link) is ‘a’.

The links are placed in a list to iterate through the list to check broken links on the page.

List<WebElement> links = driver.findElements(By.tagName("a"));

3. Iterate through the URLs

The Iterator object is used for looping through the list created in Step (2)

Iterator<WebElement> link = links.iterator();

4. Identify and Verify the URLs

A while loop is executed till the time Iterator (i.e., link) does not have more elements to iterate. The ‘href’ of the anchor tag is retrieved, and the same is stored in the URL variable.

while (link.hasNext())
{
   url = link.next().getAttribute("href");

Skip checking the links if:

a. The link is null or empty

if ((url == null) || (url.isEmpty()))
{
 System.out.println("URL is either not configured for anchor tag or it is empty");
continue;
}

b. The link contains mailto or telephone number

if ((url.startsWith(mail_to)) || (url.startsWith(tel)))
{
   System.out.println("Email address or Telephone detected");
   continue;
}

When checking for the LinkedIn page, the HTTP status code is 999. A Boolean variable (i.e., LinkedIn) is set to true to indicate that it is not a broken link.

if(url.startsWith(LinkedInPage))
{
  System.out.println("URL starts with LinkedIn, expected status code is 999");
  bLinkedIn = true;
}

5. Validate the links through the Status Code

The methods in HttpURLConnection class provide the provision for sending HTTP requests and capturing the HTTP Status Code.

The openConnection method of the URL class opens the connection to the specified URL. It returns a URLConnection instance representing a connection to the remote object that is referred by the URL. It is type-casted to HttpURLConnection.

HttpURLConnection urlconnection = null;
..............................................
..............................................
..............................................

urlconnection = (HttpURLConnection) (new URL(url).openConnection());
urlconnection.setRequestMethod("HEAD");

The setRequestMethod in HttpURLConnection class sets the method for URL request. The request type is set to HEAD so that only Headers are returned. On the other hand, request type GET would have returned the document body, which is not required in this particular test scenario.

The connect method in HttpURLConnection class establishes the connection to the URL and sends an HTTP request.

urlconnection.connect();

The getResponseCode method returns the HTTP Status Code for the previously sent request.

responseCode = urlconnection.getResponseCode();

For HTTP Status Code is 400 (or more), the variable containing broken links count (i.e., broken_links) is incremented; else, the variable containing valid links (i.e., valid_links) is incremented.

if (responseCode >= 400)
{
   if ((bLinkedIn == true) && (responseCode == LinkedInStatus))
   {
     System.out.println(url + " is a LinkedIn Page and is not a broken link");
     valid_links++;
   }
   else
   {
     System.out.println(url + " is a broken link");
     broken_links++;
   }
 }
 else
 {
    System.out.println(url + " is a valid link");
    valid_links++;
 }

Execution

For broken links testing using Selenium Java, we created a project in IntelliJ IDEA. The basic pom.xml file was sufficient for the job!

Here is the execution snapshot, which indicates 169 valid links and 0 broken links on the LambdaTest Blog Page.

Selenium Java

The links containing the email addresses and phone numbers were excluded from the search list, as shown below.

screenshot

You can see the test being run in the below screenshot and getting completed in 2 min 35 seconds, as shown on LambdaTest’s automation logs.

LambdaTest’s automation logs

Implementation

import requests
import urllib3
import pytest
from requests.exceptions import MissingSchema, InvalidSchema, InvalidURL
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

capabilities = {
    "build" : "[Python] Finding broken links on a webpage using Selenium",
    "name" : "[Python] Finding broken links on a webpage using Selenium",
    "platform" : "Windows 10",
    "browserName" : "Chrome",
    "version" : "85.0"
}

user_name = "user-name"
app_key = "access-key"
broken_links = 0
valid_links = 0

# options = webdriver.ChromeOptions()
# options.add_argument("start-maximized")
# options.add_argument('disable-infobars')
# driver=webdriver.Chrome(options=options)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
remote_url = "http://" + user_name + ":" + app_key + "@hub.lambdatest.com/wd/hub"
driver = webdriver.Remote(command_executor = remote_url, desired_capabilities = capabilities)
driver.maximize_window()
driver.get('https://www.lambdatest.com/blog/')

# links = driver.find_elements_by_css_selector("a")
links = driver.find_elements(By.CSS_SELECTOR, "a")

for link in links:
    try:
        request = requests.head(link.get_attribute('href'), data ={'key':'value'})
        print("Status of " + link.get_attribute('href') + " is " + str(request.status_code))
        if (request.status_code == 404):
            broken_links = (broken_links + 1)
        else:
            valid_links = (valid_links + 1)
    except requests.exceptions.MissingSchema:
        print("Encountered MissingSchema Exception")
    except requests.exceptions.InvalidSchema:
        print("Encountered InvalidSchema Exception")     
    except:
        print("Encountered Some other execption")

print("Detection of broken links completed with " + str(broken_links) + " broken links and " + str(valid_links) + " valid links")

Code WalkThrough

1. Import Modules

Apart from importing the Python modules for Selenium WebDriver, we also import the requests module. The requests module lets you send all kinds of HTTP requests. It can also be used for passing parameters in URL, sending custom headers, and more.

import requests
import urllib3
from requests.exceptions import MissingSchema, InvalidSchema, InvalidURL

2. Collect the links present on the page

The links present on the URL under test (i.e., LambdaTest Blog) are found by locating the web elements by the CSS Selector “a” property.

links = driver.find_elements(By.CSS_SELECTOR, "a")

Since we want the element to be iterable, we use the find_elements method (and not the find_element method).

3. Iterate through the URLs for validation

The head method of the requests module is used to send a HEAD request to the specified URL. The _get_attribute_ method is used on every link for getting ‘href’ attribute of the anchor tag.

The head method is primarily used in scenarios where only _status_code_ or HTTP headers are required, and contents of the file (or URL) are not needed. The head method returns requests.Response object which also contains the HTTP Status Code (i.e. request.status_code).

for link in links:
    try:
        request = requests.head(link.get_attribute('href'), data ={'key':'value'})
        print("Status of " + link.get_attribute('href') + " is " + str(request.status_code))

The same set of operations are performed iteratively till all the ‘links’ present on the page have been exhausted.

4. Validate the links through the Status Code

If the HTTP response code for the HTTP request sent in step(3) is 404 (i.e., Page Not Found), it means that the link is a broken link. For links that are not broken, the HTTP Status Code is 200.

if (request.status_code == 404):
    broken_links = (broken_links + 1)
else:
    valid_links = (valid_links + 1)

5. Skip irrelevant requests

When applied on links that do not contain the ‘href’ attribute (e.g., mailto, telephone, etc.), the head method results in an exception (i.e., MissingSchema, InvalidSchema).

except requests.exceptions.MissingSchema:
    print("Encountered MissingSchema Exception")
except requests.exceptions.InvalidSchema:
    print("Encountered InvalidSchema Exception")
except:
    print("Encountered Some other execption")

These exceptions are caught, and the same is printed on the terminal.

Execution

We have used the PyUnit (or unittest) here, the default test framework in Python for broken links testing using Selenium. Run the following command on the terminal:

python Broken_Links.py

The execution would take around 2-3 minutes since the LambdaTest Blog page consists of approximately 150+ links. The execution screenshot below shows that the page has 169 valid links and zero broken links.

You would witness the InvalidSchema exception or MissingSchema exception at some places, which indicates that those links are skipped from the evaluation.

InvalidSchema exception

The HEAD request to LinkedIn (i.e.) results in an HTTP Status Code of 999. As stated in this thread on StackOverflow, LinkedIn filters the requests based on the user-agent, and the request resulted in ‘Access Denied’ (i.e., 999 as HTTP Status Code).

HTTP Status Code

We verified whether the LinkedIn link present on the LambdaTest blog page is broken or not by running the same test on the local Selenium Grid, which resulted in HTTP/1.1 200 OK.

Implementation

using OpenQA.Selenium;
using OpenQA.Selenium.Remote;
using OpenQA.Selenium.Chrome;
using NUnit.Framework;
using System.Threading;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using OpenQA.Selenium.Remote;
using System;
using System.Threading;
using System.Net.Http;
using System.Threading.Tasks;

namespace ParallelLTSelenium
{
    [TestFixture("chrome", "85.0", "Windows 10")]
    public class ParallelLTTests
    {
        private String browser;
        private String version;
        private String os;

        IWebDriver driver;

        public ParallelLTTests(String browser, String version, String os)
        {
            this.browser = browser;
            this.version = version;
            this.os = os;
        }

        [SetUp]
        public void Init()
        {
            String username = "user-name";
            String accesskey = "access-key";
            String gridURL = "@hub.lambdatest.com/wd/hub";

            DesiredCapabilities capabilities = new DesiredCapabilities();

            capabilities.SetCapability("user", username);
            capabilities.SetCapability("accessKey", accesskey);
            capabilities.SetCapability("browserName", browser);
            capabilities.SetCapability("version", version);
            capabilities.SetCapability("platform", os);
            capabilities.SetCapability("build", "[C#] Finding broken links on a webpage using Selenium");
            capabilities.SetCapability("name", "[C#] Finding broken links on a webpage using Selenium");

            driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities, TimeSpan.FromSeconds(600));

            System.Threading.Thread.Sleep(2000);
        }

        [Test]
        public async Task LT_Broken_Links_Test()
        {
            int valid_links = 0, broken_links = 0;
            driver.Url = "https://www.lambdatest.com/blog";
            using var client = new HttpClient();
            var links = driver.FindElements(By.TagName("a"));

            Console.WriteLine("Looking at all the URLs in LambdaTest :");
            /* Loop through all the urls */
            foreach (var link in links)
            {
                if (!(link.Text.Contains("Email") || link.Text.Contains("https://www.linkedin.com") || link.Text == "" || link.Equals(null)))
                {
                    try
                    {
                        /* Get the URI */
                        HttpResponseMessage response = await client.GetAsync(link.GetAttribute("href"));
                        System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");
                        /* Reference - https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebresponse.statuscode?view=netcore-3.1 */
                        if (response.StatusCode == HttpStatusCode.OK)
                        {
                            valid_links++;
                        }
                        else
                        {
                            broken_links++;
                        }
                    }
                    catch (Exception ex)
                    {
                        if ((ex is ArgumentNullException) ||
                           (ex is NotSupportedException))
                        {
                            System.Console.WriteLine("Exception occured\n");
                        }
                    }
                }
            }
            /* Perform wait to check the output */
            System.Threading.Thread.Sleep(2000);
            Console.WriteLine("Detection of broken links completed with " + broken_links + " broken links and " + valid_links + " valid links");
        }

        [TearDown]
        public void Cleanup()
        {
            if (driver != null)
                driver.Quit();
        }
    }
}

Code WalkThrough

The NUnit framework is used for automation testing; our earlier blog on NUnit Test automation with Selenium C# can help you get started with the framework.

1. Include HttpClient

The HttpClient namespace is added for usage through the using directive. The HttpClient class in C# provides a base class for sending HTTP requests and receiving the HTTP response from a resource that is identified by URI.

Microsoft recommends using System.Net.Http.HttpClient instead of System.Net.HttpWebRequest; HttpWebRequest could also be used to detect broken links in Selenium C#.

using System.Net.Http;
using System.Threading.Tasks;

2. Define an async method that returns a task

An async test method is defined as using the GetAsync method that sends a GET request to the specified URI as an asynchronous operation.

public async Task LT_Broken_Links_Test()
{

3. Collect the links present on the page

Firstly, we create an instance of HttpClient.

using var client = new HttpClient();

The links present on the URL under test (i.e., LambdaTest Blog) are collected by locating the web elements by the TagName “a” property.

var links = driver.FindElements(By.TagName("a"));

The _find_elements_ method in Selenium is used for locating the links on the page as it returns an array (or list) that can be iterated to verify the workability of the links.

4. Iterate through the URLs for validation

The links located using the _find_elements_ method are verified in a for loop.

foreach (var link in links)
{

We filter the links that contain /email-addresses/telephone numbers/LinkedIn addresses. The links with no Link Text are also filtered out.

if (!(link.Text.Contains("Email") || link.Text.Contains("https://www.linkedin.com") || link.Text == "" || link.Equals(null)))
{

The GetAsync method of HttpClient class sends a GET request to the corresponding URI as an asynchronous operation. The argument to the GetAsync method is the value of the anchor’s ‘href’ attribute collected using the GetAttribute method.

The evaluation of the async method is suspended by the await operator until the completion of the asynchronous operation. On completion of the asynchronous operation, the await operator returns the HttpResponseMessage that includes the data and status code.

/* Get the URI */
HttpResponseMessage response = await client.GetAsync(link.GetAttribute("href"));
System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");

5. Validate the links through the Status Code

If the HTTP response code (i.e. response.StatusCode) for the HTTP request sent in step(4) is HttpStatusCode.OK (i.e., 200), it means that the request was completed successfully.

System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");
if (response.StatusCode == HttpStatusCode.OK)
{
    valid_links++;
}
else
{
    broken_links++;
}

NotSupportedException and ArgumentNullException exceptions are handled as a part of exception handling.

catch (Exception ex)
{
      if ((ex is ArgumentNullException) ||
         (ex is NotSupportedException))
      {
          System.Console.WriteLine("Exception occured\n");
      }
}

Execution

Here is the execution snapshot, which shows that the test was executed successfully.

selenium webdriver

Exceptions have occurred for links to the ‘share icons,’ i.e., WhatsApp, Facebook, Twitter, etc. Apart from these links, the rest of the links on the LambdaTest blog page return HttpStatusCode.OK (i.e. 200).

HttpStatusCode

Implementation

<?php
require 'vendor/autoload.php';

use PHPUnit\Framework\TestCase;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\WebDriverBy;

# accessKey:  AccessKey can be generated from automation dashboard or profile section
$GLOBALS['LT_USERNAME'] = "user-name";
$GLOBALS['LT_APPKEY'] = "access-key";

function check_nonlinks($test_url, $test_pattern)
{
    if (preg_match($test_pattern, $test_url) == false)
    { 
        return false;
    }
    else
    { 
        return true;
    } 
}

class BrokenLinksTest extends TestCase
{
  protected $webDriver;

  public function build_browser_capabilities(){
    /* $capabilities = DesiredCapabilities::chrome(); */
    $capabilities = array(
      "build" => "[PHP] Finding broken links on a webpage using Selenium",
      "name" => "[PHP] Finding broken links on a webpage using Selenium",
      "platform" => "Windows 10",
      "browserName" => "Chrome",
      "version" => "85.0"
    );
    return $capabilities;
  }

  public function setUp(): void
  {
    $url = "https://". $GLOBALS['LT_USERNAME'] .":" . $GLOBALS['LT_APPKEY'] ."@hub.lambdatest.com/wd/hub";
    $capabilities = $this->build_browser_capabilities();
    /* $this->webDriver = RemoteWebDriver::create('http://localhost:4444/wd/hub', $capabilities); */
    $this->webDriver = RemoteWebDriver::create($url, $capabilities);
  }

  public function tearDown(): void
  {
    $this->webDriver->quit();
  }

  /*
  * @test
  */ 
  public function test_Broken_Links()
  {
    $test_url = "https://www.lambdatest.com/blog";
    $pattern_1 = '/\baddtoany\b/';
    $pattern_2 = '/\bmailto\b/';
    $pattern_3 = '/\blinkedin\b/';
    $title = 'LambdaTest | A Cross Browser Testing Blog';

    $driver = $this->webDriver;
    $driver->get($test_url);
    $driver->manage()->window()->maximize();
    $this->assertEquals($title, $driver->getTitle());

    $brokenlinks = 0;
    $validlinks = 0;

    /* file_get_contents is used to get the page's HTML source */
    $html = file_get_contents($test_url);

    /* Instantiate the DOMDocument class */
    $htmlDom = new DOMDocument;

    /* The HTML of the page is parsed using DOMDocument::loadHTML */
    @$htmlDom->loadHTML($html);

    /* Extract the links from the page */
    $links = $htmlDom->getElementsByTagName('a');

    /* The DOMNodeList object is traversed to check for its validity */
    foreach($links as $link)
    {
        /* Get the Link Text */
        $linkText = $link->nodeValue;
        /* Get the details of the link in the HREF attribute */
        $linkHref = $link->getAttribute('href');
        /* print($linkHref); */
        /* print("\n"); */

        /* Skip the link if it is empty */
        if(strlen(trim($linkHref)) == 0)
        {
            continue;
        }

        /* Skip the link if it is a hashtag or anchor link */
        if($linkHref[0] == '#')
        {
            continue;
        }

        /* Skip if the link is an email address or telephone number or does not contain the URL (e.g. we have skipped addtoany) */
        if ((check_nonlinks($linkHref, $pattern_1))||(check_nonlinks($linkHref, $pattern_2)))
        {
            print("\nAdd_To_Any or email encountered");
            continue;
        }

        /* curl_init() for initializing a cURL session */
        $curl = curl_init($linkHref); 

        /* curl_setopt() is used for setting an option for cURL transfer  */
        curl_setopt($curl, CURLOPT_NOBODY, true); 

        /* curl_exec() for performing a cURL session */
        $result = curl_exec($curl);

        if ($result !== false)
        {
            /* curl_getinfo() to get information about the transfer */ 
            $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
            if ($statusCode >= 400)
            { 
                /* Return code for LinkedIn Page is 999 */
                /* https://stackoverflow.com/questions/27231113/999-error-code-on-head-request-to-linkedin */
                $linkedin_page_status = check_nonlinks($linkHref, $pattern_3);
                /* We skip the test for LinkedIn as it can only be verified on local Selenium Grid */
                if (($linkedin_page_status) && ($statusCode == 999))
                {
                    print("\nLink " . $linkHref . " is LinkedIn Page and status is " .$statusCode);
                    $validlinks++;
                }
                else
                {
                    print("\nLink " . $linkHref . " is broken link and status is " .$statusCode);
                    $brokenlinks++;
                }
            }
            else
            { 
                print("\nLink " . $linkHref . " is a valid link and status is " . $statusCode);
                $validlinks++;
            } 
        }
    }
    print("\n\nDetection of broken links completed with " . $validlinks . " valid links and " . $brokenlinks . " broken links");
  }
}
?>

Code WalkThrough

1. Read the page source

The file_get_contents function in PHP is used for reading the page’s HTML source into a String variable (e.g. $html).

$test_url = "https://www.lambdatest.com/blog";
$html = file_get_contents($test_url);

2. Instantiate the DOMDocument class

The DOMDocument class in PHP represents an entire HTML document and serves as the document tree’s root.

$htmlDom = new DOMDocument;

3. Parse HTML of the page

The DOMDocument::loadHTML() function is used for parsing the HTML source that is contained in $html. On successful execution, the function returns a DOMDocument object.

@$htmlDom->loadHTML($html);

4. Extract the links from the page

The links present on the page are extracted using the getElementsByTagName method of DOMDocument class. The elements (or links) are searched based on the ‘a’ tag from the parsed HTML source.

The getElementsByTagName function returns a new instance of DOMNodeList which contains the elements (or links) of local tag name (i.e. \< a > tag)

$links = $htmlDom->getElementsByTagName('a');

5. Iterate through the URLs for validation

The DOMNodeList, which was created in Step (4), is traversed for checking the validity of the links.

foreach($links as $link)
{
       $linkText = $link->nodeValue;

The details of the corresponding link are obtained using the ‘href’ attribute. The GetAttribute method is used for the same.

$linkHref = $link->getAttribute('href');

Skip checking the links if:

a. The link is empty

if(strlen(trim($linkHref)) == 0)
{
   continue;
}

b. The link is a hashtag or an anchor link

if($linkHref[0] == '#')
{
   continue;
}

c. The link contains mailto or addtoany (i.e., social sharing options).

function check_nonlinks($test_url, $test_pattern)
{
    if (preg_match($test_pattern, $test_url) == false)
    { 
        return false;
    }
    else
    { 
        return true;
    } 
}

public function test_Broken_Links()
{
    $pattern_1 = '/\baddtoany\b/';
    $pattern_2 = '/\bmailto\b/';

    ....................................................................
    ....................................................................
    ....................................................................

if ((check_nonlinks($linkHref, $pattern_1))||(check_nonlinks($linkHref, $pattern_2)))
{
   print("\nAdd_To_Any or email encountered");
   continue;
}
    ....................................................................
    ....................................................................
    ....................................................................
}

_preg_match_ function uses a regular expression (regex) for performing a case-insensitive search for mailto and addtoany. The regular expressions for mailto & addtoany are ‘/\bmailto\b/’ & ‘/\baddtoany\b/’ respectively.

6. Validate the HTTP Code using cURL

We use curl to get information regarding the status of the corresponding link. The first step is initializing a cURL session with the ‘link’ on which validation has to be done. The method returns a cURL instance that will be used in the latter part of the implementation.

$curl = curl_init($linkHref);

The _curl_setopt_ method is used for setting options on the given cURL session handle (i.e. $curl).

curl_setopt($curl, CURLOPT_NOBODY, true);

The _curl_exec_ method is called for execution of the given cURL session. It returns True on successful execution.

$result = curl_exec($curl);

This is the most important part of the logic that checks for broken links on the page. The _curl_getinfo_ function that takes the cURL session handle (i.e. $curl) and CURLINFO_RESPONSE_CODE (i.e. CURLINFO_HTTP_CODE) are used for getting information about the last transfer. It returns HTTP Status Code in response.

$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);

On successful completion of the request, HTTP Status Code of 200 is returned, and the variable holding the valid links count (i.e., $valid_links) is incremented. For links that result in the HTTP Status Code of 400 (or more), a check is performed if the ‘link under test’ was LambdaTest’s, LinkedIn Page. As mentioned earlier, the LinkedIn page’s status code will be 999; hence, $valid_links is incremented.

For all the other links that returned HTTP Status Code of 400 (or more), the variable holding the broken links count (i.e., _$broken_links_) is incremented.

if (($linkedin_page_status) && ($statusCode == 999))
{
 print("\nLink " . $linkHref . " is LinkedIn Page and status is " .$statusCode);
 $validlinks++;
}
else
{
 print("\nLink " . $linkHref . " is broken link and status is " .$statusCode);
 $brokenlinks++;
}

Execution

We use the PHPUnit framework for testing for broken links on the page. For downloading the PHPUnit framework, add the file composer.json in the root folder and run composer require on the terminal.

{
   "require":{
      "php":">=7.1",
      "phpunit/phpunit":"^9",
     "phpunit/phpunit-selenium": "*",
      "php-webdriver/webdriver":"1.8.0",
      "symfony/symfony":"4.4",
     "brianium/paratest": "dev-master"
   }
}

Run the following command on the terminal to check broken links in Selenium PHP.

vendor\bin\phpunit tests\BrokenLinksTest.php

Here is the execution snapshot that shows a total of 116 valid links and 0 broken links on the LambdaTest Blog. As links for social sharing (i.e., addtoany) and email address are ignored, the total count is 116 (169 in the Selenium Python test).

execution snapshot

Conclusion

Broken links, also called dead links or rot links, can hinder the user experience if they are present on the website. Broken links can also impact the rankings on search engines. Hence, broken link testing should be carried periodically for activities related to website development and testing.

Rather than relying on third-party tools or manual methods for checking broken links on a website, broken links testing can be done using Selenium WebDriver with Java, Python, C#, or PHP. The HTTP Status Code, returned when accessing any web page, should be used to check broken links using the Selenium framework.