How to Check Broken Links with 404 Error in Python

Links is one of the critical SEO factors for a Website. When creating or re-designing some pages, we cannot ignore Website audit especially in terms of finding and tracking broken links regularly. Although there are many online tools, it is still worth learning and implementing such a tool ourselves for fun. To simplify the programming complexity, I decided to use Python. With powerful Python libraries, it is fairly easy to crawl Web pages, parse HTML elements and check server response code.

How to Check URL 404 Error for a Website

To make SEO-friendly, a Website should generate a sitemap.xml file, which helps search engines to easily crawl all valid URLs. So we can implement the crawler with three steps:

Read sitemap.xml to extract all Web page links.
Parse HTML elements of every Web page to collect internal links and outbound links from the attribute href.
Get connected to all links and check the response code.

Installation

Beautiful Soup is a Python library used for parsing data of HTML and XML files. Install Beautiful Soup with following command:

pip install beautifulsoup4

Implementing a Web Page Crawler in Python

Bind keyboard event to interrupt program whenever you want.

def ctrl_c(signum, frame):
    global shutdown_event
    shutdown_event.set()
    raise SystemExit('\nCancelling...')

global shutdown_event
shutdown_event = threading.Event()
signal.signal(signal.SIGINT, ctrl_c)

Read sitemap.xml with Beautiful Soup.

        pages = []
        try:
            request = build_request("http://kb.dynamsoft.com/sitemap.xml")
            f = urlopen(request, timeout=3)
            xml = f.read()
            soup = BeautifulSoup(xml)
            urlTags = soup.find_all("url")

            print "The number of url tags in sitemap: ", str(len(urlTags))

            for sitemap in urlTags:
                link = sitemap.findNext("loc").text
                pages.append(link)

            f.close()
        except HTTPError, URLError:
            print URLError.code

        return pages

Parse HTML content to collect all links.

    def queryLinks(self, result):
        links = []
        content = ''.join(result)
        soup = BeautifulSoup(content)
        elements = soup.select('a')

        for element in elements:
            if shutdown_event.isSet():
                return GAME_OVER

            try:
                link = element.get('href')
                if link.startswith('http'):
                    links.append(link)
            except:
                print 'href error!!!'
                continue

        return links

    def readHref(self, url):
        result = []
        try:
            request = build_request(url)
            f = urlopen(request, timeout=3)
            while 1 and not shutdown_event.isSet():
                tmp = f.read(10240)
                if len(tmp) == 0:
                    break
                else:
                    result.append(tmp)

            f.close()
        except HTTPError, URLError:
            print URLError.code

        if shutdown_event.isSet():
            return GAME_OVER

        return self.queryLinks(result)

Send link request and check response code.

    def crawlLinks(self, links, file=None):
        for link in links:
            if shutdown_event.isSet():
                return GAME_OVER

            status_code = 0

            try:
                request = build_request(link)
                f = urlopen(request)
                status_code = f.code
                f.close()
            except HTTPError, URLError:
                status_code = URLError.code

            if status_code == 404:
                if file != None:
                    file.write(link + '\n')

            print str(status_code), ':', link

        return GAME_OVER

Source Code

https://github.com/yushulx/crawl-404

How to Check Broken Links with 404 Error in Python

How to Check URL 404 Error for a Website

Installation

Implementing a Web Page Crawler in Python

Source Code

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112