Links is one of the critical SEO factors for a Website. When creating or re-designing some pages, we cannot ignore Website audit especially in terms of finding and tracking broken links regularly. Although there are many online tools, it is still worth learning and implementing such a tool ourselves for fun. To simplify the programming complexity, I decided to use Python. With powerful Python libraries, it is fairly easy to crawl Web pages, parse HTML elements and check server response code.
How to Check URL 404 Error for a Website
To make SEO-friendly, a Website should generate a sitemap.xml file, which helps search engines to easily crawl all valid URLs. So we can implement the crawler with three steps:
- Read sitemap.xml to extract all Web page links.
- Parse HTML elements of every Web page to collect internal links and outbound links from the attribute href.
- Get connected to all links and check the response code.
Installation
Beautiful Soup is a Python library used for parsing data of HTML and XML files. Install Beautiful Soup with following command:
pip install beautifulsoup4
Implementing a Web Page Crawler in Python
Bind keyboard event to interrupt program whenever you want.
def ctrl_c(signum, frame): global shutdown_event shutdown_event.set() raise SystemExit('\nCancelling...') global shutdown_event shutdown_event = threading.Event() signal.signal(signal.SIGINT, ctrl_c)
Read sitemap.xml with Beautiful Soup.
pages = [] try: request = build_request("http://kb.dynamsoft.com/sitemap.xml") f = urlopen(request, timeout=3) xml = f.read() soup = BeautifulSoup(xml) urlTags = soup.find_all("url") print "The number of url tags in sitemap: ", str(len(urlTags)) for sitemap in urlTags: link = sitemap.findNext("loc").text pages.append(link) f.close() except HTTPError, URLError: print URLError.code return pages
Parse HTML content to collect all links.
def queryLinks(self, result): links = [] content = ''.join(result) soup = BeautifulSoup(content) elements = soup.select('a') for element in elements: if shutdown_event.isSet(): return GAME_OVER try: link = element.get('href') if link.startswith('http'): links.append(link) except: print 'href error!!!' continue return links def readHref(self, url): result = [] try: request = build_request(url) f = urlopen(request, timeout=3) while 1 and not shutdown_event.isSet(): tmp = f.read(10240) if len(tmp) == 0: break else: result.append(tmp) f.close() except HTTPError, URLError: print URLError.code if shutdown_event.isSet(): return GAME_OVER return self.queryLinks(result)
Send link request and check response code.
def crawlLinks(self, links, file=None): for link in links: if shutdown_event.isSet(): return GAME_OVER status_code = 0 try: request = build_request(link) f = urlopen(request) status_code = f.code f.close() except HTTPError, URLError: status_code = URLError.code if status_code == 404: if file != None: file.write(link + '\n') print str(status_code), ':', link return GAME_OVER