A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
This tutorial is about building your own web crawler - not the one that can scan the whole internet(like Google), but one that is able to extract all the links from a given webpage. For this tutorial, I would be extracting information from IMDb.
IMDb is an online database of information related to films, television programs and video games. I woulb be going to parse IMDb Top 250 and IMDb: Years and extract information about the movies ratings, year of release, start casts, directors, etc.
First, let’s write a simple program (hard-code) to parse IMDb Top 250.
Going through the code in details. For this simple parser I have used parse from lxml.html.
tree = parse('http://www.imdb.com/chart/top') parses the url and returns a tree.
Before going on the next line, lets discuss about XPath. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. XPath Tutorial is a very good tutorial for XPath by w3cschools.com. In the XPath, ‘//*[@id=“main”]/table[2]/tr/td[3]/font/a’.
// : Selects nodes in the document from the current node
that match the selection no matter where they are.
/ : Selects from the root node
/tr/td[3]: Selects the third td element that is the child of the tr element.
To get the XPath of an element, you can use Google Chrome. Click on Inspect Element.
Then select Copy XPath. This would give you the XPath to be used. Remember to remove <tbody> element from the XPath. and also remove [] from tr as you want to scrape the whole movies list.
Similarly you can find the XPath for movies_rating also.
Then in
get_movie_data
123456789
defget_movie_data(iterator):""" Returns movie_name, year_of_release as movies_data[element].itertext() would return an iterator containing these two elements"""movie_data=(iterator.next(),iterator.next())returnmovie_datamov_dict={get_movie_data(movies_data[i].itertext())[0]:[int(get_movie_data(movies_data[i].itertext())[1].strip(' ()/I')),movies_rating[i]]foriinrange(len(movies_data))}
movie_dict is built containing the dictionary with
Let’s go a step further. After writing a simple (but hard-coded) parser, I am going to write a more generic (yet simple) parser. For this, I have take some concepts from scrapy(imitation is the best form of flattery) and have used lxml for scrapping.
I would be scraping the IMDb: Years page. This page contains the links for pages containing the links for Most Popular Titles Released in that year. In the next blog, I would be using the data scraped (a dictionary {year: [name, rating, genres, director, actors]} for analysing trends in Movies.
importurllib2importrefromlxml.htmlimportparseclassCrawler():def__init__(self,settings):""" settings should be a dictionary containing domain: start_url: EXAMPLE settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'} """self.settings=settingsself.rules={self.settings['start_url']:'parse'}self.parsed_urls=[]self.url_list=[]def_get_all_urls(self,response):""" _get_all_urls returns all the urls in the page """tree=parse(response)url_list=tree.findall('//a')url_list=[url.attrib['href']ifurl.attrib['href'].startswith('http://')elseurllib2.urlparse.urljoin(self.settings['domain'],url.attrib['href'])forurlinurl_list]returnurl_listdefset_rules(self,rules):""" set_rules set the rules for crawling rules are dictionary in the form {url_pattern: parsing_function} EXAMPLE >>> settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'} >>> imdb_crawler = Crawler(settings) >>> imdb_crawler.set_rules({'/year/\d+': 'year_parser', ... '/title/\w+': 'movie_parser'}) """self.rules=rulesdef_get_crawl_function(self,url):""" _get_crawl_function returns the crawl function to be used for given url pattern """forpatterninself.rules.keys():ifre.search(pattern,url):returnself.rules[pattern]defparse(self,response):""" parse is the default parser to be called """passdefstart_crawl(self):""" start_crawl is the method that starts calling EXAMPLE >>> foo_crawler = Crawler() >>> foo_crawler.start_crawl() """response=urllib2.urlopen(urllib2.urlparse.urljoin(self.settings['domain'],self.settings['start_url']))self.url_list=self._get_all_urls(response)forurlinself.url_list:ifurlnotinself.parsed_urls:crawl_function=self._get_crawl_function(url)ifcrawl_function:getattr(self,crawl_function)(urllib2.urlopen(url))self.parsed_urls.append(url)
Crawler object have to be initailized with a dictionary settings {‘domain’: domain_of_\page, ‘start_url’: start_url_page} The Crawler class has attribute url_list that contains all the urls to be parsed and parsed_urls is a list of all the parsed urls.
_get_all_urls
123456789101112
def_get_all_urls(self,response):""" _get_all_urls returns all the urls in the page """tree=parse(response)url_list=tree.findall('//a')url_list=[url.attrib['href']ifurl.attrib['href'].startswith('http://')elseurllib2.urlparse.urljoin(self.settings['domain'],url.attrib['href'])forurlinurl_list]returnurl_list
_get_all_urls retrieves all the urls present in the web page. tree.findall(‘//a’) would return all the //a tags present in the web page. If the url starts with http:// then it would append it as usual; but if the url is a relative url, it would append the final url formed by joining the url with the domain
parse(response) is the default parser of the Crawler. More sophisticated or complex parsers can be written for different urls using set_rule. For example:
set_rule
123456
settings={'domain':'http://www.imdb.com','start_url':'/year'}imdb_crawler=Crawler(settings)# year_parser is parser for scraping year pages# movie_parser is parser for scraping movie pagesimdb_crawler.set_rules({'/year/\d+':'year_parser','/title/\w+':'movie_parser'})
All the parser should have be an input parameter response. Discussed below in details.
start_crawl
123456789101112131415161718
defstart_crawl(self):""" start_crawl is the method that starts calling EXAMPLE >>> foo_crawler = Crawler() >>> foo_crawler.start_crawl() """response=urllib2.urlopen(urllib2.urlparse.urljoin(self.settings['domain'],self.settings['start_url']))self.url_list=self._get_all_urls(response)forurlinself.url_list:ifurlnotinself.parsed_urls:crawl_function=self._get_crawl_function(url)ifcrawl_function:getattr(self,crawl_function)(urllib2.urlopen(url))self.parsed_urls.append(url)
start_crawl is the main function that initiates crawling. First, it tries to get all the urls present in the start_page. It then searches the parser to be called for that particular url using _get_crawl_function.
12345678
def_get_crawl_function(self,url):""" _get_crawl_function returns the crawl function to be used for given url pattern """forpatterninself.rules.keys():ifre.search(pattern,url):returnself.rules[pattern]
The particular parser is then called according to rules set above. Finally the url is appended into parsed_urls list.
Using the above Crawler, I have implemented IMDbCrawler.
movie_parser would crawl the movie web page and add the details of the movies such as its ratings, director, genres in the dictionary movie_final_dict. There are some movies where data (rating, actors, etc.) is missing, so I have included try and except statements.
imdb_crawler is an instance of IMDbCrawler. It has been instantiated with the domain http://www.imdb.com and start_url /year. Then the rules are set for different urls. It would call year_parser for all the webpages in the format of http://www.imdb.com/year/1992 and movie_parser for webpages in the format of http://www.imdb.com/title/tt0477348/.