Pythonic Musings

Data Analysis and Machine Learning in Python

Introduction to Machine Learning

| Comments

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. With a deluge of machine learning sources both online and offline, a newcomer in this field would simply get stranded due to indecisiveneww. This post is for all Machine Learning Enthusiasts who are not able to find a way to understand Machine Learning (ML).

This tutorial doesn’t require you to have a good deal of understanding of optimizations, linear algebra or probability. It is about learning basic concepts of Machine Learning and coding it. I would be using a python library scikit-learn for various ML applications.

Let’s start with a very simple Machine Learning algorithm Linear Regression.

Linear Regression

Linear Regression is an approach to the model the relationship between a scalar dependent variable y and one or more indenpendent variable X.

n = number of samples
m = number of features

A linear regression model assumes that the relationship between the dependent variable $y_i$ and independent variable $X_i$.

a0, a1, …. , am are some constants.

Linear Regression with One Variable (Univariate)

First we start with modelling a hypothesis $h_\theta(X)$.

The objective of linear regression is to correctly estimate the values of and such that approximates to . But how to do that?. For this we define a cost function or error function as:

Linear Regression models are often fitted using least squares approach i.e. by minimizing squared error function (or by minimizing a penalized version of the squares error function). For minimizing the error function we use the Gradient Descent Algorithm. This method is based on the observation that if a function is defined and differentiable in the neighborhood of a point , then decreases fastest if one goes from in the direction of negative gradient of at . So, we can find the minima by updating the value of as:

Where is the step size.
Using the above concept, we can find the values of and as:

Here is called as the learning rate.
Replacing the values of as

We can have a general formula for finding optimal value for any as:

Phew!!!. A lot of mathematics, right?. But where is the code?.

Let’s get our hands on some coding. For this tutorial I would be going to use scikit-learn for machine learning and matplotlib for plotting.

Suppose, for a hypothetical city FooCity, population in 10,000s and profit in $10,000 are available. We want to predict price of a house of particular size.

load_data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import matplotlib.pyplot as plt

input_file = open('example1.txt')
lines = input_file.readlines()

X = [map(float, line.strip().split(','))[0] for line in lines]
#X : size of house
X = np.array(X)
#converting X from a list to array
X = X.reshape(X.shape[0], 1)
#reshaping the X from size(97, ) to (97, 1)

y = [map(float, line.strip().split(','))[1] for line in lines]
#y : price of house
y = np.array(y)
#converting y from a list to array
y = y.reshape(y.shape[0], 1)
#reshaping the y from size(97, ) to (97, 1)

plt.plot(X, y, 'r+', label='Input Data')
#plotting house size vs house price
plt.ylabel('Profit in $10,000s')
plt.xlabel('Population of City in 10,000s')
plt.show()

It is visible from the plot that Population and Profit are varying linearly, so we can apply linear regression and predict profit for a given population.
For performing Linear Regression we have to use LinearRegression class available in sklearn.linear_model.

linear_regression
1
2
3
4
5
6
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X, y)
#linear regression using scikit-learn is very simple.
#just call the fit method with X, y

We can now predict the value of Profit for any Population(such as 15.12*10000) as clf.predict(15.12).

plot
1
2
3
4
5
6
x_ = np.linspace(np.min(X), np.max(X), 100).reshape(100, 1)
#x : array with 100 equally spaced elements starting with 
#min value of X upto max value of X
y_ = clf.predict(x_)
plt.plot(x_, y_, 'b', label='Predicted values')
plt.legend(loc='best')

Next, we would be going for Multivariate Linear Regression.

Building a Custom Parser

| Comments

A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.

This tutorial is about building your own web crawler - not the one that can scan the whole internet(like Google), but one that is able to extract all the links from a given webpage. For this tutorial, I would be extracting information from IMDb.

IMDb is an online database of information related to films, television programs and video games. I woulb be going to parse IMDb Top 250 and IMDb: Years and extract information about the movies ratings, year of release, start casts, directors, etc.

First, let’s write a simple program (hard-code) to parse IMDb Top 250.

main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env python

from lxml.html import parse

tree = parse('http://www.imdb.com/chart/top')

movies_data = tree.findall('//*[@id="main"]/table[2]/tr/td[3]/font/a')
movies_rating = tree.findall('//*[@id="main"]/table[2]/tr/td[2]/font')

# Removing unwanted data
movies_data.pop(0)
movies_rating.pop(0)

movies_rating = [float(movie.text) for movie in movies_rating]


def get_movie_data(iterator):
    movie_data = (iterator.next(), iterator.next())
    return movie_data

mov_dict = {get_movie_data(movies_data[i].itertext())[0]:
            [int(get_movie_data(movies_data[i].itertext())[1].strip(' ()/I')),
             movies_rating[i]] for i in range(len(movies_data))}

Going through the code in details. For this simple parser I have used parse from lxml.html.

tree = parse('http://www.imdb.com/chart/top') parses the url and returns a tree. Before going on the next line, lets discuss about XPath. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. XPath Tutorial is a very good tutorial for XPath by w3cschools.com. In the XPath, ‘//*[@id=“main”]/table[2]/tr/td[3]/font/a’.

	
// : Selects nodes in the document from the current node
     that match the selection no matter where they are.  
/ : Selects from the root node  
/tr/td[3]: Selects the third td element that is the child of the tr element.   
	

To get the XPath of an element, you can use Google Chrome. Click on Inspect Element.

Then select Copy XPath. This would give you the XPath to be used. Remember to remove <tbody> element from the XPath. and also remove [] from tr as you want to scrape the whole movies list. Similarly you can find the XPath for movies_rating also.

Then in

get_movie_data
1
2
3
4
5
6
7
8
9
def get_movie_data(iterator):
    """ Returns movie_name, year_of_release as movies_data[element].itertext()
    would return an iterator containing these two elements"""
    movie_data = (iterator.next(), iterator.next())
    return movie_data

mov_dict = {get_movie_data(movies_data[i].itertext())[0]:
            [int(get_movie_data(movies_data[i].itertext())[1].strip(' ()/I')),
             movies_rating[i]] for i in range(len(movies_data))}

movie_dict is built containing the dictionary with

	
key: movie_name
value: year_of_release, movie_rating
	

Let’s go a step further. After writing a simple (but hard-coded) parser, I am going to write a more generic (yet simple) parser. For this, I have take some concepts from scrapy (imitation is the best form of flattery) and have used lxml for scrapping.

I would be scraping the IMDb: Years page. This page contains the links for pages containing the links for Most Popular Titles Released in that year. In the next blog, I would be using the data scraped (a dictionary {year: [name, rating, genres, director, actors]} for analysing trends in Movies.

Crawler.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import urllib2
import re
from lxml.html import parse


class Crawler():

    def __init__(self, settings):
        """
        settings should be a dictionary containing
        domain:
        start_url:

        EXAMPLE
        settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'}
        """
        self.settings = settings
        self.rules = {self.settings['start_url']: 'parse'}
        self.parsed_urls = []
        self.url_list = []

    def _get_all_urls(self, response):
        """
        _get_all_urls returns all the urls in the page
        """
        tree = parse(response)
        url_list = tree.findall('//a')
        url_list = [url.attrib['href']
                    if url.attrib['href'].startswith('http://')
                    else urllib2.urlparse.urljoin(self.settings['domain'],
                                                  url.attrib['href'])
                    for url in url_list]
        return url_list

    def set_rules(self, rules):
        """
        set_rules set the rules for crawling
        rules are dictionary in the form
        {url_pattern: parsing_function}

        EXAMPLE
        >>> settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'}
        >>> imdb_crawler = Crawler(settings)
        >>> imdb_crawler.set_rules({'/year/\d+': 'year_parser',
        ...                         '/title/\w+': 'movie_parser'})
        """
        self.rules = rules

    def _get_crawl_function(self, url):
        """
        _get_crawl_function returns the crawl function to be
        used for given url pattern
        """
        for pattern in self.rules.keys():
            if re.search(pattern, url):
                return self.rules[pattern]

    def parse(self, response):
        """
        parse is the default parser to be called
        """
        pass

    def start_crawl(self):
        """
        start_crawl is the method that starts calling
        
        EXAMPLE
        >>> foo_crawler = Crawler()
        >>> foo_crawler.start_crawl()
        """
        response = urllib2.urlopen(
            urllib2.urlparse.urljoin(self.settings['domain'],
                                     self.settings['start_url']))
        self.url_list = self._get_all_urls(response)
        for url in self.url_list:
            if url not in self.parsed_urls:
                crawl_function = self._get_crawl_function(url)
                if crawl_function:
                    getattr(self, crawl_function)(urllib2.urlopen(url))
                    self.parsed_urls.append(url)

Crawler object have to be initailized with a dictionary settings {‘domain’: domain_of_\page, ‘start_url’: start_url_page} The Crawler class has attribute url_list that contains all the urls to be parsed and parsed_urls is a list of all the parsed urls.

_get_all_urls
1
2
3
4
5
6
7
8
9
10
11
12
def _get_all_urls(self, response):
    """
    _get_all_urls returns all the urls in the page
    """
    tree = parse(response)
    url_list = tree.findall('//a')
    url_list = [url.attrib['href']
                if url.attrib['href'].startswith('http://')
                else urllib2.urlparse.urljoin(self.settings['domain'],
                                              url.attrib['href'])
                for url in url_list]
    return url_list

_get_all_urls retrieves all the urls present in the web page. tree.findall(‘//a’) would return all the //a tags present in the web page. If the url starts with http:// then it would append it as usual; but if the url is a relative url, it would append the final url formed by joining the url with the domain

	
urllib2.urlparse.urljoin(self.settings['domain'], url.attrib['href']
	

parse(response) is the default parser of the Crawler. More sophisticated or complex parsers can be written for different urls using set_rule. For example:

set_rule
1
2
3
4
5
6
settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'}
imdb_crawler = Crawler(settings)
# year_parser is parser for scraping year pages
# movie_parser is parser for scraping movie pages
imdb_crawler.set_rules({'/year/\d+': 'year_parser',
                        '/title/\w+': 'movie_parser'})

All the parser should have be an input parameter response. Discussed below in details.

start_crawl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def start_crawl(self):
    """
    start_crawl is the method that starts calling
    
    EXAMPLE
    >>> foo_crawler = Crawler()
    >>> foo_crawler.start_crawl()
    """
    response = urllib2.urlopen(
        urllib2.urlparse.urljoin(self.settings['domain'],
                                 self.settings['start_url']))
    self.url_list = self._get_all_urls(response)
    for url in self.url_list:
        if url not in self.parsed_urls:
            crawl_function = self._get_crawl_function(url)
            if crawl_function:
                getattr(self, crawl_function)(urllib2.urlopen(url))
                self.parsed_urls.append(url)

start_crawl is the main function that initiates crawling. First, it tries to get all the urls present in the start_page. It then searches the parser to be called for that particular url using _get_crawl_function.

1
2
3
4
5
6
7
8
def _get_crawl_function(self, url):
    """
    _get_crawl_function returns the crawl function to be
    used for given url pattern
    """
    for pattern in self.rules.keys():
        if re.search(pattern, url):
            return self.rules[pattern]

The particular parser is then called according to rules set above. Finally the url is appended into parsed_urls list.

Using the above Crawler, I have implemented IMDbCrawler.

main.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from Crawler import Crawler
from collections import defaultdict

movie_final_dict = defaultdict(list)

class IMDbCrawler(Crawler):

    def year_parser(self, response):
        tree = parse(response)
        year = tree.find('//*[@id="header"]/h1')
        print year.text
        list_even = tree.findall(
            '//table//tr[@class="even detailed"]/td[@class="title"]/a')
        list_odd = tree.findall(
            '//table//tr[@class="odd detailed"]/td[@class="title"]/a')
        movies_list = list_even + list_odd
        movies_list_url = [urllib2.urlparse.urljoin(self.settings['domain'],
                                                    movie.attrib['href'])
                           for movie in movies_list]
        for url in movies_list_url:
            self.url_list.append(url)

    def movie_parser(self, response):
        tree = parse(response)
        name = tree.find('//*[@id="overview-top"]/h1/span[1]').text
        print name
        try:
            genres = tree.findall('//div[@itemprop="genre"]//a')
            genres = [genre.text.strip() for genre in genres]
            director = tree.find(
                '//div[@itemprop="director"]//span[@itemprop="name"]')
            director = director.text.strip()
            rating = tree.find('//span[@itemprop="ratingValue"]')
            rating = float(rating.text)
            actors = tree.findall('//td[@itemprop="actor"]//a//span')
            actors = [actor.text.strip() for actor in actors]
            year = tree.find('//*[@id="overview-top"]/h1/span[2]/a').text
            movie_final_dict[year].append([name, rating,
                                           genres, director, actors])
        except (AttributeError, IndexError):
            pass

settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'}
imdb_crawler = IMDbCrawler(settings)
imdb_crawler.set_rules({'/year/\d+': 'year_parser',
                        '/title/\w+': 'movie_parser'})
imdb_crawler.start_crawl()

IMDbCrawler has inherited Crawler class.

year_parser
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def year_parser(self, response):
    tree = parse(response)
    year = tree.find('//*[@id="header"]/h1')
    print year.text
    list_even = tree.findall(
        '//table//tr[@class="even detailed"]/td[@class="title"]/a')
    list_odd = tree.findall(
        '//table//tr[@class="odd detailed"]/td[@class="title"]/a')
    movies_list = list_even + list_odd
    movies_list_url = [urllib2.urlparse.urljoin(self.settings['domain'],
                                                movie.attrib['href'])
                       for movie in movies_list]
    for url in movies_list_url:
        self.url_list.append(url)

year_parser would crawl year pages and append the movie pages into url_list. XPath for extracting movie url and year name is discussed above.

movie_parser
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def movie_parser(self, response):
    tree = parse(response)
    name = tree.find('//*[@id="overview-top"]/h1/span[1]').text
    print name
    try:
        genres = tree.findall('//div[@itemprop="genre"]//a')
        genres = [genre.text.strip() for genre in genres]
        director = tree.find(
            '//div[@itemprop="director"]//span[@itemprop="name"]')
        director = director.text.strip()
        rating = tree.find('//span[@itemprop="ratingValue"]')
        rating = float(rating.text)
        actors = tree.findall('//td[@itemprop="actor"]//a//span')
        actors = [actor.text.strip() for actor in actors]
        year = tree.find('//*[@id="overview-top"]/h1/span[2]/a').text
        movie_final_dict[year].append([name, rating,
                                       genres, director, actors])
    except (AttributeError, IndexError):
        pass

movie_parser would crawl the movie web page and add the details of the movies such as its ratings, director, genres in the dictionary movie_final_dict. There are some movies where data (rating, actors, etc.) is missing, so I have included try and except statements.

1
2
3
4
5
settings = {'domain': 'http://www.imdb.com', 'start_url': '/year'}
imdb_crawler = IMDbCrawler(settings)
imdb_crawler.set_rules({'/year/\d+': 'year_parser',
                        '/title/\w+': 'movie_parser'})
imdb_crawler.start_crawl()

imdb_crawler is an instance of IMDbCrawler. It has been instantiated with the domain http://www.imdb.com and start_url /year. Then the rules are set for different urls. It would call year_parser for all the webpages in the format of http://www.imdb.com/year/1992 and movie_parser for webpages in the format of http://www.imdb.com/title/tt0477348/.