Make a simple Crawler with Python

Web Crawler (or Spider) : is a programme that start by reading the content of a web site and following the links to other pages on the site until all pages have been read, in order to create entries for Search Engine.

Prerequisite :
know the basics of Python
know the structure of an HTML page

Tools :
  urllib2 library
  beautifulsoup library (version 3.2.1)
  urlparse library (for Python version < 3)

I- Download content of a web page using urllib2
You should know that there are several libraries permit the same functionalities of urllib2 like mechanize, PycURL, urllib3, httplib, httplib2... I'll use urllib2 because it belongs to the standard libraries, and other libraries are used in more special cases.
I.1- Example of use

import urllib2
#download content of http://www.crummy.com/software/BeautifulSoup/
website = "http://www.crummy.com/software/BeautifulSoup/"
fo = urllib2.urlopen(website)
#display the content
print fo.read()
#display the URL of the resource retrieved
print fo.geturl()
#display the meta-information of the page
print fo.info()
#dispaly the HTTP status code of the response
print fo.getcode()

visit the documentation if you are interested in the other features.

II- Parsing html document with BeautifulSoup
I consider it the best tool for extracting data from a html page, but it is not a standard library, you must download the package. If you don't want to download a new library, you can use the standard libraries for parsing XML or using regular expressions.
II.1- Install BeautifulSoup

wget http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz
tar zxvf BeautifulSoup-3.2.1.tar.gz
cd BeautifulSoup-3.2.1
python setup.py install

II.2-Example of use

from BeautifulSoup import BeautifulSoup
import urllib2

website = "http://www.crummy.com/software/BeautifulSoup/"
fo = urllib2.urlopen(website)
soup = BeautifulSoup(fo)
#display page content with indent
print soup.prettify()
#get all paragraphs in page
paragraphs = soup.findAll("p")
#get all links that have a `href` parameter
links = soup.findAll("a", href=True)

III- Break URLs with urlparse

definition by the official documentation :
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais.

VI- Start coding

#-*- coding: utf-8 -*-
from __future__ import division
import urllib2
import time
from BeautifulSoup import BeautifulSoup
from urlparse import urljoin
import re
import threading
import Queue

_URLBASE = 'http://www.elektronique.fr/cours.php'
#_FOLLOW_EXTERN = 'N'#(y/n)|(yes/no)
_DEPTH = 1
_Result = list()
_folow = 0


def Curls(urls, q):
 visitedurl = set()
 print 'Load.............'
 for url in urls:
  try:
   html = urllib2.urlopen(url).read()
  except:
   continue
  soup = BeautifulSoup(html.decode('utf-8','ignore'))
  links = soup('a')
  for link in links:
   if ('href' in dict(link.attrs)):
    url = urljoin(url ,link['href'])
    if re.match(r'^https?://(w{3}\.)?[a-zA-Z0-9.-_]+\.[a-z]{2,}.+$',url):
     visitedurl.add(url)
 q.put(visitedurl)

def getLinks(urls):
 global _folow
 global _Result
 visitedurl = set()
 nbfor = len(urls) / 10
 q = Queue.Queue()
 _urls = list()
 while _folow <= _DEPTH:
  for i in range(int(nbfor)):
   start = ((i+1) * 10) - 10
   end = (i+1) * 10
   threading.Thread(target=Curls, args=(urls[start:end], q)).start()
   _urls = _urls + list(q.get())
  rest = nbfor - int(nbfor)
  if rest > 0:
   threading.Thread(target=Curls, args=(urls[int(-(rest*10)):], q)).start()
   _urls = _urls + list(q.get())
  _folow += 1
  _Result = _Result + _urls
  return getLinks(_urls)


if __name__=='__main__':
 start_time = time.time()
 getLinks([_URLBASE])
 print len(list(set(_Result)))
 end_time = time.time() - start_time
 print str(end_time),"secondes"

Make a simple Crawler with Python

0 comments:

Post a Comment

Make a simple Crawler with Python

Next

Newer Post

Previous

This is the last post.

0 comments:

Post a Comment