Web Crawler (or Spider) : is a programme that start by reading the content of a web site and following the links to other pages on the site until all pages have been read, in order to create entries for Search Engine.

Prerequisite :
  know the basics of Python
  know the structure of an HTML page
Tools :
  urllib2 library
  beautifulsoup library (version 3.2.1)
  urlparse library (for Python version < 3)

IDownload content of a web page using urllib2
 You should know that there are several libraries permit the same functionalities of urllib2 like mechanizePycURLurllib3httplibhttplib2... I'll use urllib2 because it belongs to the standard libraries, and other libraries are used in more special cases.
I.1Example of use

import urllib2
#download content of http://www.crummy.com/software/BeautifulSoup/
website = "http://www.crummy.com/software/BeautifulSoup/"
fo = urllib2.urlopen(website)
#display the content
print fo.read()
#display the URL of the resource retrieved
print fo.geturl()
#display the meta-information of the page
print fo.info()
#dispaly the HTTP status code of the response
print fo.getcode()
visit the documentation if you are interested in the other features.

IIParsing html document with BeautifulSoup
  I consider it the best tool for extracting data from a html page, but it is not a standard library, you must download the package. If you don't want to download a new library, you can use the standard libraries for parsing XML or using regular expressions.
II.1- Install BeautifulSoup
wget http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz
tar zxvf BeautifulSoup-3.2.1.tar.gz
cd BeautifulSoup-3.2.1
python setup.py install
II.2-Example of use
from BeautifulSoup import BeautifulSoup
import urllib2

website = "http://www.crummy.com/software/BeautifulSoup/"
fo = urllib2.urlopen(website)
soup = BeautifulSoup(fo)
#display page content with indent
print soup.prettify()
#get all paragraphs in page
paragraphs = soup.findAll("p")
#get all links that have a `href` parameter
links = soup.findAll("a", href=True)
III- Break URLs with urlparse
definition by the official documentation :
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais.
VI- Start coding
#-*- coding: utf-8 -*-
from __future__ import division
import urllib2
import time
from BeautifulSoup import BeautifulSoup
from urlparse import urljoin
import re
import threading
import Queue

_URLBASE = 'http://www.elektronique.fr/cours.php'
#_FOLLOW_EXTERN = 'N'#(y/n)|(yes/no)
_DEPTH = 1
_Result = list()
_folow = 0


def Curls(urls, q):
 visitedurl = set()
 print 'Load.............'
 for url in urls:
  try:
   html = urllib2.urlopen(url).read()
  except:
   continue
  soup = BeautifulSoup(html.decode('utf-8','ignore'))
  links = soup('a')
  for link in links:
   if ('href' in dict(link.attrs)):
    url = urljoin(url ,link['href'])
    if re.match(r'^https?://(w{3}\.)?[a-zA-Z0-9.-_]+\.[a-z]{2,}.+$',url):
     visitedurl.add(url)
 q.put(visitedurl)

def getLinks(urls):
 global _folow
 global _Result
 visitedurl = set()
 nbfor = len(urls) / 10
 q = Queue.Queue()
 _urls = list()
 while _folow <= _DEPTH:
  for i in range(int(nbfor)):
   start = ((i+1) * 10) - 10
   end = (i+1) * 10
   threading.Thread(target=Curls, args=(urls[start:end], q)).start()
   _urls = _urls + list(q.get())
  rest = nbfor - int(nbfor)
  if rest > 0:
   threading.Thread(target=Curls, args=(urls[int(-(rest*10)):], q)).start()
   _urls = _urls + list(q.get())
  _folow += 1
  _Result = _Result + _urls
  return getLinks(_urls)


if __name__=='__main__':
 start_time = time.time()
 getLinks([_URLBASE])
 print len(list(set(_Result)))
 end_time = time.time() - start_time
 print str(end_time),"secondes"
Next
Newer Post
Previous
This is the last post.

0 comments:

Post a Comment

 
Top