Prerequisite :
know the basics of Python
know the structure of an HTML page
Tools :
urllib2 library
beautifulsoup library (version 3.2.1)
urlparse library (for Python version < 3)
I- Download content of a web page using urllib2
You should know that there are several libraries permit the same functionalities of urllib2 like mechanize, PycURL, urllib3, httplib, httplib2... I'll use urllib2 because it belongs to the standard libraries, and other libraries are used in more special cases.
I.1- Example of use
import urllib2 #download content of http://www.crummy.com/software/BeautifulSoup/ website = "http://www.crummy.com/software/BeautifulSoup/" fo = urllib2.urlopen(website) #display the content print fo.read() #display the URL of the resource retrieved print fo.geturl() #display the meta-information of the page print fo.info() #dispaly the HTTP status code of the response print fo.getcode()visit the documentation if you are interested in the other features.
II- Parsing html document with BeautifulSoup
I consider it the best tool for extracting data from a html page, but it is not a standard library, you must download the package. If you don't want to download a new library, you can use the standard libraries for parsing XML or using regular expressions.
II.1- Install BeautifulSoup
wget http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.1.tar.gz tar zxvf BeautifulSoup-3.2.1.tar.gz cd BeautifulSoup-3.2.1 python setup.py installII.2-Example of use
from BeautifulSoup import BeautifulSoup
import urllib2
website = "http://www.crummy.com/software/BeautifulSoup/"
fo = urllib2.urlopen(website)
soup = BeautifulSoup(fo)
#display page content with indent
print soup.prettify()
#get all paragraphs in page
paragraphs = soup.findAll("p")
#get all links that have a `href` parameter
links = soup.findAll("a", href=True)
III- Break URLs with urlparsedefinition by the official documentation :VI- Start coding
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais.
#-*- coding: utf-8 -*-
from __future__ import division
import urllib2
import time
from BeautifulSoup import BeautifulSoup
from urlparse import urljoin
import re
import threading
import Queue
_URLBASE = 'http://www.elektronique.fr/cours.php'
#_FOLLOW_EXTERN = 'N'#(y/n)|(yes/no)
_DEPTH = 1
_Result = list()
_folow = 0
def Curls(urls, q):
visitedurl = set()
print 'Load.............'
for url in urls:
try:
html = urllib2.urlopen(url).read()
except:
continue
soup = BeautifulSoup(html.decode('utf-8','ignore'))
links = soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url = urljoin(url ,link['href'])
if re.match(r'^https?://(w{3}\.)?[a-zA-Z0-9.-_]+\.[a-z]{2,}.+$',url):
visitedurl.add(url)
q.put(visitedurl)
def getLinks(urls):
global _folow
global _Result
visitedurl = set()
nbfor = len(urls) / 10
q = Queue.Queue()
_urls = list()
while _folow <= _DEPTH:
for i in range(int(nbfor)):
start = ((i+1) * 10) - 10
end = (i+1) * 10
threading.Thread(target=Curls, args=(urls[start:end], q)).start()
_urls = _urls + list(q.get())
rest = nbfor - int(nbfor)
if rest > 0:
threading.Thread(target=Curls, args=(urls[int(-(rest*10)):], q)).start()
_urls = _urls + list(q.get())
_folow += 1
_Result = _Result + _urls
return getLinks(_urls)
if __name__=='__main__':
start_time = time.time()
getLinks([_URLBASE])
print len(list(set(_Result)))
end_time = time.time() - start_time
print str(end_time),"secondes"

0 comments:
Post a Comment