##master-page:HomepageTemplate #format wiki == Cenyongh == === 基于模板的html scriper === 在Beautiful Soup之上的一个扩展，可以基于模板进行html页面捉取。 {{{ #!python ''' Assume the Page Being extract looks like this, content = "

Title	xxxxxxxxxxx
Date	yyyyyyyyyyy
Tags	zzzzzzzzzzzz

" Example 1: To extract title, date ,tags from the page.What you need to do is: pattern = "

*	$title
*	$date
*	$content

" doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) # this will return all the tags match the pattern values = scraper.extract(ret[0]) # get the value that define as '$xx' #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'} Example 2: just extract the title pattern = "

* # use asterisk to match all the content that you don't care

$title

" doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) values = scraper.extract(ret[0]) #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx'} Example 3: extract the tag as a whole pattern = "

*$content

...

]} More example can be find in the scraper_test.py ''' }}} the class and relatived testcases can be downloaded from here [[attachment:scraper.rar]]. Email: <> ... ---- CategoryHomepage