##master-page:HomepageTemplate #format wiki == Cenyongh == === 基于模板的html scriper === 在Beautiful Soup之上的一个扩展,可以基于模板进行html页面捉取。 {{{ #!python ''' Assume the Page Being extract looks like this, content = "
Title xxxxxxxxxxx
Date yyyyyyyyyyy
Tags zzzzzzzzzzzz
" Example 1: To extract title, date ,tags from the page.What you need to do is: pattern = "
* $title
* $date
* $content
" doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) # this will return all the tags match the pattern values = scraper.extract(ret[0]) # get the value that define as '$xx' #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'} Example 2: just extract the title pattern = "
* # use asterisk to match all the content that you don't care
* $title
" doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) values = scraper.extract(ret[0]) #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx'} Example 3: extract the tag as a whole pattern = "
*$content
" doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) values = scraper.extract(ret[0]) #values is a dict, according to the previous example and pattern, it will look like this {'content':[
...
]} More example can be find in the scraper_test.py ''' }}} the class and relatived testcases can be downloaded from here [[attachment:scraper.rar]]. Email: <> ... ---- CategoryHomepage