##master-page:HomepageTemplate
#format wiki
== Cenyongh ==
=== 基于模板的html scriper ===
在Beautiful Soup之上的一个扩展,可以基于模板进行html页面捉取。
{{{
#!python
'''
Assume the Page Being extract looks like this,
content = "
Title |
xxxxxxxxxxx |
Date |
yyyyyyyyyyy |
Tags |
zzzzzzzzzzzz |
"
Example 1: To extract title, date ,tags from the page.What you need to do is:
pattern = "
* |
$title |
* |
$date |
* |
$content |
"
doc = CustomizedSoup(index_page)
scraper = Scraper(pattern)
ret = scraper.match(doc) # this will return all the tags match the pattern
values = scraper.extract(ret[0]) # get the value that define as '$xx'
#values is a dict, according to the previous example and pattern, it will look like this
{'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'}
Example 2: just extract the title
pattern = "
* |
$title |
* # use asterisk to match all the content that you don't care
"
doc = CustomizedSoup(index_page)
scraper = Scraper(pattern)
ret = scraper.match(doc)
values = scraper.extract(ret[0])
#values is a dict, according to the previous example and pattern, it will look like this
{'title':'xxxxxxx'}
Example 3: extract the tag as a whole
pattern = "
*$content
"
doc = CustomizedSoup(index_page)
scraper = Scraper(pattern)
ret = scraper.match(doc)
values = scraper.extract(ret[0])
#values is a dict, according to the previous example and pattern, it will look like this
{'content':[]}
More example can be find in the scraper_test.py
'''
}}}
the class and relatived testcases can be downloaded from here [[attachment:scraper.rar]].
Email: <>
...
----
CategoryHomepage