Cenyongh
基于模板的html scriper
在Beautiful Soup之上的一个扩展,可以基于模板进行html页面捉取。
1 '''
2 Assume the Page Being extract looks like this,
3
4 content = "
5 <html>
6 <body>
7 <div id='xx'>
8 <table>
9 <tr>
10 <td>Title</td>
11 <td>xxxxxxxxxxx</td>
12 </tr>
13 <tr>
14 <td>Date</td>
15 <td>yyyyyyyyyyy</td>
16 </tr>
17 <tr>
18 <td>Tags</td>
19 <td>zzzzzzzzzzzz</td>
20 </tr>
21 </table>
22 </div>
23 </body>
24 </html>
25 "
26
27
28 Example 1: To extract title, date ,tags from the page.What you need to do is:
29
30
31 pattern = "
32 <div>
33 <table>
34 <tr>
35 <td>*</td>
36 <td>$title</td>
37 </tr>
38 <tr>
39 <td>*</td>
40 <td>$date</td>
41 </tr>
42 <tr>
43 <td>*</td>
44 <td>$content</td>
45 </tr>
46 </table>
47 </div>
48 "
49
50 doc = CustomizedSoup(index_page)
51 scraper = Scraper(pattern)
52 ret = scraper.match(doc) # this will return all the tags match the pattern
53 values = scraper.extract(ret[0]) # get the value that define as '$xx'
54
55 #values is a dict, according to the previous example and pattern, it will look like this
56 {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'}
57
58 Example 2: just extract the title
59 pattern = "
60 <div>
61 <table>
62 <tr>
63 <td>*</td>
64 <td>$title</td>
65 </tr>
66 * # use asterisk to match all the content that you don't care
67 </table>
68 </div>
69 "
70
71 doc = CustomizedSoup(index_page)
72 scraper = Scraper(pattern)
73 ret = scraper.match(doc)
74 values = scraper.extract(ret[0])
75
76 #values is a dict, according to the previous example and pattern, it will look like this
77 {'title':'xxxxxxx'}
78
79
80 Example 3: extract the <table> tag as a whole
81 pattern = "
82 <div>
83 *$content
84 </div>
85 "
86
87 doc = CustomizedSoup(index_page)
88 scraper = Scraper(pattern)
89 ret = scraper.match(doc)
90 values = scraper.extract(ret[0])
91
92 #values is a dict, according to the previous example and pattern, it will look like this
93 {'content':[<table><tr>...</tr></table>]}
94
95 More example can be find in the scraper_test.py
96
97 '''
the class and relatived testcases can be downloaded from here [attachment:scraper.rar].
Email: MailTo([email protected])
...