Size: 136
Comment:
|
← Revision 3 as of 2009-12-25 07:12:31 ⇥
Size: 2660
Comment: converted to 1.6 markup
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
=== 基于模板的html scriper === 在Beautiful Soup之上的一个扩展,可以基于模板进行html页面捉取。 |
|
Line 5: | Line 7: |
Email: [[MailTo([email protected])]] | {{{ #!python ''' Assume the Page Being extract looks like this, content = " <html> <body> <div id='xx'> <table> <tr> <td>Title</td> <td>xxxxxxxxxxx</td> </tr> <tr> <td>Date</td> <td>yyyyyyyyyyy</td> </tr> <tr> <td>Tags</td> <td>zzzzzzzzzzzz</td> </tr> </table> </div> </body> </html> " Example 1: To extract title, date ,tags from the page.What you need to do is: pattern = " <div> <table> <tr> <td>*</td> <td>$title</td> </tr> <tr> <td>*</td> <td>$date</td> </tr> <tr> <td>*</td> <td>$content</td> </tr> </table> </div> " doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) # this will return all the tags match the pattern values = scraper.extract(ret[0]) # get the value that define as '$xx' #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'} Example 2: just extract the title pattern = " <div> <table> <tr> <td>*</td> <td>$title</td> </tr> * # use asterisk to match all the content that you don't care </table> </div> " doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) values = scraper.extract(ret[0]) #values is a dict, according to the previous example and pattern, it will look like this {'title':'xxxxxxx'} Example 3: extract the <table> tag as a whole pattern = " <div> *$content </div> " doc = CustomizedSoup(index_page) scraper = Scraper(pattern) ret = scraper.match(doc) values = scraper.extract(ret[0]) #values is a dict, according to the previous example and pattern, it will look like this {'content':[<table><tr>...</tr></table>]} More example can be find in the scraper_test.py ''' }}} the class and relatived testcases can be downloaded from here [[attachment:scraper.rar]]. Email: <<MailTo([email protected])>> |
Cenyongh
基于模板的html scriper
在Beautiful Soup之上的一个扩展,可以基于模板进行html页面捉取。
1 '''
2 Assume the Page Being extract looks like this,
3
4 content = "
5 <html>
6 <body>
7 <div id='xx'>
8 <table>
9 <tr>
10 <td>Title</td>
11 <td>xxxxxxxxxxx</td>
12 </tr>
13 <tr>
14 <td>Date</td>
15 <td>yyyyyyyyyyy</td>
16 </tr>
17 <tr>
18 <td>Tags</td>
19 <td>zzzzzzzzzzzz</td>
20 </tr>
21 </table>
22 </div>
23 </body>
24 </html>
25 "
26
27
28 Example 1: To extract title, date ,tags from the page.What you need to do is:
29
30
31 pattern = "
32 <div>
33 <table>
34 <tr>
35 <td>*</td>
36 <td>$title</td>
37 </tr>
38 <tr>
39 <td>*</td>
40 <td>$date</td>
41 </tr>
42 <tr>
43 <td>*</td>
44 <td>$content</td>
45 </tr>
46 </table>
47 </div>
48 "
49
50 doc = CustomizedSoup(index_page)
51 scraper = Scraper(pattern)
52 ret = scraper.match(doc) # this will return all the tags match the pattern
53 values = scraper.extract(ret[0]) # get the value that define as '$xx'
54
55 #values is a dict, according to the previous example and pattern, it will look like this
56 {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'}
57
58 Example 2: just extract the title
59 pattern = "
60 <div>
61 <table>
62 <tr>
63 <td>*</td>
64 <td>$title</td>
65 </tr>
66 * # use asterisk to match all the content that you don't care
67 </table>
68 </div>
69 "
70
71 doc = CustomizedSoup(index_page)
72 scraper = Scraper(pattern)
73 ret = scraper.match(doc)
74 values = scraper.extract(ret[0])
75
76 #values is a dict, according to the previous example and pattern, it will look like this
77 {'title':'xxxxxxx'}
78
79
80 Example 3: extract the <table> tag as a whole
81 pattern = "
82 <div>
83 *$content
84 </div>
85 "
86
87 doc = CustomizedSoup(index_page)
88 scraper = Scraper(pattern)
89 ret = scraper.match(doc)
90 values = scraper.extract(ret[0])
91
92 #values is a dict, according to the previous example and pattern, it will look like this
93 {'content':[<table><tr>...</tr></table>]}
94
95 More example can be find in the scraper_test.py
96
97 '''
the class and relatived testcases can be downloaded from here scraper.rar.
Email: <[email protected]>
...