Diff for "nickcen" - Woodpecker Wiki for CPUG

Differences between revisions 1 and 3 (spanning 2 versions)

Cenyongh

基于模板的html scriper

在Beautiful Soup之上的一个扩展，可以基于模板进行html页面捉取。

   1 '''
   2 Assume the Page Being extract looks like this,
   3 
   4 content = "
   5 <html>
   6     <body>
   7         <div id='xx'>
   8              <table>
   9                  <tr>
  10                        <td>Title</td>
  11                        <td>xxxxxxxxxxx</td>
  12                  </tr>
  13                  <tr>
  14                        <td>Date</td>
  15                        <td>yyyyyyyyyyy</td>
  16                  </tr>
  17                  <tr>
  18                        <td>Tags</td>
  19                        <td>zzzzzzzzzzzz</td>
  20                  </tr>
  21              </table>
  22         </div>
  23     </body>
  24 </html>
  25 "
  26 
  27 
  28 Example 1: To extract title, date ,tags from the page.What you need to do is:
  29 
  30 
  31  pattern = "
  32 <div>
  33    <table>
  34        <tr>
  35            <td>*</td>
  36            <td>$title</td>
  37        </tr>
  38        <tr>
  39            <td>*</td>
  40            <td>$date</td>
  41        </tr>
  42        <tr>
  43            <td>*</td>
  44            <td>$content</td>
  45        </tr>
  46    </table>
  47 </div>
  48 "
  49 
  50 doc = CustomizedSoup(index_page)
  51 scraper = Scraper(pattern)
  52 ret = scraper.match(doc)            #    this will return all the tags match the pattern
  53 values = scraper.extract(ret[0])    #    get the value that define as '$xx'
  54 
  55 #values is a dict, according to the previous example and pattern, it will look like this
  56 {'title':'xxxxxxx','date':'yyyyyyy','content':'zzzzzzzzz'}
  57 
  58 Example 2: just extract the title
  59 pattern = "
  60 <div>
  61    <table>
  62        <tr>
  63            <td>*</td>
  64            <td>$title</td>
  65        </tr>
  66         *                            #    use asterisk to match all the content that you don't care
  67    </table>
  68 </div>
  69 "
  70 
  71 doc = CustomizedSoup(index_page)
  72 scraper = Scraper(pattern)
  73 ret = scraper.match(doc)    
  74 values = scraper.extract(ret[0])
  75 
  76 #values is a dict, according to the previous example and pattern, it will look like this
  77 {'title':'xxxxxxx'}
  78 
  79 
  80 Example 3: extract the <table> tag as a whole
  81 pattern = "
  82 <div>
  83    *$content
  84 </div>
  85 "
  86 
  87 doc = CustomizedSoup(index_page)
  88 scraper = Scraper(pattern)
  89 ret = scraper.match(doc)    
  90 values = scraper.extract(ret[0])
  91 
  92 #values is a dict, according to the previous example and pattern, it will look like this
  93 {'content':[<table><tr>...</tr></table>]}
  94 
  95 More example can be find in the scraper_test.py
  96 
  97 '''

the class and relatived testcases can be downloaded from here scraper.rar.

Email: <[email protected]>

...

CategoryHomepage