像CSS选择器一样使用BeautifulSoup

jeff jie <[email protected]>
reply-to        [email protected]
to      [email protected],
[email protected]
date    Tue, Jun 16, 2009 at 11:04

subject [ZPyUG:1030] 像CSS选择器一样使用BeautifulSoup - python-cn`CPyUG`华蟒用户组(中文Py用户组)

共享一些代码片段，一直说要共享，一搁就大半年。。哎。。

较早前写扒虫多，虽然BeautifulSoup很好用，但厌倦每次都要手动去编写复杂的Dom元素定位语句，突然想，能不能像CSS选择器一样，只给一个表达式就返回我想要的东西？后来在网上找了下没找着，想了想工作量也不大就自己实现了一个，目前支持ID，Class，标签及属性选择器、支持 space、 >、 + 操作符，基本上已满足大部份定位需求。
代码见附件的BTSelector.py，另附UnitTest。
分享在: http://code.google.com/p/zqlib/source/browse/tangle/BTSelector/

常见使用方式是：

注意，本脚本依赖于BeautifulSoup,所以，使用之前请确保你已经安装了。

   1 from BTSelector import findAll
   2 soup = BeautifulSoup(htmlContent)
   3 nodes = findAll('div.navigator #notice',soup)
   4 # findAll返回的是符合选择条件的Dom 对象列表，实际上是BeautifulSoup的标签或字符类。

除了使用findAll函数，还可以直接使用findById,findByTag等函数，等有兴趣的同学看源代码来发现了。

用testCase作为使用示例：

一、上一个复杂一点的用例：

   1 def testMixSelection(self):
   2         target = "#header > div#name > a.highlight"
   3         html = '''
   4         <div id="header">
   5             <div id="name">
   6                 <a class="target">test</a>
   7                 <a class="highlight">right</a>
   8                 <a class="highlight">ok</a>
   9             </div>
  10             <div id="your">
  11             </div>
  12         </div>
  13         <div id="body">fk
  14         </div>
  15         '''
  16         soup = BeautifulSoup(html)
  17 
  18         ret = findAll(target,soup)
  19 
  20         self.assertEqual(2,len(ret))

二、再上一个使用位置操作符的用例：

   1 def testPosition(self):
   2 
   3         target = "h2 + ul > li > a"
   4 
   5         html = '''
   6 
   7         <h2>title</h2>
   8 
   9         <ul>
  10 
  11             <li><a href="#">nothing</a></li>
  12 
  13             <li><a href="#">ok</a></li>
  14 
  15             <li><a href="#">come on!</a></li>
  16 
  17         </ul>
  18 
  19         '''
  20 
  21         soup = BeautifulSoup(html)
  22 
  23         ret = findAll(target,soup)
  24 
  25         self.assertEqual(3,len(ret))

更多的见附件的Unittest。里面没太多技术含量的东西，但希望大家喜欢。

pyquery

金浩 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Jun 16, 2009 at 11:10
subject [CPyUG:89577] Re: 像CSS选择器一样使用BeautifulSoup

其实，这个东西类似应该已经有了。比如pyquery http://pyquery.org/

另外听很多人说 BeautifulSoup 效率非常差。

GAE定期任务实操

mutou majia <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Jun 16, 2009 at 16:48
subject [CPyUG:89635] 请问我的Google appegine 的Cron任务怎么做不正常呢？

问题

自己手动打开那个url一切正常，但cron自己行动这个url时，总不正常。

比如在这个corn Url里，我向数据库中写条记录，

手动输入url时，一切正常， Cron自己运行时却不行，

Google的控制板里，Cron Jobs总显示：

On time Failed。

fixed

找到原因了，把url的处理函数不要用

(r'^tasks/', include('tasks.views')),
而要用
(r'^tasks/check', tasks.views.task_check),
才行。。。。

找的好辛苦。。。。一周了，不知道问题在哪。。。。

至今GAE的Django支持性还是差了点。。。。

不过还得感谢Google啊，除他还无人提供实验田吧？

反馈

-  ⇤ ← Revision 3 as of 2009-10-16 08:35:22 → 
  Size: 4427
  Editor: limodou
  Comment:
+   ← Revision 4 as of 2009-12-25 07:15:17 → ⇥
  Size: 4429
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-[[TableOfContents]]
+<<TableOfContents>>
 Line 6:
-[[Include(ZPyUGnav)]]
+<<Include(ZPyUGnav)>>
 Line 20:
- * subject	[ZPyUG:1030] [http://groups.google.com/group/python-cn/browse_thread/thread/c77b833c0f21eb3b 像CSS选择器一样使用BeautifulSoup - python-cn`CPyUG`华蟒用户组(中文Py用户组) | Google 网上论坛]
+ * subject	[ZPyUG:1030] [[http://groups.google.com/group/python-cn/browse_thread/thread/c77b833c0f21eb3b|像CSS选择器一样使用BeautifulSoup - python-cn`CPyUG`华蟒用户组(中文Py用户组) | Google 网上论坛]]

Diff for "MiscItems/2009-06-16"

像CSS选择器一样使用BeautifulSoup

用testCase作为使用示例：

pyquery

GAE定期任务实操

问题

fixed