1. ZSPY

用python抓取网络

代码见 http://zspy.googlecode.com

写作中....

张沈鹏 [email protected] http://zsp.javaeye.com/

2008-1-23 16:42

1.1. 第一天PycURL

Pycurl http://pycurl.sourceforge.net/

外部libcurl的接口,C写的,比urllib快,功能强.支持循环rewrite陷井的安全深度. 用于做网络爬虫,抓网页.

从 http://pycurl.sourceforge.net/download/ 下载 pycurl-ssl-7.16.4.win32-py2.5.exe 安装.

参考文献1,测试代码

   1 import StringIO
   2 
   3 #像操作文件一样操作字符串
   4 html = StringIO.StringIO()
   5 
   6 import pycurl
   7 c = pycurl.Curl()
   8 
   9 c.setopt(pycurl.URL, 'http://www.baidu.com')
  10 
  11 #写的回调
  12 c.setopt(pycurl.WRITEFUNCTION, html.write)
  13 
  14 c.setopt(pycurl.FOLLOWLOCATION, 1)
  15 
  16 #最大重定向次数,可以预防重定向陷阱
  17 c.setopt(pycurl.MAXREDIRS, 5)
  18 
  19 #访问,阻塞到访问结束
  20 c.perform()
  21 
  22 #打印出 200(HTTP状态码) http://www.baidu.com(生效的url)
  23 print c.getinfo(pycurl.HTTP_CODE), c.getinfo(pycurl.EFFECTIVE_URL)
  24 
  25 #输出百度首页的html
  26 #print html.getvalue()

1.1.1. 文献

1.1.1.1. PycURL简单学习

http://blog.donews.com/limodou/archive/2005/11/28/641257.aspx PycURL 是一个C语言写的 libcurl 的 Python 绑定库。libcurl 是一个自由的，并且容易使用的用在客户端的 URL 传输库。它的功能很强大，在 PycURL 的主页上介绍的支持的功能有：

supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. libcurl supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunneling and more!

那一大堆的协议已经让人惊喜了，特别是还有代理服务器和用户认证之类的功能。这个库相对于 urllib2 来说，它不是纯 Python 的，它是一个 C 库，但因此速度更快，但它不是很 pythonic ，学起来有些复杂。它在多种平台下都有移植，象 Linux , Mac, Windows, 和多种Unix。

我安装了一个，并且测试了一小段代码，是有些复杂，代码如下：

   1             import pycurl
   2             c = pycurl.Curl()
   3             c.setopt(pycurl.URL, 'http://feeds.feedburner.com/solidot')
   4             import StringIO
   5             b = StringIO.StringIO()
   6             c.setopt(pycurl.WRITEFUNCTION, b.write)
   7             c.setopt(pycurl.FOLLOWLOCATION, 1)
   8             c.setopt(pycurl.MAXREDIRS, 5)
   9     #        c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')
  10     #        c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')
  11             c.perform()
  12             print b.getvalue()

上述代码将会把奇客(Solidot)的RSS抓下来。如果有代理服务器，那么修改一下注释的两行即可。在 PycURL 的主页上还有一个多线程抓取的例子，有兴趣的可以看一看。

1.1.1.2. python中的pycurl模块学习

文章作者：[email protected] 信息来源：邪恶八进制信息安全团队（www.eviloctal.com）

1、使用getinfo来获得更多的信息：

   1 #! /usr/bin/env python
   2 # vi:ts=4:et
   3 # $Id: test_getinfo.py,v 1.18 2003/05/01 19:35:01 mfx Exp $
   4 # Author BY MSN:[email protected]
   5 import time
   6 import pycurl
   7 
   8 
   9 ## Callback function invoked when progress information is updated
  10 #下面的函数用来显示下载的进度：
  11 def progress(download_t, download_d, upload_t, upload_d):
  12   print "Total to download %d bytes, have %d bytes so far" % \
  13       (download_t, download_d)
  14 
  15 url = "http://www.sohu.com/index.html"
  16 
  17 print "Starting downloading", url
  18 print
  19 f = open("body.html", "wb") #新建一个文件并返回文件描述字,f用来保存返回的网页内容
  20 h = open("header.txt", "wb")#h用来保存返回的包头header信息
  21 i = open("info.txt","wb") #i用来保存getinfo()函数取回的信息
  22 c = pycurl.Curl()
  23 c.setopt(c.URL, url) #设置要访问的网址
  24 c.setopt(c.WRITEDATA, f) #将返回的网页内容写入f文件描述字
  25 c.setopt(c.NOPROGRESS, 0)
  26 c.setopt(c.PROGRESSFUNCTION, progress)#调用过程函数
  27 c.setopt(c.FOLLOWLOCATION, 1)
  28 c.setopt(c.MAXREDIRS, 5)
  29 c.setopt(c.WRITEHEADER, h)#将返回的包头header内容写入h文件描述字
  30 c.setopt(c.OPT_FILETIME, 1)
  31 c.perform() #执行上述访问网址的操作
  32 
  33 print
  34 print "HTTP-code:", c.getinfo(c.HTTP_CODE) #Outputs:200
  35 buf=c.getinfo(c.HTTP_CODE)
  36 i.write("HTTP-code:"+str(buf)) #将输出写入到i文件描述字中
  37 print "Total-time:", c.getinfo(c.TOTAL_TIME) #下载总时间:0.795
  38 buf=c.getinfo(c.TOTAL_TIME)
  39 i.write('\r\n')
  40 i.write("Total-time:"+str(buf))
  41 print "Download speed: %.2f bytes/second" % c.getinfo(c.SPEED_DOWNLOAD) #下载速度:261032.00 bytes/second
  42 print "Document size: %d bytes" % c.getinfo(c.SIZE_DOWNLOAD) #下载文档的大小:207521 bytes
  43 print "Effective URL:", c.getinfo(c.EFFECTIVE_URL) #有效网址:http://www.sohu.com/index.html
  44 print "Content-type:", c.getinfo(c.CONTENT_TYPE) #text/html
  45 print "Namelookup-time:", c.getinfo(c.NAMELOOKUP_TIME) #DNS解析速度:0.065
  46 print "Redirect-time:", c.getinfo(c.REDIRECT_TIME) #0.0
  47 print "Redirect-count:", c.getinfo(c.REDIRECT_COUNT) #0
  48 epoch = c.getinfo(c.INFO_FILETIME)
  49 print "Filetime: %d (%s)" % (epoch, time.ctime(epoch)) #文件下载时间:1172361818 (Sun Feb 25 08:03:38 2007)
  50 print
  51 print "Header is in file 'header.txt', body is in file 'body.html'"
  52 
  53 c.close()
  54 f.close()
  55 h.close()

2、简单用法：

   1 #!c:\python25\python
   2 # vi:ts=4:et
   3 # $Id: test_cb.py,v 1.14 2003/04/21 18:46:10 mfx Exp $
   4 # Author BY MSN:[email protected]
   5 
   6 import sys
   7 import pycurl
   8 
   9 ## Callback function invoked when body data is ready
  10 def body(buf):
  11   # Print body data to stdout
  12   sys.stdout.write(buf) #将buf的内容输出到标准输出
  13  
  14 
  15 ## Callback function invoked when header data is ready
  16 def header(buf):
  17   # Print header data to stderr
  18   sys.stdout.write(buf)
  19 
  20 c = pycurl.Curl()
  21 c.setopt(pycurl.URL, 'http://www.sohu.com/') #设置要访问的网址
  22 c.setopt(pycurl.WRITEFUNCTION, body) #调用body()函数来输出返回的信息
  23 c.setopt(pycurl.HEADERFUNCTION, header)#调用header()函数来输出返回的信息
  24 c.setopt(pycurl.FOLLOWLOCATION, 1)
  25 c.setopt(pycurl.MAXREDIRS, 5)
  26 c.perform() #执行上述访问网址的操作
  27 c.close()

2. 反馈

PageComment2

-  ⇤ ← Revision 3 as of 2008-01-23 08:33:39 → 
  Size: 6564
  Editor: zuroc
  Comment:
+   ← Revision 5 as of 2008-01-23 08:43:56 → ⇥
  Size: 6658
  Editor: zuroc
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 17:
-== 第一天 ==
+写作中....

张沈鹏 [email protected] http://zsp.javaeye.com/

2008-1-23 16:42
== 第一天PycURL ==
-Line 20:
+Line 26:
-Line 31:
+Line 38:
-Line 67:
+Line 75:
-==== 1. PycURL简单学习 ====
+==== PycURL简单学习 ====
-Line 93:
+Line 101:
-==== 2. python中的pycurl模块学习 ====
+==== python中的pycurl模块学习 ====