Differences between revisions 3 and 5 (spanning 2 versions)
Revision 3 as of 2008-01-23 08:33:39
Size: 6564
Editor: zuroc
Comment:
Revision 5 as of 2008-01-23 08:43:56
Size: 6658
Editor: zuroc
Comment:
Deletions are marked like this. Additions are marked like this.
Line 17: Line 17:
== 第一天 ==
写作中....

张沈鹏 [email protected] http://zsp.javaeye.com/

2008-1-23 16:42
== 第一天PycURL ==
Line 20: Line 26:
Line 31: Line 38:
Line 67: Line 75:
==== 1. PycURL简单学习 ==== ==== PycURL简单学习 ====
Line 93: Line 101:
==== 2. python中的pycurl模块学习 ==== ==== python中的pycurl模块学习 ====

含有章节索引的 *PUG 文章通用模板 ::-- ["zuroc"] [DateTime(2008-01-23T08:30:45Z)] TableOfContents

Include(CPUGnav)

1. ZSPY

用python抓取网络

代码见 http://zspy.googlecode.com

写作中....

张沈鹏 [email protected] http://zsp.javaeye.com/

2008-1-23 16:42

1.1. 第一天PycURL

Pycurl http://pycurl.sourceforge.net/

外部libcurl的接口,C写的,比urllib快,功能强.支持循环rewrite陷井的安全深度. 用于做网络爬虫,抓网页.

http://pycurl.sourceforge.net/download/ 下载 pycurl-ssl-7.16.4.win32-py2.5.exe 安装.

参考文献1,测试代码

   1 import StringIO
   2 
   3 #像操作文件一样操作字符串
   4 html = StringIO.StringIO()
   5 
   6 import pycurl
   7 c = pycurl.Curl()
   8 
   9 c.setopt(pycurl.URL, 'http://www.baidu.com')
  10 
  11 #写的回调
  12 c.setopt(pycurl.WRITEFUNCTION, html.write)
  13 
  14 c.setopt(pycurl.FOLLOWLOCATION, 1)
  15 
  16 #最大重定向次数,可以预防重定向陷阱
  17 c.setopt(pycurl.MAXREDIRS, 5)
  18 
  19 #访问,阻塞到访问结束
  20 c.perform()
  21 
  22 #打印出 200(HTTP状态码) http://www.baidu.com(生效的url)
  23 print c.getinfo(pycurl.HTTP_CODE), c.getinfo(pycurl.EFFECTIVE_URL)
  24 
  25 #输出百度首页的html
  26 #print html.getvalue()

1.1.1. 文献

1.1.1.1. PycURL简单学习

http://blog.donews.com/limodou/archive/2005/11/28/641257.aspx PycURL 是一个C语言写的 libcurl 的 Python 绑定库。libcurl 是一个自由的,并且容易使用的用在客户端的 URL 传输库。它的功能很强大,在 PycURL 的主页上介绍的支持的功能有:

  • supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. libcurl supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunneling and more!

那一大堆的协议已经让人惊喜了,特别是还有代理服务器和用户认证之类的功能。这个库相对于 urllib2 来说,它不是纯 Python 的,它是一个 C 库,但因此速度更快,但它不是很 pythonic ,学起来有些复杂。它在多种平台下都有移植,象 Linux , Mac, Windows, 和多种Unix。

我安装了一个,并且测试了一小段代码,是有些复杂,代码如下:

   1             import pycurl
   2             c = pycurl.Curl()
   3             c.setopt(pycurl.URL, 'http://feeds.feedburner.com/solidot')
   4             import StringIO
   5             b = StringIO.StringIO()
   6             c.setopt(pycurl.WRITEFUNCTION, b.write)
   7             c.setopt(pycurl.FOLLOWLOCATION, 1)
   8             c.setopt(pycurl.MAXREDIRS, 5)
   9     #        c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')
  10     #        c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')
  11             c.perform()
  12             print b.getvalue()

上述代码将会把奇客(Solidot)的RSS抓下来。如果有代理服务器,那么修改一下注释的两行即可。在 PycURL 的主页上还有一个多线程抓取的例子,有兴趣的可以看一看。

1.1.1.2. python中的pycurl模块学习

文章作者:[email protected] 信息来源:邪恶八进制信息安全团队(www.eviloctal.com)

1、使用getinfo来获得更多的信息:

   1 #! /usr/bin/env python
   2 # vi:ts=4:et
   3 # $Id: test_getinfo.py,v 1.18 2003/05/01 19:35:01 mfx Exp $
   4 # Author BY MSN:[email protected]
   5 import time
   6 import pycurl
   7 
   8 
   9 ## Callback function invoked when progress information is updated
  10 #下面的函数用来显示下载的进度:
  11 def progress(download_t, download_d, upload_t, upload_d):
  12   print "Total to download %d bytes, have %d bytes so far" % \
  13       (download_t, download_d)
  14 
  15 url = "http://www.sohu.com/index.html"
  16 
  17 print "Starting downloading", url
  18 print
  19 f = open("body.html", "wb") #新建一个文件并返回文件描述字,f用来保存返回的网页内容
  20 h = open("header.txt", "wb")#h用来保存返回的包头header信息
  21 i = open("info.txt","wb") #i用来保存getinfo()函数取回的信息
  22 c = pycurl.Curl()
  23 c.setopt(c.URL, url) #设置要访问的网址
  24 c.setopt(c.WRITEDATA, f) #将返回的网页内容写入f文件描述字
  25 c.setopt(c.NOPROGRESS, 0)
  26 c.setopt(c.PROGRESSFUNCTION, progress)#调用过程函数
  27 c.setopt(c.FOLLOWLOCATION, 1)
  28 c.setopt(c.MAXREDIRS, 5)
  29 c.setopt(c.WRITEHEADER, h)#将返回的包头header内容写入h文件描述字
  30 c.setopt(c.OPT_FILETIME, 1)
  31 c.perform() #执行上述访问网址的操作
  32 
  33 print
  34 print "HTTP-code:", c.getinfo(c.HTTP_CODE) #Outputs:200
  35 buf=c.getinfo(c.HTTP_CODE)
  36 i.write("HTTP-code:"+str(buf)) #将输出写入到i文件描述字中
  37 print "Total-time:", c.getinfo(c.TOTAL_TIME) #下载总时间:0.795
  38 buf=c.getinfo(c.TOTAL_TIME)
  39 i.write('\r\n')
  40 i.write("Total-time:"+str(buf))
  41 print "Download speed: %.2f bytes/second" % c.getinfo(c.SPEED_DOWNLOAD) #下载速度:261032.00 bytes/second
  42 print "Document size: %d bytes" % c.getinfo(c.SIZE_DOWNLOAD) #下载文档的大小:207521 bytes
  43 print "Effective URL:", c.getinfo(c.EFFECTIVE_URL) #有效网址:http://www.sohu.com/index.html
  44 print "Content-type:", c.getinfo(c.CONTENT_TYPE) #text/html
  45 print "Namelookup-time:", c.getinfo(c.NAMELOOKUP_TIME) #DNS解析速度:0.065
  46 print "Redirect-time:", c.getinfo(c.REDIRECT_TIME) #0.0
  47 print "Redirect-count:", c.getinfo(c.REDIRECT_COUNT) #0
  48 epoch = c.getinfo(c.INFO_FILETIME)
  49 print "Filetime: %d (%s)" % (epoch, time.ctime(epoch)) #文件下载时间:1172361818 (Sun Feb 25 08:03:38 2007)
  50 print
  51 print "Header is in file 'header.txt', body is in file 'body.html'"
  52 
  53 c.close()
  54 f.close()
  55 h.close()

2、简单用法:

   1 #!c:\python25\python
   2 # vi:ts=4:et
   3 # $Id: test_cb.py,v 1.14 2003/04/21 18:46:10 mfx Exp $
   4 # Author BY MSN:[email protected]
   5 
   6 import sys
   7 import pycurl
   8 
   9 ## Callback function invoked when body data is ready
  10 def body(buf):
  11   # Print body data to stdout
  12   sys.stdout.write(buf) #将buf的内容输出到标准输出
  13  
  14 
  15 ## Callback function invoked when header data is ready
  16 def header(buf):
  17   # Print header data to stderr
  18   sys.stdout.write(buf)
  19 
  20 c = pycurl.Curl()
  21 c.setopt(pycurl.URL, 'http://www.sohu.com/') #设置要访问的网址
  22 c.setopt(pycurl.WRITEFUNCTION, body) #调用body()函数来输出返回的信息
  23 c.setopt(pycurl.HEADERFUNCTION, header)#调用header()函数来输出返回的信息
  24 c.setopt(pycurl.FOLLOWLOCATION, 1)
  25 c.setopt(pycurl.MAXREDIRS, 5)
  26 c.perform() #执行上述访问网址的操作
  27 c.close()

2. 反馈

PageComment2

zspy (last edited 2009-12-25 07:15:17 by localhost)