Size: 6564
Comment:
|
Size: 6658
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 17: | Line 17: |
== 第一天 == | 写作中.... 张沈鹏 [email protected] http://zsp.javaeye.com/ 2008-1-23 16:42 == 第一天PycURL == |
Line 20: | Line 26: |
Line 31: | Line 38: |
Line 67: | Line 75: |
==== 1. PycURL简单学习 ==== | ==== PycURL简单学习 ==== |
Line 93: | Line 101: |
==== 2. python中的pycurl模块学习 ==== | ==== python中的pycurl模块学习 ==== |
含有章节索引的 *PUG 文章通用模板 ::-- ["zuroc"] [DateTime(2008-01-23T08:30:45Z)] TableOfContents
1. ZSPY
用python抓取网络
代码见 http://zspy.googlecode.com
写作中....
张沈鹏 [email protected] http://zsp.javaeye.com/
2008-1-23 16:42
1.1. 第一天PycURL
Pycurl http://pycurl.sourceforge.net/
外部libcurl的接口,C写的,比urllib快,功能强.支持循环rewrite陷井的安全深度. 用于做网络爬虫,抓网页.
从 http://pycurl.sourceforge.net/download/ 下载 pycurl-ssl-7.16.4.win32-py2.5.exe 安装.
参考文献1,测试代码
1 import StringIO
2
3 #像操作文件一样操作字符串
4 html = StringIO.StringIO()
5
6 import pycurl
7 c = pycurl.Curl()
8
9 c.setopt(pycurl.URL, 'http://www.baidu.com')
10
11 #写的回调
12 c.setopt(pycurl.WRITEFUNCTION, html.write)
13
14 c.setopt(pycurl.FOLLOWLOCATION, 1)
15
16 #最大重定向次数,可以预防重定向陷阱
17 c.setopt(pycurl.MAXREDIRS, 5)
18
19 #访问,阻塞到访问结束
20 c.perform()
21
22 #打印出 200(HTTP状态码) http://www.baidu.com(生效的url)
23 print c.getinfo(pycurl.HTTP_CODE), c.getinfo(pycurl.EFFECTIVE_URL)
24
25 #输出百度首页的html
26 #print html.getvalue()
1.1.1. 文献
1.1.1.1. PycURL简单学习
http://blog.donews.com/limodou/archive/2005/11/28/641257.aspx PycURL 是一个C语言写的 libcurl 的 Python 绑定库。libcurl 是一个自由的,并且容易使用的用在客户端的 URL 传输库。它的功能很强大,在 PycURL 的主页上介绍的支持的功能有:
- supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. libcurl supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunneling and more!
那一大堆的协议已经让人惊喜了,特别是还有代理服务器和用户认证之类的功能。这个库相对于 urllib2 来说,它不是纯 Python 的,它是一个 C 库,但因此速度更快,但它不是很 pythonic ,学起来有些复杂。它在多种平台下都有移植,象 Linux , Mac, Windows, 和多种Unix。
我安装了一个,并且测试了一小段代码,是有些复杂,代码如下:
1 import pycurl
2 c = pycurl.Curl()
3 c.setopt(pycurl.URL, 'http://feeds.feedburner.com/solidot')
4 import StringIO
5 b = StringIO.StringIO()
6 c.setopt(pycurl.WRITEFUNCTION, b.write)
7 c.setopt(pycurl.FOLLOWLOCATION, 1)
8 c.setopt(pycurl.MAXREDIRS, 5)
9 # c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')
10 # c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')
11 c.perform()
12 print b.getvalue()
上述代码将会把奇客(Solidot)的RSS抓下来。如果有代理服务器,那么修改一下注释的两行即可。在 PycURL 的主页上还有一个多线程抓取的例子,有兴趣的可以看一看。
1.1.1.2. python中的pycurl模块学习
文章作者:[email protected] 信息来源:邪恶八进制信息安全团队(www.eviloctal.com)
1、使用getinfo来获得更多的信息:
1 #! /usr/bin/env python
2 # vi:ts=4:et
3 # $Id: test_getinfo.py,v 1.18 2003/05/01 19:35:01 mfx Exp $
4 # Author BY MSN:[email protected]
5 import time
6 import pycurl
7
8
9 ## Callback function invoked when progress information is updated
10 #下面的函数用来显示下载的进度:
11 def progress(download_t, download_d, upload_t, upload_d):
12 print "Total to download %d bytes, have %d bytes so far" % \
13 (download_t, download_d)
14
15 url = "http://www.sohu.com/index.html"
16
17 print "Starting downloading", url
18 print
19 f = open("body.html", "wb") #新建一个文件并返回文件描述字,f用来保存返回的网页内容
20 h = open("header.txt", "wb")#h用来保存返回的包头header信息
21 i = open("info.txt","wb") #i用来保存getinfo()函数取回的信息
22 c = pycurl.Curl()
23 c.setopt(c.URL, url) #设置要访问的网址
24 c.setopt(c.WRITEDATA, f) #将返回的网页内容写入f文件描述字
25 c.setopt(c.NOPROGRESS, 0)
26 c.setopt(c.PROGRESSFUNCTION, progress)#调用过程函数
27 c.setopt(c.FOLLOWLOCATION, 1)
28 c.setopt(c.MAXREDIRS, 5)
29 c.setopt(c.WRITEHEADER, h)#将返回的包头header内容写入h文件描述字
30 c.setopt(c.OPT_FILETIME, 1)
31 c.perform() #执行上述访问网址的操作
32
33 print
34 print "HTTP-code:", c.getinfo(c.HTTP_CODE) #Outputs:200
35 buf=c.getinfo(c.HTTP_CODE)
36 i.write("HTTP-code:"+str(buf)) #将输出写入到i文件描述字中
37 print "Total-time:", c.getinfo(c.TOTAL_TIME) #下载总时间:0.795
38 buf=c.getinfo(c.TOTAL_TIME)
39 i.write('\r\n')
40 i.write("Total-time:"+str(buf))
41 print "Download speed: %.2f bytes/second" % c.getinfo(c.SPEED_DOWNLOAD) #下载速度:261032.00 bytes/second
42 print "Document size: %d bytes" % c.getinfo(c.SIZE_DOWNLOAD) #下载文档的大小:207521 bytes
43 print "Effective URL:", c.getinfo(c.EFFECTIVE_URL) #有效网址:http://www.sohu.com/index.html
44 print "Content-type:", c.getinfo(c.CONTENT_TYPE) #text/html
45 print "Namelookup-time:", c.getinfo(c.NAMELOOKUP_TIME) #DNS解析速度:0.065
46 print "Redirect-time:", c.getinfo(c.REDIRECT_TIME) #0.0
47 print "Redirect-count:", c.getinfo(c.REDIRECT_COUNT) #0
48 epoch = c.getinfo(c.INFO_FILETIME)
49 print "Filetime: %d (%s)" % (epoch, time.ctime(epoch)) #文件下载时间:1172361818 (Sun Feb 25 08:03:38 2007)
50 print
51 print "Header is in file 'header.txt', body is in file 'body.html'"
52
53 c.close()
54 f.close()
55 h.close()
2、简单用法:
1 #!c:\python25\python
2 # vi:ts=4:et
3 # $Id: test_cb.py,v 1.14 2003/04/21 18:46:10 mfx Exp $
4 # Author BY MSN:[email protected]
5
6 import sys
7 import pycurl
8
9 ## Callback function invoked when body data is ready
10 def body(buf):
11 # Print body data to stdout
12 sys.stdout.write(buf) #将buf的内容输出到标准输出
13
14
15 ## Callback function invoked when header data is ready
16 def header(buf):
17 # Print header data to stderr
18 sys.stdout.write(buf)
19
20 c = pycurl.Curl()
21 c.setopt(pycurl.URL, 'http://www.sohu.com/') #设置要访问的网址
22 c.setopt(pycurl.WRITEFUNCTION, body) #调用body()函数来输出返回的信息
23 c.setopt(pycurl.HEADERFUNCTION, header)#调用header()函数来输出返回的信息
24 c.setopt(pycurl.FOLLOWLOCATION, 1)
25 c.setopt(pycurl.MAXREDIRS, 5)
26 c.perform() #执行上述访问网址的操作
27 c.close()