含有章节索引的 *PUG 文章通用模板 ::-- ["zuroc"] [DateTime(2008-01-23T08:30:45Z)] TableOfContents
1. ZSPY
用python抓取网络
代码见 http://zspy.googlecode.com
写作中....
张沈鹏 [email protected] http://zsp.javaeye.com/
2008-1-23 16:42
1.1. 第一天PycURL
Pycurl http://pycurl.sourceforge.net/
外部libcurl的接口,C写的,比urllib快,功能强.支持循环rewrite陷井的安全深度. 用于做网络爬虫,抓网页.
从 http://pycurl.sourceforge.net/download/ 下载 pycurl-ssl-7.16.4.win32-py2.5.exe 安装.
参考文献1,测试代码
1 #像操作文件一样操作字符串,也可以from cStringIO import StringIO,性能应该会好一些
2 import StringIO
3
4 html = StringIO.StringIO()
5
6 import pycurl
7 c = pycurl.Curl()
8
9 c.setopt(pycurl.URL, 'http://www.baidu.com')
10
11 #写的回调
12 c.setopt(pycurl.WRITEFUNCTION, html.write)
13
14 c.setopt(pycurl.FOLLOWLOCATION, 1)
15
16 #最大重定向次数,可以预防重定向陷阱
17 c.setopt(pycurl.MAXREDIRS, 5)
18
19 #访问,阻塞到访问结束
20 c.perform()
21
22 #打印出 200(HTTP状态码) http://www.baidu.com(生效的url)
23 print c.getinfo(pycurl.HTTP_CODE), c.getinfo(pycurl.EFFECTIVE_URL)
24
25 #输出百度首页的html
26 #print html.getvalue()
然后看看多线程的例子
import os, sys
from cStringIO import StringIO
import pycurl
urls = (
"http://curl.haxx.se",
"http://www.python.org",
"http://pycurl.sourceforge.net",
"http://pycurl.sourceforge.net/tests/403_FORBIDDEN", # that actually exists ;-)
"http://pycurl.sourceforge.net/tests/404_NOT_FOUND",
)
# Read list of URIs from file specified on commandline
try:
urls = open(sys.argv[1], "rb").readlines()
except IndexError:
# No file was specified
pass
# init
m = pycurl.CurlMulti()
m.handles = []
for url in urls:
c = pycurl.Curl()
# save info in standard Python attributes
c.url = url.rstrip()
c.body = StringIO()
c.http_code = -1
m.handles.append(c)
# pycurl API calls
c.setopt(c.URL, c.url)
c.setopt(c.WRITEFUNCTION, c.body.write)
m.add_handle(c)
# get data
num_handles = len(m.handles)
while num_handles:
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
# currently no more I/O is pending, could do something in the meantime
# (display a progress bar, etc.)
m.select(1.0)
# close handles
for c in m.handles:
# save info in standard Python attributes
c.http_code = c.getinfo(c.HTTP_CODE)
# pycurl API calls
m.remove_handle(c)
c.close()
m.close()
# print result
for c in m.handles:
data = c.body.getvalue()
if 0:
print "**********", c.url, "**********"
print data
else:
print "%-53s http_code %3d, %6d bytes" % (c.url, c.http_code, len(data))
1.1.1. 相关文献
PycURL简单学习 http://blog.donews.com/limodou/archive/2005/11/28/641257.aspx
python中的pycurl模块学习 https://forum.eviloctal.com/read.php?tid=27337
