Size: 2007
Comment:
|
← Revision 5 as of 2009-12-25 07:09:16 ⇥
Size: 2000
Comment: converted to 1.6 markup
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
[[TableOfContents]] | <<TableOfContents>> |
Line 6: | Line 7: |
[[Include(ZPyUGnav)]] |
<<Include(ZPyUGnav)>> |
Line 12: | Line 12: |
reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Wed, Aug 20, 2008 at 15:50 subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息 |
reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Wed, Aug 20, 2008 at 15:50 subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息 |
Line 17: | Line 17: |
##startInc 应用 Python 解决一些实际问题( |
|
Line 18: | Line 20: |
##startInc | * '''http://www.ibm.com/developerworks/cn/linux/tips/l-python/''' 的提取嵌入在文档中的图像部分,原文中的程序思路正确但代码不够简洁,不够pythonic。 改进后的代码如下: |
Line 20: | Line 24: |
应用 Python 解决一些实际问题( * '''http://www.ibm.com/developerworks/cn/linux/tips/l-python/''' 的提取嵌入在文档中的图像部分 {{{#!python |
{{{ #!python |
Line 27: | Line 29: |
headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')] # headers 中的offset为什么分别是6, 0, 1,是标识前面的数据长度吗?是什么数据? |
headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')] #不同图片格式的标识信息 |
Line 31: | Line 31: |
filename = '/path/to/a/file' |
filename = '/path/to/your/file' |
Line 37: | Line 36: |
Line 39: | Line 37: |
for line in fid: |
for line in fid: #按行迭代 |
Line 47: | Line 44: |
Line 49: | Line 45: |
j = len(marker) | |
Line 51: | Line 46: |
if j == 0: | if len(marker) == 0: |
Line 54: | Line 49: |
for i in range(j): info = marker[i] |
for info in marker: |
Line 59: | Line 52: |
if i == j-1: | index = marker.index(info) try: nextinfo = marker[index + 1] nextpos = nextinfo[0] gap = nextpos - thispos except IndexError: |
Line 61: | Line 59: |
gap = nextpos - thispos else: nextinfo = marker[i+1] nextpos = nextinfo[0] |
|
Line 68: | Line 62: |
imgname = 'imgname%02d.%s' % (i, thisext) | imgname = 'imgname%02d.%s' % (index, thisext) |
Line 73: | Line 67: |
fid.close() print '%02d images have been extracted' % imgnum }}} ##endInc ---- '''反馈''' |
|
Line 74: | Line 74: |
fid.close() print '%02d images have been extracted' % imgnum }}} ##endInc ---- '''反馈''' 创建 by -- ZoomQuiet [[[DateTime(2008-08-20T07:55:11Z)]]] ||<^>[[PageComment2]]||<^>[:/PageCommentData:PageCommentData]''|| |
创建 by -- ZoomQuiet [<<DateTime(2008-08-20T07:55:11Z)>>] |
Contents
提取文档中的图象标识信息
Shuguang Yang <[email protected]> reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Wed, Aug 20, 2008 at 15:50 subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息
应用 Python 解决一些实际问题(
的提取嵌入在文档中的图像部分,原文中的程序思路正确但代码不够简洁,不够pythonic。 改进后的代码如下:
1 import sys
2 import os
3 import string
4 headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')] #不同图片格式的标识信息
5 marker = []
6 filename = '/path/to/your/file'
7 try:
8 fid = open(filename, 'rb')
9 except:
10 sys.exit(1)
11 s = 0
12 for line in fid: #按行迭代
13 for flag, offset, ext in headers:
14 index = string.find(line, flag)
15 if index > 0:
16 pos = s + index - offset
17 marker.append((pos, ext))
18 s += len(line)
19 fid.seek(0)
20 imgnum = 0
21 if len(marker) == 0:
22 print 'No images included in this document'
23 sys.exit(1)
24 for info in marker:
25 thispos = info[0]
26 thisext = info[1]
27 index = marker.index(info)
28 try:
29 nextinfo = marker[index + 1]
30 nextpos = nextinfo[0]
31 gap = nextpos - thispos
32 except IndexError:
33 nextpos = s
34 gap = nextpos - thispos
35 fid.seek(thispos)
36 data = fid.read(gap)
37 imgname = 'imgname%02d.%s' % (index, thisext)
38 fid1 = open(imgname, 'wb')
39 fid1.write(data)
40 fid1.close()
41 imgnum += 1
42 fid.close()
43 print '%02d images have been extracted' % imgnum
反馈
创建 by -- ZoomQuiet [2008-08-20 07:55:11]