Differences between revisions 1 and 5 (spanning 4 versions)
Revision 1 as of 2008-08-20 07:55:11
Size: 2007
Editor: ZoomQuiet
Comment:
Revision 5 as of 2009-12-25 07:09:16
Size: 2000
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
[[TableOfContents]] <<TableOfContents>>
Line 6: Line 7:
[[Include(ZPyUGnav)]]
<<Include(ZPyUGnav)>>
Line 12: Line 12:
reply-to [email protected]
to python-cn`CPyUG`华蟒用户组 <[email protected]>
date Wed, Aug 20, 2008 at 15:50
subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息
reply-to        [email protected]
to      python-cn`CPyUG`华蟒用户组 <[email protected]>
date    Wed, Aug 20, 2008 at 15:50
subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息
Line 17: Line 17:
##startInc
应用 Python 解决一些实际问题(
Line 18: Line 20:
##startInc  * '''http://www.ibm.com/developerworks/cn/linux/tips/l-python/'''
的提取嵌入在文档中的图像部分,原文中的程序思路正确但代码不够简洁,不够pythonic。
改进后的代码如下:
Line 20: Line 24:
应用 Python 解决一些实际问题(
 * '''http://www.ibm.com/developerworks/cn/linux/tips/l-python/'''
的提取嵌入在文档中的图像部分
{{{#!python
{{{
#!python
Line 27: Line 29:

headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')]
# headers 中offset为什么分别是6, 0, 1,是标识前面的数据长度吗?是什么数据?
headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')] #不同图片格式的标识信息
Line 31: Line 31:
filename = '/path/to/a/file'
filename = '/path/to/your/file'
Line 37: Line 36:
Line 39: Line 37:

for line in fid:
for line in fid: #按行迭代
Line 47: Line 44:
Line 49: Line 45:
j = len(marker)
Line 51: Line 46:
if j == 0: if len(marker) == 0:
Line 54: Line 49:

for i in range(j):
  
info = marker[i]
for info in marker:
Line 59: Line 52:
   if i == j-1:    index = marker.index(info)
   try:
       nextinfo = marker[index + 1]
       nextpos = nextinfo[0]
       gap = nextpos - thispos
   except IndexError:
Line 61: Line 59:
       gap = nextpos - thispos
   else:
       nextinfo = marker[i+1]
       nextpos = nextinfo[0]
Line 68: Line 62:
   imgname = 'imgname%02d.%s' % (i, thisext)    imgname = 'imgname%02d.%s' % (index, thisext)
Line 73: Line 67:
fid.close()
print '%02d images have been extracted' % imgnum
}}}
##endInc
----
 '''反馈'''
Line 74: Line 74:
fid.close()
print '%02d images have been extracted' % imgnum
}}}

##endInc

----
'''反馈'''

创建 by -- ZoomQuiet [[[DateTime(2008-08-20T07:55:11Z)]]]
||<^>[[PageComment2]]||<^>[:/PageCommentData:PageCommentData]''||
创建 by -- ZoomQuiet [<<DateTime(2008-08-20T07:55:11Z)>>]

提取文档中的图象标识信息

Shuguang Yang <[email protected]>
reply-to        [email protected]
to      python-cn`CPyUG`华蟒用户组 <[email protected]>
date    Wed, Aug 20, 2008 at 15:50
subject [CPyUG:62863] 用python提取文档中的图象,图象的标识信息

应用 Python 解决一些实际问题(

的提取嵌入在文档中的图像部分,原文中的程序思路正确但代码不够简洁,不够pythonic。 改进后的代码如下:

   1 import sys
   2 import os
   3 import string
   4 headers = [('JFIF', 6, 'jpg'), ('GIF', 0, 'gif'), ('PNG', 1, 'png')] #不同图片格式的标识信息
   5 marker = []
   6 filename = '/path/to/your/file'
   7 try:
   8    fid = open(filename, 'rb')
   9 except:
  10    sys.exit(1)
  11 s = 0
  12 for line in fid: #按行迭代
  13    for flag, offset, ext in headers:
  14        index = string.find(line, flag)
  15        if index > 0:
  16            pos = s + index - offset
  17            marker.append((pos, ext))
  18    s += len(line)
  19 fid.seek(0)
  20 imgnum = 0
  21 if len(marker) == 0:
  22    print 'No images included in this document'
  23    sys.exit(1)
  24 for info in marker:
  25    thispos = info[0]
  26    thisext = info[1]
  27    index = marker.index(info)
  28    try:
  29        nextinfo = marker[index + 1]
  30        nextpos = nextinfo[0]
  31        gap = nextpos - thispos
  32    except IndexError:
  33        nextpos = s
  34        gap = nextpos - thispos
  35    fid.seek(thispos)
  36    data = fid.read(gap)
  37    imgname = 'imgname%02d.%s' % (index, thisext)
  38    fid1 = open(imgname, 'wb')
  39    fid1.write(data)
  40    fid1.close()
  41    imgnum += 1
  42 fid.close()
  43 print '%02d images have been extracted' % imgnum


  • 反馈

创建 by -- ZoomQuiet [2008-08-20 07:55:11]

MiscItems/2008-08-20 (last edited 2009-12-25 07:09:16 by localhost)