1. 大容量字典创建

1.1. 问题

{{{wanzathe <[email protected]> reply-to [email protected], to "python-cn:CPyUG" <[email protected]>, date Jan 2, 2008 1:20 PM subject [CPyUG:37589]}}}[http://groups.google.com/group/python-cn/t/cb06efbb633fa35d 请教在大数据量的情况下构建字典的问题]

有一个二进制格式存储的数据文件test.dat（intA+intB+intC+intD），想根据这个二进制文件创建字典 {(intA,intB):(intC,intD)}：

myDict = {}
input_file = file('./test.dat','rb')
content = input.read()
record_number = len(content) / 16
for i in range(0,record_number):
   a = struct.unpack("IIII",content[i*16:i*16+16])
   myDict[(a[0],a[1])] = (a[2],a[3])

问题：因test.dat文件数据量巨大（150M,大概1000w条记录），这么做基本上是不可行的，速度慢，内存占用太厉害:( 目前眼前一片迷茫，还请各位大侠指点一二，万分感激！

1.2. 方案1

Qiangning Hong <[email protected]>
reply-to        [email protected],
to      [email protected],
date    Jan 2, 2008 1:30 PM
subject [CPyUG:37592] Re: 请教在大数据量的情况下构建字典的问题

如果你真的是要一个dict对象的话，下面这段代码应该会内存占用小一些：

   1 f = open('./test.dat','rb')
   2 def g():
   3    while True:
   4        x = f.read(16)
   5        if not x: break
   6        a = struct.unpack('IIII', x)
   7        yield (a[0], a[1]), (a[2], a[3])
   8 mydict = dict(g())

如果你仅仅是希望能够用类似dict的方式来访问数据的话，建议你看看shelve模块

1.3. shelve

shelve，基本可行，改造代码如下：

   1 import shelve
   2 
   3 my_file = file('test.dat','rb')
   4 content = my_file.read()
   5 record_number = len(content) / 16
   6 
   7 db  = shelve.open('test.dat.db')
   8 for i in range(0,record_number):
   9    a = struct.unpack("IIII",content[i*16:i*16+16])
  10    db[str(a[0])+'_'+str(a[1])] = (a[2],a[3])
  11 db.sync()