## page was renamed from MicroProj/2008-01-02
##language:zh
#pragma section-numbers on


::-- ZoomQuiet [<<DateTime(2008-01-02T05:40:31Z)>>]
<<TableOfContents>>
## 默许导航,请保留
<<Include(CPUGnav)>>


=  大容量字典创建 =
== 问题 ==
{{{wanzathe <wanzathe@gmail.com>
reply-to	python-cn@googlegroups.com,
to	"python-cn:CPyUG" <python-cn@googlegroups.com>,
date	Jan 2, 2008 1:20 PM
subject	[CPyUG:37589]}}}[http://groups.google.com/group/python-cn/t/cb06efbb633fa35d 请教在大数据量的情况下构建字典的问题]

有一个二进制格式存储的数据文件test.dat（intA+intB+intC+intD），想根据这个二进制文件创建字典
`{(intA,intB):(intC,intD)}：`
{{{
myDict = {}
input_file = file('./test.dat','rb')
content = input.read()
record_number = len(content) / 16
for i in range(0,record_number):
   a = struct.unpack("IIII",content[i*16:i*16+16])
   myDict[(a[0],a[1])] = (a[2],a[3])
}}}
问题：
因test.dat文件数据量巨大（150M,大概1000w条记录），这么做基本上是不可行的，速度慢，内存占用太厉害:(
目前眼前一片迷茫，还请各位大侠指点一二，万分感激！

== 方案1 ==
{{{
Qiangning Hong <hongqn@gmail.com>
reply-to	python-cn@googlegroups.com,
to	python-cn@googlegroups.com,
date	Jan 2, 2008 1:30 PM
subject	[CPyUG:37592] Re: 请教在大数据量的情况下构建字典的问题
}}}
如果你真的是要一个dict对象的话，下面这段代码应该会内存占用小一些：
{{{#!python
f = open('./test.dat','rb')
def g():
   while True:
       x = f.read(16)
       if not x: break
       a = struct.unpack('IIII', x)
       yield (a[0], a[1]), (a[2], a[3])
mydict = dict(g())
}}}
如果你仅仅是希望能够用类似dict的方式来访问数据的话，建议你看看`shelve`模块


== shelve ==
shelve，基本可行，改造代码如下：

{{{#!python
import shelve

my_file = file('test.dat','rb')
content = my_file.read()
record_number = len(content) / 16

db  = shelve.open('test.dat.db')
for i in range(0,record_number):
   a = struct.unpack("IIII",content[i*16:i*16+16])
   db[str(a[0])+'_'+str(a[1])] = (a[2],a[3])
db.sync()
}}}

##= 反馈 =