##language:zh #pragma section-numbers off ##含有章节索引导航的 ZPyUG 文章通用模板 <> ## 默许导航,请保留 <> = 统计列表重复项 = ##startInc == 提问 == 2008/12/11 卢熙 要到达以下的效果: {{{ alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = fn(alist) print {'aaa': 3, 'bbb': 1, 'ccc': 2} }}} 在实际应用中,len(alist)很有可能超过10万,请问这个fn函数该如何写才能非常高效的完成这个任务? == 方案1:for == {{{ 萧萧 reply-to python-cn@googlegroups.com to python-cn@googlegroups.com date Thu, Dec 11, 2008 at 22:51 subject [CPyUG:73576] Re: 如何高效的统计列表里面的重复项 }}} {{{ >>> alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] >>> adict = {} >>> for i in alist: ... try: ... adict[i] += 1 ... except: ... adict.setdefault(i, 1) >>> adict {'aaa': 3, 'bbb': 1, 'ccc': 2} ##endInc }}} == 方案2:count() == {{{ 萧萧 reply-to python-cn@googlegroups.com to python-cn@googlegroups.com date Fri, Dec 12, 2008 at 11:18 }}} {{{ alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = dict([(i, alist.count(i) for i in list(set(alist))]) }}} == 方案3:fromkeys() == {{{ don li reply-to python-cn@googlegroups.com to python-cn@googlegroups.com date Fri, Dec 12, 2008 at 11:53 }}} {{{#!python alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = dict().fromkeys(alist, 0) for a in alist: adict[a] += 1 }}} == 方案4:.get() == {{{#!python alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = {} for e in alist: adict[e] = adict.get(e, 0) + 1 }}} == 对比 == {{{ nathan.wu@krazypeons.com> reply-to python-cn@googlegroups.com to python-cn`CPyUG`华蟒用户组 date Sat, Dec 13, 2008 at 01:26 subject [CPyUG:73653] Re: 如何高效的统计列表里面的重复项 }}} {{{ time python test2.py 865149 real 0m4.840s user 0m4.610s sys 0m0.210s time python test3.py 865113 real 0m5.724s user 0m5.490s sys 0m0.220s }}} test2.py:: {{{#!python #!/usr/bin/env python import random li = [] d = {} for i in range(10 ** 6 * 2): li.append(int(random.random() * 10 ** 6)) for e in li: if d.has_key(e): d[e] = d[e] + 1 else: d[e] = 1 print len(d) }}} test3.py:: {{{#!python #!/usr/bin/env python import random li = [] d = {} for i in range(10 ** 6 * 2): li.append(int(random.random() * 10 ** 6)) for e in li: try: d[e] = d[e] + 1 except: d[e] = 1 print len(d) }}} fanlix@gmail.com:: {{{ reply-to python-cn@googlegroups.com to python-cn@googlegroups.com date Sat, Dec 13, 2008 at 03:05 subject [CPyUG:73655] Re: 如何高效的统计列表里面的重复项 }}} {{{ $ time python test_dict_speed.py (5.0090830326080322, 9.3741579055786133) real 0m33.376s user 0m32.002s sys 0m0.872s }}} `$ cat test_dict_speed.py ` {{{#!python 不好意思,我的疏忽,改了: ========================================================= import random, time MAX = 10**6 ls = [random.randint(1, MAX) for x in xrange(2*MAX)] t0 = time.time() d = {} for x in ls: d[x] = d.get(x, 0) + 1 t1 = time.time() d = {} for e in ls: try: d[e] = d[e] + 1 except: d[e] = 1 t2 = time.time() d = {} for e in ls: try: d[e] += 1 except: d.setdefault(e, 1) t3 = time.time() print (t1 - t0, t2 - t1, t3 - t2) }}} 结果(运行了两次):: {{{ (1.5039999485015869, 2.1619999408721924, 2.2820000648498535) (1.4950001239776611, 2.2029998302459717, 2.2360000610351562) }}} 所耗时间排序一样,还是这个好一些:`for x in ls: d[x] = d.get(x, 0) + 1` === 结论 === {{{ dengyuanzhong@gmail.com> reply-to python-cn@googlegroups.com to python-cn`CPyUG`华蟒用户组 date Sun, Dec 14, 2008 at 23:40 subject [CPyUG:73747] Re: 如何高效的统计列表里面的重复项 }}} {{{#!python import random, time MAX = 10**6 ls = [random.randint(1, MAX) for x in xrange(2*MAX)] t0 = time.time() d = {} for e in ls: d[e] = d.get(e, 0) + 1 t1 = time.time() d = {} for e in ls: try: d[e] = d[e] + 1 except: d[e] = 1 t2 = time.time() d = {} for e in ls: try: d[e] += 1 except: d.setdefault(e, 1) t3 = time.time() from collections import defaultdict d = defaultdict(int) for e in ls: d[e] += 1 t4 = time.time() print (t1 - t0, t2 - t1, t3 - t2, t4 - t3) }}} 结果(运行了三次):: {{{ (1.3619999885559082, 2.187000036239624, 2.3610000610351562, 1.4879999160766602) (1.3420000076293945, 2.1319999694824219, 2.2860000133514404, 1.4579999446868896) (1.3270001411437988, 2.1959998607635498, 2.2860000133514404, 1.4579999446868896) }}} 还是这个略胜一筹:`for x in ls: d[x] = d.get(x, 0) + 1` ##endInc ---- '''反馈'''