##language:zh
#pragma section-numbers off
##含有章节索引导航的 ZPyUG 文章通用模板
<<TableOfContents>>
## 默许导航,请保留
<<Include(ZPyUGnav)>>


= 统计列表重复项 =


##startInc

== 提问 ==
2008/12/11 卢熙 <luxi78@gmail.com>

    要到达以下的效果：
{{{
    alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
    adict = fn(alist)
    print {'aaa': 3, 'bbb': 1, 'ccc': 2}
}}}
    在实际应用中，len(alist)很有可能超过10万,请问这个fn函数该如何写才能非常高效的完成这个任务？

== 方案1:for ==
{{{
萧萧 <yaksavage@gmail.com>
reply-to	python-cn@googlegroups.com
to	python-cn@googlegroups.com
date	Thu, Dec 11, 2008 at 22:51
subject	[CPyUG:73576] Re: 如何高效的统计列表里面的重复项
}}}
	

{{{
>>> alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
>>> adict = {}
>>> for i in alist:
...     try:
...             adict[i] += 1
...     except:
...             adict.setdefault(i, 1)
>>> adict

{'aaa': 3, 'bbb': 1, 'ccc': 2} ##endInc
}}}

== 方案2:count() ==
{{{
萧萧 <yaksavage@gmail.com>
reply-to	python-cn@googlegroups.com
to	python-cn@googlegroups.com
date	Fri, Dec 12, 2008 at 11:18
}}}

{{{
 alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
 adict = dict([(i, alist.count(i) for i in list(set(alist))])
}}}

== 方案3:fromkeys() ==
{{{
don li <donne.cn@gmail.com>
reply-to	python-cn@googlegroups.com
to	python-cn@googlegroups.com
date	Fri, Dec 12, 2008 at 11:53
}}}

{{{#!python
alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
adict = dict().fromkeys(alist, 0)

for a in alist:
    adict[a] += 1
}}}

== 方案4:.get() ==

{{{#!python
alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
adict = {}
for e in alist:
   adict[e] = adict.get(e, 0) + 1
}}}

== 对比 ==
{{{
nathan.wu@krazypeons.com>
reply-to	python-cn@googlegroups.com
to	python-cn`CPyUG`华蟒用户组 <python-cn@googlegroups.com>
date	Sat, Dec 13, 2008 at 01:26
subject	[CPyUG:73653] Re: 如何高效的统计列表里面的重复项
}}}	
{{{
time python test2.py
865149

real    0m4.840s
user    0m4.610s
sys     0m0.210s


time python test3.py
865113

real    0m5.724s
user    0m5.490s
sys     0m0.220s
}}}

 test2.py::
{{{#!python
#!/usr/bin/env python

import random

li = []
d = {}
for i in range(10 ** 6 * 2):
   li.append(int(random.random() * 10 ** 6))

for e in li:
   if d.has_key(e):
       d[e] = d[e] + 1
   else:
       d[e] = 1

print len(d)
}}}

 test3.py::
{{{#!python
#!/usr/bin/env python
import random

li = []
d = {}
for i in range(10 ** 6 * 2):
   li.append(int(random.random() * 10 ** 6))

for e in li:
   try:
       d[e] = d[e] + 1
   except:
       d[e] = 1

print len(d)
}}}


 fanlix@gmail.com::
{{{
reply-to	python-cn@googlegroups.com
to	python-cn@googlegroups.com
date	Sat, Dec 13, 2008 at 03:05
subject	[CPyUG:73655] Re: 如何高效的统计列表里面的重复项
}}}
{{{
$ time python   test_dict_speed.py           
(5.0090830326080322, 9.3741579055786133)

real    0m33.376s
user    0m32.002s
sys     0m0.872s
}}}

`$ cat test_dict_speed.py `
{{{#!python
不好意思，我的疏忽，改了：
=========================================================
import random, time
MAX = 10**6

ls = [random.randint(1, MAX) for x in xrange(2*MAX)]

t0 = time.time()

d = {}
for x in ls: d[x] = d.get(x, 0) + 1
t1 = time.time()

d = {}
for e in ls:
       try: d[e] = d[e] + 1
       except: d[e] = 1
t2 = time.time()

d = {}
for e in ls:
   try: d[e] += 1
   except: d.setdefault(e, 1)
t3 = time.time()

print (t1 - t0, t2 - t1, t3 - t2)
}}}
 结果（运行了两次）::
{{{
(1.5039999485015869, 2.1619999408721924, 2.2820000648498535)
(1.4950001239776611, 2.2029998302459717, 2.2360000610351562)
}}}
所耗时间排序一样，还是这个好一些：`for x in ls: d[x] = d.get(x, 0) + 1`


=== 结论 ===
{{{
dengyuanzhong@gmail.com>
reply-to	python-cn@googlegroups.com
to	python-cn`CPyUG`华蟒用户组 <python-cn@googlegroups.com>
date	Sun, Dec 14, 2008 at 23:40
subject	[CPyUG:73747] Re: 如何高效的统计列表里面的重复项
}}}

{{{#!python
import random, time
MAX = 10**6

ls = [random.randint(1, MAX) for x in xrange(2*MAX)]

t0 = time.time()

d = {}
for e in ls: d[e] = d.get(e, 0) + 1
t1 = time.time()

d = {}
for e in ls:
       try: d[e] = d[e] + 1
       except: d[e] = 1
t2 = time.time()

d = {}
for e in ls:
   try: d[e] += 1
   except: d.setdefault(e, 1)
t3 = time.time()

from collections import defaultdict
d = defaultdict(int)
for e in ls:
  d[e] += 1
t4 = time.time()

print (t1 - t0, t2 - t1, t3 - t2, t4 - t3)
}}}
 结果（运行了三次）::
{{{
(1.3619999885559082, 2.187000036239624, 2.3610000610351562,
1.4879999160766602)
(1.3420000076293945, 2.1319999694824219, 2.2860000133514404,
1.4579999446868896)
(1.3270001411437988, 2.1959998607635498, 2.2860000133514404,
1.4579999446868896)
}}}
还是这个略胜一筹：`for x in ls: d[x] = d.get(x, 0) + 1`

##endInc

----
'''反馈'''