| Size: 3537 Comment:  |  ← Revision 9 as of 2009-12-25 07:19:09  ⇥ Size: 5179 Comment: converted to 1.6 markup | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 4: | Line 4: | 
| [[TableOfContents]] | <<TableOfContents>> | 
| Line 6: | Line 6: | 
| [[Include(ZPyUGnav)]] | <<Include(ZPyUGnav)>> | 
| Line 77: | Line 77: | 
| == 方案4:.get() == {{{#!python alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = {} for e in alist: adict[e] = adict.get(e, 0) + 1 }}} | |
| Line 160: | Line 169: | 
| 不好意思,我的疏忽,改了: ========================================================= | |
| Line 173: | Line 184: | 
| try: d[e] = d[e] + 1 except: d[e] = 1 | try: d[e] = d[e] + 1 except: d[e] = 1 | 
| Line 177: | Line 188: | 
| print (t1 - t0, t2 - t1) }}} | d = {} for e in ls: try: d[e] += 1 except: d.setdefault(e, 1) t3 = time.time() print (t1 - t0, t2 - t1, t3 - t2) }}} 结果(运行了两次):: {{{ (1.5039999485015869, 2.1619999408721924, 2.2820000648498535) (1.4950001239776611, 2.2029998302459717, 2.2360000610351562) }}} 所耗时间排序一样,还是这个好一些:`for x in ls: d[x] = d.get(x, 0) + 1` === 结论 === {{{ [email protected]> reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Sun, Dec 14, 2008 at 23:40 subject [CPyUG:73747] Re: 如何高效的统计列表里面的重复项 }}} {{{#!python import random, time MAX = 10**6 ls = [random.randint(1, MAX) for x in xrange(2*MAX)] t0 = time.time() d = {} for e in ls: d[e] = d.get(e, 0) + 1 t1 = time.time() d = {} for e in ls: try: d[e] = d[e] + 1 except: d[e] = 1 t2 = time.time() d = {} for e in ls: try: d[e] += 1 except: d.setdefault(e, 1) t3 = time.time() from collections import defaultdict d = defaultdict(int) for e in ls: d[e] += 1 t4 = time.time() print (t1 - t0, t2 - t1, t3 - t2, t4 - t3) }}} 结果(运行了三次):: {{{ (1.3619999885559082, 2.187000036239624, 2.3610000610351562, 1.4879999160766602) (1.3420000076293945, 2.1319999694824219, 2.2860000133514404, 1.4579999446868896) (1.3270001411437988, 2.1959998607635498, 2.2860000133514404, 1.4579999446868896) }}} 还是这个略胜一筹:`for x in ls: d[x] = d.get(x, 0) + 1` | 
| Line 183: | Line 260: | 
| 创建 by -- ZoomQuiet [[[DateTime(2008-12-12T01:33:16Z)]]] ||<^>[[PageComment2]]||<^>[:/PageCommentData:PageCommentData]''|| | 
Contents
统计列表重复项
提问
2008/12/11 卢熙 <[email protected]>
- 要到达以下的效果:
    alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
    adict = fn(alist)
    print {'aaa': 3, 'bbb': 1, 'ccc': 2}- 在实际应用中,len(alist)很有可能超过10万,请问这个fn函数该如何写才能非常高效的完成这个任务?
方案1:for
萧萧 <[email protected]> reply-to [email protected] to [email protected] date Thu, Dec 11, 2008 at 22:51 subject [CPyUG:73576] Re: 如何高效的统计列表里面的重复项
>>> alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc']
>>> adict = {}
>>> for i in alist:
...     try:
...             adict[i] += 1
...     except:
...             adict.setdefault(i, 1)
>>> adict
{'aaa': 3, 'bbb': 1, 'ccc': 2} ##endInc
方案2:count()
萧萧 <[email protected]> reply-to [email protected] to [email protected] date Fri, Dec 12, 2008 at 11:18
alist = ['aaa', 'ccc', 'bbb', 'aaa', 'aaa', 'ccc'] adict = dict([(i, alist.count(i) for i in list(set(alist))])
方案3:fromkeys()
don li <[email protected]> reply-to [email protected] to [email protected] date Fri, Dec 12, 2008 at 11:53
方案4:.get()
对比
[email protected]> reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Sat, Dec 13, 2008 at 01:26 subject [CPyUG:73653] Re: 如何高效的统计列表里面的重复项
time python test2.py 865149 real 0m4.840s user 0m4.610s sys 0m0.210s time python test3.py 865113 real 0m5.724s user 0m5.490s sys 0m0.220s
- test2.py
- test3.py
reply-to [email protected] to [email protected] date Sat, Dec 13, 2008 at 03:05 subject [CPyUG:73655] Re: 如何高效的统计列表里面的重复项
$ time python test_dict_speed.py (5.0090830326080322, 9.3741579055786133) real 0m33.376s user 0m32.002s sys 0m0.872s
$ cat test_dict_speed.py
   1 不好意思,我的疏忽,改了:
   2 =========================================================
   3 import random, time
   4 MAX = 10**6
   5 
   6 ls = [random.randint(1, MAX) for x in xrange(2*MAX)]
   7 
   8 t0 = time.time()
   9 
  10 d = {}
  11 for x in ls: d[x] = d.get(x, 0) + 1
  12 t1 = time.time()
  13 
  14 d = {}
  15 for e in ls:
  16        try: d[e] = d[e] + 1
  17        except: d[e] = 1
  18 t2 = time.time()
  19 
  20 d = {}
  21 for e in ls:
  22    try: d[e] += 1
  23    except: d.setdefault(e, 1)
  24 t3 = time.time()
  25 
  26 print (t1 - t0, t2 - t1, t3 - t2)
- 结果(运行了两次)
(1.5039999485015869, 2.1619999408721924, 2.2820000648498535) (1.4950001239776611, 2.2029998302459717, 2.2360000610351562)
所耗时间排序一样,还是这个好一些:for x in ls: d[x] = d.get(x, 0) + 1
结论
[email protected]> reply-to [email protected] to python-cn`CPyUG`华蟒用户组 <[email protected]> date Sun, Dec 14, 2008 at 23:40 subject [CPyUG:73747] Re: 如何高效的统计列表里面的重复项
   1 import random, time
   2 MAX = 10**6
   3 
   4 ls = [random.randint(1, MAX) for x in xrange(2*MAX)]
   5 
   6 t0 = time.time()
   7 
   8 d = {}
   9 for e in ls: d[e] = d.get(e, 0) + 1
  10 t1 = time.time()
  11 
  12 d = {}
  13 for e in ls:
  14        try: d[e] = d[e] + 1
  15        except: d[e] = 1
  16 t2 = time.time()
  17 
  18 d = {}
  19 for e in ls:
  20    try: d[e] += 1
  21    except: d.setdefault(e, 1)
  22 t3 = time.time()
  23 
  24 from collections import defaultdict
  25 d = defaultdict(int)
  26 for e in ls:
  27   d[e] += 1
  28 t4 = time.time()
  29 
  30 print (t1 - t0, t2 - t1, t3 - t2, t4 - t3)
- 结果(运行了三次)
(1.3619999885559082, 2.187000036239624, 2.3610000610351562, 1.4879999160766602) (1.3420000076293945, 2.1319999694824219, 2.2860000133514404, 1.4579999446868896) (1.3270001411437988, 2.1959998607635498, 2.2860000133514404, 1.4579999446868896)
还是这个略胜一筹:for x in ls: d[x] = d.get(x, 0) + 1
反馈
