Contents
统计list里面重复次数最多的元素
相关
MiscItems/2008-12-12~统计列表重复项
再次问题
Vingel <[email protected]> sender-time Sent at 11:14 (GMT+08:00). Current time there: 11:37 PM. ✆ reply-to [email protected] to [email protected] date Mon, Jun 29, 2009 at 11:14 subject [CPyUG:90833] 关于在一个list里面找出重复次数最多的元素
问个问题,看看大家有没有好的方法,一个list里面包含很多重复元素,问:如何最快找出重复最多的N个元素。
比如,有列表[1,2,2,3,3,3,4,4,4]
我目前是思路是遍历list,每找到一个放到一个dict里去。最后比较值最大的。
我总觉得这样实现稍微ugly了点。看看大家有没有更为简单而优雅的方法。
比如去除重复元素: list(set(some_list)) 这样的方法。
py 3.1
@@ <[email protected]> sender-time Sent at 12:51 (GMT+08:00) 正好看到3.1的新特性里有
Added a collections.Counter class to support convenient counting of unique items in a sequence or iterable: >>> Counter(['red', 'blue', 'red', 'green', 'blue', 'blue']) Counter({'blue': 3, 'red': 2, 'green': 1})
Qiangning Hong
{{{<[email protected]> sender-time Sent at 16:46 (GMT+08:00). Current time there: 11:40 PM. ✆ reply-to [email protected] to [email protected] date Mon, Jun 29, 2009 at 16:46 subject [CPyUG:90882] Re: 关于在一个list里面找出重复次数最多的元素 }}}
2009/6/29 Leo Jay <[email protected]>: > 2009/6/29 Heroboy <[email protected]>: >> 2009/6/29 Leo Jay <[email protected]> >>> 或者: >>> >>> max(groupby(sorted(lst)), key=lambda x: x[0])[0] >>> 4 >> >> max(groupby(sorted(lst)), key=operator.itemgetter(0))[0] >>> > 只取一个元素的话,用itemgetter不划算,代码更长不说,还要import一个库。
operator.itemgetter 比 lambda 快。
另外,我前面的代码有个笔误,最后应该是 [1] 而不是 [0] ,不然取出来的是最大的重复次数,而不是列表里的值。
而 Heroboy 和 Leo Jay 的用max的key参数的实现也应该是 max(groupby(sorted(lst)), key=lambda x: len(list(x[1])))[0] 才对
zsp::频繁集统计
频繁集统计 python 封装 - 张沈鹏,在路上... - JavaEye技术网站
封装的是附件这篇paper的count
因为对比发现这个的综合性能比较好
code
xxx@ooo ~/zspal/zfeq/frequent-items/src $ cat Release/pyzlcl.py
1 #coding:utf-8
2 from pyzlcl import Lcl
3
4 #0.001 是要统计的频率下限
5 lcl = Lcl(0.001)
6
7 for i in xrange(200):
8 for j in xrange(100):
9 for k in xrange(j):
10 lcl.update(j, 1)
11
12 for i in xrange(1,100,30):
13 print i
14 print "出现的次数(估计值)",lcl.est(i)
15 print "estimate the worst case error in the estimate of a
16 particular item :" ,lcl.err(i)
17 print "---"*20
18
19 result = lcl.output(1000)
20 result.sort(key=lambda x:-x[1])
21 print result
22
23 print lcl.capacity()
usage
xxx@ooo ~/zspal/zfeq/frequent-items/src $ python Release/pyzlcl.py 1 出现的次数(估计值) 200 estimate the worst case error in the estimate of a particular item : 0 ------------------------------------------------------------ 31 出现的次数(估计值) 6200 estimate the worst case error in the estimate of a particular item : 0 ------------------------------------------------------------ 61 出现的次数(估计值) 12200 estimate the worst case error in the estimate of a particular item : 0 ------------------------------------------------------------ 91 出现的次数(估计值) 18200 estimate the worst case error in the estimate of a particular item : 0 ------------------------------------------------------------ [(99, 19800), (98, 19600), (97, 19400), (96, 19200), (95, 19000), (94, 18800), (93, 18600), (92, 18400), (91, 18200), (90, 18000), (89, 17800), (88, 17600), (87, 17400), (86, 17200), (85, 17000), (84, 16800), (83, 16600), (82, 16400), (81, 16200), (80, 16000), (79, 15800), (78, 15600), (77, 15400), (76, 15200), (75, 15000), (74, 14800), (73, 14600), (72, 14400), (71, 14200), (70, 14000), (69, 13800), (68, 13600), (67, 13400), (66, 13200), (65, 13000), (64, 12800), (63, 12600), (62, 12400), (61, 12200), (60, 12000), (59, 11800), (58, 11600), (57, 11400), (56, 11200), (55, 11000), (54, 10800), (53, 10600), (52, 10400), (51, 10200), (50, 10000), (49, 9800), (48, 9600), (47, 9400), (46, 9200), (45, 9000), (44, 8800), (43, 8600), (42, 8400), (41, 8200), (40, 8000), (39, 7800), (38, 7600), (37, 7400), (36, 7200), (35, 7000), (34, 6800), (33, 6600), (32, 6400), (31, 6200), (30, 6000), (29, 5800), (28, 5600), (27, 5400), (26, 5200), (25, 5000), (24, 4800), (23, 4600), (22, 4400), (21, 4200), (20, 4000), (19, 3800), (18, 3600), (17, 3400), (16, 3200), (15, 3000), (14, 2800), (13, 2600), (12, 2400), (11, 2200), (10, 2000), (9, 1800), (8, 1600), (7, 1400), (6, 1200), (5, 1000)] 44092
c中的用法演示
xxx@ooo ~/zspal/zfeq/frequent-items/src $ cat zlcl.cc
#include "prng.h" #include "lossycount.h" #include <iostream> size_t RunExact(uint32_t thresh, std::vector<uint32_t>& exact); template<class T> void generate_data(T* data,size_t number,uint32_t u32DomainSize,double dSkew); int main(int argc, char **argv) { size_t stNumberOfPackets = 10000000; // 样本数 double dPhi = 0.0001; //统计频率大于dPhi的元素,这里取万分之一 uint32_t u32DomainSize = 1048575; //样本取值范围 std::vector<uint32_t> exact(u32DomainSize + 1, 0);//精确统计,以便于做对比 //生成 Zipf 分布的数据 std::vector<uint32_t> data; generate_data(&data,stNumberOfPackets,u32DomainSize,1.0); //将测试数据分为20段运行 每运行一段 输出一次统计数据 size_t stRuns = 20; size_t stRunSize = data.size() / stRuns; size_t stStreamPos = 0; LCL_type* lcl = LCL_Init(dPhi); for (size_t run = 1; run <= stRuns; ++run) { for (size_t i = stStreamPos; i < stStreamPos + stRunSize; ++i) { exact[data[i]]+=1; } for (size_t i = stStreamPos; i < stStreamPos + stRunSize; ++i) { LCL_Update(lcl,data[i],1); } uint32_t thresh = static_cast<uint32_t>(floor(dPhi * run * stRunSize)); if (thresh == 0) thresh = 1; std::cout<<"Thresh is "<<thresh<<std::endl; size_t hh = RunExact(thresh, exact); std::cout << "Run: " << run << ", Exact: " << hh << std::endl; std::map<uint32_t, uint32_t> res; res = LCL_Output(lcl,thresh); std::cout << "LCL: " << run << ", Count: " << res.size() << std::endl; stStreamPos += stRunSize; } LCL_Destroy(lcl); printf("\n"); return 0; } size_t RunExact(uint32_t thresh, std::vector<uint32_t>& exact) { size_t hh = 0; for (size_t i = 0; i < exact.size(); ++i) if (exact[i] >= thresh) ++hh; return hh; } template<class T> void generate_data(T* data,size_t number,uint32_t u32DomainSize,double dSkew){ prng_type * prng; prng=prng_Init(44545,2); int64_t a = (int64_t) (prng_int(prng)% MOD); int64_t b = (int64_t) (prng_int(prng)% MOD); prng_Destroy(prng); Tools::Random r = Tools::Random(0xF4A54B); Tools::PRGZipf zipf = Tools::PRGZipf(0, u32DomainSize, dSkew, &r); size_t stCount = 0; for (int i = 0; i < number; ++i) { ++stCount; if (stCount % 500000 == 0) std::cout <<"Generate Data " << stCount << std::endl; uint32_t v = zipf.nextLong(); uint32_t value = hash31(a, b, v) & u32DomainSize; data->push_back(value); } }
下载
-- 弓长 孝文 、 王
反馈