Contents

大数据集:频繁测试成员关系
1. 题面儿
2. ZSP

CPyUG社区>

大数据集:频繁测试成员关系

题面儿

Roy Liu <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Mon, Mar 9, 2009 at 19:28
subject [CPyUG:80888] 大数据集，频繁测试成员关系，各位有何建议

需要编写一个针对技术文档翻译后的校对工作的小程序，不知各位能否给点建议。

我的需要是这样的：
- 在进行翻译之前，发包方会提供一个中英对照的标准术语翻译表（放在 Excel 文件中），翻译人员在翻译文档时，如果遇到相应的术语，就要按照对方提供的术语表来翻译。
- 这个程序的作用就是预先识别出待翻译文档中的所有术语，并用特殊格式显示出文档中包含的术语，以便提醒翻译人员。
我在做法是这样的：
1. 把术语表读取出来，做成一个字典 textPair{}，其中以英文术语作为键值，以中文翻译作为项值，即 textPair[En] = Ch。
2. 将待翻译文档中的内容，分割成一个一个句子(用翻译工具可以做到这个)，然后将这些句子装入一个列表 transPair[] 中。
3. 然后:

for term in textPair:
         for sentence in transPair:
                if term in sentence:
                       do something

但是，我发现这样做的效率很低，自己又没什么更好的方法，因此来这里看看，希望有高手能共同探讨一下。

ZSP

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Mon, Mar 9, 2009 at 19:53

囧英文的很好办

虽然我有一个能牛XX的多模式匹配不过这里还用不到;just 改一下循环

for sentence in transPair:
  result=[]
  for term in sentence.split(' '):
       if term in textPair:
          result.append("xxxxxxxxxxxxx")
       else:
           result.appned(term)
   print " ".join(result)

多模式匹配

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 07:33

算了我还是扔出牛逼无比的多模式匹配吧

感谢伟大的redsea前辈,因为代码是他抠出来的:)

建议

单个单词用我原来给的方法
多个单词用多模式匹配
这样效率最高

见附件: zspy.7z

2.5应该可以用;其他的要自己编译一下
当然,极度高效的做法是去改多模式匹配的封装修改他的回调函数不过还是2-3次扫描比较快
多模式匹配算法简介.pdf

类似这样???

 Liming_Do <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at

>>> text = '''This performance tuning guide is designed to help database administrators and developers configure Microsoft SQL Server 2000 for maximum performance and to assist in determining causes of poor performance of relational databases, including those used in data warehousing. SQL Server 2000 has been dramatically enhanced to create a largely auto-configuring and self-tuning database server.'''
>>>
>>> maptable = {
    'Microsoft SQL Server 2000' : '{{Microsoft SQL Server 2000}}' ,
    'SQL Server 2000 ' : '{{SQL Server 2000}} ',
    'data warehousing' : '数据仓库' ,
    'relational databases' : '关系数据库' ,
}
>>> import re
>>> def trans(maptable, text):
        for key in maptable:
            text = re.compile(key, re.M).sub(maptable[key], text.rstrip()) + '\n'
        print text
>>>
>>> trans(maptable, text)
This performance tuning guide is designed to help database administrators and developers configure {{Microsoft SQL Server 2000}} for maximum performance and to assist in determining causes of poor performance of 关系数据库, including those used in 数据仓库. {{SQL Server 2000}} has been dramatically enhanced to create a largely auto-configuring and self-tuning database server.

Trie树

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 07:39

其实,这个算法更合适

Trie树.7z

关键词/词组->分类识别系统

关键词/词组->分类识别系统设计+实现(与这个话题无关囧)

关键词/词组->分类识别系统设计+实现 - 张沈鹏,在路上...

设计

给出一篇文章,判断其属于某一个分类格式如下,{}中表示是一组词

>>> 广告事前
贵公司
{
办证
电话
}

>>> 广告事后
{
网站
代购
}
开发票

构思:

1.单个关键词,实际应用中一个单关键词最多属于一个分类,先做一下倒排

censor_single_keyword={
    "贵公司":1,
    "开发票":2,
}

配合多关键词查找(比如Aho-Corasick State Machine),很easy

a. 给每一个组分配id
    1:{
    办证
    电话
    },
    2:{
    网站
    代购
    }

b.  记录    类别id - 分组id
    用array

c.  记录    分组id - id的中内容长度
    用array
    'B' unsigned char int 1

d.    词做倒排
    词  <-> 分类id列表

e.    多关键词查找 统计出出现的词的set
   
f.    新建一个字典
    每一个出现词 分类id count+=1
    for xx in 这个字典
        if count == 分组id 中 id的中内容长度
    确认是这个分类(这里只考虑属于一个分类)

实现

~/algorithm/censor/lib/multi_pattern_search $ cat keyword2group.py

   1 ~/algorithm/censor $ cat lib/multi_pattern_search/keyword2group.py
   2 #coding:utf-8
   3 from array import array
   4 from collections import defaultdict
   5 from multi_pattern_search import MultiPatternSearch
   6 
   7 
   8 class CensorKeyword(object):
   9     def __init__(self):
  10         pass
  11 
  12     def _reload(self):
  13         self.group_name_list = []
  14         self.single_keyword_to_cat_id = {}
  15         self.group_keyword = []
  16         self.group_keyword_to_group_id = {}
  17         self.group_id_to_cat_id = array('I')
  18         self.group_keyword_length = array('B')
  19         self.single_search = MultiPatternSearch()
  20         self.group_search = MultiPatternSearch()
  21 
  22     def _init_search(self):
  23         for i in self.single_keyword_to_cat_id.keys():
  24             self.single_search.add_keyword(i)
  25         group_keyword_set = set()
  26         for i in self.group_keyword:
  27             for k in i:
  28                 group_keyword_set.add(k)
  29         for i in group_keyword_set:
  30             self.group_search.add_keyword(i)
  31 
  32     def _keyword_to_group_id(self):
  33         #词做倒排
  34         group_keyword_to_group_id = self.group_keyword_to_group_id
  35         for pos, group_keyword_list in enumerate(self.group_keyword):
  36             for keyword in group_keyword_list:
  37                 if keyword not in group_keyword_to_group_id:
  38                     group_keyword_to_group_id[keyword] = array('I')
  39                 group_keyword_to_group_id[keyword].append(pos)
  40 
  41     def load_txt(self, txt):
  42         self._reload()
  43         SINGE_KEYWORD = 0
  44         GROUP_KEYWORD = 1
  45         txt = txt.split("\n")
  46         state = SINGE_KEYWORD
  47         group_keyword = set()
  48         #处理文本
  49         for line in txt:
  50             line = line.strip()
  51             if not line:
  52                 continue
  53             if line.startswith(">>>"):
  54                 cat_id = len(self.group_name_list)
  55                 group_name_list = line[3:].strip()
  56                 self.group_name_list.append(group_name_list)
  57                 state = SINGE_KEYWORD
  58             elif len(line) == 1:
  59                 if line == '{':
  60                     state = GROUP_KEYWORD
  61                 elif line == "}":
  62                     state = SINGE_KEYWORD
  63                     if group_keyword:
  64                         group_keyword = tuple(group_keyword)
  65                         self.group_keyword.append(group_keyword)
  66                         self.group_id_to_cat_id.append(cat_id)
  67                         self.group_keyword_length.append(len(group_keyword))
  68                         group_keyword = set()
  69             elif state == SINGE_KEYWORD:
  70                 self.single_keyword_to_cat_id[line] = cat_id
  71             elif state == GROUP_KEYWORD:
  72                 group_keyword.add(line)
  73 
  74         self._keyword_to_group_id()
  75         self._init_search()
  76 
  77     def which_group_id(self, text):
  78         total_group = len(self.group_name_list)
  79         group = total_group
  80 
  81         single_search = self.single_search.count(text)
  82         all_keys =        single_search.keys()
  83         if single_search:
  84             for key in single_search.keys():
  85                 cat_id = self.single_keyword_to_cat_id[key]
  86                 if cat_id < group:
  87                     group = cat_id
  88 
  89         group_search = self.group_search.count(text)
  90         all_keys.extend(group_search.keys())
  91         group_count = defaultdict(int)
  92         for k in group_search.keys():
  93             group_id_list = self.group_keyword_to_group_id[k]
  94             for i in group_id_list:
  95                 group_count[i]+=1
  96 
  97         for k, v in group_count.iteritems():
  98             if self.group_keyword_length[k] <= v:
  99                 #返回最小的那个 事情比事后优先级高
 100                 cat_id = self.group_id_to_cat_id[k]
 101                 if cat_id < group:
 102                     group = cat_id
 103         if group < total_group:
 104             return group, all_keys
 105 
 106     def which_group_name(self, text):
 107         pos = self.which_group_id(text)
 108         if pos is not None:
 109             pos , keywords = pos
 110             keywords = " ".join(keywords)
 111             return self.group_name_list[pos], keywords
 112 
 113 
 114 if __name__ == "__main__":
 115     test_input = """
 116 >>> 广告事前
 117     贵公司
 118 {
 119     办证
 120     电话
 121 }
 122 {
 123     是
 124     网站
 125     促销
 126 }
 127 
 128 
 129 >>> 敏感文章事前
 130 
 131 
 132 >>> 广告事后
 133     开发票
 134 {
 135     网站
 136     代购
 137 }
 138     的
 139 
 140 
 141 >>> 敏感文章事后
 142 
 143 """
 144     censor_keyword = CensorKeyword()
 145     censor_keyword.load_txt(test_input)
 146     censor_keyword.load_txt(test_input)
 147     censor_keyword.load_txt(test_input)
 148     test = """
 149 牛奶@咖啡3月21日专场演出（免票）
 150 牛奶@咖啡3月21日专场演出（免票）
 151 活动介绍
 152 牛奶@咖啡自从巡演回来后一直没有在北京作专场演出，如果你对最近每次的小型演出还不过瘾，就到我们的这次专场上听个够吧！“一起不孤单！” 
 153 
 154 让更多的人来一起倾听牛奶@咖啡的歌声吧！如果你还不知道他们，去下面的网址了解一下吧！http://www.douban.com/group/milkcoffee/ 
 155 
 156 牛奶@咖啡3月21日周六下午1点半SOHO尚都专场，免票无需报名 
 157 
 158 地点：SOHO尚都西塔一层钢琴舞台（地铁一号线永安里站B口出倒126路公交芳草地站下车即是） 
 159 咨询电话：老徐：13801108173 
 160 活动网址：http://www.douban.com/event/10533960/
 161 同时还有市集可以逛：http://www.douban.com/event/10503500/
 162     """
 163 
 164     #print " ".join(censor_keyword.which_group_name("贵公司"))
 165     #print " ".join(censor_keyword.which_group_name("代购网站的"))
 166     print " ".join(censor_keyword.which_group_name(test))

索引法

shell909090 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 00:09

这个么——空间换时间，索引法。首先我做出如下假定：

len (textPair) = N
len (transPair) = M

sentence中平均有Q个单词。那么，我来标记一下你算法的时间消耗。

for term in textPair: ＃ N
        for sentence in transPair: ＃ M
               if term in sentence: ＃ N＊M＊len(term)＊len(sentence) ＝ N＊len(term)＊文章长度
                      do something

而后，我建议你构造一个开链hash表，来索引整个查找过程。令某个单词的索引key为这个单词的头5个字母(当然，具体你可以调整)，那么你连分割句子都不用做，只要持续的给我供给单词流，就可以鉴别这个单词流。

#索引建立过程
hash_table = {};
for word in tranPair:
    if word[:5] not in hash_table:
    hash_table[word[:5]] = [];
    hash_table[word[:5]].append (word);
#匹配过程
for word in f_get_word (trans_info): # 文章单词数
    if word[:5] in hash_table: # log2 (N)
    match = hash_table[word[:5]]; # log2 (N)
for term in match: # 碰撞率，通常是N/所有常见的开头。大约应该是2-3之间。
    if term match the rest of the sentence: # 较差估计为 平均碰撞率 * 文章单词数
        do something

具体的理论，计算过程，还有优化什么的我就不说了。上面的东西纯属于抛砖引玉，可能还有更好的办法。不过我用类似的方法存储分词辞典的(准确的说，当时用的是红黑树加tied树)。

逆向最大匹配

于海 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 16:10

      (2)循环的读入每一个句子S，设句子中的字数为n；
      (3)设置一个最大词长度，就是我们要截取的词的最大长度max；
      (4)从句子中取n-max到n的字符串subword，去字典中查找是否有这个词。如果有就走(5),没有就走(6)；
      (5)记住subword，从n-max付值给n，继续执行(4)，直到n=0。
      (6)将max-1，再执行(4)。

非常经典的算法。

反馈

创建 by -- ZoomQuiet [2009-03-10 03:02:14]