大数据集:频繁测试成员关系

题面儿

Roy Liu <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Mon, Mar 9, 2009 at 19:28
subject [CPyUG:80888] 大数据集，频繁测试成员关系，各位有何建议

需要编写一个针对技术文档翻译后的校对工作的小程序，不知各位能否给点建议。

我的需要是这样的：
- 在进行翻译之前，发包方会提供一个中英对照的标准术语翻译表（放在 Excel 文件中），翻译人员在翻译文档时，如果遇到相应的术语，就要按照对方提供的术语表来翻译。
- 这个程序的作用就是预先识别出待翻译文档中的所有术语，并用特殊格式显示出文档中包含的术语，以便提醒翻译人员。
我在做法是这样的：
1. 把术语表读取出来，做成一个字典 textPair{}，其中以英文术语作为键值，以中文翻译作为项值，即 textPair[En] = Ch。
2. 将待翻译文档中的内容，分割成一个一个句子(用翻译工具可以做到这个)，然后将这些句子装入一个列表 transPair[] 中。
3. 然后:

for term in textPair:
         for sentence in transPair:
                if term in sentence:
                       do something

但是，我发现这样做的效率很低，自己又没什么更好的方法，因此来这里看看，希望有高手能共同探讨一下。

ZSP

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Mon, Mar 9, 2009 at 19:53

囧英文的很好办

虽然我有一个能牛XX的多模式匹配不过这里还用不到;just 改一下循环

for sentence in transPair:
  result=[]
  for term in sentence.split(' '):
       if term in textPair:
          result.append("xxxxxxxxxxxxx")
       else:
           result.appned(term)
   print " ".join(result)

多模式匹配

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 07:33

算了我还是扔出牛逼无比的多模式匹配吧

感谢伟大的redsea前辈,因为代码是他抠出来的:)

建议

单个单词用我原来给的方法
多个单词用多模式匹配
这样效率最高

见附件: attachment:zspy.7z

2.5应该可以用;其他的要自己编译一下
当然,极度高效的做法是去改多模式匹配的封装修改他的回调函数不过还是2-3次扫描比较快
attachment:多模式匹配算法简介.pdf

类似这样???

 Liming_Do <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at

>>> text = '''This performance tuning guide is designed to help database administrators and developers configure Microsoft SQL Server 2000 for maximum performance and to assist in determining causes of poor performance of relational databases, including those used in data warehousing. SQL Server 2000 has been dramatically enhanced to create a largely auto-configuring and self-tuning database server.'''
>>>
>>> maptable = {
    'Microsoft SQL Server 2000' : '{{Microsoft SQL Server 2000}}' ,
    'SQL Server 2000 ' : '{{SQL Server 2000}} ',
    'data warehousing' : '数据仓库' ,
    'relational databases' : '关系数据库' ,
}
>>> import re
>>> def trans(maptable, text):
        for key in maptable:
            text = re.compile(key, re.M).sub(maptable[key], text.rstrip()) + '\n'
        print text
>>>
>>> trans(maptable, text)
This performance tuning guide is designed to help database administrators and developers configure {{Microsoft SQL Server 2000}} for maximum performance and to assist in determining causes of poor performance of 关系数据库, including those used in 数据仓库. {{SQL Server 2000}} has been dramatically enhanced to create a largely auto-configuring and self-tuning database server.

Trie树

张沈鹏 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 07:39

其实,这个算法更合适

attachment:Trie树.7z

关键词/词组->分类识别系统

关键词/词组->分类识别系统设计+实现

[http://zsp.javaeye.com/blog/346503 关键词/词组->分类识别系统设计+实现 - 张沈鹏,在路上... ]

设计

给出一篇文章,判断其属于某一个分类格式如下,{}中表示是一组词

>>> 广告事前
贵公司
{
办证
电话
}

>>> 广告事后
{
网站
代购
}
开发票

构思:

1.单个关键词,实际应用中一个单关键词最多属于一个分类,先做一下倒排

censor_single_keyword={
    "贵公司":1,
    "开发票":2,
}

配合多关键词查找(比如Aho-Corasick State Machine),很easy

a. 给每一个组分配id
    1:{
    办证
    电话
    },
    2:{
    网站
    代购
    }

b.  记录    类别id - 分组id
    用array

c.  记录    分组id - id的中内容长度
    用array
    'B' unsigned char int 1

d.    词做倒排
    词  <-> 分类id列表

e.    多关键词查找 统计出出现的词的set
   
f.    新建一个字典
    每一个出现词 分类id count+=1
    for xx in 这个字典
        if count == 分组id 中 id的中内容长度
    确认是这个分类(这里只考虑属于一个分类)

实现

~/algorithm/censor/lib/multi_pattern_search $ cat keyword2group.py

   1 #coding:utf-8
   2 from array import array
   3 from collections import defaultdict
   4 from multi_pattern_search import MultiPatternSearch
   5 
   6 
   7 class CensorKeyword(object):
   8     def __init__(self):
   9         self.cat_name = []
  10         self.single_keyword_to_cat_id = {}
  11         self.group_keyword = []
  12         self.group_keyword_to_group_id = {}
  13         self.group_id_to_cat_id = array('I')
  14         self.group_keyword_length = array('B')
  15         self.single_search = MultiPatternSearch()
  16         self.group_search = MultiPatternSearch()
  17 
  18     def _init_search(self):
  19         for i in self.single_keyword_to_cat_id.keys():
  20             self.single_search.add_keyword(i)
  21         group_keyword_set = set()
  22         for i in self.group_keyword:
  23             for k in i:
  24                 group_keyword_set.add(k)
  25         for i in group_keyword_set:
  26             self.group_search.add_keyword(i)
  27 
  28     def _keyword_to_group_id(self):
  29         #词做倒排
  30         group_keyword_to_group_id = self.group_keyword_to_group_id
  31         for pos, group_keyword_list in enumerate(self.group_keyword):
  32             for keyword in group_keyword_list:
  33                 if keyword not in group_keyword_to_group_id:
  34                     group_keyword_to_group_id[keyword] = array('I')
  35                 group_keyword_to_group_id[keyword].append(pos)
  36 
  37     def load_txt(self, txt):
  38         SINGE_KEYWORD = 0
  39         GROUP_KEYWORD = 1
  40         txt = txt.split("\n")
  41         state = SINGE_KEYWORD
  42         group_keyword = set()
  43         #处理文本
  44         for line in txt:
  45             line = line.strip()
  46             if not line:
  47                 continue
  48             if line.startswith(">>>"):
  49                 cat_id = len(self.cat_name)
  50                 cat_name = line[3:].strip()
  51                 self.cat_name.append(cat_name)
  52                 state = SINGE_KEYWORD
  53             elif len(line) == 1:
  54                 if line == '{':
  55                     state = GROUP_KEYWORD
  56                 elif line == "}":
  57                     state = SINGE_KEYWORD
  58                     if group_keyword:
  59                         group_keyword = tuple(group_keyword)
  60                         self.group_keyword.append(group_keyword)
  61                         self.group_id_to_cat_id.append(cat_id)
  62                         self.group_keyword_length.append(len(group_keyword))
  63                         group_keyword = set()
  64             elif state == SINGE_KEYWORD:
  65                 self.single_keyword_to_cat_id[line] = cat_id
  66             elif state == GROUP_KEYWORD:
  67                 group_keyword.add(line)
  68 
  69         self._keyword_to_group_id()
  70         self._init_search()
  71 
  72     def which_group_id(self, text):
  73         single_search = self.single_search.count(text)
  74         if single_search:
  75             key = single_search.keys()[0]
  76             cat_id = self.single_keyword_to_cat_id[key]
  77             return cat_id
  78         group_search = self.group_search.count(text)
  79         group_count = defaultdict(int)
  80         for k in group_search.keys():
  81             group_id_list = self.group_keyword_to_group_id[k]
  82             for i in group_id_list:
  83                 group_count[i]+=1
  84        
  85         group = None
  86         for k, v in group_count.iteritems():
  87             if self.group_keyword_length[k] <= v:
  88                 #返回最小的那个 事前比事后优先级高
  89                 cat_id = self.group_id_to_cat_id[k]
  90                 if group is None or (cat_id < group):
  91                     group = cat_id
  92         return group
  93    
  94     def which_group_name(self, text):
  95         pos = self.which_group_id(text)
  96         if pos is not None:
  97             return self.cat_name[pos]
  98 
  99 
 100 if __name__ == "__main__":
 101     test_input = """
 102 >>> 广告事前
 103     贵公司
 104 {
 105     办证
 106     电话
 107 }
 108 {
 109     网站
 110     促销
 111 }
 112 >>> 广告事后
 113 {
 114     网站
 115     代购
 116 }
 117     开发票
 118     """
 119     censor_keyword = CensorKeyword()
 120     censor_keyword.load_txt(test_input)
 121     print censor_keyword.which_group_name("贵公司")
 122     print censor_keyword.which_group_name("代购")
 123     print censor_keyword.which_group_name("代购网站")

索引法

shell909090 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 00:09

这个么——空间换时间，索引法。首先我做出如下假定：

len (textPair) = N
len (transPair) = M

sentence中平均有Q个单词。那么，我来标记一下你算法的时间消耗。

for term in textPair: ＃ N
        for sentence in transPair: ＃ M
               if term in sentence: ＃ N＊M＊len(term)＊len(sentence) ＝ N＊len(term)＊文章长度
                      do something

而后，我建议你构造一个开链hash表，来索引整个查找过程。令某个单词的索引key为这个单词的头5个字母(当然，具体你可以调整)，那么你连分割句子都不用做，只要持续的给我供给单词流，就可以鉴别这个单词流。

#索引建立过程
hash_table = {};
for word in tranPair:
    if word[:5] not in hash_table:
    hash_table[word[:5]] = [];
    hash_table[word[:5]].append (word);
#匹配过程
for word in f_get_word (trans_info): # 文章单词数
    if word[:5] in hash_table: # log2 (N)
    match = hash_table[word[:5]]; # log2 (N)
for term in match: # 碰撞率，通常是N/所有常见的开头。大约应该是2-3之间。
    if term match the rest of the sentence: # 较差估计为 平均碰撞率 * 文章单词数
        do something

具体的理论，计算过程，还有优化什么的我就不说了。上面的东西纯属于抛砖引玉，可能还有更好的办法。不过我用类似的方法存储分词辞典的(准确的说，当时用的是红黑树加tied树)。

逆向最大匹配

于海 <[email protected]>
reply-to        [email protected]
to      [email protected]
date    Tue, Mar 10, 2009 at 16:10

      (2)循环的读入每一个句子S，设句子中的字数为n；
      (3)设置一个最大词长度，就是我们要截取的词的最大长度max；
      (4)从句子中取n-max到n的字符串subword，去字典中查找是否有这个词。如果有就走(5),没有就走(6)；
      (5)记住subword，从n-max付值给n，继续执行(4)，直到n=0。
      (6)将max-1，再执行(4)。

非常经典的算法。

反馈

创建 by -- ZoomQuiet [DateTime(2009-03-10T03:02:14Z)]

PageComment2

[:/PageCommentData:PageCommentData]

MiscItems/2009-03-10

大数据集:频繁测试成员关系

题面儿

ZSP

多模式匹配

Trie树

关键词/词组->分类识别系统

设计

构思:

实现

索引法

逆向最大匹配