文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 61.182.251.99 [DateTime(2004-09-21T22:29:37Z)] TableOfContents

描述

处理文件中每个单词

Credit: Luther Blissett

问题 Problem

You need to do something to every word in a file, similar to the foreach function of csh.

需要处理文件中每个单词, 类似于cshforeach功能。

解决 Solution

This is best handled by two nested loops, one on lines and one on the words in each line:

最佳方法是使用2层嵌套循环,对文件的各行循环和对每行内的单词循环:

for line in open(thefilepath).xreadlines(  ):                  #方法1
    for word in line.split(  ):
        dosomethingwith(word)

This implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does).

代码中隐含单词定义是:被空白符号分开的非空白符号的系列(同Unix程序wc一样)。

For other definitions of words, you can use regular expressions. For example:

对于单词的其它定义,可以使用正则表达式,比如:

import re
re_word = re.compile(r'[\w-]+')

for line in open(thefilepath).xreadlines(  ):
    for word in re_word.findall(line):
        dosomethingwith(word)

In this case, a word is defined as a maximal sequence of alphanumerics and hyphens.

此处,单词的定义是:由字母数字-的组成的最长序列(#译注:贪婪查找?)

讨论 Discussion

For other definitions of words you will obviously need different regular expressions. The outer loop, on all lines in the file, can of course be done in many ways. The xreadlines method is good, but you can also use the list obtained by the readlines method, the standard library module fileinput, or, in Python 2.2, even just:

对于单词的其它定义,显然需要不同的正则表达式。对文件每行进行的外层循环,可以以多种方式进行。上面使用xreadlines不错,也可以使用由readlines获得的list对象,或则标准模块fileinput, 进一步在Python 2.2种,可以用:

for line in open(thefilepath):

which is simplest and fastest.

这样最简单最快。

In Python 2.2, it's often a good idea to wrap iterations as iterator objects, most commonly by simple generators:

Python 2.2及高版本中,用iterator对象封装迭代是个好主意。一般由简单generator产生,代码如下:

from _ _future_ _ import generators

def words_of_file(thefilepath):
    for line in open(thefilepath):
        for word in line.split(  ):
            yield word

for word in words_of_file(thefilepath):
    dosomethingwith(word)

This approach lets you separate, cleanly and effectively, two different concerns: how to iterate over all items (in this case, words in a file) and what to do with each item in the iteration.

generatoriterator的使用,可以干净有效的分离2个不同的Concern(#译注:AOP中seperation of concerns):1,如何在所有元素上迭代; 2,对每个元素的处理。

Once you have cleanly encapsulated iteration concerns in an iterator object (often, as here, a generator), most of your uses of iteration become simple for statements.

迭代Concern封装于iterator对象(这里是generator)内一次,那么几乎所有后期迭代代码就可以使用简单的for循环了。

You can often reuse the iterator in many spots in your program, and if maintenance is ever needed, you can then perform it in just one place梩he definition of the iterator梤ather than having to hunt for all uses. The advantages are thus very similar to those you obtain, in any programming language, by appropriately defining and using functions rather than copying and pasting pieces of code all over the place. With Python 2.2's iterators, you can get these advantages for looping control structures, too.

可以在程序中多处使用这个iteraotor, 如果需要维护,那么在一处的维护就够了,仅仅需要处理iterator的代码。

参考 See Also

Documentation for the fileinput module in the Library Reference; PEP 255 on simple generators (http://www.python.org/peps/pep-0255.html); Perl Cookbook Recipe 8.3.