| Size: 10627 Comment:  | Size: 11748 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 121: | Line 121: | 
| 事实上,我开始构架此节方法就是这样想的。后来,意识到这种泛化是典型的,应该避免的过度泛化行为(也称为'''过度工程'''?和'''过度设计''',see http://xp.c2.com/BigDesignUpFront.html),因此决定回归到本节中适度泛化的解决方法 | 事实上,我开始构架此节方法就是这样想的。后来,意识到这种泛化是典型的,应该避免的过度泛化行为(也称为'''过度工程'''?和'''过度设计''',see http://xp.c2.com/BigDesignUpFront.html), 因此决定回归到本节中适度泛化的解决方法 | 
| Line 125: | Line 125: | 
| 实际上,另一个合理的设计选择是完全不顾这一层泛化,不用理会如何定义分隔行,而只用line.isspace()测试本行是否为空行。 这样可以处理显示为空行的行中包含零星的空白符号的情况, 而不会将它作为 非分隔行。 | |
| Line 127: | Line 132: | 
| 本节中的适配类是使用分块的方法对序列进行适配的特例。对底层序列(此处,是在文件或类似文件的对象上使用readlines方法获得的文本行序列)分块构成具有更大单位元素的新序列(此处是段落字符串).这个模式很容易泛化,处理序列-分块的需求. 当然,使用Python 2.2中的iterator 和 generator会更简单些, 但即使在Python 2.1中处理这个问题也有很好的方法。 | |
| Line 128: | Line 135: | 
| 2.1中,需要维持文件底层的行序列的索引,同时需要方法检查'''__getitem'''函数的参数是合理的顺序索引(见下面for循环), 因此将line_num和indexes_num作为对象的可用属性对客户开放,客户代码可以进行序列检查,使用它们与底层行序列的索引对照,来确定当前的访问位置。 | 
 文章来自《Python cookbook》.   翻译仅仅是为了个人学习,其它商业版权纠纷与此无关! 
-- 61.182.251.99 [DateTime(2004-09-22T19:44:16Z)] TableOfContents
描述
Reading a Text File by Paragraphs
读取文本文件各段落
Credit: Alex Martelli, Magnus Lie Hetland
问题 Problem
You need to read a file paragraph by paragraph, in which a paragraph is defined as a sequence of nonempty lines (in other words, paragraphs are separated by empty lines).
需要按段落读取文件,段落的定义是由非空行组成的行序列(既空行 分隔段落)
解决 Solution
A wrapper class is, as usual, the right Pythonic architecture for this (in Python 2.1 and earlier):
Python风格的普通解决方法(在Python 2.1及更早版本中)的架构基础是使用一个包装类(wrapper class):
class Paragraphs:
    def _ _init_ _(self, fileobj, separator='\n'):
        # Ensure that we get a line-reading sequence in the best way possible:
        # 保证用最佳方法读取行系列                (#译注:困惑阿,xreadlines在2.3?中已经deprecated了)
        import xreadlines
        try:
            # Check if the file-like object has an xreadlines method
            # 检查可能是文件对象的参数是否具有'''xreadlines'''方法,以获得对象的各行组成的序列
            self.seq = fileobj.xreadlines(  ) 
        except AttributeError:
            # No, so fall back to the xreadlines module's implementation
            # 如果参数对象不具有xreadlines方法,使用xreadlines模块的实现
            self.seq = xreadlines.xreadlines(fileobj)
        self.line_num = 0    # current index into self.seq (line number)
                             #实例变量, 行号索引,
        self.para_num = 0    # current index into self (paragraph number)
                             #实例变量,段落号索引,
        # Ensure that separator string includes a line-end character at the end
        #检查参数'''分隔字符串'''末尾包含 '\n' 
        if separator[-1:] != '\n': separator += '\n'
        self.separator = separator         #实例变量,行分隔字符串
    def _ _getitem_ _(self, index):
        if index != self.para_num:
            # 实现仅支持顺序提取,如果下标不合理,抛出TypeError   
            raise TypeError, "Only sequential access supported"        
        self.para_num += 1                           #译注:从段落1开始,不是0
        # Start where we left off and skip 0+ separator lines
        #从前一段落结束处开始,忽略可能的'''空行'''         
        while 1:                                     #译注:循环处理可能的空行,遇到非空行开始段落处理        
        # Propagate IndexError, if any, since we're finished if it occurs
        #前面已经处理下标不合理错误。 这里如果有错误,抛出这个异常
 
            line = self.seq[self.line_num]           #译注:仅对空行计数,忽略空行处理
            self.line_num += 1
            if line != self.separator: break         #译注:遇到非空行, 处理 
        # Accumulate 1+ nonempty lines into result
        #添加非空行到结果
        result = [line]                              #译注:开始处理段落 
 
        while 1:
        # Intercept IndexError, since we have one last paragraph to return
            try:
                # Let's check if there's at least one more line in self.seq
                #检查行序列是否还有剩余元素未处理  
                line = self.seq[self.line_num]
            except IndexError:
                # self.seq is finished, so we exit the loop
                #序列已处理完毕,退出循环 
                break
            # Increment index into self.seq for next time
            self.line_num += 1
            if line == self.separator: break
            result.append(line)                      #译注:添加非空行到结果
        return ''.join(result)                 #译注:对段落行序列使用'''join'''构成段落字符串并返回 
# Here's an example function, showing how to use class Paragraphs:
#如何使用Paragraphs类的函数范例如下:
def show_paragraphs(filename, numpars=5):
    pp = Paragraphs(open(filename))
    for p in pp:
        print "Par#%d, line# %d: %s" % (
            pp.para_num, pp.line_num, repr(p))
        if pp.para_num>numpars: break          #译注,这样不好!参数是5,打印6段,应该用>=
 
讨论 Discussion
Python doesn't directly support paragraph-oriented file reading, but, as usual, it's not hard to add such functionality. We define a paragraph as a string formed by joining a nonempty sequence of nonseparator lines, separated from any adjoining paragraphs by nonempty sequences of separator lines.
Python没有直接提供对于读取文件段落的支持,不过,像在Python中编写其它功能一样,编写这样的功能也不困难。定义一个类包装段落: 由连接连续的非空行序列得到的字符串构成,段与段之间由定义的分隔行分隔。
By default, a separator line is one that equals '\n' (empty line), although this concept is easy to generalize. We let the client code determine what a separator is when instantiating this class. Any string is acceptable, but we append a '\n' to it, if it doesn't already end with '\n' (since we read the underlying file line by line, a separator not ending with '\n' would never match).
默认的分隔行是"\n"(空行), 容易泛化此概念, 实例化包装类时由客户参数决定什么字符串作为分隔行。 所有的字符串都是可以接受的参数,如果参数不是以"\n"结尾, 那么在参数末尾附加"\n"(由于读取文件是一行一行进行的,如果分隔符号不以\n结尾,就不会匹配)。
We can get even more generality by having the client code pass us a callable that looks at any line and tells us whether that line is a separator or not.
进一步泛化,可以由客户传入一个可以调用的函数,由它检查每一行并决定此行是否为分隔行。
- In fact, this is how I originally architected this recipe, but then I decided that such an architecture represented a typical, avoidable case of overgeneralization (also known as overengineering and "Big Design Up Front"; see http://xp.c2.com/BigDesignUpFront.html), so I backtracked to the current, more reasonable amount of generality. 
事实上,我开始构架此节方法就是这样想的。后来,意识到这种泛化是典型的,应该避免的过度泛化行为(也称为过度工程?和过度设计,see http://xp.c2.com/BigDesignUpFront.html), 因此决定回归到本节中适度泛化的解决方法
Indeed, another reasonable design choice for this recipe's class would be to completely forego the customizability of what lines are to be considered separators and just test for separator lines with line.isspace( ), so that stray blanks on an empty-looking line wouldn't misleadingly transform it into a nonseparator line.
实际上,另一个合理的设计选择是完全不顾这一层泛化,不用理会如何定义分隔行,而只用line.isspace()测试本行是否为空行。 这样可以处理显示为空行的行中包含零星的空白符号的情况, 而不会将它作为 非分隔行。
This recipe's adapter class is a special case of sequence adaptation by bunching. An underlying sequence (here, a sequence of lines, provided by xreadlines on a file or file-like object) is bunched up into another sequence of larger units (here, a sequence of paragraph strings). The pattern is easy to generalize to other sequence-bunching needs. Of course, it's even easier with iterators and generators in Python 2.2, but even Python 2.1 is pretty good at this already. Sequence adaptation is an important general issue that arises particularly often when you are sequentially reading and/or writing files; see Recipe 4.10 for another example.
本节中的适配类是使用分块的方法对序列进行适配的特例。对底层序列(此处,是在文件或类似文件的对象上使用readlines方法获得的文本行序列)分块构成具有更大单位元素的新序列(此处是段落字符串).这个模式很容易泛化,处理序列-分块的需求. 当然,使用Python 2.2中的iterator 和 generator会更简单些, 但即使在Python 2.1中处理这个问题也有很好的方法。
For Python 2.1, we need an index of the underlying sequence of lines and a way to check that our _ _getitem_ _ method is being called with properly sequential indexes (as the for statement does), so we expose the line_num and para_num indexes as useful attributes of our object. Thus, client code can determine our position during a sequential scan, in regard to the indexing on the underlying line sequence, the paragraph sequence, or both, without needing to track it itself.
2.1中,需要维持文件底层的行序列的索引,同时需要方法检查getitem函数的参数是合理的顺序索引(见下面for循环), 因此将line_num和indexes_num作为对象的可用属性对客户开放,客户代码可以进行序列检查,使用它们与底层行序列的索引对照,来确定当前的访问位置。
The code uses two separate loops, each in a typical pattern:
while 1:
- .. if xxx: break
The first loop skips over zero or more separators that may occur between arbitrary paragraphs. Then, a separate loop accumulates nonseparators into a result list, until the underlying file finishes or a separator is encountered.
It's an elementary issue, but quite important to performance, to build up the result as a list of strings and combine them with .join at the end. Building up a large string as a string, by repeated application of += in a loop, is never the right approach梚t's slow and clumsy. Good Pythonic style demands using a list as the intermediate accumulator when building up a string.   The show_paragraphs function demonstrates all the simple features of the Paragraphs class and can be used to unit-test the latter by feeding it a known text file.   Python 2.2 makes it very easy to build iterators and generators. This, in turn, makes it very tempting to build a more lightweight version of the by-paragraph buncher as a generator function, with no classes involved:   from _ _future_ _ import generators  def paragraphs(fileobj, separator='\n'):  yield 
.join(paragraph) paragraph = [] else:
- paragraph.append(line)
if paragraph: yield .join(paragraph)  We don't get the line and paragraph numbers, but the approach is much more lightweight, and it works polymorphically on any fileobj that can be iterated on to yield a sequence of lines, not just a file or file-like object. Such useful polymorphism is always a nice plus, particularly considering that it's basically free. Here, we have merged the loops into one, and we use the intermediate list paragraph itself as the state indicator. If the list is empty, we're skipping separators; otherwise, we're accumulating nonseparators.   
参考 See Also
