描述

Reading a Text File by Paragraphs

按段落读取文本文件

Credit: Alex Martelli, Magnus Lie Hetland

问题 Problem

You need to read a file paragraph by paragraph, in which a paragraph is defined as a sequence of nonempty lines (in other words, paragraphs are separated by empty lines).

需要按段落读取文件，段落的定义是由非空行组成的行序列(既空行分隔段落)

解决 Solution

A wrapper class is, as usual, the right Pythonic architecture for this (in Python 2.1 and earlier):

按照Python语言风格(在Python 2.1及更早版本中)普通的正确解决方法的基础是使用一个包装类(wrapper class):

class Paragraphs:

    def _ _init_ _(self, fileobj, separator='\n'):

        # Ensure that we get a line-reading sequence in the best way possible:
        # 保证用最佳方法读取行系列                (#译注:困惑阿,xreadlines在2.3？中已经deprecated了)
        import xreadlines
        try:
            # Check if the file-like object has an xreadlines method
            # 检查可能是文件对象的参数是否具有'''xreadlines'''方法，以获得对象的各行组成的序列
            self.seq = fileobj.xreadlines(  ) 
        except AttributeError:
            # No, so fall back to the xreadlines module's implementation
            # 如果参数对象不具有xreadlines方法，使用xreadlines模块的实现
            self.seq = xreadlines.xreadlines(fileobj)

        self.line_num = 0    # current index into self.seq (line number)
                             #实例变量, 行号索引，
        self.para_num = 0    # current index into self (paragraph number)
                             #实例变量,段落号索引，
        # Ensure that separator string includes a line-end character at the end
        #检查参数'''分隔字符串'''末尾包含 '\n' 
        if separator[-1:] != '\n': separator += '\n'
        self.separator = separator         #实例变量,行分隔字符串


    def _ _getitem_ _(self, index):
        if index != self.para_num:
            # 实现仅支持顺序提取，如果下标不合理，抛出TypeError   
            raise TypeError, "Only sequential access supported"        

        self.para_num += 1
        # Start where we left off and skip 0+ separator lines
        #从前一段落结束处开始，忽略可能的'''空行'''         

        while 1:                                         #循环处理可能的空行，遇到非空行开始段落处理        
        # Propagate IndexError, if any, since we're finished if it occurs
        #前面已经处理下标不合理错误。 这里如果有错误,抛出这个异常
 
            line = self.seq[self.line_num]               #仅对空行计数，忽略空行处理
            self.line_num += 1
            if line != self.separator: break             #遇到非空行， 处理 
        # Accumulate 1+ nonempty lines into result
        #添加非空行到结果
        result = [line]                                  #开始处理段落 
 
        while 1:
        # Intercept IndexError, since we have one last paragraph to return
            try:
                # Let's check if there's at least one more line in self.seq
                #检查行序列是否还有剩余元素未处理  
                line = self.seq[self.line_num]
            except IndexError:
                # self.seq is finished, so we exit the loop
                #序列已处理完毕,退出循环 
                break
            # Increment index into self.seq for next time
            self.line_num += 1
            if line == self.separator: break
            result.append(line)                      #添加非空行到结果

        return ''.join(result)                       #对段落行序列使用'''join'''构成段落字符串返回 

# Here's an example function, showing how to use class Paragraphs:
#如何使用Paragraphs类的函数如下：

def show_paragraphs(filename, numpars=5):
    pp = Paragraphs(open(filename))
    for p in pp:
        print "Par#%d, line# %d: %s" % (
            pp.para_num, pp.line_num, repr(p))
        if pp.para_num>numpars: break

讨论 Discussion

...

-  ⇤ ← Revision 2 as of 2004-09-22 19:54:34 → 
  Size: 3699
  Editor: 61
  Comment:
+   ← Revision 3 as of 2004-09-22 20:30:37 → ⇥
  Size: 4640
  Editor: 61
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 26:
-按照Python语言风格(在Python 2.1及更早版本中)普通的解决架构是使用一个'''包装'''类(wrapper class):
+按照Python语言风格(在Python 2.1及更早版本中)普通的正确解决方法的基础是使用一个'''包装'''类(wrapper class):
 Line 38:
-            # 检查可能是文件对象的参数是否具有'''xreadlines'''方法
            self.seq = fileobj.xreadlines(  )
+            # 检查可能是文件对象的参数是否具有'''xreadlines'''方法，以获得对象的各行组成的序列
            self.seq = fileobj.xreadlines(  )
 Line 52:
-        self.separator = separator         #实例变量,段落号索引，
+        self.separator = separator         #实例变量,行分隔字符串
 Line 57:
-            raise TypeError, "Only sequential access supported"
+            # 实现仅支持顺序提取，如果下标不合理，抛出TypeError   
            raise TypeError, "Only sequential access supported"
-Line 60:
+Line 62:
-        while 1:
+        #从前一段落结束处开始，忽略可能的'''空行'''         

        while 1:                                         #循环处理可能的空行，遇到非空行开始段落处理
-Line 62:
+Line 66:
-            line = self.seq[self.line_num]
+        #前面已经处理下标不合理错误。 这里如果有错误,抛出这个异常
 
            line = self.seq[self.line_num]               #仅对空行计数，忽略空行处理
-Line 64:
+Line 70:
-            if line != self.separator: break
+            if line != self.separator: break             #遇到非空行， 处理
-Line 66:
+Line 72:
-        result = [line]
+        #添加非空行到结果
        result = [line]                                  #开始处理段落
-Line 71:
+Line 79:
+                #检查行序列是否还有剩余元素未处理
-Line 74:
+Line 83:
+                #序列已处理完毕,退出循环
-Line 78:
+Line 88:
-            result.append(line)
        return ''.join(result)
+            result.append(line)                      #添加非空行到结果

        return ''.join(result)                       #对段落行序列使用'''join'''构成段落字符串返回
-Line 82:
+Line 93:
+#如何使用Paragraphs类的函数如下：

Diff for "PyCkBk-4-9"

描述

问题 Problem

解决 Solution

讨论 Discussion

参考 See Also