Diff for "TPiP/Chap3"

Differences between revisions 2 and 3

CHAPTER III -- REGULAR EXPRESSIONS

第三章 --- 正则表达式

Regular expressions allow extremely valuable text processing techniques, but ones that warrant careful explanation. Python's [re] module, in particular, allows numerous enhancements to basic regular expressions (such as named backreferences, lookahead assertions, backreference skipping, non-greedy quantifiers, and others). A solid introduction to the subtleties of regular expressions is valuable to programmers engaged in text processing tasks.

正则表达式是极有价值的文字处理技术，不过也需要详细的解释。 Python的[re]模块，特别为基本的正则表达式(比如向回引用<+??+>，向前断言<+??+>, 略过向回引用<+??+>，非贪婪性限定词,以及其他)增加了众多的增强。一个文章如能详细介绍正则表达式的精妙之处，会让从事文本处理的程序员觉得很有价值。

The prequel of this chapter contains a tutorial on regular expressions that allows a reader unfamiliar with regular expressions to move quickly from simple to complex elements of regular expression syntax. This tutorial is aimed primarily at beginners, but programmers familiar with regular expressions in other programming tools can benefit from a quick read of the tutorial, which explicates the particular regular expression dialect in Python.

本章的导语包含了一个关于正则表达式的指导手册，它可以帮助不熟悉正则表达式的读者迅速由简到繁地掌握相关语法。该手册主要针对初学者，如果读者熟悉其他编程工具中的正则表达式，也可以快速阅读该手册来获取Python中关于正则表达式的特别方言。

It is important to note up-front that regular expressions, while very powerful, also have limitations. In brief, regular expressions cannot match patterns that nest to arbitrary depths. If that statement does not make sense, read Chapter 4, which discusses parsers--to a large extent, parsing exists to address the limitations of regular expressions. In general, if you have doubts about whether a regular expression is sufficient for your task, try to understand the examples in Chapter 4, particularly the discussion of how you might spell a floating point number.

坦白说正则表达式很强大，但是也并非万能。简单来说，正则表达式不能匹配那些有无限制嵌套的模式。如果不明白以上说法，请参阅第四章，那里广泛地讨论了解析器，还通过解析现存<+不解+>来说明正则表达式的不足之处。广泛来说，如果你怀疑正则表达式能否胜任你的工作，可以尝试去理解第四章中的例子，特别是关于拼写浮点数之可能性的讨论。

Section 3.1 examines a number of text processing problems that are solved most naturally using regular expression. As in other chapters, the solutions presented to problems can generally be adopted directly as little utilities for performing tasks. However, as elsewhere, the larger goal in presenting problems and solutions is to address a style of thinking about a wider class of problems than those whose solutions are presented directly in this book. Readers who are interested in a range of ready utilities and modules will probably want to check additional resources on the Web, such as the Vaults of Parnassus <http://www.vex.net/parnassus/> and the Python Cookbook <http://aspn.activestate.com/ASPN/Python/Cookbook/>.

3.1节检查了一些用正则表达式自然解决的文本处理问题。和其他章一样，相关的解决方案可以直接被采用为完成任务的小工具。但是，读者需要注意到问题背后所表达的关于更广问题的思维方式，而非仅仅这些代码。读者如果对现成的工具和模块有兴趣，可以查看网上的资源，比如说 the Vaults of Parnassus <http://www.vex.net/parnassus/> 和 the Python Cookbook <http://aspn.activestate.com/ASPN/Python/Cookbook/>.

Section 3.2 is a "reference with commentary" on the Python standard library modules for doing regular expression tasks. Several utility modules and backward-compatibility regular expression engines are available, but for most readers, the only important module will be [re] itself. The discussions interspersed with each module try to give some guidance on why you would want to use a given module or function, and the reference documentation tries to contain more examples of actual typical usage than does a plain reference. In many cases, the examples and discussion of individual functions address common and productive design patterns in Python. The cross-references are intended to contextualize a given function (or other thing) in terms of related ones (and to help a reader decide which is right for her). The actual listing of functions, constants, classes, and the like are in alphabetical order within each category.

3.2节是个带评论的参考手册，介绍了如何使用Python标准库模块来完成正则表达式任务。其中涉及了若干工具模块和向后兼容的正则表达式引擎，但是对绝大部分读者而言，唯一重要的模块就是 [re] 自己。讨论散布于各个模块，并试图指引你明白为何使用给出的模块或者函数。参考文档则比普通参考包含了更多的实际使用例子。在许多情况下，每个独立函数的例子和讨论说明了 Python中普通而又多产的设计模式。交叉引用意图对给定的函数(或者其他东西) 给出上下文关系，列举相关内容(这样读者可以自行决定什么合适自己)。每个分类都按照字母表列出了函数，常数，类还有相似的<+??+>。

SECTION 0 -- A Regular Expression Tutorial

第0节 -- 一个关于正则表达式的简明教程

   1     Some people, when confronted with a problem, think "I know,
   2     I'll use regular expressions." Now they have two problems.
   3      -- Jamie Zawinski, '<alt.religion.emacs>' (08/12/1997)
   4         有的人遇到问题时候会想：“我知道，我可以使用正则表达式。”
   5         然后他们就有了两个问题。
   6      -- Jamie Zawinski, '<alt.religion.emacs>' (08/12/1997)

TOPIC -- Just What is a Regular Expression, Anyway?

主题 -- 赶紧地，啥是正则表达式呀？

Many readers will have some background with regular expressions, but some will not have any. Those with experience using regular expressions in other languages (or in Python) can probably skip this tutorial section. But readers new to regular expressions (affectionately called 'regexes' by users) should read this section; even some with experience can benefit from a refresher.

并非所有读者都接触过正则表达式，正则表达式的新朋友们（他们被亲切地称呼为'regexes'）应该读一读本节。如果你有过此类经验(无论是否与Python相关），均可略过本节。不过温故知新，再看一遍也许有新的发现呢。

A regular expression is a compact way of describing complex patterns in texts. You can use them to search for patterns and, once found, to modify the patterns in complex ways. They can also be used to launch programmatic actions that depend on patterns.

一个正则表达式是一种简洁描述文本中的复杂模式的方法。你可以用它们来搜索模式，一旦找到，你就可以用复杂的方法修改模式。他们还可以用来发起一些依赖于模式的计划性行动。

Jamie Zawinski's tongue-in-cheek comment in the epigram is worth thinking about. Regular expressions are amazingly powerful and deeply expressive. That is the very reason that writing them is just as error-prone as writing any other complex programming code. It is always better to solve a genuinely simple problem in a simple way; when you go beyond simple, think about regular expressions.

Jamie Zawinski在其讽刺短诗中半开玩笑的评论值得深思。正则表达式具有让人惊讶无比的能力，同时也富有表现力。这个也正是为什么编写它们和编写其他复杂程序代码一样容易出错。如果能用一种简单的方法解决一个真正简单的问题总是更好；当你超越了简单，请想想正则表达式。

A large number of tools other than Python incorporate regular expressions as part of their functionality. Unix-oriented command-line tools like 'grep', 'sed', and 'awk' are mostly wrappers for regular expression processing. Many text editors allow search and/or replacement based on regular expressions. Many programming languages, especially other scripting languages such as Perl and TCL, build regular expressions into the heart of the language. Even most command-line shells, such as Bash or the Windows-console, allow restricted regular expressions as part of their command syntax. 除了Python外还有众多工具都有支持正则表达式的功能。'grep','sed' 和 'awk' 等起源于unix的命令行工具实际上是对正则表达式的包装。许多文本编辑器允许基于正则表达式的搜索/替换。很多程序语言，特别是其他脚本语言，例如Perl和TCL，都内建正则表达式支持。甚至命令行处理程序，例如Bash或者Windows的控制台，他们的语法都允许有限的正则表达式。

There are some variations in regular expression syntax between different tools that use them, but for the most part regular expressions are a "little language" that gets embedded inside bigger languages like Python. The examples in this tutorial section (and the documentation in the rest of the chapter) will focus on Python syntax, but most of this chapter transfers easily to working with other programming languages and tools. 不同工具使用正则表达式的语法都略有不同。但是绝大部分的正则表达式就如同"小语种"一样嵌入在大编程语言中，例如Python。本节中的例子 (以及本章其他地方的例子)，都专注于Python的语法，不过本章可以很容易地转换以适用于其他的编程语言和工具。

As with most of this book, examples will be illustrated by use of Python interactive shell sessions that readers can type themselves, so that they can play with variations on the examples. However, the [re] module has little reason to include a function that simply illustrates matches in the shell. Therefore, the availability of the small wrapper program below is implied in the examples: 正如本书所展现，例子是通过Python的交互性命令行处理程序(shell)执行的，读者们可以自己键入，这样他们可以方便查看例子中的变量。尽管 [re] 模块本身并未为shell提供一个轻松展示匹配的函数。下面的小包装函数可以提供这样的功能。

   1     #---------- re_show.py ----------#
   2     import re
   3     def re_show(pat, s):
   4         print re.compile(pat, re.M).sub("{\g<0>}", s.rstrip()),'\n'
   5 
   6     s = '''Mary had a little lamb
   7     And everywhere that Mary
   8     went, the lamb was sure
   9     to go'''

Place the code in an external module and 'import' it. Those new to regular expressions need not worry about what the above function does for now. It is enough to know that the first argument to 're_show()' will be a regular expression pattern, and the second argument will be a string to be matched against. The matches will treat each line of the string as a separate pattern for purposes of matching beginnings and ends of lines. The illustrated matches will be whatever is contained between curly braces (and is typographically marked for emphasis). 请把代码放入一个外部模块并'导入(import)'。刚接触正则表达式的朋友现在不用担心以上代码的作用，只要知道 're_show()' 第一个参数是一个正则表达式模式，第二个参数是用来匹配的字符串就可以了。为了匹配行首和行尾，字符串的每一行都将被单独匹配。

TOPIC -- Matching Patterns in Text: The Basics

主题 -- 文本匹配模式：基础

The very simplest pattern matched by a regular expression is a literal character or a sequence of literal characters. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lowercase character is not identical with its uppercase version, and vice versa. A space in a regular expression, by the way, matches a literal space in the target (this is unlike most programming languages or command-line tools, where a variable number of spaces separate keywords). 最简单的正则表达式匹配模式是一个或一串可见字符(非控制符，译者注)。目标文本中如果存在完全一样的字符，而且处于完全一样的顺序，则为命中。大小写字母互不等同。顺便说一下，正则表达式中的空格就匹配目标中字面上的空格(这和绝大部分编程语言或者命令行工具不同，因为它们都是使用空格来分隔关键词)。

   1     >>> from re_show import re_show, s
   2     >>> re_show('a', s)
   3     M{a}ry h{a}d {a} little l{a}mb.
   4     And everywhere th{a}t M{a}ry
   5     went, the l{a}mb w{a}s sure
   6     to go.
   7 
   8     >>> re_show('Mary', s)
   9     {Mary} had a little lamb.
  10     And everywhere that {Mary}
  11     went, the lamb was sure
  12     to go.

A number of characters have special meanings to regular expressions. A symbol with a special meaning can be matched, but to do so it must be prefixed with the backslash character (this includes the backslash character itself: to match one backslash in the target, the regular expression should include '\\'). In Python, a special way of quoting a string is available that will not perform string interpolation. Since regular expressions use many of the same backslash-prefixed codes as do Python strings, it is usually easier to compose regular expression strings by quoting them as "raw strings" with an initial "r". 一些字符对正则表达式来说有特殊表示。一个符号必须跟在反斜杠后面才能匹配它的特别意义(特殊符号里面包含反斜杠自己，所以如果要在目标文本中匹配反斜杠，正则表达式需要包含'\\')。Python提供了一种无需格式替换的方法来引用字符串，就是在字符串最前面加一个"r"字，这被称为"生字符串"(意即未处理过的字符串,译者注)。因为正则表达式和Python字符串一样，使用了许多以反斜杠为前缀的代码，这样处理会让编写正则表达式轻松很多。

   1     >>> from re_show import re_show
   2     >>> s = '''Special characters must be escaped.*'''
   3     >>> re_show(r'.*', s)
   4     {Special characters must be escaped.*}
   5 
   6     >>> re_show(r'\.\*', s)
   7     Special characters must be escaped{.*}
   8 
   9     >>> re_show('\\\\', r'Python \ escaped \ pattern')
  10     Python {\} escaped {\} pattern
  11 
  12     >>> re_show(r'\\', r'Regex \ escaped \ pattern')
  13     Regex {\} escaped {\} pattern

Two special characters are used to mark the beginning and end of a line: caret ("^") and dollarsign ("$"). To match a caret or dollarsign as a literal character, it must be escaped (i.e., precede it by a backslash "\"). 有两个特殊字符被用来标注一行的开头和结尾: 脱字符 ("^") 和美元符号 ("$")。为了匹配一个脱字符或者美元符号本身，它需要被转义 (例如在其之前加一个反斜杠 "\")。

An interesting thing about the caret and dollarsign is that they match zero-width patterns. That is, the length of the string matched by a caret or dollarsign by itself is zero (but the rest of the regular expression can still depend on the zero-width match). Many regular expression tools provide another zero-width pattern for word-boundary ("\b"). Words might be divided by whitespace like spaces, tabs, newlines, or other characters like nulls; the word-boundary pattern matches the actual point where a word starts or ends, not the particular whitespace characters. 关于脱字符和美元符还有一个有趣的事情，那就是它们匹配零长度的模式，也就是说，只有一个脱字符或者美元符匹配到的字符串长度是零 (但是剩余的正则表达式仍然依赖于这个零长度的匹配)。许多正则表达式工具还提供另外一个零长度模式来识别词语的边界 ("\b")。词语常用空白来分隔，包括空格，制表符，换行以及其他字符例如空符号；词边界模式匹配一个词实际开始或者结束的地方，并不包括那些特别的空白字符。

   1     >>> from re_show import re_show, s
   2     >>> re_show(r'^Mary', s)
   3     {Mary} had a little lamb
   4     And everywhere that Mary
   5     went, the lamb was sure
   6     to go
   7 
   8     >>> re_show(r'Mary$', s)
   9     Mary had a little lamb
  10     And everywhere that {Mary}
  11     went, the lamb was sure
  12     to go
  13 
  14     >>> re_show(r'$','Mary had a little lamb')
  15     Mary had a little lamb{}

In regular expressions, a period can stand for any character. Normally, the newline character is not included, but optional switches can force inclusion of the newline character also (see later documentation of [re] module functions). Using a period in a pattern is a way of requiring that "something" occurs here, without having to decide what. 通常来说，在正则表达式中，(英文)句号可以表示任何字符，除了换行符。但也有可选项开关可以迫使句号也能代表换行符(请参阅后面关于 [re] 模块函数的文档)。在模式中使用句号是为了表示此处有"东西"，不需要确定是什么。

Readers who are familiar with DOS command-line wildcards will know the question mark as filling the role of "some character" in command masks. But in regular expressions, the question mark has a different meaning, and the period is used as a wildcard. 读者如果熟悉DOS命令行的通配符就知道问号是用来表示"一些字符"，但是在正则表达式中，问号符却有不同的意义，而句号符才是通配符。

   1     >>> from re_show import re_show, s
   2     >>> re_show(r'.a', s)
   3     {Ma}ry {ha}d{ a} little {la}mb
   4     And everywhere t{ha}t {Ma}ry
   5     went, the {la}mb {wa}s sure
   6     to go

-  ⇤ ← Revision 2 as of 2007-10-15 12:55:43 → 
  Size: 18674
  Editor: lwl
  Comment:
+   ← Revision 3 as of 2007-10-15 12:56:36 → ⇥
  Size: 18676
  Editor: lwl
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 220:
-=== 主题 -- 文本匹配模式： 基础 ===
+==== 主题 -- 文本匹配模式： 基础 ====

第三章 --- 正则表达式

第0节 -- 一个关于正则表达式的简明教程

主题 -- 赶紧地，啥是正则表达式呀？

主题 -- 文本匹配模式： 基础

主题 -- 文本匹配模式：基础