TPiP/Intro

FRONTMATTER -- PREFACE 序

    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one--and preferably only one--obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea--let's do more of those!
      --Tim Peters, "The Zen of Python"

SECTION 1 -- What is Text Processing? 什么是文本处理

At the broadest level text processing is simply taking textual information and -doing something- with it. This doing might be restructuring or reformatting it, extracting smaller bits of information from it, algorithmically modifying the content of the information, or performing calculations that depend on the textual information. The lines between "text" and the even more general term "data" are extremely fuzzy; at an approximation, "text" is just data that lives in forms that people can themselves read--at least in principle, and maybe with a bit of effort. Most typically computer "text" is composed of sequences of bits that have a "natural" representation as letters, , and symbols; most often such text is delimited (if delimited at all) by symbols and formatting that can be easily pronounced as "next datum."

宽泛的说,文本处理就是对简单的文本信息“做一些事情”. 这个“做一些事情”可以包括 重组文本结构,重排格式,提取更小的文本信息,算法化地自动处理文本的内容,或者 根据文本信息做一些计算。“文本”这个概念,和“数据”这个更广义的概念之间的界限 其实非常的模糊;某种意义上,“文本”其实就是一种存在形式可以被人直接阅读的数据-- 至少理论上可以直接阅读,虽然有时候也许还需要费点力气去阅读。绝大多数典型的计算机 “文本”由一组可以“自然”地表述字母或者符号的bit序列构成;大多数情况下,这样的 文本被符号和格式分割开,每一部分可以被简单的称为“下一条资料”。

The lines are fuzzy, but the data that seems least like text--and that, therefore, this particular book is least concerned with--is the data that makes up "multimedia" (pictures, sounds, video, animation, etc.) and data that makes up UI "events" (draw a window, move the mouse, open an application, etc.). Like I said, the lines are fuzzy, and some representations of the most nontextual data are themselves pretty textual. But in general, the subject of this book is all the stuff on the near side of that fuzzy line.

虽然界限并不分明,但是有一些数据并不像是文本,至少不算是本书集中讨论的文本: 比如,组成多媒体(图片,声音,视频,动画等等)的数据,还有组成UI事件(画一个窗 口的事件,或者以东鼠标,打开应用程序的事件等等)的数据。就像我说过的,界限很 模糊,而有一些最不文本化的数据他们的表现形式本身却往往非常的“文本化”。总的 来说,本书的主题讨论的是这一条模糊的界限附近的所有东西。

Text processing is arguably what most programmers spend most of their time doing. The information that lives in business software systems mostly comes down to collections of words about the application domain--maybe with a few special symbols mixed in. Internet communications protocols consist mostly of a few special words used as headers, a little bit of constrained formatting, and message bodies consisting of additional wordish texts. Configuration files, log files, CSV and fixed-length data files, error files, documentation, and source code itself are all just sequences of words with bits of constraint and formatting applied.

Programmers and developers spend so much time with text processing that it is easy to forget that that is what we are doing. The most common text processing application is probably your favorite text editor. Beyond simple entry of new characters, text editors perform such text processing tasks as search/replace and copy/paste, which--given guided interaction with the user--accomplish sophisticated manipulation of textual sources. Many text editors go farther than these simple capabilities and include their own complete programming systems (usually called "macro processing"); in those cases where editors include "Turing-complete" macro languages, text editors suffice, in principle, to accomplish anything that the examples in this book can.

After text editors, a variety of text processing tools are widely used by developers. Tools like "File Find" under Windows, or "grep" on Unix (and other platforms), perform the basic chore of -locating- text patterns. "Little languages" like sed and awk perform basic text manipulation (or even nonbasic). A large number of utilities--especially in Unix-like environments--perform small custom text processing tasks: 'wc', 'sort', 'tr', 'md5sum', 'uniq', 'split', 'strings', and many others.

At the top of the text processing food chain are general-purpose programming languages, such as Python. I wrote this book on Python in large part because Python is such a clear, expressive, and general-purpose language. But for all Python's virtues, text editors and "little" utilities will always have an important place for developers "getting the job done." As simple as Python is, it is still more complicated than you need to achieve many basic tasks. But once you get past the very simple, Python is a perfect language for making the difficult things possible (and it is also good at making the easy things simple).

SECTION 2 -- The Philosophy of Text Processing


SECTION 3 -- What You'll Need to Use This Book


SECTION 4 -- Conventions Used in This Book



Shell sample


#

SECTION 5 -- A Word on Source Code Examples


SECTION 6 -- External Resources






TPiP/Intro (last edited 2009-12-25 07:11:03 by localhost)