Differences between revisions 1 and 2
Revision 1 as of 2007-12-10 13:10:50
Size: 202818
Editor: digit
Comment:
Revision 2 as of 2007-12-10 13:11:52
Size: 66
Editor: digit
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
{{{
-------------------------------------------------------------------

    The cheapest, fastest and most reliable components of a
    computer system are those that aren't there.
      --Gordon Bell, Encore Computer Corporation

  If you are writing programs in Python to accomplish text
  processing tasks, most of what you need to know is in this
  chapter. Sure, you will probably need to know how to do some
  basic things with pipes, files, and arguments to get your text
  to process (covered in Chapter 1); but for actually
  -processing- the text you have gotten, the [string] module and
  string methods--and Python's basic data structures--do most
  all of what you need done, almost all the time. To a lesser
  extent, the various custom modules to perform encodings,
  encryptions, and compressions are handy to have around (and you
  certainly do not want the work of implementing them yourself).
  But at the heart of text processing are basic transformations of
  bits of text. That's what [string] functions and string
  methods do.

  There are a lot of interesting techniques elsewhere in this
  book. I wouldn't have written about them if I did not find
  them important. But be cautious before doing interesting
  things. Specifically, given a fixed task in mind, before
  cracking this book open to any of the other chapters, consider
  very carefully whether your problem can be solved using the
  techniques in this chapter. If you can answer this question
  affirmatively, you should usually eschew the complications of
  using the higher-level modules and techniques that other
  chapters discuss. By all means read all of this book for
  the insight and edification that I hope it provides; but still
  focus on the "Zen of Python," and prefer simple to complex when
  simple is enough.

  This chapter does several things. Section 2.1 looks at a number
  of common problems in text processing that can (and should) be
  solved using (predominantly) the techniques documented in this
  chapter. Each of these "Problems" presents working solutions that
  can often be adopted with little change to real-life jobs. But a
  larger goal is to provide readers with a starting point for
  adaptation of the examples. It is not my goal to provide mere
  collections of packaged utilities and modules--plenty of those
  exist on the Web, and resources like the Vaults of Parnassus
  <http://www.vex.net/parnassus/> and the Python Cookbook
  <http://aspn.activestate.com/ASPN/Python/Cookbook/> are worth
  investigating as part of any project/task (and new and better
  utilities will be written between the time I write this and when
  you read it). It is better for readers to receive a solid
  foundation and starting point from which to develop the
  functionality they need for their own projects and tasks. And
  even better than spurring adaptation, these examples aim to
  encourage contemplation. In presenting examples, this book tries
  to embody a way of thinking about problems and an attitude
  towards solving them. More than any individual technique, such
  ideas are what I would most like to share with readers.

  Section 2.2 is a "reference with commentary" on the Python
  standard library modules for doing basic text manipulations. The
  discussions interspersed with each module try to give some
  guidance on why you would want to use a given module or function,
  and the reference documentation tries to contain more examples of
  actual typical usage than does a plain reference. In many cases,
  the examples and discussion of individual functions addresses
  common and productive design patterns in Python. The
  cross-references are intended to contextualize a given function
  (or other thing) in terms of related ones (and to help you decide
  which is right for you). The actual listing of functions,
  constants, classes, and the like is in alphabetical order within
  type of thing.

  Section 2.3 in many ways continues Section 2.1, but also provides
  some aids for using this book in a learning context. The
  problems and solutions presented in Section 2.3 are somewhat more
  open-ended than those in Section 2.1. As well, each section
  labeled as "Discussion" is followed by one labeled
  "Questions." These questions are ones that could be assigned
  by a teacher to students; but they are also intended to be
  issues that general readers will enjoy and benefit from
  contemplating. In many cases, the questions point to
  limitations of the approaches initially presented, and ask
  readers to think about ways to address or move beyond these
  limitations--exactly what readers need to do when writing their
  own custom code to accomplish outside tasks. However, each
  Discussion in Section 2.3 should stand on its own, even if the
  Questions are skipped over by the reader.


SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------

  PROBLEM: Quickly sorting lines on custom criteria
  --------------------------------------------------------------------

  Sorting is one of the real meat-and-potatoes algorithms of text
  processing and, in fact, of most programming. Fortunately for
  Python developers, the native `[].sort` method is extraordinarily
  fast. Moreover, Python lists with almost any heterogeneous
  objects as elements can be sorted--Python cannot rely on the
  uniform arrays of a language like C (an unfortunate exception to
  this general power was introduced in recent Python versions where
  comparisons of complex numbers raise a 'TypeError'; and
  '[1+1j,2+2j].sort()' dies for the same reason; Unicode strings in
  lists can cause similar problems).

  SEE ALSO, [complex]

  +++

  The list sort method is wonderful when you want to sort items in
  their "natural" order--or in the order that Python considers
  natural, in the case of items of varying types. Unfortunately, a
  lot of times, you want to sort things in "unnatural" orders. For
  lines of text, in particular, any order that is not simple
  alphabetization of the lines is "unnatural." But often text lines
  contain meaningful bits of information in positions other than
  the first character position: A last name may occur as the second
  word of a list of people (for example, with first name as the
  first word); an IP address may occur several fields into a server
  log file; a money total may occur at position 70 of each line;
  and so on. What if you want to sort lines based on this style of
  meaningful order that Python doesn't quite understand?

  The list sort method `[].sort()` supports an optional custom
  comparison function argument. The job this function has is to
  return -1 if the first thing should come first, return 0 if the
  two things are equal order-wise, and return 1 if the first thing
  should come second. The built-in function `cmp()` does this in a
  manner identical to the default `[].sort()` (except in terms of
  speed, 'lst.sort()' is much faster than 'lst.sort(cmp)'). For
  short lists and quick solutions, a custom comparison function is
  probably the best thing. In a lot of cases, one can even get by
  with an in-line 'lambda' function as the custom comparison
  function, which is a pleasant and handy idiom.

  When it comes to speed, however, use of custom comparison
  functions is fairly awful. Part of the problem is Python's
  function call overhead, but a lot of other factors contribute to
  the slowness. Fortunately, a technique called "Schwartzian
  Transforms" can make for much faster custom sorts. Schwartzian
  Transforms are so named after Randal Schwartz, who proposed the
  technique for working with Perl; but the technique is equally
  applicable to Python.

  The pattern involved in the Schwartzian Transform technique
  consists of three steps (these can more precisely be called the
  Guttman-Rosler Transform, which is based on the Schwartzian
  Transform):

  1. Transform the list in a reversible way into one that sorts
      "naturally."

  2. Call Python's native `[].sort()` method.

  3. Reverse the transformation in (1) to restore the
      original list items (in new sorted order).

  The reason this technique works is that, for a list of size N,
  it only requires O(2N) transformation operations, which is easy
  to amortize over the necessary O(N log N) compare/flip
  operations for large lists. The sort dominates computational
  time, so anything that makes the sort more efficient is a win
  in the limit case (this limit is reached quickly).

  Below is an example of a simple, but plausible, custom sorting
  algorithm. The sort is on the fourth and subsequent words of
  a list of input lines. Lines that are shorter than four words
  sort to the bottom. Running the test against a file with about
  20,000 lines--about 1 megabyte--performed the Schwartzian
  Transform sort in less than 2 seconds, while taking over 12
  seconds for the custom comparison function sort (outputs were
  verified as identical). Any number of factors will change the
  exact relative timings, but a better than six times gain can
  generally be expected.

      #---------- schwartzian_sort.py ----------#
      # Timing test for "sort on fourth word"
      # Specifically, two lines >= 4 words will be sorted
      # lexographically on the 4th, 5th, etc.. words.
      # Any line with fewer than four words will be sorted to
      # the end, and will occur in "natural" order.

      import sys, string, time
      wrerr = sys.stderr.write

      # naive custom sort
      def fourth_word(ln1,ln2):
          lst1 = string.split(ln1)
          lst2 = string.split(ln2)
          #-- Compare "long" lines
          if len(lst1) >= 4 and len(lst2) >= 4:
              return cmp(lst1[3:],lst2[3:])
          #-- Long lines before short lines
          elif len(lst1) >= 4 and len(lst2) < 4:
              return -1
          #-- Short lines after long lines
          elif len(lst1) < 4 and len(lst2) >= 4:
              return 1
          else: # Natural order
              return cmp(ln1,ln2)

      # Don't count the read itself in the time
      lines = open(sys.argv[1]).readlines()

      # Time the custom comparison sort
      start = time.time()
      lines.sort(fourth_word)

      end = time.time()
      wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
      # open('tmp.custom','w').writelines(lines)

      # Don't count the read itself in the time
      lines = open(sys.argv[1]).readlines()

      # Time the Schwartzian sort
      start = time.time()
      for n in range(len(lines)): # Create the transform
          lst = string.split(lines[n])
          if len(lst) >= 4: # Tuple w/ sort info first
              lines[n] = (lst[3:], lines[n])
          else: # Short lines to end
              lines[n] = (['\377'], lines[n])

      lines.sort() # Native sort

      for n in range(len(lines)): # Restore original lines
          lines[n] = lines[n][1]

      end = time.time()
      wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
      # open('tmp.schwartzian','w').writelines(lines)

  Only one particular example is presented, but readers should be
  able to generalize this technique to any sort they need to
  perform frequently or on large files.


  PROBLEM: Reformatting paragraphs of text
  --------------------------------------------------------------------

  While I mourn the decline of plaintext ASCII as a communication
  format--and its eclipse by unnecessarily complicated and large
  (and often proprietary) formats--there is still plenty of life
  left in text files full of prose. READMEs, HOWTOs, email,
  Usenet posts, and this book itself are written in plaintext (or
  at least something close enough to plaintext that generic
  processing techniques are valuable). Moreover, many formats like
  HTML and LaTeX are frequently enough hand-edited that their
  plaintext appearance is important.

  One task that is extremely common when working with prose text
  files is reformatting paragraphs to conform to desired margins.
  Python 2.3 adds the module [textwrap], which performs more
  limited reformatting than the code below. Most of the time, this
  task gets done within text editors, which are indeed quite
  capable of performing the task. However, sometimes it would be
  nice to automate the formatting process. The task is simple
  enough that it is slightly surprising that Python has no standard
  module function to do this. There -is- the class
  `formatter.DumbWriter`, or the possibility of inheriting from and
  customizing `formatter.AbstractWriter`. These classes are
  discussed in Chapter 5; but frankly, the amount of customization
  and sophistication needed to use these classes and their many
  methods is way out of proportion for the task at hand.

  Below is a simple solution that can be used either as a
  command-line tool (reading from STDIN and writing to STDOUT) or
  by import to a larger application.

      #---------- reformat_para.py ----------#
      # Simple paragraph reformatter. Allows specification
      # of left and right margins, and of justification style
      # (using constants defined in module).

      LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER'

      def reformat_para(para='',left=0,right=72,just=LEFT):
          words = para.split()
          lines = []
          line = ''
          word = 0
          end_words = 0
          while not end_words:
              if len(words[word]) > right-left: # Handle very long words
                  line = words[word]
                  word +=1
                  if word >= len(words):
                      end_words = 1
              else: # Compose line of words
                  while len(line)+len(words[word]) <= right-left:
                      line += words[word]+' '
                      word += 1
                      if word >= len(words):
                          end_words = 1
                          break
              lines.append(line)
              line = ''
          if just==CENTER:
              r, l = right, left
              return '\n'.join([' '*left+ln.center(r-l) for ln in lines])
          elif just==RIGHT:
              return '\n'.join([line.rjust(right) for line in lines])
          else: # left justify
              return '\n'.join([' '*left+line for line in lines])

      if __name__=='__main__':
          import sys
          if len(sys.argv) <> 4:
              print "Please specify left_margin, right_marg, justification"
          else:
              left = int(sys.argv[1])
              right = int(sys.argv[2])
              just = sys.argv[3].upper()

              # Simplistic approach to finding initial paragraphs
              for p in sys.stdin.read().split('\n\n'):
                  print reformat_para(p,left,right,just),'\n'

  A number of enhancements are left to readers, if needed. You
  might want to allow hanging indents or indented first lines, for
  example. Or paragraphs meeting certain criteria might not be
  appropriate for wrapping (e.g., headers). A custom application
  might also determine the input paragraphs differently, either
  by a different parsing of an input file, or by generating
  paragraphs internally in some manner.


  PROBLEM: Column statistics for delimited or flat-record files
  --------------------------------------------------------------------

  Data feeds, DBMS dumps, log files, and flat-file databases all
  tend to contain ontologically similar records--one per line--with
  a collection of fields in each record. Usually such fields are
  separated either by a specified delimiter or by specific column
  positions where fields are to occur.

  Parsing these structured text records is quite easy, and
  performing computations on fields is equally straightforward. But
  in working with a variety of such "structured text databases," it
  is easy to keep writing almost the same code over again for each
  variation in format and computation.

  The example below provides a generic framework for every
  similar computation on a structured text database.

      #---------- fields_stats.py ----------#
      # Perform calculations on one or more of the
      # fields in a structured text database.

      import operator
      from types import *
      from xreadlines import xreadlines # req 2.1, but is much faster...
                                        # could use .readline() meth < 2.1
      #-- Symbolic Constants
      DELIMITED = 1
      FLATFILE = 2

      #-- Some sample "statistical" func (in functional programming style)
      nillFunc = lambda lst: None
      toFloat = lambda lst: map(float, lst)
      avg_lst = lambda lst: reduce(operator.add, toFloat(lst))/len(lst)
      sum_lst = lambda lst: reduce(operator.add, toFloat(lst))
      max_lst = lambda lst: reduce(max, toFloat(lst))

      class FieldStats:
          """Gather statistics about structured text database fields

          text_db may be either string (incl. Unicode) or file-like object
          style may be in (DELIMITED, FLATFILE)
          delimiter specifies the field separator in DELIMITED style text_db
          column_positions lists all field positions for FLATFILE style,
                           using one-based indexing (first column is 1).
                     E.g.: (1, 7, 40) would take fields one, two, three
                           from columns 1, 7, 40 respectively.
          field_funcs is a dictionary with column positions as keys,
                      and functions on lists as values.
               E.g.: {1:avg_lst, 4:sum_lst, 5:max_lst} would specify the
                      average of column one, the sum of column 4, and the
                      max of column 5. All other cols--incl 2,3, >=6--
                      are ignored.

          """
          def __init__(self,
                       text_db='',
                       style=DELIMITED,
                       delimiter=',',
                       column_positions=(1,),
                       field_funcs={} ):
              self.text_db = text_db
              self.style = style
              self.delimiter = delimiter
              self.column_positions = column_positions
              self.field_funcs = field_funcs

          def calc(self):
              """Calculate the column statistics
              """
              #-- 1st, create a list of lists for data (incl. unused flds)
              used_cols = self.field_funcs.keys()
              used_cols.sort()
              # one-based column naming: column[0] is always unused
              columns = []
              for n in range(1+used_cols[-1]):
                  # hint: '[[]]*num' creates refs to same list
                  columns.append([])

              #-- 2nd, fill lists used for calculated fields
                      # might use a string directly for text_db
              if type(self.text_db) in (StringType,UnicodeType):
                  for line in self.text_db.split('\n'):
                      fields = self.splitter(line)
                      for col in used_cols:
                          field = fields[col-1] # zero-based index
                          columns[col].append(field)
              else: # Something file-like for text_db
                  for line in xreadlines(self.text_db):
                      fields = self.splitter(line)
                      for col in used_cols:
                          field = fields[col-1] # zero-based index
                          columns[col].append(field)

              #-- 3rd, apply the field funcs to column lists
              results = [None] * (1+used_cols[-1])
              for col in used_cols:
                  results[col] = \
                       apply(self.field_funcs[col],(columns[col],))

              #-- Finally, return the result list
              return results

          def splitter(self, line):
              """Split a line into fields according to curr inst specs"""
              if self.style == DELIMITED:
                  return line.split(self.delimiter)
              elif self.style == FLATFILE:
                  fields = []
                  # Adjust offsets to Python zero-based indexing,
                  # and also add final position after the line
                  num_positions = len(self.column_positions)
                  offsets = [(pos-1) for pos in self.column_positions]
                  offsets.append(len(line))
                  for pos in range(num_positions):
                      start = offsets[pos]
                      end = offsets[pos+1]
                      fields.append(line[start:end])
                  return fields
              else:
                  raise ValueError, \
                        "Text database must be DELIMITED or FLATFILE"

      #-- Test data
      # First Name, Last Name, Salary, Years Seniority, Department
      delim = '''
      Kevin,Smith,50000,5,Media Relations
      Tom,Woo,30000,7,Accounting
      Sally,Jones,62000,10,Management
      '''.strip() # no leading/trailing newlines

      # Comment First Last Salary Years Dept
      flat = '''
      tech note Kevin Smith 50000 5 Media Relations
      more filler Tom Woo 30000 7 Accounting
      yet more... Sally Jones 62000 10 Management
      '''.strip() # no leading/trailing newlines

      #-- Run self-test code
      if __name__ == '__main__':
          getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})
          print 'Delimited Calculations:'
          results = getdelim.calc()
          print ' Average salary -', results[3]
          print ' Max years worked -', results[4]

          getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},
                                     style=FLATFILE,
                                     column_positions=(15,25,35,45,52))
          print 'Flat Calculations:'
          results = getflat.calc()
          print ' Average salary -', results[3]
          print ' Max years worked -', results[4]

  The example above includes some efficiency considerations that
  make it a good model for working with large data sets. In the
  first place, class 'FieldStats' can (optionally) deal with a
  file-like object, rather than keeping the whole structured text
  database in memory. The generator `xreadlines.xreadlines()` is
  an extremely fast and efficient file reader, but it requires
  Python 2.1+--otherwise use `FILE.readline()` or
  `FILE.readlines()` (for either memory or speed efficiency,
  respectively). Moreover, only the data that is actually of
  interest is collected into lists, in order to save memory.
  However, rather than require multiple passes to collect
  statistics on multiple fields, as many field columns and
  summary functions as wanted can be used in one pass.

  One possible improvement would be to allow multiple summary
  functions against the same field during a pass. But that is
  left as an exercise to the reader, if she desires to do it.


  PROBLEM: Counting characters, words, lines, and paragraphs
  --------------------------------------------------------------------

  There is a wonderful utility under Unix-like systems called
  'wc'. What it does is so basic, and so obvious, that it is
  hard to imagine working without it. 'wc' simply counts the
  characters, words, and lines of files (or STDIN). A few
  command-line options control which results are displayed, but I
  rarely use them.

  In writing this chapter, I found myself on a system without
  'wc', and felt a remedy was in order. The example below is
  actually an "enhanced" 'wc' since it also counts paragraphs
  (but it lacks the command-line switches). Unlike the external
  'wc', it is easy to use the technique directly within Python
  and is available anywhere Python is. The main trick--inasmuch
  as there is one--is a compact use of the `"".join()` and
  `"".split()` methods (`string.join()` and `string.split()` could
  also be used, for example, to be compatible with Python 1.5.2 or
  below).

      #---------- wc.py ----------#
      # Report the chars, words, lines, paragraphs
      # on STDIN or in wildcard filename patterns
      import sys, glob
      if len(sys.argv) > 1:
          c, w, l, p = 0, 0, 0, 0
          for pat in sys.argv[1:]:
              for file in glob.glob(pat):
                  s = open(file).read()
                  wc = len(s), len(s.split()), \
                       len(s.split('\n')), len(s.split('\n\n'))
                  print '\t'.join(map(str, wc)),'\t'+file
                  c, w, l, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]
          wc = (c,w,l,p)
          print '\t'.join(map(str, wc)), '\tTOTAL'
      else:
          s = sys.stdin.read()
          wc = len(s), len(s.split()), len(s.split('\n')), \
               len(s.split('\n\n'))
          print '\t'.join(map(str, wc)), '\tSTDIN'

  This little functionality could be wrapped up in a function,
  but it is almost too compact to bother with doing so. Most of
  the work is in the interaction with the shell environment, with
  the counting basically taking only two lines.

  The solution above is quite likely the "one obvious way to do
  it," and therefore Pythonic. On the other hand a slightly more
  adventurous reader might consider this assignment (if only for
  fun):

      >>> wc = map(len,[s]+map(s.split,(None,'\n','\n\n')))

  A real daredevil might be able to reduce the entire program to
  a single 'print' statement.


  PROBLEM: Transmitting binary data as ASCII
  --------------------------------------------------------------------

  Many channels require that the information that travels over them
  is 7-bit ASCII. Any bytes with a high-order first bit of one will
  be handled unpredictably when transmitting data over protocols
  like Simple Mail Transport Protocol (SMTP), Network News
  Transport Protocol (NNTP), or HTTP (depending on content
  encoding), or even just when displaying them in many standard
  tools like editors. In order to encode 8-bit binary data as
  ASCII, a number of techniques have been invented over time.

  An obvious, but obese, encoding technique is to translate each
  binary byte into its hexadecimal digits. UUencoding is an older
  standard that developed around the need to transmit binary files
  over the Usenet and on BBSs. Binhex is similar technique from
  the MacOS world. In recent years, base64--which is specified by
  RFC1521--has edged out the other styles of encoding. All of the
  techniques are basically 4/3 encodings--that is, four ASCII bytes
  are used to represent three binary bytes--but they differ
  somewhat in line ending and header conventions (as well as in the
  encoding as such). Quoted printable is yet another format, but of
  variable encoding length. In quoted printable encoding, most
  plain ASCII bytes are left unchanged, but a few special
  characters and all high-bit bytes are escaped.

  Python provides modules for all the encoding styles mentioned.
  The high-level wrappers [uu], [binhex], [base64], and [quopri]
  all operate on input and output file-like objects, encoding the
  data therein. They also each have slightly different method names
  and arguments. [binhex], for example, closes its output file
  after encoding, which makes it unusable in conjunction with a
  [cStringIO] file-like object. All of the high-level encoders
  utilize the services of the low-level C module [binascii].
  [binascii], in turn, implements the actual low-level block
  conversions, but assumes that it will be passed the right size
  blocks for a given encoding.

  The standard library, therefore, does not contain quite the
  right intermediate-level functionality for when the goal is
  just encoding the binary data in arbitrary strings. It is easy
  to wrap that up though:

      #---------- encode_binary.py ----------#
      # Provide encoders for arbitrary binary data
      # in Python strings. Handles block size issues
      # transparently, and returns a string.
      # Precompression of the input string can reduce
      # or eliminate any size penalty for encoding.

      import sys
      import zlib
      import binascii

      UU = 45
      BASE64 = 57
      BINHEX = sys.maxint

      def ASCIIencode(s='', type=BASE64, compress=1):
          """ASCII encode a binary string"""
          # First, decide the encoding style
          if type == BASE64: encode = binascii.b2a_base64
          elif type == UU: encode = binascii.b2a_uu
          elif type == BINHEX: encode = binascii.b2a_hqx
          else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"
          # Second, compress the source if specified
          if compress: s = zlib.compress(s)
          # Third, encode the string, block-by-block
          offset = 0
          blocks = []
          while 1:
              blocks.append(encode(s[offset:offset+type]))
              offset += type
              if offset > len(s):
                  break
          # Fourth, return the concatenated blocks
          return ''.join(blocks)

      def ASCIIdecode(s='', type=BASE64, compress=1):
          """Decode ASCII to a binary string"""
          # First, decide the encoding style
          if type == BASE64: s = binascii.a2b_base64(s)
          elif type == BINHEX: s = binascii.a2b_hqx(s)
          elif type == UU:
              s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])
          # Second, decompress the source if specified
          if compress: s = zlib.decompress(s)
          # Third, return the decoded binary string
          return s

      # Encode/decode STDIN for self-test
      if __name__ == '__main__':
          decode, TYPE = 0, BASE64
          for arg in sys.argv:
              if arg.lower()=='-d': decode = 1
              elif arg.upper()=='UU': TYPE=UU
              elif arg.upper()=='BINHEX': TYPE=BINHEX
              elif arg.upper()=='BASE64': TYPE=BASE64
          if decode:
              print ASCIIdecode(sys.stdin.read(),type=TYPE)
          else:
              print ASCIIencode(sys.stdin.read(),type=TYPE)

  The example above does not attach any headers or delimit the
  encoded block (by design); for that, a wrapper like [uu],
  [mimify], or [MimeWriter] is a better choice. Or a custom
  wrapper around 'encode_binary.py'.


  PROBLEM: Creating word or letter histograms
  --------------------------------------------------------------------

  A histogram is an analysis of the relative occurrence frequency
  of each of a number of possible values. In terms of text
  processing, the occurrences in question are almost always
  either words or byte values. Creating histograms is quite
  simple using Python dictionaries, but the technique is not
  always immediately obvious to people thinking about it. The
  example below has a good generality, provides several utility
  functions associated with histograms, and can be used in a
  command-line operation mode.

      #---------- histogram.py ----------#
      # Create occurrence counts of words or characters
      # A few utility functions for presenting results
      # Avoids requirement of recent Python features

      from string import split, maketrans, translate, punctuation, digits
      import sys
      from types import *
      import types

      def word_histogram(source):
          """Create histogram of normalized words (no punct or digits)"""
          hist = {}
          trans = maketrans('','')
          if type(source) in (StringType,UnicodeType): # String-like src
              for word in split(source):
                  word = translate(word, trans, punctuation+digits)
                  if len(word) > 0:
                      hist[word] = hist.get(word,0) + 1
          elif hasattr(source,'read'): # File-like src
              try:
                  from xreadlines import xreadlines # Check for module
                  for line in xreadlines(source):
                      for word in split(line):
                          word = translate(word, trans, punctuation+digits)
                          if len(word) > 0:
                              hist[word] = hist.get(word,0) + 1
              except ImportError: # Older Python ver
                  line = source.readline() # Slow but mem-friendly
                  while line:
                      for word in split(line):
                          word = translate(word, trans, punctuation+digits)
                          if len(word) > 0:
                              hist[word] = hist.get(word,0) + 1
                      line = source.readline()
          else:
              raise TypeError, \
                    "source must be a string-like or file-like object"
          return hist

      def char_histogram(source, sizehint=1024*1024):
          hist = {}
          if type(source) in (StringType,UnicodeType): # String-like src
              for char in source:
                  hist[char] = hist.get(char,0) + 1
          elif hasattr(source,'read'): # File-like src
              chunk = source.read(sizehint)
              while chunk:
                  for char in chunk:
                      hist[char] = hist.get(char,0) + 1
                  chunk = source.read(sizehint)
          else:
              raise TypeError, \
                    "source must be a string-like or file-like object"
          return hist

      def most_common(hist, num=1):
          pairs = []
          for pair in hist.items():
              pairs.append((pair[1],pair[0]))
          pairs.sort()
          pairs.reverse()
          return pairs[:num]

      def first_things(hist, num=1):
          pairs = []
          things = hist.keys()
          things.sort()
          for thing in things:
              pairs.append((thing,hist[thing]))
          pairs.sort()
          return pairs[:num]

      if __name__ == '__main__':
          if len(sys.argv) > 1:
              hist = word_histogram(open(sys.argv[1]))
          else:
              hist = word_histogram(sys.stdin)

          print "Ten most common words:"
          for pair in most_common(hist, 10):
              print '\t', pair[1], pair[0]

          print "First ten words alphabetically:"
          for pair in first_things(hist, 10):
              print '\t', pair[0], pair[1]

          # a more practical command-line version might use:
          # for pair in most_common(hist,len(hist)):
          # print pair[1],'\t',pair[0]

  Several of the design choices are somewhat arbitrary. Words
  have all their punctuation stripped to identify "real" words.
  But on the other hand, words are still case-sensitive, which
  may not be what is desired. The sorting functions
  'first_things()' and 'most_common()' only return an initial
  sublist. Perhaps it would be better to return the whole list,
  and let the user slice the result. It is simple to customize
  around these sorts of issues, though.


  PROBLEM: Reading a file backwards by record, line, or paragraph
  --------------------------------------------------------------------

  Reading a file line by line is a common task in Python, or in
  most any language. Files like server logs, configuration files,
  structured text databases, and others frequently arrange
  information into logical records, one per line. Very often, the
  job of a program is to perform some calculation on each record
  in turn.

  Python provides a number of convenient methods on file-like
  objects for such line-by-line reading. `FILE.readlines()`
  reads a whole file at once and returns a list of lines. The
  technique is very fast, but requires the whole contents of the
  file be kept in memory. For very large files, this can be a
  problem. `FILE.readline()` is memory-friendly--it just reads a
  line at a time and can be called repeatedly until the EOF is
  reached--but it is also much slower. The best solution for
  recent Python versions is `xreadlines.xreadlines()` or
  `FILE.xreadlines()` in Python 2.1+. These techniques are
  memory-friendly, while still being fast and presenting a
  "virtual list" of lines (by way of Python's new
  generator/iterator interface).

  The above techniques work nicely for reading a file in its
  natural order, but what if you want to start at the end of a
  file and work backwards from there? This need is frequently
  encountered when you want to read log files that have records
  appended over time (and when you want to look at the most
  recent records first). It comes up in other situations also.
  There is a very easy technique if memory usage is not an issue:

      >>> open('lines','w').write('\n'.join([`n` for n in range(100)]))
      >>> fp = open('lines')
      >>> lines = fp.readlines()
      >>> lines.reverse()
      >>> for line in lines[1:5]:
      ... # Processing suite here
      ... print line,
      ...
      98
      97
      96
      95

  For large input files, however, this technique is not feasible.
  It would be nice to have something analogous to [xreadlines]
  here. The example below provides a good starting point (the
  example works equally well for file-like objects).

      #---------- read_backwards.py ----------#
      # Read blocks of a file from end to beginning.
      # Blocks may be defined by any delimiter, but the
      # constants LINE and PARA are useful ones.
      # Works much like the file object method '.readline()':
      # repeated calls continue to get "next" part, and
      # function returns empty string once BOF is reached.

      # Define constants
      from os import linesep
      LINE = linesep
      PARA = linesep*2
      READSIZE = 1000

      # Global variables
      buffer = ''

      def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):
          """Read blocks of file backwards (return empty string when done)"""
          # Trick of mutable default argument to hold state between calls
          if not _init[0]:
              fp.seek(0,2)
              _init[0] = 1
          # Find a block (using global buffer)
          global buffer
          while 1:
              # first check for block in buffer
              delim = buffer.rfind(mode)
              if delim <> -1: # block is in buffer, return it
                  block = buffer[delim+len(mode):]
                  buffer = buffer[:delim]
                  return block+mode
              #-- BOF reached, return remainder (or empty string)
              elif fp.tell()==0:
                  block = buffer
                  buffer = ''
                  return block
              else: # Read some more data into the buffer
                  readsize = min(fp.tell(),sizehint)
                  fp.seek(-readsize,1)
                  buffer = fp.read(readsize) + buffer
                  fp.seek(-readsize,1)

      #-- Self test of read_backwards()
      if __name__ == '__main__':
          # Let's create a test file to read in backwards
          fp = open('lines','wb')
          fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))
          # Now open for reading backwards
          fp = open('lines','rb')
          # Read the blocks in, one per call (block==line by default)
          block = read_backwards(fp)
          while block:
              print block,
              block = read_backwards(fp)

  Notice that -anything- could serve as a block delimiter. The
  constants provided just happened to work for lines and block
  paragraphs (and block paragraphs only with current OS's style
  of line breaks). But other delimiters could be used. It would
  -not- be immediately possible to read backwards word-by-word--a
  space delimiter would come close, but would not be quite right
  for other whitespace. However, reading a line (and maybe
  reversing its words) is generally good enough.

  Another enhancement is possible with Python 2.2+. Using the
  new 'yield' keyword, 'read_backwards()' could be programmed as
  an iterator rather than as a multi-call function. The
  performance will not differ significantly, but the function
  might be expressed more clearly (and a "list-like" interface
  like `FILE.readlines()` makes the application's loop simpler).

  QUESTIONS:

  1. Write a generator-based version of 'read_backwards()' that
      uses the 'yield' keyword. Modify the self-test code to
      utilize the generator instead.

  2. Explore and explain some pitfalls with the use of a mutable
      default value as a function argument. Explain also how the
      style allows functions to encapsulate data and contrast
      with the encapsulation of class instances.


SECTION 2 -- Standard Modules
------------------------------------------------------------------------

  TOPIC -- Basic String Transformations
  --------------------------------------------------------------------

  The module [string] forms the core of Python's text manipulation
  libraries. That module is certainly the place to look before
  other modules. Most of the methods in the [string] module, you
  should note, have been copied to methods of string objects from
  Python 1.6+. Moreover, methods of string objects are a little bit
  faster to use than are the corresponding module functions. A few
  new methods of string objects do not have equivalents in the
  [string] module, but are still documented here.

  SEE ALSO, [str], [UserString]

  =================================================================
    MODULE -- string : A collection of string operations
  =================================================================

  There are a number of general things to notice about the
  functions in the [string] module (which is composed entirely
  of functions and constants; no classes).

  1. Strings are immutable (as discussed in Chapter 1). This
      means that there is no such thing as changing a string "in
      place" (as we might do in many other languages, such as C,
      by changing the bytes at certain offsets within the
      string). Whenever a [string] module function takes a
      string object as an argument, it returns a brand-new
      string object and leaves the original one as is. However,
      the very common pattern of binding the same name on the
      left of an assignment as was passed on the right side
      within the [string] module function somewhat conceals this
      fact. For example:

      >>> import string
      >>> str = "Mary had a little lamb"
      >>> str = string.replace(str, 'had', 'ate')
      >>> str
      'Mary ate a little lamb'

      The first string object never gets modified per se; but
      since the first string object is no longer bound to any
      name after the example runs, the object is subject to
      garbage collection and will disappear from memory. In
      short, calling a [string] module function will not change
      any existing strings, but rebinding a name can make it
      look like they changed.

  2. Many [string] module functions are now also available as
      string object methods. To use these string object
      methods, there is no need to import the [string] module,
      and the expression is usually slightly more concise.
      Moreover, using a string object method is usually slightly
      faster than the corresponding [string] module function.
      However, the most thorough documentation of each
      function/method that exists as both a [string] module
      function and a string object method is contained in this
      reference to the [string] module.

  3. The form 'string.join(string.split(...))' is a frequent
      Python idiom. A more thorough discussion is contained in
      the reference items for `string.join()` and
      `string.split()`, but in general, combining these two
      functions is very often a useful way of breaking down a
      text, processing the parts, then putting together the
      pieces.

  4. Think about clever `string.replace()` patterns. By
      combining multiple `string.replace()` calls with use of
      "place holder" string patterns, a surprising range of
      results can be achieved (especially when also manipulating
      the intermediate strings with other techniques). See the
      reference item for `string.replace()` for some discussion
      and examples.

  5. A mutable string of sorts can be obtained by using built-in
      lists, or the [array] module. Lists can contain a
      collection of substrings, each one of which may be replaced
      or modified individually. The [array] module can define
      arrays of individual characters, each position modifiable,
      included with slice notation. The function `string.join()`
      or the method `"".join()` may be used to re-create true
      strings; for example:

      >>> lst = ['spam','and','eggs']
      >>> lst[2] = 'toast'
      >>> print ''.join(lst)
      spamandtoast
      >>> print ' '.join(lst)
      spam and toast

      Or:

      >>> import array
      >>> a = array.array('c','spam and eggs')
      >>> print ''.join(a)
      spam and eggs
      >>> a[0] = 'S'
      >>> print ''.join(a)
      Spam and eggs
      >>> a[-4:] = array.array('c','toast')
      >>> print ''.join(a)
      Spam and toast

  CONSTANTS:

  The [string] module contains constants for a number of frequently
  used collections of characters. Each of these constants is itself
  simply a string (rather than a list, tuple, or other collection).
  As such, it is easy to define constants alongside those provided
  by the [string] module, should you need them. For example:

      >>> import string
      >>> string.brackets = "[]{}()<>"
      >>> print string.brackets
      []{}()<>

  string.digits
      The decimal numerals ("0123456789").

  string.hexdigits
      The hexadecimal numerals ("0123456789abcdefABCDEF").

  string.octdigits
      The octal numerals ("01234567").

  string.lowercase
      The lowercase letters; can vary by language. In English
      versions of Python (most systems):

      >>> import string
      >>> string.lowercase
      'abcdefghijklmnopqrstuvwxyz'

      You should not modify `string.lowercase` for a source
      text language, but rather define a new attribute, such as
      'string.spanish_lowercase' with an appropriate string
      (some methods depend on this constant).

  string.uppercase
      The uppercase letters; can vary by language. In English
      versions of Python (most systems):

      >>> import string
      >>> string.uppercase
      'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

      You should not modify `string.uppercase` for a source
      text language, but rather define a new attribute, such as
      'string.spanish_uppercase' with an appropriate string
      (some methods depend on this constant).

  string.letters
      All the letters (string.lowercase+string.uppercase).

  string.punctuation
      The characters normally considered as punctuation; can
      vary by language. In English versions of Python (most
      systems):

      >>> import string
      >>> string.punctuation
      '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

  string.whitespace
      The "empty" characters. Normally these consist of tab,
      linefeed, vertical tab, formfeed, carriage return, and
      space (in that order):

      >>> import string
      >>> string.whitespace
      '\011\012\013\014\015 '

      You should not modify `string.whitespace` (some methods
      depend on this constant).

  string.printable
      All the characters that can be printed to any device; can
      vary by language
      (string.digits+string.letters+string.punctuation+string.whitespace)

  FUNCTIONS:

  string.atof(s=...)
      Deprecated. Use `float()`.

      Converts a string to a floating point value.

      SEE ALSO, `eval()`, `float()`

  string.atoi(s=... [,base=10])
      Deprecated with Python 2.0. Use `int()` if no custom
      base is needed or if using Python 2.0+.

      Converts a string to an integer value (if the string
      should be assumed to be in a base other than 10, the base
      may be specified as the second argument).

      SEE ALSO, `eval()`, `int()`, `long()`

  string.atol(s=... [,base=10])
      Deprecated with Python 2.0. Use `long()` if no custom
      base is needed or if using Python 2.0+.

      Converts a string to an unlimited length integer value
      (if the string should be assumed to be in a base other
      than 10, the base may be specified as the second argument).

      SEE ALSO, `eval()`, `long()`, `int()`

  string.capitalize(s=...)
  "".capitalize()
      Return a string consisting of the initial character
      converted to uppercase (if applicable), and all other
      characters converted to lowercase (if applicable):

      >>> import string
      >>> string.capitalize("mary had a little lamb!")
      'Mary had a little lamb!'
      >>> string.capitalize("Mary had a Little Lamb!")
      'Mary had a little lamb!'
      >>> string.capitalize("2 Lambs had Mary!")
      '2 lambs had mary!'

      For Python 1.6+, use of a string object method is
      marginally faster and is stylistically preferred in most
      cases:

      >>> "mary had a little lamb".capitalize()
      'Mary had a little lamb'

      SEE ALSO, `string.capwords()`, `string.lower()`

  string.capwords(s=...)
  "".title()
      Return a string consisting of the capitalized words.
      An equivalent expression is:

      #*----- equivalent expression -----#
      string.join(map(string.capitalize,string.split(s))

      But `string.capwords()` is a clearer way of writing it. An
      effect of this implementation is that whitespace is
      "normalized" by the process:

      >>> import string
      >>> string.capwords("mary HAD a little lamb!")
      'Mary Had A Little Lamb!'
      >>> string.capwords("Mary had a Little Lamb!")
      'Mary Had A Little Lamb!'

      With the creation of string methods in Python 1.6, the
      module function `string.capwords()` was renamed as a string
      method to `"".title()`.

      SEE ALSO, `string.capitalize()`, `string.lower()`,
                `"".istitle()`

  string.center(s=..., width=...)
  "".center(width)
      Return a string with 's' padded with symmetrical leading
      and trailing spaces (but not truncated) to occupy length
      'width' (or more).

      >>> import string
      >>> string.center(width=30,s="Mary had a little lamb")
      ' Mary had a little lamb '
      >>> string.center("Mary had a little lamb", 5)
      'Mary had a little lamb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a little lamb".center(25)
      ' Mary had a little lamb '

      SEE ALSO, `string.ljust()`, `string.rjust()`

  string.count(s, sub [,start [,end]])
  "".count(sub [,start [,end]])
      Return the number of nonoverlapping occurrences of 'sub'
      in 's'. If the optional third or fourth arguments are
      specified only the corresponding slice of 's' is
      examined.

      >>> import string
      >>> string.count("mary had a little lamb", "a")
      4
      >>> string.count("mary had a little lamb", "a", 3, 10)
      2

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> 'mary had a little lamb'.count("a")
      4

  "".endswith(suffix [,start [,end]])
      This string method does not have an equivalent in the
      [string] module. Return a Boolean value indicating whether
      the string ends with the suffix 'suffix'. If the optional
      second argument 'start' is specified, only consider the
      terminal substring after offset 'start'. If the optional
      third argument 'end' is given, only consider the slice
      '[start:end]'.

      SEE ALSO, `"".startswith()`, `string.find()`

  string.expandtabs(s=... [,tabsize=8])
  "".expandtabs([,tabsize=8])
      Return a string with tabs replaced by a variable number
      of spaces. The replacement causes text blocks to line up
      at "tab stops." If no second argument is given, the new
      string will line up at multiples of 8 spaces. A newline
      implies a new set of tab stops.

      >>> import string
      >>> s = 'mary\011had a little lamb'
      >>> print s
      mary had a little lamb
      >>> string.expandtabs(s, 16)
      'mary had a little lamb'
      >>> string.expandtabs(tabsize=1, s=s)
      'mary had a little lamb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> 'mary\011had a little lamb'.expandtabs(25)
      'mary had a little lamb'

  string.find(s, sub [,start [,end]])
  "".find(sub [,start [,end]])
      Return the index position of the first occurrence of 'sub'
      in 's'. If the optional third or fourth arguments are
      specified, only the corresponding slice of 's' is examined
      (but result is position in s as a whole). Return -1 if
      no occurrence is found. Position is zero-based, as with
      Python list indexing:

      >>> import string
      >>> string.find("mary had a little lamb", "a")
      1
      >>> string.find("mary had a little lamb", "a", 3, 10)
      6
      >>> string.find("mary had a little lamb", "b")
      21
      >>> string.find("mary had a little lamb", "b", 3, 10)
      -1

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> 'mary had a little lamb'.find("ad")
      6

      SEE ALSO, `string.index()`, `string.rfind()`

  string.index(s, sub [,start [,end]])
  "".index(sub [,start [,end]])
      Return the same value as does `string.find()` with same
      arguments, except raise 'ValueError' instead of returning
      -1 when sub does not occur in s.

      >>> import string
      >>> string.index("mary had a little lamb", "b")
      21
      >>> string.index("mary had a little lamb", "b", 3, 10)
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
        File "d:/py20sl/lib/string.py", line 139, in index
          return s.index(*args)
      ValueError: substring not found in string.index

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> 'mary had a little lamb'.index("ad")
      6

      SEE ALSO, `string.find()`, `string.rindex()`

  Several string methods that return Boolean values indicating
  whether a string has a certain property. None of the '.is*()'
  methods, however, have equivalents in the [string] module:

  "".isalpha()
      Return a true value if all the characters are alphabetic.

  "".isalnum()
      Return a true value if all the characters are alphanumeric.

  "".isdigit()
      Return a true value if all the characters are digits.

  "".islower()
      Return a true value if all the characters are lowercase
      and there is at least one cased character:

      >>> "ab123".islower(), '123'.islower(), 'Ab123'.islower()
      (1, 0, 0)

      SEE ALSO, `"".lower()`

  "".isspace()
      Return a true value if all the characters are whitespace.

  "".istitle()
      Return a true value if all the string has title casing
      (each word capitalized).

      SEE ALSO, `"".title()`

  "".isupper()
      Return a true value if all the characters are uppercase
      and there is at least one cased character.

      SEE ALSO, `"".upper()`

  string.join(words=... [,sep=" "])
  "".join(words)
      Return a string that results from concatenating the
      elements of the list 'words' together, with 'sep' between
      each. The function `string.join()` differs from all
      other [string] module functions in that it takes a list
      (of strings) as a primary argument, rather than a string.

      It is worth noting `string.join()` and `string.split()`
      are inverse functions if 'sep' is specified to both; in
      other words, 'string.join(string.split(s,sep),sep)==s'
      for all 's' and 'sep'.

      Typically, `string.join()` is used in contexts where it
      is natural to generate lists of strings. For example,
      here is a small program to output the list of
      all-capital words from STDIN to STDOUT, one per line:

      #---------- list_capwords.py ----------#
      import string,sys
      capwords = []

      #*--- fix linebreak ---#
      for line in sys.stdin.readlines():
          for word in line.split():
              if word == word.upper() and word.isalpha():
                  capwords.append(word)
      print string.join(capwords, '\n')

      The technique in the sample 'list_capwords.py' script can
      be considerably more efficient than building up a string
      by direct concatenation. However, Python 2.0's augmented
      assignment reduces the performance difference:

      >>> import string
      >>> s = "Mary had a little lamb"
      >>> t = "its fleece was white as snow"
      >>> s = s +" "+ t # relatively "expensive" for big strings
      >>> s += " " + t # "cheaper" than Python 1.x style
      >>> lst = [s]
      >>> lst.append(t) # "cheapest" way of building long string
      >>> s = string.join(lst)

      For Python 1.6+, use of a string object method is
      stylistically preferred in some cases. However, just as
      `string.join()` is special in taking a list as a first
      argument, the string object method `"".join()` is unusual
      in being an operation on the (optional) 'sep' string, not
      on the (required) 'words' list (this surprises many new
      Python programmers).

      SEE ALSO, `string.split()`

  string.joinfields(...)
      Identical to `string.join()`.

  string.ljust(s=..., width=...)
  "".ljust(width)
      Return a string with 's' padded with trailing spaces (but
      not truncated) to occupy length 'width' (or more).

      >>> import string
      >>> string.ljust(width=30,s="Mary had a little lamb")
      'Mary had a little lamb '
      >>> string.ljust("Mary had a little lamb", 5)
      'Mary had a little lamb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a little lamb".ljust(25)
      'Mary had a little lamb '

      SEE ALSO, `string.rjust()`, `string.center()`

  string.lower(s=...)
  "".lower()
      Return a string with any uppercase letters converted to
      lowercase.

      >>> import string
      >>> string.lower("mary HAD a little lamb!")
      'mary had a little lamb!'
      >>> string.lower("Mary had a Little Lamb!")
      'mary had a little lamb!'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a Little Lamb!".lower()
      'mary had a little lamb!'

      SEE ALSO, `string.upper()`

  string.lstrip(s=...)
  "".lstrip([chars=string.whitespace])
      Return a string with leading whitespace characters
      removed. For Python 1.6+, use of a string object method
      is stylistically preferred in many cases:

      >>> import string
      >>> s = """
      ... Mary had a little lamb \011"""
      >>> string.lstrip(s)
      'Mary had a little lamb \011'
      >>> s.lstrip()
      'Mary had a little lamb \011'

      Python 2.3+ accepts the optional argument 'chars' to the
      string object method. All characters in the string
      'chars' will be removed.

      SEE ALSO, `string.rstrip(), `string.strip()`

  string.maketrans(from, to)
      Return a translation table string, for use with
      `string.translate()`. The strings 'from' and 'to' must
      be the same length. A translation table is a string of
      256 successive byte values, where each position defines a
      translation from the `chr()` value of the index to the
      character contained at that index position.

      >>> import string
      >>> ord('A')
      65
      >>> ord('z')
      122
      >>> string.maketrans('ABC','abc')[65:123]
      'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz'
      >>> string.maketrans('ABCxyz','abcXYZ')[65:123]
      'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwXYZ'

      SEE ALSO, `string.translate()`

  string.replace(s=..., old=..., new=... [,maxsplit=...])
  "".replace(old, new [,maxsplit])
      Return a string based on 's' with occurrences of 'old'
      replaced by 'new'. If the fourth argument 'maxsplit' is
      specified, only replace 'maxsplit' initial occurrences.

      >>> import string
      >>> string.replace("Mary had a little lamb", "a little", "some")
      'Mary had some lamb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a little lamb".replace("a little", "some")
      'Mary had some lamb'

      A common "trick" involving `string.replace()` is to use
      it multiple times to achieve a goal. Obviously, simply
      to replace several different substrings in a string,
      multiple `string.replace()` operations are almost
      inevitable. But there is another class of cases where
      `string.replace()` can be used to create an intermediate
      string with "placeholders" for an original substring in
      a particular context. The same goal can always be
      achieved with regular expressions, but sometimes staged
      `string.replace()` operations are both faster and easier
      to program:

      >>> import string
      >>> line = 'variable = val # see comments #3 and #4'
      >>> # we'd like '#3' and '#4' spelled out within comment
      >>> string.replace(line,'#','number ') # doesn't work
      'variable = val number see comments number 3 and number 4'
      >>> place_holder=string.replace(line,' # ',' !!! ') # insrt plcholder
      >>> place_holder
      'variable = val !!! see comments #3 and #4'
      >>> place_holder=place_holder.replace('#','number ') # almost there
      >>> place_holder
      'variable = val !!! see comments number 3 and number 4'
      >>> line = string.replace(place_holder,'!!!','#') # restore orig
      >>> line
      'variable = val # see comments number 3 and number 4'

      Obviously, for jobs like this, a place holder must be
      chosen so as not ever to occur within the strings
      undergoing "staged transformation"; but that should be
      possible generally since place holders may be as long as
      needed.

      SEE ALSO, `string.translate()`, `mx.TextTools.replace()`

  string.rfind(s, sub [,start [,end]])
  "".rfind(sub [,start [,end]])
      Return the index position of the last occurrence of 'sub'
      in 's'. If the optional third or fourth arguments are
      specified only the corresponding slice of 's' is examined
      (but result is position in 's' as a whole). Return -1 if
      no occurrence is found. Position is zero-based, as with
      Python list indexing:

      >>> import string
      >>> string.rfind("mary had a little lamb", "a")
      19
      >>> string.rfind("mary had a little lamb", "a", 3, 10)
      9
      >>> string.rfind("mary had a little lamb", "b")
      21
      >>> string.rfind("mary had a little lamb", "b", 3, 10)
      -1

      For Python 1.6+, use of a string object method
      stylistically preferred in many cases:

      >>> 'mary had a little lamb'.rfind("ad")
      6

      SEE ALSO, `string.rindex()`, `string.find()`

  string.rindex(s, sub [,start [,end]])
  "".rindex(sub [,start [,end]])
      Return the same value as does `string.rfind()` with same
      arguments, except raise 'ValueError' instead of returning
      -1 when sub does not occur in 's'.

      >>> import string
      >>> string.rindex("mary had a little lamb", "b")
      21
      >>> string.rindex("mary had a little lamb", "b", 3, 10)
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
        File "d:/py20sl/lib/string.py", line 148, in rindex
          return s.rindex(*args)
      ValueError: substring not found in string.rindex

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> 'mary had a little lamb'.index("ad")
      6

      SEE ALSO, `string.rfind()`, `string.index()`

  string.rjust(s=..., width=...)
  "".rjust(width)
      Return a string with 's' padded with leading spaces (but
      not truncated) to occupy length 'width' (or more).

      >>> import string
      >>> string.rjust(width=30,s="Mary had a little lamb")
      ' Mary had a little lamb'
      >>> string.rjust("Mary had a little lamb", 5)
      'Mary had a little lamb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a little lamb".rjust(25)
      ' Mary had a little lamb'

      SEE ALSO, `string.ljust()`, `string.center()`

  string.rstrip(s=...)
  "".rstrip()
      Return a string with trailing whitespace characters
      removed. For Python 1.6+, use of a string object method
      is stylistically preferred in many cases:

      >>> import string
      >>> s = """
      ... Mary had a little lamb \011"""
      >>> string.rstrip(s)
      '\012 Mary had a little lamb'
      >>> s.rstrip()
      '\012 Mary had a little lamb'

      Python 2.3+ accepts the optional argument 'chars' to the
      string object method. All characters in the string
      'chars' will be removed.

      SEE ALSO, `string.lstrip(), `string.strip()`

  string.split(s=... [,sep=... [,maxsplit=...]])
  "".split([,sep [,maxsplit]])
      Return a list of nonoverlapping substrings of 's'. If the
      second argument 'sep' is specified, the substrings are
      divided around the occurrences of 'sep'. If 'sep' is not
      specified, the substrings are divided around -any-
      whitespace characters. The dividing strings do not
      appear in the resultant list. If the third argument
      'maxsplit' is specified, everything "left over" after
      splitting 'maxsplit' parts is appended to the list,
      giving the list length 'maxsplit'+1.

      >>> import string
      >>> s = 'mary had a little lamb ...with a glass of sherry'
      >>> string.split(s, ' a ')
      ['mary had', 'little lamb ...with', 'glass of sherry']
      >>> string.split(s)
      ['mary', 'had', 'a', 'little', 'lamb', '...with', 'a', 'glass',
      'of', 'sherry']
      >>> string.split(s,maxsplit=5)
      ['mary', 'had', 'a', 'little', 'lamb', '...with a glass of sherry']

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a Little Lamb!".split()
      ['Mary', 'had', 'a', 'Little', 'Lamb!']

      The `string.split()` function (and corresponding string
      object method) is surprisingly versatile for working with
      texts, especially ones that resemble prose. Its default
      behavior of treating all whitespace as a single divider
      allows `string.split()` to act as a quick-and-dirty word
      parser:

      >>> wc = lambda s: len(s.split())
      >>> wc("Mary had a Little Lamb")
      5
      >>> s = """Mary had a Little Lamb
      ... its fleece as white as snow.
      ... And everywhere that Mary went ... the lamb was sure to go."""
      >>> print s
      Mary had a Little Lamb
      its fleece as white as snow.
      And everywhere that Mary went ... the lamb was sure to go.
      >>> wc(s)
      23

      The function `string.split()` is very often used in
      conjunction with `string.join()`. The pattern involved is
      "pull the string apart, modify the parts, put it back
      together." Often the parts will be words, but this also
      works with lines (dividing on '\n') or other chunks. For
      example:

      >>> import string
      >>> s = """Mary had a Little Lamb
      ... its fleece as white as snow.
      ... And everywhere that Mary went ... the lamb was sure to go."""
      >>> string.join(string.split(s))
      'Mary had a Little Lamb its fleece as white as snow. And everywhere
      ... that Mary went the lamb was sure to go.'

      A Python 1.6+ idiom for string object methods expresses
      this technique compactly:

      >>> "-".join(s.split())
      'Mary-had-a-Little-Lamb-its-fleece-as-white-as-snow.-And-everywhere
      ...-that-Mary-went--the-lamb-was-sure-to-go.'

      SEE ALSO, `string.join()`,
                `mx.TextTools.setsplit()`,
                `mx.TextTools.charsplit()`,
                `mx.TextTools.splitat()`,
                `mx.TextTools.splitlines()`

  string.splitfields(...)
      Identical to `string.split()`.

  "".splitlines([keepends=0])
      This string method does not have an equivalent in the
      [string] module. Return a list of lines in the string.
      The optional argument 'keepends' determines whether line
      break character(s) are included in the line strings.

  "".startswith(prefix [,start [,end]])
      This string method does not have an equivalent in the
      [string] module. Return a Boolean value indicating whether
      the string begins with the prefix 'prefix'. If the optional
      second argument 'start' is specified, only consider the
      terminal substring after the offset 'start'. If the
      optional third argument 'end' is given, only consider the
      slice '[start:end]'.

      SEE ALSO, `"".endswith()`, `string.find()`

  string.strip(s=...)
  "".strip()
      Return a string with leading and trailing whitespace
      characters removed. For Python 1.6+, use of a string
      object method is stylistically preferred in many cases:

      >>> import string
      >>> s = """
      ... Mary had a little lamb \011"""
      >>> string.strip(s)
      'Mary had a little lamb'
      >>> s.strip()
      'Mary had a little lamb'

      Python 2.3+ accepts the optional argument 'chars' to the
      string object method. All characters in the string
      'chars' will be removed.

      >>> s = "MARY had a LITTLE lamb STEW"
      >>> s.strip("ABCDEFGHIJKLMNOPQRSTUVWXYZ") # strip caps
      ' had a LITTLE lamb '

      SEE ALSO, `string.rstrip(), `string.lstrip()`

  string.swapcase(s=...)
  "".swapcase()
      Return a string with any uppercase letters converted to
      lowercase and any lowercase letters converted to uppercase.

      >>> import string
      >>> string.swapcase("mary HAD a little lamb!")
      'MARY had A LITTLE LAMB!'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a Little Lamb!".swapcase()
      'MARY had A LITTLE LAMB!'

      SEE ALSO, `string.upper()`, `string.lower()`

  string.translate(s=..., table=... [,deletechars=""])
  "".translate(table [,deletechars=""])
      Return a string, based on 's', with 'deletechars' deleted
      (if third argument is specified) and with any remaining
      characters translated according to translation 'table'.

      >>> import string
      >>> tab = string.maketrans('ABC','abc')
      >>> string.translate('MARY HAD a little LAMB', tab, 'Atl')
      'MRY HD a ie LMb'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases. However, if
      `string.maketrans()` is used to create the translation
      table, one will need to import the [string] module
      anyway:

      >>> 'MARY HAD a little LAMB'.translate(tab, 'Atl')
      'MRY HD a ie LMb'

      The `string.translate()` function is a -very- fast way to
      modify a string. Setting up the translation table takes
      some getting used to, but the resultant transformation is
      much faster than a procedural technique such as:

      >>> (new,frm,to,dlt) = ("",'ABC','abc','Alt')
      >>> for c in 'MARY HAD a little LAMB':
      ... if c not in dlt:
      ... pos = frm.find(c)
      ... if pos == -1: new += c
      ... else: new += to[pos]
      ...
      >>> new
      'MRY HD a ie LMb'

      SEE ALSO, `string.maketrans()`

  string.upper(s=...)
  "".upper()
      Return a string with any lowercase letters converted to
      uppercase.

      >>> import string
      >>> string.upper("mary HAD a little lamb!")
      'MARY HAD A LITTLE LAMB!'
      >>> string.upper("Mary had a Little Lamb!")
      'MARY HAD A LITTLE LAMB!'

      For Python 1.6+, use of a string object method is
      stylistically preferred in many cases:

      >>> "Mary had a Little Lamb!".upper()
      'MARY HAD A LITTLE LAMB!'

      SEE ALSO, `string.lower()`

  string.zfill(s=..., width=...)
      Return a string with 's' padded with leading zeros (but
      not truncated) to occupy length 'width' (or more). If a
      leading sign is present, it "floats" to the beginning of
      the return value. In general, `string.zfill()` is
      designed for alignment of numeric values, but no checking
      is done that a string looks number-like.

      >>> import string
      >>> string.zfill("this", 20)
      '0000000000000000this'
      >>> string.zfill("-37", 20)
      '-0000000000000000037'
      >>> string.zfill("+3.7", 20)
      '+00000000000000003.7'

      Based on the example of `string.rjust()`, one might
      expect a string object method `"".zfill()`; however,
      no such method exists.

      SEE ALSO, `string.rjust()`


  TOPIC -- Strings as Files, and Files as Strings
  --------------------------------------------------------------------

  In many ways, strings and files do a similar job. Both provide a
  storage container for an unlimited amount of (textual)
  information that is directly structured only by linear position
  of the bytes. A first inclination is to suppose that the
  difference between files and strings is one of persistence--files
  hang around when the current program is no longer running. But
  that distinction is not really tenable. On the one hand, standard
  Python modules like [shelve], [pickle], and [marshal]--and
  third-party modules like [xml_pickle] and [ZODB]--provide simple
  ways of making strings persist (but not thereby correspond in any
  direct way to a filesystem). On the other hand, many files are
  not particularly persistent: Special files like STDIN and STDOUT
  under Unix-like systems exist only for program life; other
  peculiar files like '/dev/cua0' and similar "device files" are
  really just streams; and even files that live on transient memory
  disks, or get deleted with program cleanup, are not very
  persistent.

  The real difference between files and strings in Python is no
  more or less than the set of techniques available to operate on
  them. File objects can do things like '.read()' and '.seek()' on
  themselves. Notably, file objects have a concept of a "current
  position" that emulates an imaginary "read-head" passing over the
  physical storage media. Strings, on the other hand, can be sliced
  and indexed--for example 'str[4:10]' or 'for c in str:'--and can
  be processed with string object methods and by functions of
  modules like [string] and [re]. Moreover, a number of
  special-purpose Python objects act "file-like" without quite
  being files; for example `gzip.open()` and `urllib.urlopen()`. Of
  course, Python itself does not impose any strict condition for
  just how "file-like" something has to be to work in a file-like
  context. A programmer has to figure that out for each type of
  object she wishes to apply techniques to (but most of the time
  things "just work" right).

  Happily, Python provides some standard modules to make files
  and strings easily interoperable.

  =================================================================
    MODULE -- mmap : Memory-mapped file support
  =================================================================

  The [mmap] module allows a programmer to create "memory-mapped"
  file objects. These special [mmap] objects enable most of the
  techniques you might apply to "true" file objects and
  simultaneously most of the techniques one might apply to "true"
  strings. Keep in mind the hinted caveat about "most," however:
  Many [string] module functions are implemented using the
  corresponding string object methods. Since a [mmap] object is
  only somewhat "string-like," it basically only implements the
  '.find()' method and those "magic" methods associated with
  slicing and indexing. This is enough to support most string
  object idioms.

  When a string-like change is made to a [mmap] object, that change
  is propagated to the underlying file, and the change is
  persistent (assuming the underlying file is persistent, and that
  the object called '.flush()' before destruction). [mmap] thereby
  provides an efficient route to "persistent strings."

  Some examples of working with memory-mapped file objects are
  worth looking at:

      >>> # Create a file with some test data
      >>> open('test','w').write(' #'.join(map(str, range(1000))))
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(),1000)
      >>> len(mm)
      1000
      >>> mm[-20:]
      '218 #219 #220 #221 #'
      >>> import string # apply a string module method
      >>> mm.seek(string.find(mm, '21'))
      >>> mm.read(10)
      '21 #22 #23'
      >>> mm.read(10) # next ten bytes
      ' #24 #25 #'
      >>> mm.find('21') # object method to find next occurrence
      402
      >>> try: string.rfind(mm, '21')
      ... except AttributeError: print "Unsupported string function"
      ...
      Unsupported string function
      >>> '/'.join(re.findall('..21..',mm)) # regex's work nicely
      ' #21 #/#121 #/ #210 / #212 / #214 / #216 / #218 /#221 #'

  It is worth emphasizing that the bytes in a file on disk are in
  fixed positions. You may use the `mmap.mmap.resize()` method
  to write into different portions of a file, but you cannot
  expand the file from the middle, only by adding to the end.

  CLASSES:

  mmap.mmap(fileno, length [,tagname]) (Windows)
  mmap.mmap(fileno, length [,flags=MAP_SHARED,
    -? prot=PROT_READ|PROT_WRITE])
    -? (Unix)
      Create a new memory-mapped file object. 'fileno' is the
      numeric file handle to base the mapping on. Generally this
      number should be obtained using the '.fileno()' method of a
      file object. 'length' specifies the length of the mapping.
      Under Windows, the value 0 may be given for 'length' to
      specify the current length of the file. If 'length'
      smaller than the current file is specified, only the
      initial portion of the file will be mapped. If 'length'
      larger than the current file is specified, the file can be
      extended with additional string content.

      The underlying file for a memory-mapped file object must be
      opened for updating, using the "+" mode modifier.

      According to the official Python documentation for Python
      2.1, a third argument 'tagname' may be specified. If
      it is, multiple memory-maps against the same file are
      created. In practice, however, each instance of
      `mmap.mmap()` creates a new memory-map whether or not a
      'tagname' is specified. In any case, this allows multiple
      file-like updates to the same underlying file, generally at
      different positions in the file.

      >>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm1 = mmap.mmap(fp.fileno(),1000)
      >>> mm2 = mmap.mmap(fp.fileno(),1000)
      >>> mm1.seek(500)
      >>> mm1.read(10)
      '122 #123 #'
      >>> mm2.read(10)
      '0 #1 #2 #3'

      Under Unix, the third argument 'flags' may be MAP_PRIVATE
      or MAP_SHARED. If MAP_SHARED is specified for 'flags', all
      processes mapping the file will see the changes made to a
      [mmap] object. Otherwise, the changes are restricted to
      the current process. The fourth argument, 'prot', may be
      used to disallow certain types of access by other processes
      to the mapped file regions.

  METHODS:

  mmap.mmap.close()
      Close the memory-mapped file object. Subsequent calls to
      the other methods of the [mmap] object will raise an
      exception. Under Windows, the behavior of a [mmap] object
      after '.close()' is somewhat erratic, however. Note that
      closing the memory-mapped file object is not the same as
      closing the underlying file object. Closing the underlying
      file will make the contents inaccessible, but closing the
      memory-mapped file object will not affect the underlying
      file object.

      SEE ALSO, `FILE.close()`

  mmap.mmap.find(sub [,pos])
      Similar to `string.find()`. Return the index position of
      the first occurrence of 'sub' in the [mmap] object. If the
      optional second argument 'pos' is specified, the result is
      the offset returned relative to 'pos'. Return -1 if no
      occurrence is found:

      >>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(), 0)
      >>> mm.find('21')
      74
      >>> mm.find('21',100)
      -26
      >>> mm.tell()
      0

      SEE ALSO, `mmap.mmap.seek()`, `string.find()`

  mmap.mmap.flush([offset, size])
      Writes changes made in memory to [mmap] object back to
      disk. The first argument 'offset' and second argument 'size'
      must either both be specified or both omitted. If 'offset'
      and 'size' are specified, only the position starting at
      'offset' or length 'size' will be written back to disk.

      `mmap.mmap.flush()` is necessary to guarantee that changes
      are written to disk; however, no guarantee is given that
      changes -will not- be written to disk as part of normal
      Python interpreter housekeeping. [mmap] should not be used
      for systems with "cancelable" changes (since changes may
      not be cancelable).

      SEE ALSO, `FILE.flush()`

  mmap.mmap.move(target, source, length)
      Copy a substring within a memory-mapped file object. The
      length of the substring is the third argument 'length'. The
      target location is the first argument 'target'. The
      substring is copied from the position 'source'. It is
      allowable to have the substring's original position overlap
      its target range, but it must not go past the last position
      of the [mmap] object.

      >>> open('test','w').write(''.join([c*10 for c in 'ABCDE']))
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(),0)
      >>> mm[:]
      'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEE'
      >>> mm.move(40,0,5)
      >>> mm[:]
      'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDAAAAAEEEEE'

  mmap.mmap.read(num)
      Return a string containing 'num' bytes, starting at the
      current file position. The file position is moved to the
      end of the read string. In contrast to the '.read()'
      method of file objects, `mmap.mmap.read()` always requires
      that a byte count be specified, which makes a memory-map
      file object not fully substitutable for a file object when
      data is read. However, the following is safe for both true
      file objects and memory-mapped file objects:

      >>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(),0)
      >>> def safe_readall(file):
      ... try:
      ... length = len(file)
      ... return file.read(length)
      ... except TypeError:
      ... return file.read()
      ...
      >>> s1 = safe_readall(fp)
      >>> s2 = safe_readall(mm)
      >>> s1 == s2
      1

      SEE ALSO, `mmap.mmap.read_byte()`, `mmap.mmap.readline()`,
                `mmap.mmap.write()`, `FILE.read()`

  mmap.mmap.read_byte()
      Return a one-byte string from the current file position
      and advance the current position by one. Same as
      'mmap.mmap.read(1)'.

      SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.readline()`

  mmap.mmap.readline()
      Return a string from the memory-mapped file object,
      starting from the current file position and going to the
      next newline character. Advance the current file position
      by the amount read.

      SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.read_byte()`,
                `FILE.readline()`

  mmap.mmap.resize(newsize)
      Change the size of a memory-mapped file object. This may
      be used to expand the size of an underlying file or merely
      to expand the area of a file that is memory-mapped. An
      expanded file is padded with null bytes ('\000') unless
      otherwise filled with content. As with other operations on
      [mmap] objects, changes to the underlying file system may
      not occur until a '.flush()' is performed.

      SEE ALSO, `mmap.mmap.flush()`

  mmap.mmap.seek(offset [,mode])
      Change the current file position. If a second argument
      'mode' is given, a different seek mode can be selected.
      The default is 0, absolute file positioning. Mode 1 seeks
      relative to the current file position. Mode 2 is relative
      to the end of the memory-mapped file (which may be smaller
      than the whole size of the underlying file). The first
      argument 'offset' specifies the distance to move the
      current file position--in mode 0 it should be positive, in
      mode 2 it should be negative, in mode 1 the current
      position can be moved either forward or backward.

      SEE ALSO, `FILE.seek()`

  mmap.mmap.size()
      Return the length of the underlying file. The size of the
      actual memory-map may be smaller if less than the whole
      file is mapped:

      >>> open('test','w').write('X'*100)
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(),50)
      >>> mm.size()
      100
      >>> len(mm)
      50

      SEE ALSO, `len()`, `mmap.mmap.seek()`, `mmap.mmap.tell()`

  mmap.mmap.tell()
      Return the current file position.

      >>> open('test','w').write('X'*100)
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(), 0)
      >>> mm.tell()
      0
      >>> mm.seek(20)
      >>> mm.tell()
      20
      >>> mm.read(20)
      'XXXXXXXXXXXXXXXXXXXX'
      >>> mm.tell()
      40

      SEE ALSO, `FILE.tell()`, `mmap.mmap.seek()`

  mmap.mmap.write(s)
      Write 's' into the memory-mapped file object at the current
      file position. The current file position is updated to the
      position following the write. The method
      `mmap.mmap.write()` is useful for functions that expect to
      be passed a file-like object with a '.write()' method.
      However, for new code, it is generally more natural to use
      the string-like index and slice operations to write
      contents. For example:

      >>> open('test','w').write('X'*50)
      >>> fp = open('test','r+')
      >>> import mmap
      >>> mm = mmap.mmap(fp.fileno(), 0)
      >>> mm.write('AAAAA')
      >>> mm.seek(10)
      >>> mm.write('BBBBB')
      >>> mm[30:35] = 'SSSSS'
      >>> mm[:]
      'AAAAAXXXXXBBBBBXXXXXXXXXXXXXXXSSSSSXXXXXXXXXXXXXXX'
      >>> mm.tell()
      15

      SEE ALSO, `FILE.write()`, `mmap.mmap.read()`

  mmap.mmap.write_byte(c)
      Write a one-byte string to the current file position,
      and advance the current position by one. Same as
      'mmap.mmap.write(c)' where 'c' is a one-byte string.

      SEE ALSO, `mmap.mmap.write()`

  =================================================================
    MODULE -- StringIO : File-like objects that read from or
                         write to a string buffer

  =================================================================
    MODULE -- cStringIO : Fast, but incomplete, StringIO
                          replacement
  =================================================================

  The [StringIO] and [cStringIO] modules allow a programmer to
  create "memory files," that is, "string buffers." These special
  [StringIO] objects enable most of the techniques you might apply
  to "true" file objects, but without any connection to a
  filesystem.

  The most common use of string buffer objects is when some
  existing techniques for working with byte-streams in files are to
  be applied to strings that do not come from files. A string
  buffer object behaves in a file-like manner and can "drop in" to
  most functions that want file objects.

  [cStringIO] is much faster than [StringIO] and should be used in
  most cases. Both modules provide a 'StringIO' class whose
  instances are the string buffer objects. `cStringIO.StringIO`
  cannot be subclassed (and therefore cannot provide additional
  methods), and it cannot handle Unicode strings. One rarely needs
  to subclass [StringIO], but the absence of Unicode support in
  [cStringIO] could be a problem for many developers. As well,
  [cStringIO] does not support write operations, which makes its
  string buffers less general (the effect of a write against an
  in-memory file can be accomplished by normal string operations).

  A string buffer object may be initialized with a string (or
  Unicode for [StringIO]) argument. If so, that is the initial
  content of the buffer. Below are examples of usage (including
  Unicode handling):

      >>> from cStringIO import StringIO as CSIO
      >>> from StringIO import StringIO as SIO
      >>> alef, omega = unichr(1488), unichr(969)
      >>> sentence = "In set theory, the Greek "+omega+" represents the \n"+\
      ... "ordinal limit of the integers, while the Hebrew \n"+\
      ... alef+" represents their cardinality."
      >>> sio = SIO(sentence)
      >>> try:
      ... csio = CSIO(sentence)
      ... print "New string buffer from raw string"
      ... except TypeError:
      ... csio = CSIO(sentence.encode('utf-8'))
      ... print "New string buffer from ENCODED string"
      ...
      New string buffer from ENCODED string
      >>> sio.getvalue() == unicode(csio.getvalue(),'utf-8')
      1
      >>> try:
      ... sio.getvalue() == csio.getvalue()
      ... except UnicodeError:
      ... print "Cannot even compare Unicode with string, in general"
      ...
      Cannot even compare Unicode with string, in general
      >>> lines = csio.readlines()
      >>> len(lines)
      3
      >>> sio.seek(0)
      >>> print sio.readline().encode('utf-8'),
      In set theory, the Greek 蠅 represents the ordinal
      >>> sio.tell(), csio.tell()
      (51, 124)

  CONSTANTS:

  cStringIO.InputType
      The type of a `cStringIO.StringIO` instance that has been
      opened in "read" mode. All `StringIO.StringIO` instances
      are simply InstanceType.

      SEE ALSO, `cStringIO.StringIO`

  cStringIO.OutputType
      The type of `cStringIO.StringIO` instance that has been
      opened in "write" mode (actually read/write). All
      `StringIO.StringIO` instances are simply InstanceType.

      SEE ALSO, `cStringIO.StringIO`

  CLASSES:

  StringIO.StringIO([buf=...])
  cStringIO.StringIO([buf])
      Create a new string buffer. If the first argument 'buf' is
      specified, the buffer is initialized with a string content.
      If the [cStringIO] module is used, the presence of the 'buf'
      argument determines whether write access to the buffer is
      enabled. A `cStringIO.StringIO` buffer with write access
      must be initialized with no argument, otherwise it becomes
      read-only. A `StringIO.StringIO` buffer, however, is
      always read/write.

  METHODS:

  StringIO.StringIO.close()
  cStringIO.StringIO.close()
      Close the string buffer. No access is permitted after close.

      SEE ALSO, `FILE.close()`

  StringIO.StringIO.flush()
  cStringIO.StringIO.flush()
      Compatibility method for file-like behavior. Data in a string
      buffer is already in memory, so there is no need to finalize a
      write to disk.

      SEE ALSO, `FILE.close()`

  StringIO.StringIO.getvalue()
  cStringIO.StringIO.getvalue()
      Return the entire string held by the string buffer. Does
      not affect the current file position. Basically, this is
      the way you convert back from a string buffer to a string.

  StringIO.StringIO.isatty()
  cStringIO.StringIO.isatty()
      Return 0. Compatibility method for file-like behavior.

      SEE ALSO, `FILE.isatty()`

  StringIO.StringIO.read([num])
  cStringIO.StringIO.read([num])
      If the first argument 'num' is specified, return a string
      containing the next 'num' characters. If 'num' characters
      are not available, return as many as possible. If 'num' is
      not specified, return all the characters from current file
      position to end of string buffer. Advance the current file
      position by the amount read.

      SEE ALSO, `FILE.read()`, `mmap.mmap.read()`,
      `StringIO.StringIO.readline()`

  StringIO.StringIO.readline([length=...])
  cStringIO.StringIO.readline([length])
      Return a string from the string buffer, starting from the
      current file position and going to the next newline
      character. Advance the current file position by the amount
      read.

      SEE ALSO, `mmap.mmap.readline()`,
                `StringIO.StringIO.read()`,
                `StringIO.StringIO.readlines()`,
                `FILE.readline()`

  StringIO.StringIO.readlines([sizehint=...])
  cStringIO.StringIO.readlines([sizehint]
      Return a list of strings from the string buffer. Each
      list element consists of a single line, including the
      trailing newline character(s). If an argument 'sizehint'
      is specified, only read approximately 'sizehint' characters
      worth of lines (full lines will always be read).

      SEE ALSO, `StringIO.StringIO.readline()`,
                `FILE.readlines()`

  cStringIO.StringIO.reset()
      Sets the current file position to the beginning of the
      string buffer. Same as 'cStringIO.StringIO.seek(0)'.

      SEE ALSO, `StringIO.StringIO.seek()`

  StringIO.StringIO.seek(offset [,mode=0])
  cStringIO.StringIO.seek(offset [,mode])
      Change the current file position. If the second argument
      'mode' is given, a different seek mode can be selected.
      The default is 0, absolute file positioning. Mode 1 seeks
      relative to the current file position. Mode 2 is relative
      to the end of the string buffer. The first argument
      'offset' specifies the distance to move the current file
      position--in mode 0 it should be positive, in mode 2 it
      should be negative, in mode 1 the current position can be
      moved either forward or backward.

      SEE ALSO, `FILE.seek()`, `mmap.mmap.seek()`

  StringIO.StringIO.tell()
  cStringIO.StringIO.tell()
      Return the current file position in the string buffer.

      SEE ALSO, `StringIO.StringIO.seek()`

  StringIO.StringIO.truncate([len=0])
  cStringIO.StringIO.truncate([len])
      Reduce the length of the string buffer to the first
      argument 'len' characters. Truncate can only reduce
      characters later than the current file position (an initial
      'cStringIO.StringIO.reset()' can be used to assure
      truncation from the beginning).

      SEE ALSO, `StringIO.StringIO.seek()`,
                `cStringIO.StringIO.reset()`,
                `StringIO.StringIO.close()`

  StringIO.StringIO.write(s=...)
  cStringIO.StringIO.write(s)
      Write the first argument 's' into the string buffer at the
      current file position. The current file position is
      updated to the position following the write.

      SEE ALSO, `StringIO.StringIO.writelines()`,
                `mmap.mmap.write()`, `StringIO.StringIO.read()`,
                `FILE.write()`

  StringIO.StringIO.writelines(list=...)
  cStringIO.StringIO.writelines(list)
      Write each element of 'list' into the string buffer at the
      current file position. The current file position is
      updated to the position following the write. For the
      [cStringIO] method, 'list' must be an actual list. For the
      [StringIO] method, other sequence types are allowed. To be
      safe, it is best to coerce an argument into an actual list
      first. In either case, 'list' must contain only strings,
      or a 'TypeError' will occur.

      Contrary to what might be expected from the method name,
      `StringIO.StringIO.writelines()` never inserts newline
      characters. For the list elements actually to occupy
      separate lines in the string buffer, each element string
      must already have a newline terminator. Consider the
      following variants on writing a list to a string buffer:

      >>> from StringIO import StringIO
      >>> sio = StringIO()
      >>> lst = [c*5 for c in 'ABC']
      >>> sio.writelines(lst)
      >>> sio.write(''.join(lst))
      >>> sio.write('\n'.join(lst))
      >>> print sio.getvalue()
      AAAAABBBBBCCCCCAAAAABBBBBCCCCCAAAAA
      BBBBB
      CCCCC

      SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.write()`


  TOPIC -- Converting Between Binary and ASCII
  --------------------------------------------------------------------

  The Python standard library provides several modules for
  converting between binary data and 7-bit ASCII. At the low level,
  [binascii] is a C extension to produce fast string conversions.
  At a high level, [base64], [binhex], [quopri], and [uu] provide
  file-oriented wrappers to the facilities in [binascii].

  =================================================================
    MODULE -- base64 : Convert to/from base64 encoding (RFC1521)
  =================================================================

  The [base64] module is a wrapper around the functions
  `binascii.a2b_base64()` and `binascii.b2a_base64()`. As well
  as providing a file-based interface on top of the underlying
  string conversions, [base64] handles the chunking of binary
  files into base64 line blocks and provides for the direct
  encoding of arbitrary input strings. Unlike [uu], [base64]
  adds no content headers to encoded data; MIME standards for
  headers and message-wrapping are handled by other modules that
  utilize [base64]. Base64 encoding is specified in RFC1521.

  FUNCTIONS:

  base64.encode(input=..., output=...)
      Encode the contents of the first argument 'input' to the
      second argument 'output'. Arguments 'input' and 'output'
      should be file-like objects; 'input' must be readable and
      'output' must be writable.

  base64.encodestring(s=...)
      Return the base64 encoding of the string passed in the
      first argument 's'.

  base64.decode(input=..., output=...)
      Decode the contents of the first argument 'input' to the
      second argument 'output'. Arguments 'input' and 'output'
      should be file-like objects; 'input' must be readable and
      'output' must be writable.

  base64.decodestring(s=...)
      Return the decoding of the base64-encoded string passed in
      the first argument 's'.


  SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes],
    `MimeWriter`, `mimify`, [binascii], [quopri]

  =================================================================
    MODULE -- binascii : Convert between binary data and ASCII
  =================================================================

  The [binascii] module is a C implementation of a number of
  styles of ASCII encoding of binary data. Each function in the
  [binascii] module takes either encoded ASCII or raw binary
  strings as an argument, and returns the string result of
  converting back or forth. Some restrictions apply to the
  length of strings passed to some functions in the module (for
  encodings that operate on specific block sizes).

  FUNCTIONS:

  binascii.a2b_base64(s)
      Return the decoded version of a base64-encoded string.
      A string consisting of one or more encoding blocks should
      be passed as the argument 's'.

  binascii.a2b_hex(s)
      Return the decoded version of a hexadecimal-encoded string.
      A string consisting of an even number of hexadecimals
      digits should be passed as the argument 's'.

  binascii.a2b_hqx(s)
      Return the decoded version of a binhex-encoded string.
      A string containing a complete number of encoded binary
      bytes should be passed as the argument 's'.

  binascii.a2b_qp(s [,header=0])
      Return the decoded version of a quoted printable string.
      A string containing a complete number of encoded binary
      bytes should be passed as the argument 's'. If the
      optional argument 'header' is specified, underscores will
      be decoded as spaces. New to Python 2.2.

  binascii.a2b_uu(s)
      Return the decoded version of a UUencoded string. A string
      consisting of exactly one encoding block should be passed
      as the argument 's' (for a full block, 62 bytes input, 45
      bytes returned).

  binascii.b2a_base64(s)
      Return the based64 encoding of a binary string (including
      the newline after block). A binary string no longer than
      57 bytes should be passed as the argument 's'.

  binascii.b2a_hex(s)
      Return the hexadecimal encoding of a binary string. A
      binary string of any length should be passed as the
      argument 's'.

  binascii.b2a_hqx(s)
      Return the binhex4 encoding of a binary string. A
      binary string of any length should be passed as the
      argument 's'. Run-length compression of 's' is not
      performed by this function (use `binascii.rlecode_hqx()`
      first, if needed).

  binascii.b2a_qp(s [,quotetabs=0 [,istext=1 [header=0]]])
      Return the quoted printable encoding of a binary string. A
      binary string of any length should be passed as the argument
      's'. The optional argument 'quotetabs' specified whether
      to escape spaces and tabs; 'istext' specifies -not- to
      newlines; 'header' specifies whether to encode spaces as
      underscores (and escape underscores). New to Python 2.2.

  binascii.b2a_uu(s)
      Return the UUencoding of a binary string (including
      the initial block specifier--"M" for full blocks--and
      newline after block). A binary string no longer than 45
      bytes should be passed as the argument 's'.

  binascii.crc32(s [,crc])
      Return the CRC32 checksum of the first argument 's'. If
      the second argument 'crc' is specified, it will be used as
      an initial checksum. This allows partial computation of a
      checksum and continuation. For example:

      >>> import binascii
      >>> crc = binascii.crc32('spam')
      >>> binascii.crc32(' and eggs', crc)
      739139840
      >>> binascii.crc32('spam and eggs')
      739139840

  binascii.crc_hqx(s, crc)
      Return the binhex4 checksum of the first argument 's',
      using initial checksum value in second argument. This
      allows partial computation of a checksum and continuation.
      For example:

      >>> import binascii
      >>> binascii.crc_hqx('spam and eggs', 0)
      17918
      >>> crc = binascii.crc_hqx('spam', 0)
      >>> binascii.crc_hqx(' and eggs', crc)
      17918

      SEE ALSO, `binascii.crc32`

  binascii.hexlify(s)
      Identical to `binascii.b2a_hex()`.

  binascii.rlecode_hqx(s)
      Return the binhex4 run-length encoding (RLE) of first
      argument 's'. Under this RLE technique, '0x90' is used as
      an indicator byte. Independent of the binhex4 standard,
      this is a poor choice of precompression for encoded
      strings.

      SEE ALSO, `zlib.compress()`

  binascii.rledecode_hqx(s)
      Return the expansion of a binhex4 run-length encoded
      string.

  binascii.unhexlify(s)
      Identical to `binascii.a2b_hex()`

  EXCEPTIONS:

  binascii.Error
      Generic exception that should only result from programming
      errors.

  binascii.Incomplete
      Exception raised when a data block is incomplete. Usually
      this results from programming errors in reading blocks, but
      it could indicate data or channel corruption.

  SEE ALSO, [base64], [binhex], [uu]

  =================================================================
    MODULE -- binhex : Encode and decode binhex4 files
  =================================================================

  The [binhex] module is a wrapper around the functions
  `binascii.a2b_hqx()`, `binascii.b2a_hqx()`,
  `binascii.rlecode_hqx()`, `binascii.rledecode_hqx()`, and
  `binascii.crc_hqx()`. As well as providing a file-based
  interface on top of the underlying string conversions, [binhex]
  handles run-length encoding of encoded files and attaches the
  needed header and footer information. Under MacOS, the
  resource fork of a file is encoded along with the data fork
  (not applicable under other platforms).

  FUNCTIONS:

  binhex.binhex(inp=..., out=...)
      Encode the contents of the first argument 'inp' to the
      second argument 'out'. Argument 'inp' is a filename;
      'out' may be either a filename or a file-like object.
      However, a `cStringIO.StringIO` object is not "file-like"
      enough since it will be closed after the conversion--and
      therefore, its value lost. You could override the
      '.close()' method in a subclass of `StringIO.StringIO` to
      solve this limitation.

  binhex.hexbin(inp=... [,out=...])
      Decode the contents of the first argument to an output
      file. If the second argument 'out' is specified, it will
      be used as the output filename, otherwise the filename
      will be taken from the binhex header. The argument 'inp'
      may be either a filename or a file-like object.

  CLASSES:

  A number of internal classes are used by [binhex]. They are
  not documented here, but can be examined in
  '$PYTHONHOME/lib/binhex.py' if desired (it is unlikely readers
  will need to do this).

  SEE ALSO, [binascii]

  =================================================================
    MODULE -- quopri : Convert to/from quoted printable encoding (RFC1521)
  =================================================================

  The [quopri] module is a wrapper around the functions
  `binascii.a2b_qp()` and `binascii.b2a_qp()`. The module
  [quopri] has the same methods as [base64]. Unlike [uu], [base64]
  adds no content headers to encoded data; MIME standards for
  headers and message wrapping are handled by other modules that
  utilize [quopri]. Quoted printable encoding is specified in
  RFC1521.

  FUNCTIONS:

  quopri.encode(input, output, quotetabs)
      Encode the contents of the first argument 'input' to the
      second argument 'output'. Arguments 'input' and 'output'
      should be file-like objects; 'input' must be readable and
      'output' must be writable. If 'quotetabs' is a true
      value, escape tabs and spaces.

  quopri.encodestring(s [,quotetabs=0])
      Return the quoted printable encoding of the string passed
      in the first argument 's'. If 'quotetabs' is a true value,
      escape tabs and spaces.

  quopri.decode(input=..., output=... [,header=0])
      Decode the contents of the first argument 'input' to the
      second argument 'output'. Arguments 'input' and 'output'
      should be file-like objects; 'input' must be readable and
      'output' must be writable. If 'header' is a true value,
      encode spaces as underscores and escape underscores.

  quopri.decodestring(s [,header=0])
      Return the decoding of the quoted printable string passed
      in the first argument 's'. If 'header' is a true value,
      decode underscores as spaces.

  SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes],
    `MimeWriter`, `mimify`, [binascii], [base64]

  =================================================================
    MODULE -- uu : UUencode and UUdecode files
  =================================================================

  The [uu] module is a wrapper around the functions
  `binascii.a2b_uu()` and `binascii.b2a_uu()`. As well as
  providing a file-based interface on top of the underlying
  string conversions, [uu] handles the chunking of binary files
  into UUencoded line blocks and attaches the needed header and
  footer.

  FUNCTIONS:

  uu.encode(in, out, [name=... [,mode=0666]])
      Encode the contents of the first argument 'in' to the
      second argument 'out'. Arguments 'in' and 'out' should be
      file objects, but filenames are also accepted (the latter
      is deprecated). The special filename "-" can be used to
      specify STDIN or STDOUT, as appropriate. When file objects
      are passed as arguments, 'in' must be readable and 'out'
      must be writable. The third argument 'name' can be used to
      specify the filename that appears in the UUencoding header;
      by default it is the name of 'in'. The fourth argument
      'mode' is the octal filemode to store in the UUencoding
      header.

  uu.decode(in, [,out_file=... [, mode=...])
      Decode the contents of the first argument 'in' to an output
      file. If the second argument 'out_file' is specified, it
      will be used as the output file; otherwise, the filename
      will be taken from the UUencoding header. Arguments 'in'
      and 'out_file' should be file objects, but filenames are
      also accepted (the latter is deprecated). If the third
      argument 'mode' is specified (and if 'out_file' is either
      unspecified or is a filename), open the created file in
      mode 'mode'.


  SEE ALSO, [binascii]


  TOPIC -- Cryptography
  --------------------------------------------------------------------

  Python does not come with any standard and general cryptography
  modules. The few included capabilities are fairly narrow in
  purpose and limited in scope. The capabilities in the standard
  library consist of several cryptographic hashes and one weak
  symmetrical encryption algorithm. A quick survey of cryptographic
  techniques shows what capabilities are absent from the standard
  library:

  *Symmetrical Encryption:* Any technique by which a plaintext
  message M is "encrypted" with a key K to produce a cyphertext C.
  Application of K--or some K' easily derivable from K--to C is
  called "decryption" and produces as output M. The standard module
  [rotor] provides a form of symmetrical encryption.

  *Cryptographic Hash:* Any technique by which a short "hash" H is
  produced from a plaintext message M that has several additional
  properties: (1) Given only H, it is difficult to obtain any M'
  such that the cryptographic hash of M' is H; (2) Given two
  plaintext messages M and M', there is a very low probability that
  the cryptographic hashes of M and M' are the same. Sometimes a
  third property is included: (3) Given M, its cryptographic hash
  H, and another hash H', examining the relationship between H and
  H' does not make it easier to find an M' whose hash is H'. The
  standard modules [crypt], [md5], and [sha] provide forms of
  cryptographic hashes.

  *Asymmetrical Encryption:* Also called "public-key cryptography."
  Any technique by which a pair of keys K{pub} and K{priv} can be
  generated that have several properties. The algorithm for an
  asymmetrical encryption technique will be called "P(M,K)" in the
  following. (1) For any plaintext message M, M equals
  P(K{priv},P(M,K{pub})). (2) Given only a public-key K{pub}, it is
  difficult to obtain a private-key K{priv} that assures the
  equality in (1). (3) Given only P(M,K{pub}), it is difficult to
  obtain M. In general, in an asymmetrical encryption system, a
  user generates K{pub} and K{priv}, then releases K{pub} to other
  users but retains K{priv} as a secret. There is no support for
  asymmetrical encryption in the standard library.

  *Digital Signatures:* Digital signatures are really just
  "public-keys in reverse." In many cases, the same underlying
  algorithm is used for each. A digital signature is any technique
  by which a pair of keys K{ver} and K{sig} can be generated that
  have several properties. The algorithm for a digital signature
  will be called S(M,K) in the following. (1) For any message M, M
  equals P(K{ver},P(M,K{sig})). (2) Given only a verification key
  K{ver}, it is difficult to obtain a signature key K{sig} that
  assures the equality in (1). (3) Given only P(M,K{sig}), it is
  difficult to find any C' such that P(K{ver},C) is a plausible
  message (in other words, the signature shows it is not a
  forgery). In general, in a digital signature system, a user
  generates K{ver} and K{sig}, then releases K{ver} to other users
  but retains K{sig} as a secret. There is no support for digital
  signatures in the standard library.

  -*-

  Those outlined are the most important cryptographic techniques.
  More detailed general introductions to cryptology and
  cryptography can be found at the author's Web site.
  A first tutorial is _Introduction to Cryptology Concepts I_:

    <http://gnosis.cx/publish/programming/cryptology1.pdf>

  Further material is in _Introduction to Cryptology Concepts II_:

    <http://gnosis.cx/publish/programming/cryptology2.pdf>

  And more advanced material is in _Intermediate Cryptology:
  Specialized Protocols_:

    <http://gnosis.cx/publish/programming/cryptology3.pdf>

  A number of third-party modules have been created to handle
  cryptographic tasks; a good guide to these third-party tools is
  the Vaults of Parnassus Encryption/Encoding index at
  <http://www.vex.net/parnassus/apyllo.py?i=94738404>. Only the
  tools in the standard library will be covered here
  specifically, since all the third-party tools are somewhat far
  afield of the topic of text processing as such. Moreover,
  third-party tools often rely on additional non-Python
  libraries, which will not be present on most platforms; and
  these tools will not necessarily be maintained as new Python
  versions introduce changes.

  The most important third-party modules are listed below. These
  are modules that the author believes are likely to be
  maintained and that provide access to a wide range of
  cryptographic algorithms.

  mxCrypto
  amkCrypto
      Marc-Andre Lemburg and Andrew Kuchling--both valuable
      contributors of many Python modules--have played a game of
      leapfrog with each other by releasing [mxCrypto] and
      [amkCrypto], respectively. Each release of either module
      builds on the work of the other, providing compatible
      interfaces and overlapping source code. Whatever is newest
      at the time you read this is the best bet. Current
      information on both should be obtainable from:

      <http://www.amk.ca/python/code/crypto.html>

  Python Cryptography
      Andrew Kuchling, who has provided a great deal of excellent
      Python documentation, documents these cryptography modules
      at:

      <http://www.amk.ca/python/writing/pycrypt/>

  M2Crypto
      The [mxCrypto] and [amkCrypto] modules are most readily
      available for Unix-like platforms. A similar range of
      cryptographic capabilities for a Windows platform is
      available in Ng Pheng Siong's [M2Crypto]. Information and
      documentation can be found at:

      <http://www.post1.com/home/ngps/m2/>

  fcrypt
      Carey Evans has created [fcrypt], which is a pure-Python,
      single-module replacement for the standard library's
      [crypt] module. While probably orders-of-magnitude slower
      than a C implementation, [fcrypt] will run anywhere that
      Python does (and speed is rarely an issue for this
      functionality). [fcrypt] may be obtained at:

      <http://home.clear.net.nz/pages/c.evans/sw/>


  =================================================================
    MODULE -- crypt : Create and verify Unix-style passwords
  =================================================================

  The 'crypt()' function is a frequently used, but somewhat
  antiquated, password creation/verification tool. Under
  Unix-like systems, 'crypt()' is contained in system libraries
  and may be called from wrapper functions in languages like
  Python. 'crypt()' is a form of cryptographic hash based on the
  Data Encryption Standard (DES). The hash produced by 'crypt()'
  is based on an 8-byte key and a 2-byte "salt." The output of
  'crypt()' is produced by repeated encryption of a constant
  string, using the user key as a DES key and the salt to
  perturb the encryption in one of 4,096 ways. Both the key and
  the salt are restricted to alphanumerics plus dot and slash.

  By using a cryptographic hash, passwords may be stored in a
  relatively insecure location. An imposter cannot easily
  produce a false password that will hash to the same value as
  the one stored in the password file, even given access to the
  password file. The salt is used to make "dictionary attacks"
  more difficult. If an imposter has access to the password
  file, she might try applying 'crypt()' to a candidate password
  and compare the result to every entry in the password file.
  Without a salt, the chances of matching -some- encrypted
  password would be higher. The salt (a random value should be
  used) decreases the chance of such a random guess by 4,096
  times.

  The [crypt] module is only installed on some Python systems
  (even only some Unix systems). Moreover, the module, if
  installed, relies on an underlying system library. For a
  portable approach to password creation, the third-party
  [fcrypt] module provides a portable, pure-Python
  reimplementation.

  FUNCTIONS:

  crypt.crypt(passwd, salt)
      Return an ASCII 13-byte encrypted password. The first
      argument 'passwd' must be a string up to eight characters
      in length (extra characters are truncated and do not
      affect the result). The second argument 'salt' must be a
      string up to two characters in length (extra characters are
      truncated). The value of 'salt' forms the first two
      characters of the result.

      >>> from crypt import crypt
      >>> crypt('mypassword','XY')
      'XY5XuULXk4pcs'
      >>> crypt('mypasswo','XY')
      'XY5XuULXk4pcs'
      >>> crypt('mypassword...more.characters','XY')
      'XY5XuULXk4pcs'
      >>> crypt('mypasswo','AB')
      'AB06lnfYxWIKg'
      >>> crypt('diffpass','AB')
      'ABlO5BopaFYNs'


  SEE ALSO, `fcrypt`, [md5], [sha]

  =================================================================
    MODULE -- md5 : Create MD5 message digests
  =================================================================

  RSA Data Security, Inc.'s MD5 cryptographic hash is a popular
  algorithm that is codified by RFC1321. Like [sha], and unlike
  [crypt], [md5] allows one to find the cryptographic hash of
  arbitrary strings (Unicode strings may not be hashed, however).
  Absent any other considerations--such as compatibility with other
  programs--Secure Hash Algorithm (SHA) is currently considered a
  better algorithm than MD5, and the [sha] module should be used
  for cryptographic hashes. The operation of [md5] objects is
  similar to `binascii.crc32()` hashes in that the final hash value
  may be built progressively from partial concatenated strings. The
  MD5 algorithm produces a 128-bit hash.

  CONSTANTS:

  md5.MD5Type
      The type of an `md5.new` instance.

  CLASSES:

  md5.new([s])
      Create an [md5] object. If the first argument 's' is
      specified, initialize the MD5 digest buffer with the
      initial string 's'. An MD5 hash can be computed in a
      single line with:

      >>> import md5
      >>> md5.new('Mary had a little lamb').hexdigest()
      'e946adb45d4299def2071880d30136d4'

  md5.md5([s])
      Identical to `md5.new`.

  METHODS:

  md5.copy()
      Return a new [md5] object that is identical to the current
      state of the current object. Different terminal strings
      can be concatenated to the clone objects after they are
      copied. For example:

      >>> import md5
      >>> m = md5.new('spam and eggs')
      >>> m.digest()
      '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'
      >>> m2 = m.copy()
      >>> m2.digest()
      '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'
      >>> m.update(' are tasty')
      >>> m2.update(' are wretched')
      >>> m.digest()
      '*\x94\xa2\xc5\xceq\x96\xef&\x1a\xc9#\xac98\x16'
      >>> m2.digest()
      'h\x8c\xfam\xe3\xb0\x90\xe8\x0e\xcb\xbf\xb3\xa7N\xe6\xbc'

  md5.digest()
      Return the 128-bit digest of the current state of the [md5]
      object as a 16-byte string. Each byte will contain a full
      8-bit range of possible values.

      >>> import md5 # Python 2.1+
      >>> m = md5.new('spam and eggs')
      >>> m.digest()
      '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'

      >>> import md5 # Python <= 2.0
      >>> m = md5.new('spam and eggs')
      >>> m.digest()
      '\265\201f\014\377\027\347\214\204\303\250J\320.g\205'

  md5.hexdigest()
      Return the 128-bit digest of the current state of the [md5]
      object as a 32-byte hexadecimal-encoded string. Each byte
      will contain only values in `string.hexdigits`. Each pair
      of bytes represents 8-bits of hash, and this format may be
      transmitted over 7-bit ASCII channels like email.

      >>> import md5
      >>> m = md5.new('spam and eggs')
      >>> m.hexdigest()
      'b581660cff17e78c84c3a84ad02e6785'

  md5.update(s)
      Concatenate additional strings to the [md5] object.
      Current hash state is adjusted accordingly. The number of
      concatenation steps that go into an MD5 hash does not
      affect the final hash, only the actual string that would
      result from concatenating each part in a single string.
      However, for large strings that are determined
      incrementally, it may be more practical to call
      `md5.update()` numerous times. For example:

      >>> import md5
      >>> m1 = md5.new('spam and eggs')
      >>> m2 = md5.new('spam')
      >>> m2.update(' and eggs')
      >>> m3 = md5.new('spam')
      >>> m3.update(' and ')
      >>> m3.update('eggs')
      >>> m1.hexdigest()
      'b581660cff17e78c84c3a84ad02e6785'
      >>> m2.hexdigest()
      'b581660cff17e78c84c3a84ad02e6785'
      >>> m3.hexdigest()
      'b581660cff17e78c84c3a84ad02e6785'


  SEE ALSO, [sha], [crypt], `binascii.crc32()`

  =================================================================
    MODULE -- rotor : Perform Enigma-like encryption and decryption
  =================================================================

  The [rotor] module is a bit of a curiosity in the Python standard
  library. The symmetric encryption performed by [rotor] is similar
  to that performed by the extremely historically interesting and
  important Enigma algorithm. Given Alan Turing's famous role not
  just in inventing the theory of computability, but also in
  cracking German encryption during WWII, there is a nice literary
  quality to the inclusion of [rotor] in Python. However, [rotor]
  should not be mistaken for a robust modern encryption algorithm.
  Bruce Schneier has commented that there are two types of
  encryption algorithms: those that will stop your little sister
  from reading your messages, and those that will stop major
  governments and powerful organization from reading your messages.
  [rotor] is in the first category--albeit allowing for rather
  bright little sisters. But [rotor] will not help much against
  TLAs (three letter agencies). On the other hand, there is nothing
  else in the Python standard library that performs actual
  military-grade encryption, either.

  CLASSES:

  rotor.newrotor(key [,numrotors])
      Return a [rotor] object with rotor permutations and
      positions based on the first argument 'key'. If the second
      argument 'numrotors' is specified, a number of rotors other
      than the default of 6 can be used (more is stronger). A
      rotor encryption can be computed in a single line with:

      >>> rotor.newrotor('mypassword').encrypt('Mary had a lamb')
      '\x10\xef\xf1\x1e\xeaor\xe9\xf7\xe5\xad,r\xc6\x9f'

      Object style encryption and decryption is performed like
      the following:

      >>> import rotor
      >>> C = rotor.newrotor('pass2').encrypt('Mary had a little lamb')
      >>> r1 = rotor.newrotor('mypassword')
      >>> C2 = r1.encrypt('Mary had a little lamb')
      >>> r1.decrypt(C2)
      'Mary had a little lamb'
      >>> r1.decrypt(C) # Let's try it
      '\217R$\217/sE\311\330~#\310\342\200\025F\221\245\263\036\220O'
      >>> r1.setkey('pass2')
      >>> r1.decrypt(C) # Let's try it
      'Mary had a little lamb'

  METHODS:

  rotor.decrypt(s)
      Return a decrypted version of cyphertext string 's'. Prior
      to decryption, rotors are set to their initial positions.

  rotor.decryptmore(s)
      Return a decrypted version of cyphertext string 's'. Prior
      to decryption, rotors are left in their current positions.

  rotor.encrypt(s)
      Return an encrypted version of plaintext string 's'. Prior
      to encryption, rotors are set to their initial positions.

  rotor.encryptmore(s)
      Return an encrypted version of plaintext string 's'. Prior
      to encryption, rotors are left in their current positions.

  rotor.setkey(key)
      Set a new key for a [rotor] object.

  =================================================================
    MODULE -- sha : Create SHA message digests
  =================================================================

  The National Institute of Standards and Technology's (NIST's)
  Secure Hash Algorithm is the best well-known cryptographic hash
  for most purposes. Like [md5], and unlike [crypt], [sha] allows
  one to find the cryptographic hash of arbitrary strings (Unicode
  strings may not be hashed, however). Absent any other
  considerations--such as compatibility with other programs--SHA is
  currently considered a better algorithm than MD5, and the [sha]
  module should be used for cryptographic hashes. The operation of
  [sha] objects is similar to `binascii.crc32()` hashes in that the
  final hash value may be built progressively from partial
  concatenated strings. The SHA algorithm produces a 160-bit hash.

  CLASSES:

  sha.new([s])
      Create an [sha] object. If the first argument 's' is
      specified, initialize the SHA digest buffer with the
      initial string 's'. An SHA hash can be computed in a
      single line with:

      >>> import sha
      >>> sha.new('Mary had a little lamb').hexdigest()
      'bac9388d0498fb378e528d35abd05792291af182'

  sha.sha([s])
      Identical to `sha.new`.

  METHODS:

  sha.copy()
      Return a new [sha] object that is identical to the current
      state of the current object. Different terminal strings
      can be concatenated to the clone objects after they are
      copied. For example:

      >>> import sha
      >>> s = sha.new('spam and eggs')
      >>> s.digest()
      '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'
      >>> s2 = s.copy()
      >>> s2.digest()
      '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'
      >>> s.update(' are tasty')
      >>> s2.update(' are wretched')
      >>> s.digest()
      '\013^C\366\253?I\323\206nt\2443\251\227\204-kr6'
      >>> s2.digest()
      '\013\210\237\216\014\3337X\333\221h&+c\345\007\367\326\274\321'

  sha.digest()
      Return the 160-bit digest of the current state of the [sha]
      object as a 20-byte string. Each byte will contain a full
      8-bit range of possible values.

      >>> import sha # Python 2.1+
      >>> s = sha.new('spam and eggs')
      >>> s.digest()
      '\xbe\x87\x94\x8b\xad\xfdx\x14\xa5b\x1eC\xd2\x0f\xaa8 @\x0f\xa6'

      >>> import sha # Python <= 2.0
      >>> s = sha.new('spam and eggs')
      >>> s.digest()
      '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'

  sha.hexdigest()
      Return the 160-bit digest of the current state of the [sha]
      object as a 40-byte hexadecimal-encoded string. Each byte
      will contain only values in `string.hexdigits`. Each pair of
      bytes represents 8-bits of hash, and this format may be
      transmitted over 7-bit ASCII channels like email.

      >>> import sha
      >>> s = sha.new('spam and eggs')
      >>> s.hexdigest()
      'be87948badfd7814a5621e43d20faa3820400fa6'

  sha.update(s)
      Concatenate additional strings to the [sha] object.
      Current hash state is adjusted accordingly. The number of
      concatenation steps that go into an SHA hash does not
      affect the final hash, only the actual string that would
      result from concatenating each part in a single string.
      However, for large strings that are determined
      incrementally, it may be more practical to call
      `sha.update()` numerous times. For example:

      >>> import sha
      >>> s1 = sha.sha('spam and eggs')
      >>> s2 = sha.sha('spam')
      >>> s2.update(' and eggs')
      >>> s3 = sha.sha('spam')
      >>> s3.update(' and ')
      >>> s3.update('eggs')
      >>> s1.hexdigest()
      'be87948badfd7814a5621e43d20faa3820400fa6'
      >>> s2.hexdigest()
      'be87948badfd7814a5621e43d20faa3820400fa6'
      >>> s3.hexdigest()
      'be87948badfd7814a5621e43d20faa3820400fa6'


  SEE ALSO, [md5], [crypt], `binascii.crc32()`


  TOPIC -- Compression
  --------------------------------------------------------------------

  Over the history of computers, a large number of data compression
  formats have been invented, mostly as variants on Lempel-Ziv and
  Huffman techniques. Compression is useful for all sorts of data
  streams, but file-level archive formats have been the most widely
  used and known application. Under MS-DOS and Windows we have seen
  ARC, PAK, ZOO, LHA, ARJ, CAB, RAR, and other formats--but the ZIP
  format has become the most widespread variant. Under Unix-like
  systems, 'compress' (.Z) mostly gave way to 'gzip' (GZ); 'gzip'
  is still the most popular format on these systems, but 'bzip'
  (BZ2) generally obtains better compression rates. Under MacOS,
  the most popular format is SIT. Other platforms have additional
  variants on archive formats, but ZIP--and to a lesser extent
  GZ--are widely supported on a number of platforms.

  The Python standard library includes support for several styles
  of compression. The [zlib] module performs low-level compression
  of raw string data and has no concept of a file. [zlib] is itself
  called by the high-level modules below for its compression
  services.

  The modules [gzip] and [zipfile] provide file-level interfaces to
  compressed archives. However, a notable difference in the
  operation of [gzip] and [zipfile] arises out of a difference in
  the underlying GZ and ZIP formats. 'gzip' (GZ) operates
  exclusively on single files--leaving the work of concatenating
  collections of files to tools like 'tar'. One frequently
  encounters (especially on Unix-like systems) files like
  'foo.tar.gz' or 'foo.tgz' that are produced by first applying
  'tar' to a collection of files, then applying 'gzip' to the
  result. ZIP, however, handles both the compression and archiving
  aspects in a single tool and format. As a consequence, [gzip] is
  able to create file-like objects based directly on the compressed
  contents of a GZ file. [ziplib] needs to provide more specialized
  methods for navigating archive contents and for working with
  individual compressed file images therein.

  Also see Appendix B (A Data Compression Primer).

  =================================================================
    MODULE -- gzip : Functions that read and write gzipped files
  =================================================================

  The [gzip] module allows the treatment of the compressed data
  inside 'gzip' compressed files directly in a file-like manner.
  Uncompressed data can be read out, and compressed data written
  back in, all without a caller knowing or caring that the file
  is a GZ-compressed file. A simple example illustrates this:

      #---------- gzip_file.py ----------#
      # Treat a GZ as "just another file"
      import gzip, glob
      print "Size of data in files:"
      for fname in glob.glob('*'):
          try:
              if fname[-3:] == '.gz':
                  s = gzip.open(fname).read()
              else:
                  s = open(fname).read()
              print ' ',fname,'-',len(s),'bytes'
          except IOError:
              print 'Skipping',file

  The module [gzip] is a wrapper around [zlib], with the latter
  performing the actual compression and decompression tasks. In
  many respects, [gzip] is similar to [mmap] and [StringIO] in
  emulating and/or wrapping a file object.

  SEE ALSO, [mmap], [StringIO], [cStringIO]

  CLASSES:

  gzip.GzipFile([filename=... [,mode="rb" [,compresslevel=9 [,fileobj=...]]]])
      Create a [gzip] file-like object. Such an object supports
      most file object operations, with the exception of
      '.seek()' and '.tell()'. Either the first argument
      'filename' or the fourth argument 'fileobj' should be
      specified (likely by argument name, especially if fourth
      argument 'fileobj').

      The second argument 'mode' takes the mode of 'fileobj' if
      specified, otherwise it defaults to 'rb' ('r', 'rb', 'a',
      'ab', 'w', or 'wb' may be specified with the same meaning
      as with `FILE.open()` objects). The third argument
      'compresslevel' specifies the level of compression. The
      default is the highest level, 9; an integer down to 1 may
      be selected for less compression but faster operation
      (compression level of a read file comes from the file
      itself, however).

  gzip.open(filename=... [mode='rb [,compresslevel=9]])
      Same as `gzip.GzipFile` but with extra arguments omitted.
      A GZ file object opened with `gzip.open` is always opened
      by name, not by underlying file object.

  METHODS AND ATTRIBUTES:

  gzip.close()
      Close the [gzip] object. No access is permitted after
      close. If the object was opened by file object, the
      underlying file object is not closed, only the [gzip]
      interface to the file.

      SEE ALSO, `FILE.close()`

  gzip.flush()
      Write outstanding data from memory to disk.

      SEE ALSO, `FILE.close()`

  gzip.isatty()
      Return 0. Compatibility method for file-like behavior.

      SEE ALSO, `FILE.isatty()`

  gzip.myfileobj
      Attribute holding the underlying file object.

  gzip.read([num])
      If the first argument 'num' is specified, return a string
      containing the next 'num' characters. If 'num' characters
      are not available, return as many as possible. If 'num' is
      not specified, return all the characters from current file
      position to end of string buffer. Advance the current file
      position by the amount read.

      SEE ALSO, `FILE.read()`

  gzip.readline([length])
      Return a string from the [gzip] object, starting from the
      current file position and going to the next newline
      character. The argument 'length' limits the read if specified.
      Advance the current file position by the amount read.

      SEE ALSO, `FILE.readline()`

  gzip.readlines([sizehint=...])
      Return a list of strings from the [gzip] object. Each
      list element consists of a single line, including the
      trailing newline character(s). If an argument 'sizehint'
      is specified, read only approximately 'sizehint' characters
      worth of lines (full lines will always be read).

      SEE ALSO, `FILE.readlines()`

  gzip.write(s)
      Write the first argument 's' into the [gzip] object at the
      current file position. The current file position is
      updated to the position following the write.

      SEE ALSO, `FILE.write()`

  gzip.writelines(list)
      Write each element of 'list' into the [gzip] object at the
      current file position. The current file position is
      updated to the position following the write. Most sequence
      types are allowed, but 'list' must contain only strings, or
      a 'TypeError' will occur.

      Contrary to what might be expected from the method name,
      `gzip.writelines()` never inserts newline characters. For
      the list elements actually to occupy separate lines in the
      string buffer, each element string must already have a
      newline terminator. See `StringIO.StringIO.writelines()`
      for an example.

      SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.writelines()`


  SEE ALSO, [zlib], [zipfile]

  =================================================================
    MODULE -- zipfile : Read and write ZIP files
  =================================================================

  The [zipfile] module enables a variety of operations on ZIP
  files and is compatible with archives created by applications
  such as PKZip, Info-Zip, and WinZip. Since the ZIP format
  allows inclusion of multiple file images within a single
  archive, the [zipfile] does not behave in a directly file-like
  manner as [gzip] does. Nonetheless, it is possible to view the
  contents of an archive, add new file images to one, create a
  new ZIP archive, or manipulate the contents and directory
  information of a ZIP file.

  An initial example of working with the [zipfile] module gives a
  feel for its usage.

      >>> for name in 'ABC':
      ... open(name,'w').write(name*1000)
      ...
      >>> import zipfile
      >>> z = zipfile.ZipFile('new.zip','w',zipfile.ZIP_DEFLATED) # new archv
      >>> z.write('A') # write files to archive
      >>> z.write('B','B.newname',zipfile.ZIP_STORED)
      >>> z.write('C','C.newname')
      >>> z.close() # close the written archive
      >>> z = zipfile.ZipFile('new.zip') # reopen archive in read mode
      >>> z.testzip() # 'None' returned means OK
      >>> z.namelist() # What's in it?
      ['A', 'B.newname', 'C.newname']
      >>> z.printdir() # details
      File Name Modified Size
      A 2001-07-18 21:39:36 1000
      B.newname 2001-07-18 21:39:36 1000
      C.newname 2001-07-18 21:39:36 1000
      >>> A = z.getinfo('A') # bind ZipInfo object
      >>> B = z.getinfo('B.newname') # bind ZipInfo object
      >>> A.compress_size
      11
      >>> B.compress_size
      1000
      >>> z.read(A.filename)[:40] # Check what's in A
      'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
      >>> z.read(B.filename)[:40] # Check what's in B
      'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
      >>> # For comparison, see what Info-Zip reports on created archive
      >>> import os
      >>> print os.popen('unzip -v new.zip').read()
      Archive: new.zip
       Length Method Size Ratio Date Time CRC-32 Name
       ------ ------ ---- ----- ---- ---- ------ ----
         1000 Defl:N 11 99% 07-18-01 21:39 51a02e01 A
         1000 Stored 1000 0% 07-18-01 21:39 7d9c564d B.newname
         1000 Defl:N 11 99% 07-18-01 21:39 66778189 C.newname
       ------ ------ --- -------
         3000 1022 66% 3 files

  The module [gzip] is a wrapper around [zlib], with the latter
  performing the actual compression and decompression tasks.

  CONSTANTS:

  Several string constants ([struct] formats) are used to
  recognize signature identifiers in the ZIP format. These
  constants are not normally used directly by end-users of
  [zipfile].

      #*----- zipfile constants -----#
      zipfile.stringCentralDir = 'PK\x01\x02'
      zipfile.stringEndArchive = 'PK\x05\x06'
      zipfile.stringFileHeader = 'PK\x03\x04'
      zipfile.structCentralDir = '<4s4B4H3l5H2l'
      zipfile.structEndArchive = '<4s4H2lH'
      zipfile.structFileHeader = '<4s2B4H3l2H'

  Symbolic names for the two supported compression methods are
  also defined.

      #*----- zipfile constants -----#
      zipfile.ZIP_STORED = 0
      zipfile.ZIP_DEFLATED = 8

  FUNCTIONS:

  zipfile.is_zipfile(filename=...)
      Check if the argument 'filename' is a valid ZIP archive.
      Archives with appended comments are not recognized as valid
      archives. Return 1 if valid, None otherwise. This
      function does not guarantee archive is fully intact, but it
      does provide a sanity check on the file type.

  CLASSES:

  zipfile.PyZipFile(pathname)
      Create a `zipfile.ZipFile` object that has the extra method
      `zipfile.ZipFile.writepy()`. This extra method allows you
      to recursively add all '*.py[oc]' files to an archive.
      This class is not general purpose, but a special feature to
      aid [distutils].

  zipfile.ZipFile(file=... [,mode='r' [,compression=ZIP_STORED]])
      Create a new `zipfile.ZipFile` object. This object is used
      for management of a ZIP archive. The first argument 'file'
      must be specified and is simply the filename of the
      archive to be manipulated. The second argument 'mode' may
      have one of three string values: 'r' to open the archive
      in read-only mode; 'w' to truncate the filename and create
      a new archive; 'a' to read an existing archive and add to
      it. The third argument 'compression' indicates the
      compression method--ZIP_DEFLATED requires that [zlib] and
      the zlib system library be present.

  zipfile.ZipInfo()
      Create a new `zipfile.ZipInfo` object. This object
      contains information about an individual archived filename
      and its file image. Normally, one will not directly
      instantiate `zipfile.ZipInfo` but only look at the
      `zipfile.ZipInfo` objects that are returned by methods like
      `zipfile.ZipFile.infolist()`, `zipfile.ZipFile.getinfo()`,
      and `zipfile.ZipFile.NameToInfo`. However, in special
      cases like `zipfile.ZipFile.writestr()`, it is useful to
      create a `zipfile.ZipInfo` directly.

  METHODS AND ATTRIBUTES:

  zipfile.ZipFile.close()
      Close the `zipfile.ZipFile` object, and flush any changes
      made to it. An object must be explicitly closed to perform
      updates.

  zipfile.ZipFile.getinfo(name=...)
      Return the `zipfile.ZipInfo` object corresponding to the
      filename 'name'. If 'name' is not in the ZIP archive, a
      'KeyError' is raised.

  zipfile.ZipFile.infolist()
      Return a list of `zipfile.ZipInfo` objects contained in the
      `zipfile.ZipFile` object. The return value is simply a
      list of instances of the same type. If the filename
      within the archive is known, `zipfile.ZipFile.getinfo()` is
      a better method to use. For enumerating over all archived
      files, however, `zipfile.ZipFile.infolist()` provides a
      nice sequence.

  zipfile.ZipFile.namelist()
      Return a list of the filenames of all the archived files
      (including nested relative directories).

  zipfile.ZipFile.printdir()
      Print to STDOUT a pretty summary of archived files and
      information about them. The results are similar to running
      Info-Zip's 'unzip' with the '-l' option.

  zipfile.ZipFile.read(name=...)
      Return the contents of the archived file with filename
      'name'.

  zipfile.ZipFile.testzip()
      Test the integrity of the current archive. Return the
      filename of the first `zipfile.ZipInfo` object with
      corruption. If everything is valid, return None.

  zipfile.ZipFile.write(filename=... [,arcname=... [,compress_type=...]])
      Add the file 'filename' to the `zipfile.ZipFile` object. If
      the second argument 'arcname' is specified, use 'arcname' as
      the stored filename (otherwise, use 'filename' itself). If
      the third argument 'compress_type' is specified, use the
      indicated compression method. The current archive must be
      opened in 'w' or 'a' mode.

  zipfile.ZipFile.writestr(zinfo=..., bytes=...)
      Write the data contained in the second argument 'bytes' to
      the `zipfile.ZipFile` object. Directory meta-information
      must be contained in attributes of the first argument
      'zinfo' (a filename, data, and time should be included;
      other information is optional). The current archive must
      be opened in 'w' or 'a' mode.

  zipfile.ZipFile.NameToInfo
      Dictionary that maps filenames in archive to corresponding
      `zipfile.ZipInfo` objects. The method
      `zipfile.ZipFile.getinfo()` is simply a wrapper for a
      dictionary lookup in this attribute.

  zipfile.ZipFile.compression
      Compression type currently in effect for new
      `zipfile.ZipFile.write()` operations. Modify with due
      caution (most likely not at all after initialization).

  zipfile.ZipFile.debug = 0
      Attribute for level of debugging information sent to
      STDOUT. Values range from the default 0 (no output) to 3
      (verbose). May be modified.

  zipfile.ZipFile.filelist
      List of `zipfile.ZipInfo` objects contained in the
      `zipfile.ZipFile` object. The method
      `zipfile.ZipFile.infolist()` is simply a wrapper to
      retrieve this attribute. Modify with due caution (most
      likely not at all).

  zipfile.ZipFile.filename
      Filename of the `zipfile.ZipFile` object. DO NOT modify!

  zipfile.ZipFile.fp
      Underlying file object for the `zipfile.ZipFile` object.
      DO NOT modify!

  zipfile.ZipFile.mode
      Access mode of current `zipfile.ZipFile` object. DO NOT
      modify!

  zipfile.ZipFile.start_dir
      Position of start of central directory. DO NOT modify!

  zipfile.ZipInfo.CRC
      Hash value of this archived file. DO NOT modify!

  zipfile.ZipInfo.comment
      Comment attached to this archived file. Modify with due
      caution (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.compress_size
      Size of the compressed data of this archived file. DO NOT
      modify!

  zipfile.ZipInfo.compress_type
      Compression type used with this archived file. Modify with
      due caution (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.create_system
      System that created this archived file. Modify with due
      caution (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.create_version
      PKZip version that created the archive. Modify with due
      caution (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.date_time
      Timestamp of this archived file. Modify with due caution
      (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.external_attr
      File attribute of archived file when extracted.

  zipfile.ZipInfo.extract_version
      PKZip version needed to extract the archive. Modify with
      due caution (e.g., for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.file_offset
      Byte offset to start of file data. DO NOT modify!

  zipfile.ZipInfo.file_size
      Size of the uncompressed data in the archived file. DO NOT
      modify!

  zipfile.ZipInfo.filename
      Filename of archived file. Modify with due caution (e.g.,
      for use with `zipfile.ZipFile.writestr()`).

  zipfile.ZipInfo.header_offset
      Byte offset to file header of the archived file. DO NOT
      modify!

  zipfile.ZipInfo.volume
      Volume number of the archived file. DO NOT modify!

  EXCEPTIONS:

  zipfile.error
      Exception that is raised when corrupt ZIP file is
      processed.

  zipfile.BadZipFile
      Alias for `zipfile.error`.


  SEE ALSO, [zlib], [gzip]

  =================================================================
    MODULE -- zlib : Compress and decompress with zlib library
  =================================================================

  [zlib] is the underlying compression engine for all Python
  standard library compression modules. Moreover, [zlib] is
  extremely useful in itself for compression and decompression of
  data that does not necessarily live in files (or where data
  does not map directly to files, even if it winds up in them
  indirectly). The Python [zlib] module relies on the
  availability of the zlib system library.

  There are two basic modes of operation for [zlib]. In the
  simplest mode, one can simply pass an uncompressed string to
  `zlib.compress()` and have the compressed version returned.
  Using `zlib.decompress()` is symmetrical. In a more
  complicated mode, one can create compression or decompression
  objects that are able to receive incremental raw or compressed
  byte-streams, and return partial results based on what they have
  seen so far. This mode of operation is similar to the way one
  uses `sha.sha.update()`, `md5.md5.update()`,
  `rotor.encryptmore()`, or `binascii.crc32()` (albeit for a
  different purpose from each of those). For large byte-streams
  that are determined, it may be more practical to utilize
  compression/decompression objects than it would be to
  compress/decompress an entire string at once (for example, if
  the input or result is bound to a slow channel).

  CONSTANTS:

  zlib.ZLIB_VERSION
      The installed zlib system library version.

  zlib.Z_BEST_COMPRESSION = 9
      Highest compression level.

  zlib.Z_BEST_SPEED = 1
      Fastest compression level.

  zlib.Z_HUFFMAN_ONLY = 2
      Intermediate compression level that uses Huffman codes,
      but not Lempel-Ziv.

  FUNCTIONS:

  zlib.adler32(s [,crc])
      Return the Adler-32 checksum of the first argument 's'.
      If the second argument 'crc' is specified, it will be used
      as an initial checksum. This allows partial computation
      of a checksum and continuation. An Adler-32 checksum can
      be computed much more quickly than a CRC32 checksum.
      Unlike [md5] or [sha], an Adler-32 checksum is not
      sufficient for cryptographic hashes, but merely for
      detection of accidental corruption of data.

      SEE ALSO, `zlib.crc32()`, [md5], [sha]

  zlib.compress(s [,level])
      Return the zlib compressed version of the string in the
      first argument 's'. If the second argument 'level' is
      specified, the compression technique can be fine-tuned.
      The compression level ranges from 1 to 9 and may also be
      specified using symbolic constants such as
      Z_BEST_COMPRESSION and Z_BEST_SPEED. The default value for
      'level' is 6 and is usually the desired compression level
      (usually within a few percent of the speed of
      Z_BEST_SPEED and within a few percent of the size of
      Z_BEST_COMPRESSION).

      SEE ALSO, `zlib.decompress()`, `zlib.compressobj`

  zlib.crc32(s [,crc])
      Return the CRC32 checksum of the first argument 's'. If
      the second argument 'crc' is specified, it will be used as
      an initial checksum. This allows partial computation of a
      checksum and continuation. Unlike [md5] or [sha], a
      CRC32 checksum is not sufficient for cryptographic hashes,
      but merely for detection of accidental corruption of data.

      Identical to `binascii.crc32()` (example appears there).

      SEE ALSO, `binascii.crc32()`, `zlib.adler32()`, [md5],
      [sha]

  zlib.decompress(s [,winsize [,buffsize]])
      Return the decompressed version of the zlib compressed
      string in the first argument 's'. If the second argument
      'winsize' is specified, it determines the base 2 logarithm
      of the history buffer size. The default 'winsize' is 15.
      If the third argument 'buffsize' is specified, it
      determines the size of the decompression buffer. The
      default 'buffsize' is 16384, but more is dynamically
      allocated if needed. One rarely needs to use 'winsize'
      and 'buffsize' values other than the defaults.

      SEE ALSO, `zlib.compress()`, `zlib.decompressobj`

  CLASS FACTORIES:

  [zlib] does not define true classes that can be specialized.
  `zlib.compressobj()` and `zlib.decompressobj()` are actually
  factory-functions rather than classes. That is, they return
  instance objects, just as classes do, but they do not have
  unbound data and methods. For most users, the difference is not
  important: To get a `zlib.compressobj` or `zlib.decompressobj`
  object, you just call that factory-function in the same manner
  you would a class object.

  zlib.compressobj([level])
      Create a compression object. A compression object is able
      to incrementally compress new strings that are fed to it
      while maintaining the seeded symbol table from previously
      compressed byte-streams. If argument 'level' is specified,
      the compression technique can be fine-tuned. The
      compression-level ranges from 1 to 9. The default value
      for 'level' is 6 and is usually the desired compression
      level.

      SEE ALSO, `zlib.compress()`, `zlib.decompressobj()`

  zlib.decompressobj([winsize])
      Create a decompression object. A decompression object is
      able to incrementally decompress new strings that are
      fed to it while maintaining the seeded symbol table from
      previously decompressed byte-streams. If the argument
      'winsize' is specified, it determines the base 2 logarithm
      of the history buffer size. The default 'winsize' is 15.

      SEE ALSO, `zlib.decompress()`, `zlib.compressobj()`

  METHODS AND ATTRIBUTES:

  zlib.compressobj.compress(s)
      Add more data to the compression object. If symbol table
      becomes full, compressed data is returned, otherwise an
      empty string. All returned output from each repeated call
      to `zlib.compressobj.compress()` should be concatenated to
      a decompression byte-stream (either a string or a
      decompression object). The example below, if run in a
      directory with some files, lets one examine the buffering
      behavior of compression objects:

      #---------- zlib_objs.py ----------#
      # Demonstrate compression object streams
      import zlib, glob
      decom = zlib.decompressobj()
      com = zlib.compressobj()
      for file in glob.glob('*'):
          s = open(file).read()
          c = com.compress(s)
          print 'COMPRESSED:', len(c), 'bytes out'
          d = decom.decompress(c)
          print 'DECOMPRESS:', len(d), 'bytes out'
          print 'UNUSED DATA:', len(decom.unused_data), 'bytes'
          raw_input('-- %s (%s bytes) --' % (file, `len(s)`))
      f = com.flush()
      m = decom.decompress(f)
      print 'DECOMPRESS:', len(m), 'bytes out'
      print 'UNUSED DATA:', len(decom.unused_data), 'byte'

      SEE ALSO, `zlib.compressobj.flush()`,
                `zlib.decompressobj.decompress()`,
                `zlib.compress()`

  zlib.compressobj.flush([mode])
      Flush any buffered data from the compression object. As in
      the example in `zlib.compressobj.compress()`, the output of
      a `zlib.compressobj.flush()` should be concatenated to the
      same decompression byte-stream as `zlib.compressobj.compress()`
      calls are. If the first argument 'mode' is left empty, or
      the default Z_FINISH is specified, the compression object
      cannot be used further, and one should `delete` it.
      Otherwise, if Z_SYNC_FLUSH or Z_FULL_FLUSH are specified,
      the compression object can still be used, but some
      uncompressed data may not be recovered by the decompression
      object.

      SEE ALSO, `zlib.compress()`, `zlib.compressobj.compress()`

  zlib.decompressobj.unused_data
      As indicated, `zlib.decompressobj.unused_data` is an
      instance attribute rather than a method. If any partial
      compressed stream cannot be decompressed immediately based
      on the byte-stream received, the remainder is buffered in
      this instance attribute. Normally, any output of a
      compression object forms a complete decompression block,
      and nothing is left in this instance attribute. However,
      if data is received in bits over a channel, only partial
      decompression may be possible on a particular
      `zlib.decompressobj.decompress()` call.

      SEE ALSO, `zlib.decompress()`,
      `zlib.decompressobj.decompress()`

  zlib.decompressobj.decompress(s)
      Return the decompressed data that may be derived from the
      current decompression object state and the argument 's'
      data passed in. If all of 's' cannot be decompressed in
      this pass, the remainder is left in
      `zlib.decompressobj.unused_data`.

  zlib.decompressobj.flush()
      Return the decompressed data from any bytes buffered by
      the decompression object. After this call, the
      decompression object cannot be used further, and you
      should `del` it.

  EXCEPTIONS:

  zlib.error
      Exception that is raised by compression or decompression
      errors.


  SEE ALSO, [gzip], [zipfile]


  TOPIC -- Unicode
  --------------------------------------------------------------------

  Note that Appendix C (Understanding Unicode) also discusses
  Unicode issues.

  Unicode is an enhanced set of character entities, well beyond
  the basic 128 characters defined in ASCII encoding and the
  codepage-specific national language sets that contain 128
  characters each. The full Unicode character set--evolving
  continuously, but with a large number of codepoints already
  fixed--can contain literally millions of distinct characters.
  This allows the representation of a large number of national
  character sets within a unified encoding space, even the large
  character sets of Chinese-Japanese-Korean (CJK) alphabets.

  Although Unicode defines a unique codepoint for each distinct
  character in its range, there are numerous -encodings- that
  correspond to each character. The encoding called 'UTF-8'
  defines ASCII characters as single bytes with standard ASCII
  values. However, for non-ASCII characters, a variable number
  of bytes (up to 6) are used to encode characters, with the
  "escape" to Unicode being indicated by high-bit values in
  initial bytes of multibyte sequences. 'UTF-16' is similar,
  but uses either 2 or 4 bytes to encode each character (but
  never just 1). 'UTF-32' is a format that uses a fixed 4-byte
  value for each Unicode character. 'UTF-32', however, is not
  currently supported by Python.

  Native Unicode support was added to Python 2.0. On the face of
  it, it is a happy situation that Python supports Unicode--it
  brings the world closer to multinational language support in
  computer applications. But in practice, you have to be careful
  when working with Unicode, because it is all too easy to
  encounter glitches like the one below:

      >>> alef, omega = unichr(1488), unichr(969)
      >>> unicodedata.name(alef)
      >>> print alef
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeError: ASCII encoding error: ordinal not in range(128)
      >>> print chr(170)
      ?
      >>> if alef == chr(170): print "Hebrew is Roman diacritic"
      ...
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeError: ASCII decoding error: ordinal not in range(128)

  A Unicode string that is composed of only ASCII characters,
  however, is considered equal (but not identical) to a Python
  string of the same characters.

      >>> u"spam" == "spam"
      1
      >>> u"spam" is "spam"
      0
      >>> "spam" is "spam" # string interning is not guaranteed
      1
      >>> u"spam" is u"spam" # unicode interning not guaranteed
      1

  Still, the care you take should not discourage you from working
  with multilanguage strings, as Unicode enables. It is really
  amazingly powerful to be able to do so. As one says of a talking
  dog: It is not that he speaks so -well-, but that he speaks at
  all.

  =================================================================
    Built-In Unicode Functions/Methods
  =================================================================

  The Unicode string method `u"".encode()` and the built-in
  function `unicode()` are inverse operations. The Unicode
  string method returns a plain string with the 8-bit bytes
  needed to represent it (using the specified or default
  encoding). The built-in `unicode()` takes one of these encoded
  strings, and produces the Unicode object represented by the
  encoding. Specifically, suppose we define the function:

      >>> chk_eq = lambda u,enc: u == unicode(u.encode(enc),enc)

  The call `chk_eq(u,enc)` should return 1 for every value of
  'u' and 'enc'--as long as 'enc' is a valid encoding name and 'u'
  is capable of being represented in that encoding.

  The set of encodings supported for both built-ins are listed
  below. Additional encodings may be registered using the
  [codecs] module. Each encoding is indicated by the string that
  names it, and the case of the string is normalized before
  comparison (case-insensitive naming of encodings):

  ascii, us-ascii
      Encode using 7-bit ASCII.

  base64
      Encode Unicode strings using the base64 4-to-3 encoding
      format.

  latin-1, iso-8859-1
      Encode using common European accent characters in high-bit
      values of 8-bit bytes. Latin-1 character's `ord()` values
      are identical to their Unicode codepoints.

  quopri
      Encode in quoted printable format.

  rot13
      Not really a Unicode encoding, but "rotate 13 chars" is
      included with Python 2.2+ as an example and convenience.

  utf-7
      Encode using variable byte-length encoding that is
      restricted to 7-bit ASCII octets. As with 'utf-8', ASCII
      characters encode themselves.

  utf-8
      Encode using variable byte-length encoding that preserves
      ASCII value bytes.

  utf-16
      Encoding using 2/4 byte encoding. Include "endian" lead
      bytes (platform-specific selection).

  utf-16-le
      Encoding using 2/4 byte encoding. Assume "little
      endian," and do not prepend "endian" indicator bytes.

  utf-16-be
      Encoding using 2/4 byte encoding. Assume "big endian,"
      and do not prepend "endian" indicator bytes.

  unicode-escape
      Encode using Python-style Unicode string constants
      ('u"\uXXXX"').

  raw-unicode-escape
      Encode using Python-style Unicode raw string constants
      ('ur"\uXXXX"').

  The error modes for both built-ins are listed below. Errors in
  encoding transformations may be handled in any of several ways:

  strict
      Raise 'UnicodeError' for all decoding errors. Default
      handling.

  ignore
      Skip all invalid characters.

  replace
      Replace invalid characters with '?' (string target) or
      'u"\xfffd"' (Unicode target).

  u"".encode([enc [,errmode]])
  "".encode([enc [,errmode]])
      Return an encoded string representation of a Unicode string
      (or of a plain string). The representation is in the style
      of encoding 'enc' (or system default). This string is
      suitable for writing to a file or stream that other
      applications will treat as Unicode data. Examples show
      several encodings:

      >>> alef = unichr(1488)
      >>> s = 'A'+alef
      >>> s
      u'A\u05d0'
      >>> s.encode('unicode-escape')
      'A\\u05d0'
      >>> s.encode('utf-8')
      'A\xd7\x90'
      >>> s.encode('utf-16')
      '\xff\xfeA\x00\xd0\x05'
      >>> s.encode('utf-16-le')
      'A\x00\xd0\x05'
      >>> s.encode('ascii')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeError: ASCII encoding error: ordinal not in range(128)
      >>> s.encode('ascii','ignore')
      'A'

  unicode(s [,enc [,errmode]])
      Return a Unicode string object corresponding to the encoded
      string passed in the first argument 's'. The string 's'
      might be a string that is read from another Unicode-aware
      application. The representation is treated as conforming
      to the style of the encoding 'enc' if the second argument
      is specified, or system default otherwise (usually 'utf-8').
      Errors can be handled in the default 'strict' style or in
      a style specified in the third argument 'errmode'

  unichr(cp)
      Return a Unicode string object containing the single
      Unicode character whose integer codepoint is passed in the
      argument 'cp'.

  =================================================================
    MODULE -- codecs : Python Codec Registry, API, and helpers
  =================================================================

  The [codecs] module contains a lot of sophisticated
  functionality to get at the internals of Python's Unicode
  handling. Most of those capabilities are at a lower level than
  programmers who are just interested in text processing need to
  worry about. The documentation of this module, therefore, will
  break slightly with the style of most of the documentation and
  present only two very useful wrapper functions within the
  [codecs] module.

  codecs.open(filename=... [,mode='rb' [,encoding=... [,errors='strict'
    -? [,buffering=1]]]])
      This wrapper function provides a simple and direct means of
      opening a Unicode file, and treating its contents directly
      as Unicode. In contrast, a file opened with the built-in
      `open()` function, its contents are written and read as
      strings; to read/write Unicode data to such a file involves
      multiple passes through `u"".encode()` and `unicode()`.

      The first argument 'filename' specifies the name of the
      file to access. If the second argument 'mode' is
      specified, the read/write mode can be selected. These
      arguments work identically to those used by `open()`. If
      the third argument 'encoding' is specified, this encoding
      will be used to interpret the file (an incorrect encoding
      will probably result in a 'UnicodeError'). Error handling
      may be modified by specifying the fourth argument 'errors'
      (the options are the same as with the built-in `unicode()`
      function). A fifth argument 'buffering' may be specified
      to use a specific buffer size (on platforms that support
      this).

      An example of usage clarifies the difference between
      `codecs.open()` and the built-in `open()`:

      >>> import codecs
      >>> alef = unichr(1488)
      >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8'))
      >>> open('unicode_test').read() # Read as plain string
      'A\xd7\x90'
      >>> # Now read directly as Unicode
      >>> codecs.open('unicode_test', encoding='utf-8').read()
      u'A\u05d0'

      Data written back to a file opened with `codecs.open()`
      should likewise be Unicode data.

      SEE ALSO, `open()`

  codecs.EncodedFile(file=..., data_encoding=... [,file_encoding=...
    -? [,errors='strict']])
      This function allows an already opened file to be wrapped
      inside an "encoding translation" layer. The mode and
      buffering are taken from the underlying file. By
      specifying a second argument 'data_encoding' and a third
      argument 'file_encoding', it is possible to generate
      strings in one encoding within an application, then write
      them directly into the appropriate file encoding. As with
      `codecs.open()` and `unicode()`, an error handling style
      may be specified with the fourth argument 'errors'.

      The most likely purpose for `codecs.EncodedFile()` is where
      an application is likely to receive byte-streams from
      multiple sources, encoded according to multiple Unicode
      encodings. By wrapping file objects (or file-like objects)
      in an encoding translation layer, the strings coming in one
      encoding can be transparently written to an output in the
      format the output expects. An example clarifies:

      >>> import codecs
      >>> alef = unichr(1488)
      >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8'))
      >>> fp = open('unicode_test','rb+')
      >>> fp.read() # Plain string w/ two-byte UTF-8 char in it
      'A\xd7\x90'
      >>> utf16_writer = codecs.EncodedFile(fp,'utf-16','utf-8')
      >>> ascii_writer = codecs.EncodedFile(fp,'ascii','utf-8')
      >>> utf16_writer.tell() # Wrapper keeps same current position
      3
      >>> s = alef.encode('utf-16')
      >>> s # Plain string as UTF-16 encoding
      '\xff\xfe\xd0\x05'
      >>> utf16_writer.write(s)
      >>> ascii_writer.write('XYZ')
      >>> fp.close() # File should be UTF-8 encoded
      >>> open('unicode_test').read()
      'A\xd7\x90\xd7\x90XYZ'

      SEE ALSO, `codecs.open()`


  =================================================================
    MODULE -- unicodedata : Database of Unicode characters
  =================================================================

  The module [unicodedata] is a database of Unicode character
  entities. Most of the functions in [unicodedata] take as an
  argument one Unicode character and return some information about
  the character contained in a plain (non-Unicode) string. The
  function of [unicodedata] is essentially informational, rather
  than transformational. Of course, an application might make
  decisions about the transformations performed based on the
  information returned by [unicodedata]. The short utility below
  provides all the information available for any Unicode
  codepoint:

      #------------------ unichr_info.py ----------------------#
      # Return all the information [unicodedata] has
      # about the single unicode character whose codepoint
      # is specified as a command-line argument.
      # Arg may be any expression evaluating to an integer
      from unicodedata import *
      import sys
      char = unichr(eval(sys.argv[1]))
      print 'bidirectional', bidirectional(char)
      print 'category ', category(char)
      print 'combining ', combining(char)
      print 'decimal ', decimal(char,0)
      print 'decomposition', decomposition(char)
      print 'digit ', digit(char,0)
      print 'mirrored ', mirrored(char)
      print 'name ', name(char,'NOT DEFINED')
      print 'numeric ', numeric(char,0)
      try: print 'lookup ', `lookup(name(char))`
      except: print "Cannot lookup"

  The usage of 'unichr_info.py' is illustrated below by the runs
  with two possible arguments:

      #*--------------- Using unichr_info.py ------------------#
      % python unichr_info.py 1488
      bidirectional R
      category Lo
      combining 0
      decimal 0
      decomposition
      digit 0
      mirrored 0
      name HEBREW LETTER ALEF
      numeric 0
      lookup u'\u05d0'

      % python unichr_info.py ord('1')
      bidirectional EN
      category Nd
      combining 0
      decimal 1
      decomposition
      digit 1
      mirrored 0
      name DIGIT ONE
      numeric 1.0
      lookup u'1'

  For additional information on current Unicode character
  codepoints and attributes, consult:

    <http://www.unicode.org/Public/UNIDATA/UnicodeData.html>

  FUNCTIONS:

  unicodedata.bidirectional(unichr)
      Return the bidirectional characteristic of the character
      specified in the argument 'unichr'. Possible values are
      AL, AN, B, BN, CS, EN, ES, ET, L, LRE, LRO, NSM, ON, PDF, R,
      RLE, RLO, S, and WS. Consult the URL above for details on
      these. Particularly notable values are L (left-to-right), R
      (right-to-left), and WS (whitespace).

  unicodedata.category(unichr)
      Return the category of the character specified in the
      argument 'unichr'. Possible values are Cc, Cf, Cn, Ll,
      Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf,
      Pi, Po, Ps, Sc, Sk , Sm, So, Zl, Zp, and Zs. The first
      (capital) letter indicates L (letter), M (mark), N
      (number), P (punctuation), S (symbol), Z (separator), or
      C (other). The second letter is generally mnemonic within
      the major category of the first letter. Consult the URL
      above for details.

  unicodedata.combining(unichr)
      Return the numeric combining class of the character
      specified in the argument 'unichr'. These include values
      such as 218 (below left) or 210 (right attached). Consult
      the URL above for details.

  unicodedata.decimal(unichr [,default])
      Return the numeric decimal value assigned to the character
      specified in the argument 'unichr'. If the second argument
      'default' is specified, return that if no value is assigned
      (otherwise raise 'ValueError').

  unicodedata.decomposition(unichr)
      Return the decomposition mapping of the character specified
      in the argument 'unichr', or empty string if none exists.
      Consult the URL above for details. An example shows that
      some characters may be broken into component characters:

      >>> from unicodedata import *
      >>> name(unichr(190))
      'VULGAR FRACTION THREE QUARTERS'
      >>> decomposition(unichr(190))
      '<fraction> 0033 2044 0034'
      >>> name(unichr(0x33)), name(unichr(0x2044)), name(unichr(0x34))
      ('DIGIT THREE', 'FRACTION SLASH', 'DIGIT FOUR')

  unicodedata.digit(unichr [,default])
      Return the numeric digit value assigned to the character
      specified in the argument 'unichr'. If the second argument
      'default' is specified, return that if no value is assigned
      (otherwise raise 'ValueError').

  unicodedata.lookup(name)
      Return the Unicode character with the name specified in
      the first argument 'name'. Matches must be exact, and
      'ValueError' is raised if no match is found. For example:

      >>> from unicodedata import *
      >>> lookup('GREEK SMALL LETTER ETA')
      u'\u03b7'
      >>> lookup('ETA')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      KeyError: undefined character name

      SEE ALSO, `unicodedata.name()`

  unicodedata.mirrored(unichr)
      Return 1 if the character specified in the argument
      'unichr' is a mirrored character in bidirection text.
      Return 0 otherwise.

  unicodedata.name(unichr)
      Return the name of the character specified in the argument
      'unichr'. Names are in all caps and have a regular form
      by descending category importance. Consult the URL above
      for details.

      SEE ALSO, `unicodedata.lookup()`

  unicodedata.numeric(unichr [,default])
      Return the floating point numeric value assigned to the
      character specified in the argument 'unichr'. If the
      second argument 'default' is specified, return that if no
      value is assigned (otherwise raise 'ValueError').


SECTION 3 -- Solving Problems
------------------------------------------------------------------------

  EXERCISE: Many ways to take out the garbage
  --------------------------------------------------------------------

  DISCUSSION:

  Recall, if you will, the dictum in "The Zen of Python" that
  "There should be one--and preferably only one--obvious way to
  do it." As with most dictums, the real world sometimes fails
  our ideals. Also as with most dictums, this is not necessarily
  such a bad thing.

  A discussion on the newsgroup '<comp.lang.python>' in 2001 posed
  an apparently rather simple problem. The immediate problem was
  that one might encounter telephone numbers with a variety of
  dividers and delimiters inside them. For example, '(123)
  456-7890', '123-456-7890', or '123/456-7890' might all represent
  the same telephone number, and all forms might be encountered in
  textual data sources (such as ones entered by users of a
  free-form entry field. For purposes of this problem, the
  canonical form of this number should be '1234567890'.

  The problem mentioned here can be generalized in some natural
  ways: Maybe we are interested in only some of the characters
  within a longer text field (in this case, the digits), and the
  rest is simply filler. So the general problem is how to
  extract the content out from the filler.

  The first and "obvious" approach might be a procedural loop
  through the initial string. One version of this approach might
  look like:

      >>> s = '(123)/456-7890'
      >>> result = ''
      >>> for c in s:
      ... if c in '0123456789':
      ... result = result + c
      ...
      >>> result
      '1234567890'

  This first approach works fine, but it might seem a bit bulky for
  what is, after all, basically a single action. And it might also
  seem odd that you need to loop though character-by-character
  rather than just transform the whole string.

  One possibly simpler approach is to use a regular expression. For
  readers who have skipped to the next chapter, or who know regular
  expressions already, this approach seems obvious:

      >>> import re
      >>> s = '(123)/456-7890'
      >>> re.sub(r'\D', '', s)
      '1234567890'

  The actual work done (excluding defining the initial string and
  importing the [re] module) is just one short expression. Good
  enough, but one catch with regular expressions is that they are
  frequently far slower than basic string operations. This makes
  no difference for the tiny example presented, but for
  processing megabytes, it could start to matter.

  Using a functional style of programming is one way to express
  the "filter" in question rather tersely, and perhaps more
  efficiently. For example:

      >>> s = '(123)/456-7890'
      >>> filter(lambda c:c.isdigit(), s)
      '1234567890'

  We also get something short, without needing to use regular
  expressions. Here is another technique that utilizes string
  object methods and list comprehensions, and also pins some hopes
  on the great efficiency of Python dictionaries:

      >>> isdigit = {'0':1,'1':1,'2':1,'3':1,'4':1,
      ... '5':1,'6':1,'7':1,'8':1,'9':1}.has_key
      >>> ''.join([x for x in s if isdigit(x)])
      '1234567890'

  QUESTIONS:

  1. Which content extraction technique seems most natural to
      you? Which would you prefer to use? Explain why.

  2. What intuitions do you have about the performance of these
      different techniques, if applied to large data sets? Are
      there differences in comparative efficiency of techniques
      between operating on one single large string input and
      operating on a large number of small string inputs?

  3. Construct a program to verify or refute your intuitions
      about performance of the constructs.

  4. Can you think of ways of combining these techniques to
      maximize efficiency? Are there any other techniques available
      that might be even better (hint: think about what
      `string.translate()` does)? Construct a faster technique,
      and demonstrate its efficiency.

  5. Are there reasons other than raw processing speed to prefer
      some of these techniques over others? Explain these reasons,
      if they exist.


  EXERCISE: Making sure things are what they should be
  --------------------------------------------------------------------

  DISCUSSION:

  The concept of a "digital signature" was introduced in Section
  2.2.4. As was mentioned, the Python standard library does not
  include (directly) any support for digital signatures. One way to
  characterize a digital signature is as some information that
  -proves- or -verifies- that some other information really is what
  it purports to be. But this characterization actually applies to
  a broader set of things than just digital signatures. In
  cryptology literature one is accustomed to talk about the "threat
  model" a crypto-system defends against. Let us look at a few.

  Data may be altered by malicious tampering, but it may also be
  altered by packet loss, storage-media errors, or by program
  errors. The threat of accidental damage to data is the easiest
  threat to defend against. The standard technique is to use a
  hash of the correct data and send that also. The receiver of
  the data can simply calculate the hash of the data
  herself--using the same algorithm--and compare it with the
  hash sent. A very simple utility like the one below does this:

      #---------- crc32.py ----------#
      # Calculate CRC32 hash of input files or STDIN
      # Incremental read for large input sources
      # Usage: python crc32.py [file1 [file2 [...]]]
      # or: python crc32.py < STDIN

      import binascii
      import fileinput
      filelist = []
      crc = binascii.crc32('')
      for line in fileinput.input():
          if fileinput.isfirstline():
              if fileinput.isstdin():
                  filelist.append('STDIN')
              else:
                  filelist.append(fileinput.filename())
          crc = binascii.crc32(line,crc)
      print 'Files:', ' '.join(filelist)
      print 'CRC32:', crc

  A slightly faster version could use `zlib.adler32()` instead of
  `binascii.crc32`. The chance that a randomly corrupted file would
  have the right CRC32 hash is approximately (2**-32)--unlikely
  enough not to worry about most times.

  A CRC32 hash, however, is far too weak to be used
  cryptographically. While random data error will almost surely not
  create a chance hash collision, a malicious tamperer-- Mallory,
  in crypto-parlance--can find one relatively easily. Specifically,
  suppose the true message is M, Mallory can find an M' such that
  CRC32(M) equals CRC32(M'). Moreover, even imposing the condition
  that M' appears plausible as a message to the receiver does not
  make Mallory's tasks particularly difficult.

  To thwart fraudulent messages, it is necessary to use a
  cryptographically strong hash, such as [SHA] or [MD5]. Doing
  so is almost the same utility as above:

      #---------- sha.py ----------#
      # Calculate SHA hash of input files or STDIN
      # Usage: python sha.py [file1 [file2 [...]]]
      # or: python sha.py < STDIN

      import sha, fileinput, os, sys
      filelist = []
      sha = sha.sha()
      for line in fileinput.input():
          if fileinput.isfirstline():
              if fileinput.isstdin():
                  filelist.append('STDIN')
              else:
                  filelist.append(fileinput.filename())
          sha.update(line[:-1]+os.linesep) # same as binary read
      sys.stderr.write('Files: '+' '.join(filelist)+'\nSHA: ')
      print sha.hexdigest()

  An SHA or MD5 hash cannot be forged practically, but if our
  threat model includes a malicious tamperer, we need to worry
  about whether the hash itself is authentic. Mallory, our
  tamperer, can produce a false SHA hash that matches her false
  message. With CRC32 hashes, a very common procedure is to attach
  the hash to the data message itself--for example, as the first or
  last line of the data file, or within some wrapper lines. This is
  called an "in band" or "in channel" transmission. One alternative
  is "out of band" or "off channel" transmission of cryptographic
  hashes. For example, a set of cryptographic hashes matching data
  files could be placed on a Web page. Merely transmitting the hash
  off channel does not guarantee security, but it does require
  Mallory to attack both channels effectively.

  By using encryption, it is possible to transmit a secured hash
  in channel. The key here is to encrypt the hash and attach
  that encrypted version. If the hash is appended with some
  identifying information before the encryption, that can be
  recovered to prove identity. Otherwise, one could simply
  include both the hash and its encrypted version. For the
  encryption of the hash, an asymmetrical encryption algorithm is
  ideal; however, with the Python standard library, the best we
  can do is to use the (weak) symmetrical encryption in [rotor].
  For example, we could use the utility below:

      #---------- hash_rotor.py ----------#
      #!/usr/bin/env python
      # Encrypt hash on STDIN using sys.argv[1] as password
      import rotor, sys, binascii
      cipher = rotor.newrotor(sys.argv[1])
      hexhash = sys.stdin.read()[:-1] # no newline
      print hexhash
      hash = binascii.unhexlify(hexhash)
      sys.stderr.write('Encryption: ')
      print binascii.hexlify(cipher.encrypt(hash))

  The utilities could then be used like:

      #*-------- hash_rotor at work --------#
      % cat mary.txt
      Mary had a little lamb
      % python sha.py mary.txt | hash_rotor.py mypassword >> mary.txt
      Files: mary.txt
      SHA: Encryption:
      % cat mary.txt
      Mary had a little lamb
      c49bf9a7840f6c07ab00b164413d7958e0945941
      63a9d3a2f4493d957397178354f21915cb36f8f8

  The penultimate line of the file now has its SHA hash, and the
  last line has an encryption of the hash. The password used will
  somehow need to be transmitted securely for the receiver to
  validate the appended document (obviously, the whole system make
  more sense with longer and more proprietary documents than in the
  example).

  QUESTIONS:

  1. How would you wrap up the suggestions in the small
      utilities above into a more robust and complete
      "digital_signatures.py" utility or module? What concerns
      would come into a completed utility?

  2. Why is CRC32 not suitable for cryptographic purposes? What
      sets SHA and MD5 apart (you should not need to know the
      details of the algorithm for this answer)? Why is
      uniformity of coverage of hash results important for any
      hash algorithm?

  3. Explain in your own words why hashes serve to verify
      documents. If you were actually the malicious attacker in
      the scenarios above, how would you go about interfering
      with the crypto-systems outlined here? What lines of
      attack are left open by the system you sketched out or
      programmed in (1)?

  4. If messages are subject to corruptions, including
      accidental corruption, so are hashes. The short length of
      hashes may make problems in them less likely, but not
      impossible. How might you enhance the document verification
      systems above to detect corruption within a hash itself?
      How might you allow more accurate targeting of corrupt
      versus intact portions of a large document (it may be
      desirable to recover as much as possible from a corrupt
      document)?

  5. Advanced: The RSA public-key algorithm is actually quite
      simple; it just involves some modulo exponentiation
      operations and some large primes. An explanation can be
      found, among other places, at the author's -Introduction
      to Cryptology Concepts II-:

        <http://gnosis.cx/publish/programming/cryptology2.pdf>

      Try implementing an RSA public-key algorithm in Python, and
      use this to enrich the digital signature system you
      developed above.


  EXERCISE: Finding needles in haystacks (full-text indexing)
  --------------------------------------------------------------------

  DISCUSSION:

  Many texts you deal with are loosely structured and prose-like,
  rather than composed of well-ordered records. For documents of
  that sort, a very frequent question you want answered is, "What
  is (or isn't) in the documents?"--at a more general level than
  the semantic richness you might obtain by actually -reading- the
  documents. In particular, you often want to check a large
  collection of documents to determine the (comparatively) small
  subset of them that are relevant to a given area of interest.

  A certain category of questions about document collections has
  nothing much to do with text processing. For example, to locate
  all the files modified within a certain time period, and having a
  certain file size, some basic use of the [os.path] module
  suffices. Below is a sample utility to do such a search, which
  includes some typical argument parsing and help screens. The
  search itself is only a few lines of code:

      #---------- findfile1.py ----------#
      # Find files matching date and size
      _usage = """
      Usage:
         python findfile1.py [-start=days_ago] [-end=days_ago]
                             [-small=min_size] [-large=max_size] [pattern]
       Example:
         python findfile1.py -start=10 -end=5 -small=1000 -large=5000 *.txt
      """
      import os.path
      import time
      import glob
      import sys

      def parseargs(args):
          """Somewhat flexible argument parser for multiple platforms.

          Switches can start with - or /, keywords can end with = or :.
          No error checking for bad arguments is performed, however.
          """
          now = time.time()
          secs_in_day = 60*60*24
          start = 0 # start of epoch
          end = time.time() # right now
          small = 0 # empty files
          large = sys.maxint # max file size
          pat = '*' # match all
          for arg in args:
             if arg[0] in '-/':
                if arg[1:6]=='start': start = now-(secs_in_day*int(arg[7:]))
                elif arg[1:4]=='end': end = now-(secs_in_day*int(arg[5:]))
                elif arg[1:6]=='small': small = int(arg[7:])
                elif arg[1:6]=='large': large = int(arg[7:])
                elif arg[1] in 'h?': print _usage
             else:
                pat = arg
          return (start,end,small,large,pat)

      if __name__ == '__main__':
          if len(sys.argv) > 1:
              (start,end,small,large,pat) = parseargs(sys.argv[1:])
              for fname in glob.glob(pat):
                  if not os.path.isfile(fname):
                      continue # don't check directories
                  modtime = os.path.getmtime(fname)
                  size = os.path.getsize(fname)
                  if small <= size <= large and start <= modtime <= end:
                      print time.ctime(modtime),'%8d '%size,fname
          else: print _usage

  What about searching for text inside files? The `string.find()`
  function is good for locating contents quickly and could be
  used to search files for contents. But for large document
  collections, hits may be common. To make sense of search
  results, ranking the results by number of hits can help. The
  utility below performs a match-accuracy ranking (for brevity,
  without the argument parsing of 'findfile1.py'):

      #---------- findfile2.py ----------#
      # Find files that contain a word
      _usage = "Usage: python findfile.py word"
      import os.path
      import glob
      import sys

      if len(sys.argv) == 2:
          search_word = sys.argv[1]
          results = []
          for fname in glob.glob('*'):
              if os.path.isfile(fname): # don't check directories
                  text = open(fname).read()
                  fsize = len(text)
                  hits = text.count(search_word)
                  density = (fsize > 0) and float(hits)/(fsize)
                  if density > 0: # consider when density==0
                      results.append((density,fname))
          results.sort()
          results.reverse()
          print 'RANKING FILENAME'
          print '------- --------------------------'
          for match in results:
              print '%6d '%int(match[0]*1000000), match[1]
      else:
          print _usage

  Variations on these are, of course, possible. But generally
  you could build pretty sophisticated searches and rankings by
  adding new search options incrementally to 'findfile2.py'. For
  example, adding some regular expression options could give the
  utility capabilities similar to the 'grep' utility.

  The place where a word search program like the one above falls
  terribly short is in speed of locating documents in -very-
  large document collections. Even something as fast, and well
  optimized, as 'grep' simply takes a while to search a lot of
  source text. Fortunately, it is possible to -shortcut- this
  search time, as well as add some additional capabilities.

  A technique for rapid searching is to perform a generic search
  just once (or periodically) and create an index--i.e.,
  database--of those generic search results. Performing a later
  search need not -really- search contents, but only check the
  abstracted and structured index of possible searches. The utility
  'indexer.py' is a functional example of such a computed search
  index. The most current version may be downloaded from the
  book's Web site <http://gnosis.cx/TPiP/>.

  The utility 'indexer.py' allows very rapid searching for the
  simultaneous occurrence of multiple words within a file. For
  example, one might want to locate all the document files (or
  other text sources, such as VARCHAR database fields) that
  contain the words 'Python', 'index', and 'search'. Supposing
  there are many thousands of candidate documents, searching them
  on an ad hoc basis could be slow. But 'indexer.py' creates a
  comparatively compact collection of persistent dictionaries
  that provide answers to such inquiries.

  The full source code to 'indexer.py' is worth reading, but most
  of it deals with a variety of persistence mechanisms and with an
  object-oriented programming (OOP) framework for reuse. The
  underlying idea is simple, however. Create three dictionaries
  based on scanning a collection of documents:

      #*---------- Index dictionaries ----------#
      *Indexer.fileids: fileid --> filename
      *Indexer.files: filename --> (fileid, wordcount)
      *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...}

  The essential mapping is '*Indexer.words'. For each word, what
  files does it occur in and how often? The mappings
  '*Indexer.fileids' and '*Indexer.files' are ancillary. The
  first just allows shorter numeric aliases to be used instead of
  long filenames in the '*Indexer.words' mapping (a performance
  boost and storage saver). The second, '*Indexer.files', also
  holds a total wordcount for each file. This allows a ranking
  of the importance of different matches. The thought is that a
  megabyte file with ten occurrences of 'Python' is less focused
  on the topic of Python than is a kilobyte file with the same
  ten occurrences.

  Both generating and utilizing the mappings above is
  straightforward. To search multiple words, one basically
  simply needs the intersection of the results of several values
  of the '*Indexer.words' dictionary, one value for each word
  key. Generating the mappings involves incrementing counts in
  the nested dictionary of '*Indexer.words', but is not
  complicated.

  QUESTIONS:

  1. One of the most significant--and surprisingly
      subtle--concerns in generating useful word indexes is
      figuring out just what a "word" is. What considerations
      would you bring to determine word identities? How might
      you handle capitalization? Punctuation? Whitespace? How
      might you disallow binary strings that are not "real"
      words. Try performing word-identification tests against
      real-world documents. How successful were you?

  2. Could other data structures be used to store word index
      information than those proposed above? If other data
      structures are used, what efficiency (speed) advantages or
      disadvantages do you expect to encounter? Are there other
      data structures that would allow for additional search
      capabilities than the multiword search of 'indexer.py'?
      If so, what other indexed search capabilities would have
      the most practical benefit?

  3. Consider adding integrity guarantees to index results.
      What if an index falls out of synchronization with the
      underlying documents? How might you address referential
      integrity? Hint: consider `binascii.crc32`, [sha], and
      [md5]. What changes to the data structures would be needed
      for integrity checks? Implement such an improvement.

  4. The utility 'indexer.py' has some ad hoc exclusions of
      nontextual files from inclusion in an index, based simply
      on some file extensions. How might one perform accurate
      exclusion of nontextual data? What does it mean for a
      document to contain text? Try writing a utility
      'istextual.py' that will identify text and nontext
      real-world documents. Does it work to your satisfaction?

  5. Advanced: 'indexer.py' implements several different
      persistence mechanisms. What other mechanisms might you
      use from those implemented? Benchmark your mechanism.
      Does it do better than 'SlicedZPickleIndexer' (the best
      variant ncluded in both speed and space)?

}}}

TableOfContents

==CHAPTER II -- BASIC STRING OPERATIONS==

TPiP/Chap2 (last edited 2009-12-25 07:13:53 by localhost)