文章来自《Python cookbook》.   翻译仅仅是为了个人学习,其它商业版权纠纷与此无关! 
-- 大熊 [2004-10-08 16:42:30]
Contents
描述
12.2 Checking XML Well-Formedness Credit: Paul Prescod
12.2 检查XML是否为良好格式
感谢:Paul Prescod
问题 Problem
12.2.1 Problem You need to check if an XML document is well-formed (not if it conforms to a DTD or schema), and you need to do this quickly.
12.2.1 问题 你需要检查一个XML文档是否是格式良好的(是否符合DTD或schema),同时需要快速的完成检查。
解决 Solution
12.2.2 Solution SAX (presumably using a fast parser such as Expat underneath) is the fastest and simplest way to perform this task:
12.2.2 解决 SAX(在底层可能会使用一个较快的解析器,就像Expat)是最快的和最为简单的方式来做这个任务:
   1 from xml.sax.handler import ContentHandler
   2 from xml.sax import make_parser
   3 from glob import glob
   4 import sys
   5 
   6 def parsefile(file):
   7     parser = make_parser(  )
   8     parser.setContentHandler(ContentHandler(  ))
   9     parser.parse(file)
  10 
  11 for arg in sys.argv[1:]:
  12     for filename in glob(arg):
  13         try:
  14             parsefile(filename)
  15             print "%s is well-formed" % filename
  16         except Exception, e:
  17             print "%s is NOT well-formed! %s" % (filename, e)
讨论 Discussion
12.2.3 Discussion A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.
12.2.3 讨论 一个文本如果它遵守所有基本的XML文档的语法规则,那它就是格式良好的。换句话说,它有一个正确的XML声明和有一个单一的根元素,所有的标记签套正确,标记的属性是用引号括起来的,等等。
This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document's contents. But in this case, we only want to know if the document meets the most fundamental syntax constraints of XML; therefore, there is no processing that we need to do, and the do-nothing handler suffices.
这个处方使用了SAX API,使用了一个虚拟的ContentHandler,实际什么也没有做。通常,当我们要使用SAX解析一个XML文档,需要使用一个ContentHandler实例来处理文档的内容。但在这个例子中,我们仅仅想知道是否文档满足XML基本的语法约定,因此无需作什么处理,这样的一个空的ContentHandler足够了。
The parsefile function parses the whole document and throws an exception if there is an error. The recipe's main code catches any such exception and prints it out like this:
函数parsefile解析整个文档,如果有什么错误将抛出一个异常。处方的主程序将捕获这样的异常,然后打印如下的信息:
$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag This means that character 2 on line 1,002 has a mismatched tag.
This recipe does not check adherence to a DTD or schema. That is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task.
这个处方并不检查XML是否遵守DTD或schema,这是一个单独处理,称为有效性检查。这个脚本的性能相当的好,正因为它仅仅关注于执行一个最小的不能再缩减的核心任务。
参考 See Also
12.2.4 See Also Recipe 12.3, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the faster validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.
处方12.3,处方12.4,以及处方12.6演示了SAX API的其他一些应用:PyXML包( http://pyxml.sourceforge.net ) 包括纯Python的带检验的解析器xmlproc,可以检查XML文档是否和指定的DTD一致;来自于ReportLab的PyRXP包是一个较快速的带校验的解析器RXP的包装( http://www.reportlab.com/xml/pyrxp.html ),在GPL许可下是可用的。