当前位置：首页 > 软件开发 > Web开发 > XML

用SAX和XNI检测XML文档的编码

来源：岁月联盟编辑：zhuzhu 时间：2009-01-10

　　XML 根据 Unicode 字符进行定义。在现代计算机的传输和存储过程中，那些 Unicode 字符必须按字节存储，通过解析器进行解码。很多编码方案可实现此目的：UTF-8、 UTF-16、ISO-8859-1、Cp1252 和 SJIS 等。

　　通常情况下，但不一定总是这样，您实际上不关注基本编码。XML 解析器对任何写入到 Unicode 字符串和字符数组中的文档进行转换。程序对解码后的字符串进行操作。本文讨论真正关注基本编码的 “不常出现” 的情况。

　　最常见的情况是想为输出结果保存输入编码。

　　另外一种情况是，不用解析文档，而将其作为字符串或字符大对象（Character Large Object, CLOB）存储在数据库中。

　　类似地，有些系统通过 HTTP 传输 XML 文档时，并没有全部读取文档，但需要设置 HTTP 的 Content-type 报头，指定正确的编码。在这种情况下，您需要知道文档是如何编码的。

　　大多数情况下，对于您编写的文档，您知道如何编码。但是，如果不是您编写的文档 — 只是从其他地方接收的文档（例如，从一个 Atom 提要中）— 那么最好的方法是使用一个 streaming API，例如 Simple API for XML（SAX）、Streaming API for XML（StAX）、System.Xml.XmlReader 或 Xerces Native Interface（XNI）。另外，也可以使用树型 API，例如文档对象模型（Document Object Model，DOM）。但是，它们需要读取整个文档，即使通常只需读取前 100 个字节（或更少）来判断编码。streaming API 可以只读取需要的内容，一旦得到结果后，就不再解析。这样就会更有效率。

　　SAX

　　目前，大多数 SAX 解析器，包括与 Sun 公司的 Java™ 软件开发套件（JDK）6 绑定的 SAX 解析器，可以用来检测编码。该技术不难实现，但是也不易理解。可以简单地概括为：

　　在 setDocumentLocator 方法中，将 Locator 参数传递给 Locator2。

　　在字段中保存 Locator2 对象。

　　在 startDocument 方法中，调用 Locator2 字段的 getEncoding() 方法。

　　（可选）如果已得到想要的全部结果，那么可以抛出 SAXException 提前结束解析过程。

　　清单 1 通过一个简单的程序说明该技术，输出命令行中给定的所有 URL 的编码。

　　清单 1. 使用 SAX 确定文档的编码

import org.xml.sax.*; import org.xml.sax.ext.*; import org.xml.sax.helpers.*; import java.io.IOException; public class SAXEncodingDetector extends DefaultHandler { 　　public static void main(String[] args) throws SAXException, IOException { 　　　　XMLReader parser = XMLReaderFactory.createXMLReader(); 　　　　SAXEncodingDetector handler = new SAXEncodingDetector(); 　　　　parser.setContentHandler(handler); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　parser.parse(args[i]); 　　　　　　} 　　　　　　catch (SAXException ex) { 　　　　　　　　System.out.println(handler.encoding); 　　　　　　} 　　　　} 　　} 　　　　private String encoding; 　　private Locator2 locator; 　　　　_cnnew1@Override 　　public void setDocumentLocator(Locator locator) { 　　　　if (locator instanceof Locator2) { 　　　　　　this.locator = (Locator2) locator; 　　　　} 　　　　else { 　　　　　　this.encoding = "unknown"; 　　　　} 　　} 　　　　@Override 　　public void startDocument() throws SAXException { 　　　　if (locator != null) { 　　　　　　this.encoding = locator.getEncoding(); 　　　　} 　　　　throw new SAXException("Early termination"); 　　} 　　 }

　　该方法花费 90% 的时间，有可能会更多一点。但是，SAX 解析器不需要支持 Locator 接口，更不用说 Locator2 以及其他的接口。如果知道正在使用的是 Xerces，第二种方法是使用 XNI。

　　Xerces Native Interface

　　使用 XNI 的方法与 SAX 是非常相似的（实际上，在 Xerces 中，SAX 解析器是本机 XNI 解析器之上很薄的一层）。总之，这种方法更容易一些，因为编码作为参数直接传递给 startDocument()。您只需要读取它，如清单 2 所示。

　　清单 2. 使用 XNI 确定文档的编码

import java.io.IOException; import org.apache.xerces.parsers.*; import org.apache.xerces.xni.*; import org.apache.xerces.xni.parser.*; public class XNIEncodingDetector extends XMLDocumentParser { 　　　　public static void main(String[] args) throws XNIException, IOException { 　　　　XNIEncodingDetector parser = new XNIEncodingDetector(); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　XMLInputSource document = new XMLInputSource("", args[i], ""); 　　　　　　　　parser.parse(document); 　　　　　　} 　　　　　　catch (XNIException ex) { 　　　　　　　　System.out.println(parser.encoding); 　　　　　　} 　　　　} 　　} 　　　　private String encoding = "unknown"; 　　@Override 　　public void startDocument(XMLLocator locator, String encoding, 　　　　NamespaceContext context, Augmentations augs) 　　　　　　　　throws XNIException { 　　　　this.encoding = encoding; 　　　　throw new XNIException("Early termination"); 　　} }

　　请注意，因为一些未知的原因，该技术只使用 org.apache.xerces 中实际的 Xerces 类，而不使用与 Sun 的 JDK 6 绑定的 com.sun.org.apache.xerces.internal 中重新打包的 Xerces 类。

　　XNI 提供了另外一个 SAX 不具有的功能。在少数情况下，在 XML 声明中声明的编码不是实际的编码。SAX 只报告实际编码，但是，XNI 也可以告诉您在 xmlDecl() 方法中声明的编码，如清单 3 所示。

　　清单 3. 使用 XNI 确定文档的声明的编码和实际的编码

import java.io.IOException; import org.apache.xerces.parsers.*; import org.apache.xerces.xni.*; import org.apache.xerces.xni.parser.*; public class AdvancedXNIEncodingDetector extends XMLDocumentParser { 　　　　public static void main(String[] args) throws XNIException, IOException { 　　　　AdvancedXNIEncodingDetector parser = new AdvancedXNIEncodingDetector(); 　　　　for (int i = 0; i < args.length; i++) { 　　　　　　try { 　　　　　　　　XMLInputSource document = new XMLInputSource("", args[i], ""); 　　　　　　　　parser.parse(document); 　　　　　　} 　　　　　　catch (XNIException ex) { 　　　　　　　　System.out.println("Actual: " + parser.actualEncoding); 　　　　　　　　System.out.println("Declared: " + parser.declaredEncoding); 　　　　　　} 　　　　} 　　} 　　　　private String actualEncoding = "unknown"; 　　private String declaredEncoding = "none"; 　　@Override 　　public void startDocument(XMLLocator locator, String encoding, 　　　　NamespaceContext namespaceContext, Augmentations augs) 　　　　　　　　throws XNIException { 　　　　this.actualEncoding = encoding; 　　　　this.declaredEncoding = "none"; // reset 　　} 　　@Override 　　// this method is not called if there's no XML declaration 　　public void xmlDecl(String version, String encoding, 　　　String standalone, Augmentations augs) throws XNIException { 　　　　this.declaredEncoding = encoding; 　　} 　　@Override 　　public void startElement(QName element, XMLAttributes attributes, 　　　Augmentations augs) throws XNIException { 　　　　 throw new XNIException("Early termination"); 　　} 　　 }

　　通常情况下，如果声明的编码和实际的编码不同，就表明服务器存在一个 bug。最常见的原因是由于 HTTP Content-type 报头指定的编码与在 XML 声明中声明的编码不同。在本例中，要严格遵守规范，要求优先考虑 HTTP 报头的值。但实际上，很可能 XML 声明中的值是正确的。

　　结束语

　　通常情况下，您不需要了解输入文档的编码。只需要用解析器处理输入文档，以 UTF-8 编码输出结果即可。但是，有些情况下需要知道输入编码，SAX 和 XNI 可以提供快速而有效的方法来解决这一问题。

上一篇：使用UTF-8对XML文档进行编码

下一篇：掌握XML系列(4)---创建格式良好的XML文档

当前位置：首页 > 软件开发 > Web开发 > XML

用SAX和XNI检测XML文档的编码

图片内容

最近更新

随机推荐