Accessing xml with windows-1252 encoding?

I’m working on a meta data agent for importing information from DVD Profiler exported xml. XML is encoded with windows-1252. I’m using the following code to load the xml.


<br />
collection = Core.storage.load(collectionFilename)<br />
self.collectionXML = XML.ElementFromString(collection)<br />




This however throws an exception "XMLSyntaxError: Unsupported encoding windows-1252" (full stack trace below). It seems that the lxml library used in Plex does not support this particular encoding. I've been able to work around the problem by manually changing the encoding to UTF-8 with Notepad++. But this changes all Scandinavian and other European special characters into question marks. And manual is of course not acceptable way for the final agent.

So I have two questions:
1. What is the correct way to change the encoding so that the special characters would be preserved?
2. How would I do this within the agent code?

Any help is well appreciated.


2012-12-06 09:12:15,683 (218) :  CRITICAL (core:561) - Exception in the update function of agent named 'DVDP2Plex', called with guid 'com.plexapp.agents.dvdp2plex://5050629160611.16?lang=xn' (most recent call last):<br />
  File "C:\Users\Henri\AppData\Local\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\agentkit.py", line 971, in _update<br />
    agent.update(obj, media, lang)<br />
  File "C:\Users\Henri\AppData\Local\Plex Media Server\Plug-ins\DVDP2Plex.bundle\Contents\Code\__init__.py", line 92, in update<br />
    if not self.loadCollection(): return<br />
  File "C:\Users\Henri\AppData\Local\Plex Media Server\Plug-ins\DVDP2Plex.bundle\Contents\Code\__init__.py", line 37, in loadCollection<br />
    self.collectionXML = XML.ElementFromString(collection)<br />
  File "C:\Users\Henri\AppData\Local\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 285, in ElementFromString<br />
    return self._core.data.xml.from_string(string, encoding = encoding)<br />
  File "C:\Users\Henri\AppData\Local\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 167, in from_string<br />
    return etree.fromstring(markup)<br />
  File "lxml.etree.pyx", line 2743, in lxml.etree.fromstring (src\lxml\lxml.etree.c:52665)<br />
  File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:79932)<br />
  File "parser.pxi", line 1452, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:78774)<br />
  File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:75389)<br />
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:71739)<br />
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:72614)<br />
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:71955)<br />
XMLSyntaxError: Unsupported encoding windows-1252, line 1, column 42

I realised that I had done something wrong when manually changing the encoding to UTF-8 with Notepad++. The right feature is the “Convert to UTF-8 without BOM” located in the “Encoding” menu in Notepad++. The XML-encoding attribute at beginning of the file has to be of course manually changed to UTF-8.



After these changes the XML-file is nicely parsed by the lxml library and all the Scandinavian characters are properly shown. It also seems this process removed some illegal characters which I had to previously manually remove.



To do this with Python I researched a bit and found the following official guide quite nice:

http://docs.python.org/2/howto/unicode.html



There is also a link to a table of supported encodings:

http://docs.python.org/2/library/codecs.html#standard-encodings



With these I was able to build working code:


<br />
# load the xml collection file content into a variable<br />
collectionStringWin1252 = Core.storage.load(collectionFilename)<br />
# convert content to unicode<br />
collectionStringUTF8 = unicode(collectionStringWin1252, encoding="windows-1252")<br />
# change the xml encoding tag to UTF-8 (first occurrence)<br />
collectionStringUTF8 = collectionStringUTF8.replace('windows-1252', 'UTF-8', 1)<br />
# parse string to an XML object<br />
self.collectionXML = XML.ElementFromString(collectionStringUTF8)<br />


This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.