Hi folks
I am experiencing an encoding issue that I can’t seem to nail. Some of the text in the Channel 9 site is in Russian.
http://channel9.msdn.com/posts/borxab/499085/
The web page doesn’t seem to specify any particular encoding.
When I view the web page in Safari, the Russian characters look fine.
When I view the page source in Safari, the Russian characters look fine.
However when displayed in Plex it looks like the attached screenshot.
This is the same type of weird chars you get if you try to view the web page with the Latin-1 encoding.
I have tried various combinations of .encode() and .decode() with “Latin-1” and “utf-8” but I’m clutching at straws a bit.
Can anyone help with this?
Thanks,
Charlie.
Hi Charlie!
The encoding of that channel9 webpage is UTF-8, or at least that’s in the HTTP headers. Getting the character encoding right has always been trial and error for me. You can try adding encoding=‘utf-8’ to your XML.ElementFromURL or HTTP.Request. To prevent errors (like “‘xxx’ codec can’t decode byte xxx in position xxx”) with wrongly encoded characters you can also add errors=‘ignore’:
content = XML.ElementFromURL(url, isHTML=True, encoding='utf-8', errors='ignore')
content = HTTP.Request(url, encoding='utf-8', errors='ignore')
Once, this also didn't do the trick for me. The solution I ended up with was to use *HTTP.Request* to retrieve the contents of the webpage and use *XML.ElementFromString* to process it:
content = HTTP.Request(url)<br />
programs = XML.ElementFromString(content, isHTML=True).xpath(.......)
Using *unicode()* also worked for me sometimes to output characters correctly:
title = unicode( program.xpath('./a')[0].text )
I hope this can be of some help for you!
Hi sander1,
Thanks very much for your reply.
I have tried all the tips you suggested, but unfortunately nothing seems to work. I really don’t know what is going on now…
I have added the “encoding=‘utf-8’” when the data is being fetched. I also tried the HTTP.Request.
I have tried using the unicode() function (and without).
I am using the unicode font (have also tried default).
Nothing seems to make any difference to the output.
I would really appreciate if anyone could take a look at my plugin and see if they can see what I’m doing wrong,
or even just provide some feedback on the plugin so far.
Thanks,
Charlie
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.