RFE: Plex and Unicode

Struggling here with a small plug-in, where I might encounter characters,  that does not confirm to regular Unicode decoding in Python 2.7

 

Like the following:

 

A filename like: Doppelgänger is reported like the following:

 

Plex Database: Doppelg%C3%A4nger

FileSystem: Doppelga%CC%88nger

 

So, researching shows, that even using normalize method, doesn't work for all characters :(

 

But since Plex itself can handle this, I was wondering, if the function they use could be exposed in the framework, so all dev's didn't have to maintain their own table?

 

Best Regards

 

Tommy

Hey dane22,
 
Ah, the fun of unicode. The issue your having is because of unicode decomposition. I won't launch into a large explanation of it but there is a good wikipedia article on the subject here.
 
The Plex scanners always use the NFKC normalization algorithm which stores the filenames in their composed form. If you are comparing the database with the filesystem you will need to do the same with strings from the filesystem. Here is a quick demonstration: 
 
import urllib, unicodedata
filename = urllib.unquote('Doppelga%CC%88nger').decode('utf8')
composed_filename = unicodedata.normalize('NFKC', filename)
urllib.quote(composed_filename.encode('utf8')) # => Doppelg%C3%A4nger

Or you could decompose the ones from the Plex database using NFKD. 

There should be no reason why it won't work for all characters, where were you finding this to be true?

Cheers,

Gray

First of all.....Sorry about the delay in responding here, but somehow I wasn't notified as normally about your reply :(

Your code is a lot more adv. than the one I used, so thanX for that, however I'm still faced with the same problem.

Code to grap the file names is:

####################################################################################################
# This function will scan the filesystem for files
####################################################################################################
@route(PREFIX + '/listTree')
def listTree(top, files=list()):
	Log.Debug("******* Starting ListTree with a path of %s***********" %(top))
	r = files[:]
	try:
		if not os.path.exists(top):
			Log.Debug("The file share [%s] is not mounted" %(top))
			return r
		for f in os.listdir(top):
			pathname = os.path.join(top, f)
			Log.Debug("Found a file named : %s" %(pathname))
			if os.path.isdir(pathname):
				r = listTree(pathname, r)
			elif os.path.isfile(pathname):
				filename = urllib.unquote(pathname).decode('utf8')
				Log.Debug("Tommy 1 %s" %(filename))
				composed_filename = unicodedata.normalize('NFKC', filename)
				filename = urllib.quote(composed_filename.encode('utf8'))
				Log.Debug("Tommy 2 %s" %(filename))			
				r.append(filename)
			else:
				Log.Debug("Skipping %s" %(pathname))
		return r
	except UnicodeDecodeError:
		Log.Critical("Detected an invalid caracter in the file/directory following this : %s" %(pathname))

When given the following parameter:

/share/MD0_DATA/FindMovies/Les Misérables/Les Misérables (1080p HD).m4v

it returns back the following filename:

/share/MD0_DATA/FindMovies/Les%20Mis%C3%A9rables/Les%20Mis%C3%A9rables%20%281080p%20HD%29.m4v

However, in the database, the following is registred:

/share/MD0_DATA/FindMovies/Les%20Mise%CC%81rables/Les%20Mise%CC%81rables%20%281080p%20HD%29.m4v

So I needed some way of matching %C3%A9 with the following e%CC%81

And the result was simply to run your code against the database output as well, and the both ended up as eq.

Sir....I salute you for the feedback and insight, as well as sharing code

Best Regards

Tommy

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.