Struggling here with a small plug-in, where I might encounter characters, that does not confirm to regular Unicode decoding in Python 2.7
Like the following:
A filename like: Doppelgänger is reported like the following:
Plex Database: Doppelg%C3%A4nger
FileSystem: Doppelga%CC%88nger
So, researching shows, that even using normalize method, doesn't work for all characters :(
But since Plex itself can handle this, I was wondering, if the function they use could be exposed in the framework, so all dev's didn't have to maintain their own table?
Ah, the fun of unicode. The issue your having is because of unicode decomposition. I won't launch into a large explanation of it but there is a good wikipedia article on the subject here.
The Plex scanners always use the NFKC normalization algorithm which stores the filenames in their composed form. If you are comparing the database with the filesystem you will need to do the same with strings from the filesystem. Here is a quick demonstration:
First of all.....Sorry about the delay in responding here, but somehow I wasn't notified as normally about your reply :(
Your code is a lot more adv. than the one I used, so thanX for that, however I'm still faced with the same problem.
Code to grap the file names is:
####################################################################################################
# This function will scan the filesystem for files
####################################################################################################
@route(PREFIX + '/listTree')
def listTree(top, files=list()):
Log.Debug("******* Starting ListTree with a path of %s***********" %(top))
r = files[:]
try:
if not os.path.exists(top):
Log.Debug("The file share [%s] is not mounted" %(top))
return r
for f in os.listdir(top):
pathname = os.path.join(top, f)
Log.Debug("Found a file named : %s" %(pathname))
if os.path.isdir(pathname):
r = listTree(pathname, r)
elif os.path.isfile(pathname):
filename = urllib.unquote(pathname).decode('utf8')
Log.Debug("Tommy 1 %s" %(filename))
composed_filename = unicodedata.normalize('NFKC', filename)
filename = urllib.quote(composed_filename.encode('utf8'))
Log.Debug("Tommy 2 %s" %(filename))
r.append(filename)
else:
Log.Debug("Skipping %s" %(pathname))
return r
except UnicodeDecodeError:
Log.Critical("Detected an invalid caracter in the file/directory following this : %s" %(pathname))