A while ago I wrote a new series scanner (I posted it here before this subforum opened) and I’d like some feedback from people if possible.
I was very frightened by the default scanners, they were quite densely coded and the regexes made them hard to read. In addition, they are marked as all rights reserved, so didn’t want to step on anyone’s toes by hacking them around.
I’ve made the basis of a new series scanner, with the following properties:
Lightning fast
Can scan dump directories
Makes mistakes with titles much more often
Pickier about naming formats, but handle everything I have
I’d be interested in hearing from people if they like this scanner, and if so, what mistakes it makes on their files. I’m also working on a scanner for sport (as the ashes is on ;)) so if anyone’s interested in collaborating on a metadata provider for that I’d be interested.
The main motivation wasn’t the legal issue (I was about 95% sure you’d say ‘do what you like’ anyway), but more that the hacks I’d seen around for the scanners to handle dump directories were very slow for me as my unsorted-in directory has lots and lots of stuff in it, so the default scanner with a mod to allow top-level episodes was taking about 20 minutes to scan it. This does it in a few seconds.
The main speedup is the SystemFilter decorator, which throws out subdirs based on a system call to find. This lets me find all directories that have something that looks like a media file in, and throw away those subtrees that don’t. That’s much faster than iterating through them.
Writing this did give me some thoughts on the API for scanners though, especially in comparison to WSGI that has a similar pipe-line process.
It would be nice to just be able to return an iterable (i.e. also support generators) rather than relying on the fact lists are mutable
I’d like to be given control of the process earlier than currently, i.e. just called with a path, without files and subdirs
The current behaviour would look something like this under that model:
Pseudo-python, obviously, but it shows how the API design goes to some lengths to provide all the information to the scanners instead of trusting us to go find it ourselves. The big win over the real design is that things like VideoFiles.Scan can be implemented with a simple filter callable rather than in-place manipulation of the lists.
Not very relevant, I know, as you've already settled on a format, but thought I'd give you my feedback from using it as an outsider to the devteam.
Did a small update to the scanner, it’s gone from 12+ min to 2-3 min to scan my library. The main problem with the old scanner was that it did a "find | grep " for each path and sub directory. With my alteration it just searches the current path for files, not it’s sub directories.
extensions = "|".join(".*"+ext+"$" for ext in VideoFiles.video_exts)<br />
command = """find \"%s\" -maxdepth 1 -type f -iregex \"%s\"""" % (os.path.split(base)[0], extensions)<br />