Developing a method for automatic TV Episode name identification

As we all know, ripping TV shows from DVD/Blu-Ray can be a real pain. Especially when said tv show doesn’t display an episode name as part of it’s intro. I’ve therefore been looking into methods of creating automations for this process.

I’ve been looking into the python DejaVu library which uses audio fingerprinting and wondering if this might be a possible solution. Fingerprint the primary audio track of the file and use that to identify season number and episode number of each file using (for example) the first 5 min of each file.

Current issues I see are,

  • fingerprint file size. It seems one needs quite a lot of storage to store the fingerprints.
  • sharing fingerprint. One needs some way of sharing the fingerprints easily (which is again difficult due to the size of the fingerprints).

Currently I’m wondering if a possible solution might be that instead of having a single shared fingerprint database one could download a database dump for the show one is ripping. Then the identification process would go quicker as there will be fewer fingerprints to compare against, and the error rate should be much lower as you’ve already identified the show. It shouldn’t be to much of a hassle to have people download 1 file for each show either. Certainly quicker than having to identify each episode manually.

Before I waste any more time on this possible route I was wondering if anyone has experimentet with this idea before? One could host the fingerprints as 1 file per tv show on a github repo. The script would then ask you to identify the show using (for instance) the id from themoviedb, and the script would then download the relevant fingerprints from the github repo. For example, to run identification on a folder of MKVs from the big bang theory blu-rays you would simply run,

python3 identify.py . -id 1418

where id is the moviedb id for the show.

Any thoughts from fellow programmers?

Haven’t ripped a Blu-ray, but programmatically I’m thinking of inspecting the meta data on the files and on the disc. If I can’t consistently identify the ep from those, then I would move in the direction you’ve taken. My gut tells me I need 3 key frames near min 5, 10, 20 and if those match, I’ve identified the ep.

Or It might be small and easy enough to sample 30sec of audio instead.

What have you learned so far?

I remember around 2001 a guy at Stanford mentioning to me that he was working on a project of theirs to identify by fingerprint the files being passed around the campus network to stop copyright violation. We got into a passionate argument because from a science point of view it was ridiculous when the whole point of a computer was to store and reproduce exact copies of data and because the internet was going to contain the sum total of all human knowledge. But Napster was making a lot of waves on campus.

So I think the software is out there or has been attempted, right?

As far as a no storage required solution, I’d probably consider using a headless browser to search subtitle text in google, then use an algorithm to find the titles from search and return them. You could also download third party subtitles for each episode and run a comparison against the blu ray subtitles to find matches, timing differences would be irrelevant as you’d be comparing the text content. That’s probably the most practical solution. Otherwise, you’d be looking at really complex imaging/different compression algorithm keyframe or fingerprint comparisons. If rips weren’t timed perfectly, you’d end up having to extract a lot of keyframes to find relative matches, then algorithmically decide which color differences were the result of possible compression. Otherwise you could use machine learning to detect matches, but that seems like a lot of work compared to just matching text.