Content-based file matching (Fingerprints)

There has been a lively exchange on the topic in the German section of the forum. Along the lines of „what does that mean“ and „how can this work“.

Here is a summary in FAQ format:

Where is the problem anyway?
The current process uses filenames to identify the title for film content. That is wrong from the start. A filename is metadata like much else. The process should look at content.
The current process is: content => filename => (one) title => id.
A better process cuts the middle step: content => id

Yeah, but where is the problem for me?
Symptoms that result from the wrong start are:

  • Severe matching problems if file names do not follow the naming guide.
    Naming and organizing your Movie files | Plex Support
  • Still matching problems even if file names do follow the naming guide.
  • No good answer when films have multiple titles. The relation content:filename:title:id is 1:1:n:1.
  • Movies and TV-series must be separated. So it is difficult to keep related movies and series in one folder - like Star Trek, Star Wars, Disney classics, … - and difficult for themed folders - like „music“ or „children“.
  • It’s every user for him-/herself to fix this, rather than community based, where one user can fix it for everybody else. So millions of users world-wide are doing the same (stupid) work over and over again. Ouch!

But … it works like it is!
No, it does not – see above. At best it works ok for those who have followed the corset and have learned to live with it.
It works like horses have worked before cars came around. The process was conceived when content-based matching was not possible. Since then it has served ok, but it is time to move on.

Well, go tell Plex
Exactly what this post is for. :slightly_smiling_face: But then, in a hidden corner Plex already knows !

  • Plex locally uses fingerprints (or rather what Plex calls „fingerprints“) to recognize renamed files. So Plex already uses content for video identification, but only in a very limited way.
  • Plex uses fingerprints to recognize music. The way it should be. :clap:

„fingerprints“ to „identify“ „content“ - clarification (or trying to :grimacing: )

  • content : Literally content , the stuff that users are interested in. Like many people recognize Michael Jackson performing „Beat it“ when they hear it. Irrespective of whether it is played from tape or CD or streamed, in good or bad quality, the radio cut or the LP edition.
  • content variations : Content can vary artistically and technically. Video content, content can vary by different cuts, re-masters, qualities (resolution, color depth), containers, encoding, audio tracks, subtitles, embedded noise / errors.
    A re-make is more than content variation. Again the music analogy: A cover version of „Beat it“ might still be the song, but it is not Michael Jackson performing . So for many people that is different content.
  • file : One specific expressions of content. Files can differ significantly due to the above variations and still hold the same content. (Content does not need to come as a file, e. g. can be an analog recording on tape, but that is out of scope for Plex.)
  • filename : It’s just a name. Really. Of course it is often useful, when names relate to content, but that is not guaranteed. Usually users „own“ filenames. Unless filenames are fully shielded from the user, software should not depend on them to and only interact with filenames per user request.
  • title : It’s just a name. Really. Video content can have more than one title. In fact, movies and series usually do, when they are made available in different countries. But even in the same country the title is sometimes changed over time.
  • metadata : Data about data. Metadata can be implicit in the nature of a video file (e. g. resolution) and explicit (e. g. if a tag additionally says „720p“). Titles are metadata for content.
  • identify : The process of technically recognizing content similar in effect to how humans recognize content. Then put an id, an identifier on it. That identifier is typically a unique key. These identifiers are literally key to bring all the wonderful goodies together … what Plex does so well: Posters, actors, ratings, …
  • fingerprinting : A technical method to identify content. The challenge: Even if the input varies a lot, the fingerprint should still identify the content correctly. Fingerprinting must be robust.
    Almost magic, but it actually works: Shazam for music and biometric identification for smartphones and laptops are common examples where „fuzzy“ signals must be analyzed to identify a track or a person.
    Plex uses fingerprinting for music identification.
  • hashing : A technical method to create a unique identifier from a file. The challenge: Just changing a single bit should also change the hash. So in that respect the opposite of a fingerprint.
    Plex uses hashing (but calls it „fingerprinting“) for local identification of video files.

Phew!
Absolutely. :sweat_smile: But those clarifications were the hardest part.

How do you fingerprint video?
Well, maybe you don’t. It might be preferable to fingerprint the audio.

  • The technology is already there.
  • The language information is also recognized.
  • Probably uses less resources (CPU, storage).

Multi-language files will kill fingerprinting by audio
Quite the opposite. Multi-language files will help to fill the database of fingerprints with data.
If just one audio-track has a fingerprint for which the content-id is known, then that content-id can be mapped to the other audio-tracks in the database.

Silent movies will kill fingerprinting by audio
Silent movies are not silent but have audio, too. They will work just fine.

But how could Plex fill a central database with fingerprints and video ids?
There are already millions of de-central databases in the homes of Plex users. The movie files could be fingerprinted and those fingerprints sent to the central database along with the existing id (which was determined old-school by filename or manual match).

Hold it … why should users let Plex do that?
Good angle: Plex should definitely ask users for permission. The user response is likely positive: This is how Gracenote and Wikipedia work.
If desired, Plex could add further incentives. But that’s probably not needed, because community efforts work and the benefits will be quickly visible, once Plex hands back suggestions to users where their local database might be off or could be otherwise improved.

What is needed to make this fly?
Thankfully, key building blocks already exist – in production-quality for music, as proof-of-concept for video:

  • A fingerprinting mechanism for audio in video files
    (Already exists in the music section, must be adapted to audio tracks in video.)
  • A central database to hold fingerprints and ids
    (Already exists in the music section, must be adapted to audio tracks in video.)
  • A process to initially fill the central database
    (Suggestion see above)
  • A process for QA (by community or in software by Plex)
  • A decision by Plex to actually do it. :slightly_smiling_face:
4 Likes