Content-based file matching (Fingerprints)

Magnolia · January 3, 2021, 12:39pm

As of now the step from a video file to an identified title is 100% based on the file name.

The process is finicky and fragile – it causes problems and lots of effort for users, as shown in the forum, where file mismatching is a recurring topic.
The current process also forces users in a corset how to name and organize files.

The suggestion is to make the process for video identification content-based via fingerprinting.
Plex already uses fingerprints for music. Shazam and iTunes Match are prominent examples of fingerprinting in action: They do not need hints in the filename to identify a piece of music.

Further the backend should be community-based.
So manual matches from users on their files create new relations between a fingerprint and the respective title. Then other users benefit, if they have files with a similar fingerprint. They can accept the matching (or, of course) reject it.
In this case Gracenote for CDs is a good example of this process in action. Most of my CDs are automatically recognized. Some exotic CDs are not recognized. I enter the data, send them back to Gracenote, and other users will benefit in the future.

Some building blocks are already available in Plex.
This feature request is about putting them together.

Magnolia · January 4, 2021, 2:14pm

There has been a lively exchange on the topic in the German section of the forum. Along the lines of „what does that mean“ and „how can this work“.

Here is a summary in FAQ format:

Where is the problem anyway?
The current process uses filenames to identify the title for film content. That is wrong from the start. A filename is metadata like much else. The process should look at content.
The current process is: content => filename => (one) title => id.
A better process cuts the middle step: content => id

Yeah, but where is the problem for me?
Symptoms that result from the wrong start are:

Severe matching problems if file names do not follow the naming guide.
Naming and organizing your Movie files | Plex Support
Still matching problems even if file names do follow the naming guide.
No good answer when films have multiple titles. The relation content:filename:title:id is 1:1:n:1.
Movies and TV-series must be separated. So it is difficult to keep related movies and series in one folder - like Star Trek, Star Wars, Disney classics, … - and difficult for themed folders - like „music“ or „children“.
It’s every user for him-/herself to fix this, rather than community based, where one user can fix it for everybody else. So millions of users world-wide are doing the same (stupid) work over and over again. Ouch!

But … it works like it is!
No, it does not – see above. At best it works ok for those who have followed the corset and have learned to live with it.
It works like horses have worked before cars came around. The process was conceived when content-based matching was not possible. Since then it has served ok, but it is time to move on.

Well, go tell Plex
Exactly what this post is for. But then, in a hidden corner Plex already knows !

Plex locally uses fingerprints (or rather what Plex calls „fingerprints“) to recognize renamed files. So Plex already uses content for video identification, but only in a very limited way.
Plex uses fingerprints to recognize music. The way it should be.

„fingerprints“ to „identify“ „content“ - clarification (or trying to )

content : Literally content , the stuff that users are interested in. Like many people recognize Michael Jackson performing „Beat it“ when they hear it. Irrespective of whether it is played from tape or CD or streamed, in good or bad quality, the radio cut or the LP edition.
content variations : Content can vary artistically and technically. Video content, content can vary by different cuts, re-masters, qualities (resolution, color depth), containers, encoding, audio tracks, subtitles, embedded noise / errors.
A re-make is more than content variation. Again the music analogy: A cover version of „Beat it“ might still be the song, but it is not Michael Jackson performing . So for many people that is different content.
file : One specific expressions of content. Files can differ significantly due to the above variations and still hold the same content. (Content does not need to come as a file, e. g. can be an analog recording on tape, but that is out of scope for Plex.)
filename : It’s just a name. Really. Of course it is often useful, when names relate to content, but that is not guaranteed. Usually users „own“ filenames. Unless filenames are fully shielded from the user, software should not depend on them to and only interact with filenames per user request.
title : It’s just a name. Really. Video content can have more than one title. In fact, movies and series usually do, when they are made available in different countries. But even in the same country the title is sometimes changed over time.
metadata : Data about data. Metadata can be implicit in the nature of a video file (e. g. resolution) and explicit (e. g. if a tag additionally says „720p“). Titles are metadata for content.
identify : The process of technically recognizing content similar in effect to how humans recognize content. Then put an id, an identifier on it. That identifier is typically a unique key. These identifiers are literally key to bring all the wonderful goodies together … what Plex does so well: Posters, actors, ratings, …
fingerprinting : A technical method to identify content. The challenge: Even if the input varies a lot, the fingerprint should still identify the content correctly. Fingerprinting must be robust.
Almost magic, but it actually works: Shazam for music and biometric identification for smartphones and laptops are common examples where „fuzzy“ signals must be analyzed to identify a track or a person.
Plex uses fingerprinting for music identification.
hashing : A technical method to create a unique identifier from a file. The challenge: Just changing a single bit should also change the hash. So in that respect the opposite of a fingerprint.
Plex uses hashing (but calls it „fingerprinting“) for local identification of video files.

Phew!
Absolutely. But those clarifications were the hardest part.

How do you fingerprint video?
Well, maybe you don’t. It might be preferable to fingerprint the audio.

The technology is already there.
The language information is also recognized.
Probably uses less resources (CPU, storage).

Multi-language files will kill fingerprinting by audio
Quite the opposite. Multi-language files will help to fill the database of fingerprints with data.
If just one audio-track has a fingerprint for which the content-id is known, then that content-id can be mapped to the other audio-tracks in the database.

Silent movies will kill fingerprinting by audio
Silent movies are not silent but have audio, too. They will work just fine.

But how could Plex fill a central database with fingerprints and video ids?
There are already millions of de-central databases in the homes of Plex users. The movie files could be fingerprinted and those fingerprints sent to the central database along with the existing id (which was determined old-school by filename or manual match).

Hold it … why should users let Plex do that?
Good angle: Plex should definitely ask users for permission. The user response is likely positive: This is how Gracenote and Wikipedia work.
If desired, Plex could add further incentives. But that’s probably not needed, because community efforts work and the benefits will be quickly visible, once Plex hands back suggestions to users where their local database might be off or could be otherwise improved.

What is needed to make this fly?
Thankfully, key building blocks already exist – in production-quality for music, as proof-of-concept for video:

A fingerprinting mechanism for audio in video files
(Already exists in the music section, must be adapted to audio tracks in video.)
A central database to hold fingerprints and ids
(Already exists in the music section, must be adapted to audio tracks in video.)
A process to initially fill the central database
(Suggestion see above)
A process for QA (by community or in software by Plex)
A decision by Plex to actually do it.

Magnolia · January 10, 2021, 11:40am

One example of a forum thread that has dragged on from 2015 to 2019 and mostly deals with matching / identification.

A lot of effort went into clarifying the issue, finding solutions, applying solutions, testing – effort on the user side and for contributors in the forum.

Plex incorrectly identifying TV episodes

The user experience, the effort for the helpful forum members and and the result would be so much better with matching by fingerprinting.

Magnolia · February 4, 2021, 4:36pm

There is already a solid codebase to start from, the AcoustID project.

Key components are

a library that does the actual fingerprinting: chromaprint.
a server to hold the fingerprints and additional data: acoustid-server

The code is production ready. It is used by multiple applications, the most notable probably is MusicBrainz Picard, a database and client that does content-based file matching for music – so the music equivalent of what should happen for video content.

The code should in principle work for fingerprinting films as well by using the audio-tracks. Parameters should be adapted, however.

It is all OSS in some way, but the license situation is mixed (part GPL 2.0, part MIT, part LGPL), due to other code being used like FFmpeg.

forresthopkinsa · February 6, 2021, 6:10am

Although this is long overdue, I think it would be best achieved by a third-party solution and then integrated into Plex; not least because I’m not overwhelmingly confident in the company’s dedication to open-source and community efforts

Magnolia · February 18, 2021, 2:58pm

Indeed, this could be a stand-alone SW.
It could be delivered by Plex or by another party.
It just makes sense for Plex to dig into this imho and run this as a community effort.
Sense = … since Plex is a business after all (no offense).

Viability. Plex have in-house resources and a supportive community. Plex can mobilize the existing user base to crack the chicken-egg-problem of getting fingerprints into the database to make this work. Being in the lead helps Plex to steer the efforts so that Plex benefits the most.
Maintain and grow user base = . What sets Plex apart from “the Netflix way” (and I like Netflix) is that Plex lets me integrate my own video content. But data management is a major barrier for entry to the Plex world and an ongoing cost. Reducing this cost and barrier will nurture the user base.
License = . Even with OSS software and community efforts, there is money to be made. The suggested solution plays in a different league compared to everything else. I imagine a world where the default advice in user forums would be “Buy Plex Match and be happy”.

This is the way.

Beaster99 · March 1, 2021, 8:03am

+1 agree with the above this is long overdue.

Magnolia · June 1, 2021, 4:44pm

One more example of a lengthy post which deals with matching issues:

The root cause again: The matching method relies on filenames. This can (and does!) break on many levels for many users – unless users apply rigid discipline and restrict themselves in how they name and organize their files.

If fingerprints were used for matching, aka perceptual hashes, the user experience would be so much better in many ways.

meh123 · June 14, 2021, 7:55pm

This would be worth every single drop of blood sweat and tears to implement, and then some. Even if the devs turn into total misanthropes by the end of it. They just might. While (from memory) gracenote has been around since the eartly 2000’s and shazam not long after. Both have been rock solid since day 1.

Data must be suitably anonymized being statistical aggregation of fingerprint hashes. Plex will need to be very open and clear about this. Perhaps with example submission.
+1000, when this works it will be magical

Magnolia · August 5, 2021, 4:01pm

Thanks for the support, @meh123 .

Agree. Sooo worth it.

I have been in the process of reorganizing the collection for months now. (Admittedly, it is a large collection.) And I frown ever so often how many disadvantages there are to the current method.

Like here in the forum:
Plex incorrectly identifying TV episodes
Incorrect match keeps re-occurring no matter how many times I fix it
TV-Serien Fehler bei Episoden-Benennung
… and this list could go on and on.

The standard advice is always to stick exactly to the naming schemes. Even though this is tedious and error-prone and unflexible, even though this is preventing new users from using and liking Plex, I find little little discussion to fix the root cause … so far.

I hope that over time more users and decisions makers at Plex will come to realize how deficient the current method is and that a much better way really is possible.

Magnolia · August 26, 2021, 2:12pm

And the list of forum threads relating to naming and organizing goes on.

Here is one started by @medicineman2500 that fits especially well into the context of this feature request:

meh123 · September 27, 2021, 6:19am

@Magnolia Your welcome

Getting everything just nice and identified correctly is alot of work. Some i simply have no solutions for like
s01e01a
s01e01b
s01e02a
s01e02b

Movies can sort of handle this it will play seamlessly though it creates headaches with subtitles.
Some things never will be able to be sorted out properly. Like looney tunes. There are thousands, some restored some not. Lots hiding only in Disney vaults.

The new tv agent has had a lot of love and fixed a lot of things.

Another feature request along the same lines, of fixing file structure. I fear these kinds of niggles need to be sorted out first before we implement fuzzy hash or we are going to contaminate the database with bad matches. More a fix everything, then do a deep analysis, then push it upstream back to plex. That could make things a bit more fragile though because less users would actively go and push. Where it will really shine is on everything that isn’t matched from initial library populations or is a duplicate. By default performance could be choked, or deeper analysis only for duplicates/unmatched.

One can kind of imagine the hardware requirements for this at a bartop level
Say we used a 64 char hex hash. Saved on the hard drive as a raw text file would be 129 bytes
10 samples per file: 1290 bytes
Ok its hard to guess how many different media files there are, so lets say 1 billion.
1,29×10¹² ÷ 1024 = 1259765625 KiB
1259765625÷1024 = 1230239.868164062 MiB
1230239.868164062÷1024 = 1201.406121254 GiB
1201,406121254÷1024 = 1,1732 TiB

So a terabyte and change, factoring in file system and database overhead 2TiB

Bolt that on or a bunch of ssd’s 320gb of ram, dual 64 core cpu;s you could probably pull 40Gib/s to the wire. I highly doubt there will be 1 billion video samples out there, that will be fed into plex so its entirely possible.

Im sure there will be lots of hosts out there very happy to be hosting plex because its a core part of their infrastructure you probably want one of the big Netherlands server farms, for privacy laws and price, and raw bandwidth cost. Might be able to pick up something at cost or very cheap because of what plex does for their hosting.

Magnolia · April 24, 2022, 10:52am

As I keep sorting through an extensive collection of movies and series, I cannot help but constantly stumble over the craziness of organizing by name.

Latest addition: Various films around Wallander.

There is a series of movies, which are “movies”.
And then there are series with movie-length episodes, which are “series”.

The result in the file system is confusing and arbitrary.
And the path to get there is tedious and error-prone.

Again:

The recognition of films should be based on content not filenames.
The method to do content-based recognition is the use of “fingerprints” or “perceptual hashes”.
The technology behind it is not magic (although it feels like that).
Many people have already used it for music via Shazaam. Some might be familiar with MusicBrainz Picard and AcoustID.
A fingerprint gets the essence of the content like we humans do, so it is robust against variations like resolutions, logos, artefacts and glitches.
Plex has the advantage of a strong user base. I am sure that many (like myself) would support an effort by running fingerprint extractions on their collections to be used.
File naming and organisation can be changed automatically if the user so desires (e. g. iTunes offers that feature).

Volts · April 24, 2022, 11:02am

There are a billion patents that cover all aspects of this concept.

Here’s a fun one, partially because it’s got IBM’s high-gloss patent sheen, and partly because it has an awesome list of references at the bottom.

US Patent for Detecting usage of copyrighted video content using object recognition Patent (Patent # 11,301,714 issued April 12, 2022) - Justia Patents Search

Magnolia · May 7, 2022, 12:40pm

That is an interesting patent.
As patents often do: It tries to be as broad as possible.
As patents should be: Ultimately it describes one specific technique. You’re fine, if you don’t use this specific technique.

Hence it is important to strip off the hyperbole. There‘s a whole lot less than a billion patents out there.

Specifically: Recognition by audio tracks as suggested has advantages (outlined above) and is available from different sources, both commercial and FOSS. Prominent example is AcoustID used by Musicbrainz Picard.

Magnolia · February 1, 2023, 7:32am

AI is achieving some seemingly miraculous feats these days – some tricks relate to “generating” stuff, others to “recognizing” stuff as is the case here.

Still, from what I understand about the technology behind it, training AI is a massive effort – especially without dedicated hardware, but even then. Even putting AI to work requires lots of resources.

With respect to Plex crowdsourcing the effort of building content-based recognition, I still think it is more feasible to primarily use fingerprints aka perceptual hashes on the audio tracks.

I can see myself donating resources to the effort by letting a fingerprinter run over my collection of films and series. I can’t see my lowly ARM-based NAS doing that for AI training, even if it is part of a large crowd of Plex supporters doing the same.

Short story:
I thought the use of AI is noteworthy for smarter content recognition.
But the way to go still seems traditional (by now, anyway) fingerprinting.

mgutt · April 13, 2023, 11:13am

I really like to see a wizard in the Plex server settings which allows to do an “advanced” fingerprint-based scan of my TV show episodes. For example I rip a new TV show and it creates files as follows:
/Title (2023)/s01_disc01_title01.mkv
/Title (2023)/s01_disc01_title02.mkv
…
/Title (2023)/s04_disc05_title22.mkv

Now I like to use the wizard to name all these episodes correctly (S01E01 title.mkv). If the wizard’s database does not know which file is which episode, the user has to be asked to select it manually and after f.e. three different Plex Pass users did that and assigned all files to the same episodes, this became valid for all Plex users.

After the files are named correctly, the default Plex indexing will find them as usual. So the default Plex indexer works as usual.

How to do fingerprinting
I think a simple sha/md5 hash of the audio tracks should be sufficient as fingerprint (example: if the english audio track has the hash 71ba776e2fb6e149f373519242fc8dca or the french audio track has the hash 2fbb57d13c6ca2dc6e3da2213586cdbb, then it is “S04E03 Name of Episode” of the show “Title (2023)”, which means the database must be able to remind multiple hashes per episode to support multi-language and multi codec matching). Original audio tracks are rarely re-encoded AND removed from the MKV file by the user, so this should be a good solution (if the user has removed all original audio tracks: Tough luck!).

Why this would be a huge benefit:
As some discs contain a wrong order I need to check ALL mkv-files manually. I’m doing this as follows:

open FileBot twice
search in the second FileBot window for the episode, which returns the episode list
open the episode list on TMDB website
open the first file, jump fast through the complete episode and compare the screenshot, title, description with what I’m seeing
doing this for 1st, 3rd, 5th, … mkv file until I hit a wrong ordered episode. If yes I check ALL mkv files, if not, I assume that the 2nd, 4th, 6th mkv file has the correct order
add the file to the first FileBot window
copy the title from the second FileBot window
paste it into the first one
click the rename button

So this is a lot of work, which can be saved. I would even buy an additional “Plex Super Pass” to get such a feature.

Magnolia · April 15, 2023, 5:27pm

Hi @mgutt

I agree with your ideas how this should work from a user perspective.

Your use cases are a good example of the many benefits that a content based identification process would bring.

Indeed the Plex user community with their collections is a rich source to tap into to get this started.

One caveat: Using standard hash functions is imho not a good idea.
MD5 et al are common and light on resources.
But these hash functions are designed to create vastly different results if just one bit differs. This can easily be the case for the same content depending on how a file was created.

We actually need the opposite, a hash function that creates a unique value which is robust against „irrelevant“ variations of the content.
Basically how we humans recognize faces, voices, music, films, etc.: Even if technically our neurons never receive the exact same signal, we still recognize „Star Wars IV“, „Billy Jean“, Doris Day, ….
This is called a perceptual hash or a fingerprint.

Frankly, I find fingerprinting / perceptual hashes close to magic. But really the technology is well established by now.
My favorite example is Shazam, which often recognizes a track under the most absurd circumstances like taken from a cell phone mic in a noisy pub.
But there are various algorithms / libraries available under different licenses.

Magnolia · October 24, 2023, 1:22pm

Also discussed here (in German):

Magnolia · January 7, 2024, 10:17am

Also discussed here:

Topic		Replies	Views
Same/similar movie names, HELP!~ General Discussions	20	1291	January 7, 2020
Echoprint for music sections Feature Suggestions	3	162	June 10, 2021
Why does plex swing and miss with matches so often? General Discussions	14	1053	January 7, 2020
Groooße Filmsammlung neu in Plex einpflegen Deutsch - German server-qnap	54	2249	August 18, 2021
How to name a file to get a correct match General Discussions	24	847	January 8, 2020

Content-based file matching (Fingerprints)

Related topics