Content-based file matching (Fingerprints)

@Magnolia Your welcome :slight_smile:

Getting everything just nice and identified correctly is alot of work. Some i simply have no solutions for like
s01e01a
s01e01b
s01e02a
s01e02b

Movies can sort of handle this it will play seamlessly though it creates headaches with subtitles.
Some things never will be able to be sorted out properly. Like looney tunes. There are thousands, some restored some not. Lots hiding only in Disney vaults.

The new tv agent has had a lot of love and fixed a lot of things.

Another feature request along the same lines, of fixing file structure. I fear these kinds of niggles need to be sorted out first before we implement fuzzy hash or we are going to contaminate the database with bad matches. More a fix everything, then do a deep analysis, then push it upstream back to plex. That could make things a bit more fragile though because less users would actively go and push. Where it will really shine is on everything that isn’t matched from initial library populations or is a duplicate. By default performance could be choked, or deeper analysis only for duplicates/unmatched.

One can kind of imagine the hardware requirements for this at a bartop level
Say we used a 64 char hex hash. Saved on the hard drive as a raw text file would be 129 bytes
10 samples per file: 1290 bytes
Ok its hard to guess how many different media files there are, so lets say 1 billion.
1,29×10¹² ÷ 1024 = 1259765625 KiB
1259765625÷1024 = 1230239.868164062 MiB
1230239.868164062÷1024 = 1201.406121254 GiB
1201,406121254÷1024 = 1,1732 TiB

So a terabyte and change, factoring in file system and database overhead 2TiB

Bolt that on or a bunch of ssd’s 320gb of ram, dual 64 core cpu;s you could probably pull 40Gib/s to the wire. I highly doubt there will be 1 billion video samples out there, that will be fed into plex so its entirely possible.

Im sure there will be lots of hosts out there very happy to be hosting plex because its a core part of their infrastructure you probably want one of the big Netherlands server farms, for privacy laws and price, and raw bandwidth cost. Might be able to pick up something at cost or very cheap because of what plex does for their hosting.