Content-based file matching (Fingerprints)

Given advances on image and object recognition, there seems to be another way for content based file-matching:

Recognizing key elements of a video, such as actors will give valuable information on what film we might be dealing with.

  • Even for an actor like John Wayne, when recognized as playing in the movie, the number of movies is quickly narrowed from many thousands of films out there to around 160.
  • There are even less films where e. g. Bill Murray, Dan Aykroyd and Sigourney Weaver play together. So with this simple content-based technique the selection can be reduced down to Ghostbusters 1 or 2.

This approach may be extended to other elements to get further cues, such as recognizing buildings, animals, landscapes or by identifying text in the credits or elsewhere.

This is complementary to using perceptual hashes on the audio track of video files.

All cues taken together should quickly give a very reliable hit rate.

1 Like

To those not familiar with content fingerprinting / perceptual hashes, here is an interesting (although sad) example of what is currently possible.

Not just possible as in “edge case with huge amounts of resources” but as in “built into regular Smart TVs from Samsung and LG.”

From the study:

Smart TVs implement a unique tracking approach called Automatic Content Recognition (ACR) to profile viewing activity of their users. ACR is a Shazam-like technology that works by periodically captur- ing the content displayed on a TV’s screen and matching it against a content library to detect what content is being displayed at any given point in time.

And further

ACR periodically captures frames (and/or audio), builds a fingerprint of the content, and then shares it with an ACR server for match- ing it against a database of known content (e.g., movies, ads, live feed). When the fingerprint matches, ACR server can determine exactly what piece of content is being watched on the smart TV. […] Fingerprints in ACR are essentially hash of the content, which can be matched at the server-side to identify the content.

Note: The suggestion for Plex achieves the same purpose minus the imho nefarious aspects.

The required database of known contents can be built from volunteering Plex supporters, who already have content that is mapped to titles via Plex naming conventions.

1 Like