Please fix the OpenSubtitle metadata agent while waiting for a new subtitle managment system

Having a more efficient way to manage subtitles is asked by many users. While waiting a new unexpected feature as you have the secret to solve the subtitles complaints can you improve the opensubtitle metadata agent ?

 

Currently this agent search the best subtitle for a video by searching on opensubtile.org the list of the subtitles corresponding to the video file hash and by fetching only the one with the more downloads. It works generally but there are more and more movie or shows episode with the same hash and I've often wrong subtitles for my videos.

 

Example: I've the Episode 7 of season 2 of the Game of Thrones with the hash 4eaa05eb585cdc4 and opensubtitles.org return for this hash: http://www.opensubtitles.org/en/search/sublanguageid-all/moviehash-4eaa05eb585cdc4/sort-7/asc-0.

Following the metadata agent algorithm the best response is : Revenge S01E14. My episode is in fourth place  :(

 

I think it could be quite easy to modify the metadata agent to add a check on the context (like show name, season and episode) to eliminate inconsistent items in the opensubtitles.org answer.

 

Edit : a fork of the metadata agent is in progress to test some propositions of this thread. More info there https://forums.plex.tv/topic/102923-mock-up-for-opensubtitles-metadata-agent-improvement/

You could always install Plex Tools :)

https://github.com/manuliege/PlexTools.bundle  

Thanks for the suggestion!

Thanks Elan for considering an improvement of the agent

OrionShock I’ve just discovered Plex Tools yesterday and it’s a good workaround

I've look at the metadata agent code. In fact it asks opensubtitle.org to find the list of the subs corresponding to a hash AND a file size. I'm very surprised to have many occurencewith more than one match in this case. Perhaps statistics theory can demonstrate this fact or perhaps the opensubtitle.org database is not very  clean.

So in the two hypothesis, the metadata agent need to consolidate the items returned by the database to avoid real duplicate or to clean from wrong entries ;-)

Volunteer to beta test if needed.

I've look at the metadata agent code. In fact it asks opensubtitle.org to find the list of the subs corresponding to a hash AND a file size. I'm very surprised to have many occurencewith more than one match in this case. Perhaps statistics theory can demonstrate this fact or perhaps the opensubtitle.org database is not very  clean.

So in the two hypothesis, the metadata agent need to consolidate the items returned by the database to avoid real duplicate or to clean from wrong entries ;-)

Volunteer to beta test if needed.

Hi, there.

A couple of things, from my own experience working with the OS API since 2007:

What you see are not videos with the same hash, but sloppy clients, users and assumptions. I have stats in SolEol for something like 200 thousand videos and there're no collisions in the hashes.

The problem is that there're clients that blindly upload subs doing very sloppy matching (for example, BSPlayer is famous for uploading all subtitles in the current directory, for the video being played, regardless of name.

Then OpenSubtitles might make the assumption that if this subtitle Y, known to be matched to video A, is also matched to video B, then all subtitles matched to video A should work with video B. Similarly, all subtitles matched to video A should be matched to video B.

You can imagine how quickly mismatches then replicate here. It's like an infection propagating. You only need one user who has all subs and videos in the same folder to contaminate the whole database.

I've argued in the OS forums to fix this, but the contamination is so thorough and it's so hard to clean up, no major inroads have been made towards that goal.

The other point is that your search string lacks bytesize, which makes the search way more open than it should. Bytesize should NOT be optional, and by not including it, it's further muddying up results.

In SolEol to be able to select the best subtitle I assign weigh to all results, taking into account a lot of factors (of which download count is used only in case of a tie). Plex can do an even better job than my SolEol can, as it's got better metadata. I suggest:

-Filter out all subtitles where the Season doesn't match (episodes might be numbered differently due to DVD/Airing order, but the season rarely is).

-Filter out all subtitles with a rating between 1 and 4 (these represent subtitles that have had people so pissed off at, that they went through the horrible rating interface to rate them bad).

-Rating 10 and 0 have the same weight. Ratings between 5 and 9 are all equally good.

-Machine-translated flags, when reported, should get the subtitle out of the results pool as well. Same for "BAD" and "Not synced" flags.

If Plex still had IMDB metadata, it would be trivial to use that to match, since the subtitles have IMDB data in them. Such a shame OS has IMDB so deeply intertwined they can't really get out of it (I have also lobbied for this, unsuccessfully). But what can still be done is, if the results include lots of results for the same season of the same show and only one of several other shows, it can be assumed those are false positives.

Source: I've been working with OpenSubtitles and their API for seven years. I'm painfully aware of the problems it has. It's still the best around, but it takes a lot of thinking on the part of clients to get good results.

Eduo your post is very interesting.

I don’t know how is the priority of this update request by Plex dev team. Eduo do you want to help me prototype a better agent ? I’ve some knowledge how metadata works and you have the OS expertise.


PS: soleol is my favorite app to fetch subs when plex didn’t succeed. And I thank you for your works.

I'm only familiar with opensubs via XBMC...

And it would seem to me the solution lay with opensubs not the agents we are using to find subtitles.

Even if you got info from IMDB and TMDB it still wouldn't help since the SRT file needs to be for that particular encoded source you have which can be millions of rips all with different sync.

I long for the day when SRT files are scrapped entirely in favor of subtitle tracks incorporated into the source.

They are just too hit and miss for my good.

Eduo your post is very interesting.
I don't know how is the priority of this update request by Plex dev team. Eduo do you want to help me prototype a better agent ? I've some knowledge how metadata works and you have the OS expertise.

PS: soleol is my favorite app to fetch subs when plex didn't succeed. And I thank you for your works.

I've coupled this post with a stern one in the Developers forums in OpenSubtitles. I think that unless OpenSubtitles cleans this up the whole point of using it instead of other sources becomes moot.

Sure, I can help prototype a new agent, but who would code it? The current one could be forked to be improved, but I don't do Python.

I'm only familiar with opensubs via XBMC...

And it would seem to me the solution lay with opensubs not the agents we are using to find subtitles.

You may not be familiar with the way OpenSubtitles works and why it's usually chosen first when adding sources to Subtitles. This is OK, since it's not common knowledge or obvious unless you go into it. Oncleben13 mentioned it, but he assumed the people reading would already know.

Even if you got info from IMDB and TMDB it still wouldn't help since the SRT file needs to be for that particular encoded source you have which can be millions of rips all with different sync.

This is where the confusion exists. When searching in OpenSubtitles you don't search by name, but by "hash". Programs create a "hash" (think of it as a "digital fingerprint" of the file) and ask OpenSubtitles if it has subtitles matched to that hash and bytesize. This means that if there are a dozen subtitles for the same movie, the results will only include those that match the specific file you have.

This is a great advantage, when trying to get the right one for you. The problem comes when your hash is not matched to any video or is matched to the wrong videos. When it's the first the plug-in reverts to name-based search (little more it can do).

This means that if you convert or modify your videos in any way other than renaming them before getting subtitles, you lower the probability of getting subs for them, but also means that when it finds matches they're much better than your average search and are usually in sync with your video.

Renaming them lowers the probability of finding stuff by name, since OpenSubtitles keeps tracks of release names as well and results can be prioritized by that as well.

The problem is that OpenSubtitles is a very big database and errors and mismatches have creeped into it, and it's very hard to clean them up. Imagine you're a pretty careless user and you have all your movies and your subtitles in a single big folder. You might not care much since players do a job of sorting the ones named the same as your file.

But not all do, and some take ALL subtitles in, and present them to the user prioritizing by name, the best match first. This means that you may not notice but your player may be picking up all the other subtitles as valid.

Then you get your entrepreneur developer types (the ones that would code something like "take all subtitles in the folder and subfolder regardless of name") and he/she adds automatic upload of hash matches to OpenSubtitles (the intention is good, the results disastrous). You can see where this can lead, where a single movie hash is matched to a ton of subtitles (this needs to happen only once). Then OpenSubtitles, who knows already those subtitles belong to other movies, decides then that they're all the same movie. In a couple of days you've got such a mess it can take years to go back to a clean state again.

To your point: OpenSubtitles reports back the IMDB of the files it finds, while Plex has the IMDB of series and movies stored (from TMBD and TVDB). You can prioritize the results (which are already filtered by matched file hash) according to IMDB.

I long for the day when SRT files are scrapped entirely in favor of subtitle tracks incorporated into the source.

They are just too hit and miss for my good.

You might not have thought this through, since it's almost impossible to do except for very localized and specific situations.

Most of your subtitles out there are made by volunteers. The closer they are to the release date the less likely it is that they come from an "official" source (this is specially true for TV Shows). Since there are so many groups doing subtitles, there's always multiple versions of them, sometimes of equal quality.

Subtitles also evolve over time. If you were to get subtitles for a series released in DVD long ago they'd probably be of great quality, but last week's episode of Walking Dead most likely had lower-quality subs for a whole week.

Add to this that in some markets there're never "official" subtitles so all existing are translations. And that a lot of translations are well-meaning but useless Google Translate ones that just muddy up the pool.

Essentially, the further you get from the English language and the US release dates the worse the subtitles tend to be.

Your "ideal" would only be met the day releases come with subtitles from the source for all languages (this is never going to happen, as volunteer subtitles are the only way to get some of the media into markets it'll never reach officially).

If you're just talking about english subs and only for month-old-at-the-earliest videos, then it's perfectly feasible.

I've been pushing for a "master" subtitle selection for each video in OpenSubtitles for a while, but this has proven impossible to do. For them this would mean delivering a single suggested result (one for full CC, one for just spoken text, actually). Where one subtitle for each release is flagged as "the best" (until a better one comes along) and that's prioritized in clients.

One note: You might see what the problem with OpenSubtitles is: It requires people to upload matches between a subtitle and a video. This is done by only a few volunteers (I calculate the proportion to be around 10K to 1, downloaders vs. uploaders) since most players (like Plex) do not upload matches or do them wrong (like BSPlayer)-

You may not be familiar with the way OpenSubtitles works and why it's usually chosen first when adding sources to Subtitles. This is OK, since it's not common knowledge or obvious unless you go into it. Oncleben13 mentioned it, but he assumed the people reading would already know.
 
 
 
This is where the confusion exists. When searching in OpenSubtitles you don't search by name, but by "hash". Programs create a "hash" (think of it as a "digital fingerprint" of the file) and ask OpenSubtitles if it has subtitles matched to that hash and bytesize. This means that if there are a dozen subtitles for the same movie, the results will only include those that match the specific file you have.

What you don't seem to realize is that most of us who remain Legal and do not Pirate Movies, Rip our Movies from our own BluRay sources which means our Hashes will NEVER correspond to whats in OpenSubs database unless we upload those SRT files to it. We will have whatever language is available to us on the BluRay but others who need a language that is not supported will have to add their own.
 

You might not have thought this through, since it's almost impossible to do except for very localized and specific situations.
 
Most of your subtitles out there are made by volunteers. The closer they are to the release date the less likely it is that they come from an "official" source (this is specially true for TV Shows). Since there are so many groups doing subtitles, there's always multiple versions of them, sometimes of equal quality.

Which just goes to show what I'm saying, Even f you find a SRT the system made for matching them is UNRELIABLE either because the hash's don't match or the VOLUNTEER did not do a very good job with making them.
And we haven't even discussed the Technical issues of using external SRTs in the case of Streaming and Media servers where the player device does not see the folder the source is in and therefore can't use subtitles without transcoding which is expensive compared to the wat SRT was DESIGNED to work as a LOCAL HINT file Co-Loaded with the Video Source.

Even if you find a matching subtitle, you still have issues using that subtitle on players like a CCast or a Roku which never sees the folder the SRT file is in and therefore the Server has to do the integration where SRT was designed to help the CLIENT side overlay the subs. This requires the CLIENT is able to see the entire folder of the source not just the source.

This is why I say the SRT method is antiquated, old and needs to be re-engineered not just from a matching POV but from a technical implementation POV.

My point is more technical, With MKV container there is no longer a reason to have external subtitles and for those who need Subtitle support, they would be much better off finding a SRT that matched and merged them directly into the container. It may seem like more work but it's not any harder than it is to try 10 different SRT files hoping to hit on one that works while your trying to watch a movie.
That is if you even have the option to do that which many player targets do not because they don't support external Subtitles at all.

What you don't seem to realize is that most of us who remain Legal and do not Pirate Movies, Rip our Movies from our own BluRay sources which means our Hashes will NEVER correspond to whats in OpenSubs database unless we upload those SRT files to it. We will have whatever language is available to us on the BluRay but others who need a language that is not supported will have to add their own.

It's not that I don't realize it. It's that this would be off-topic to the OP's stated problem, which is specifically about hashes. if you're talking something else maybe you didn't realize what the point originally was or you're indadvertedly hijacking the thread. Nonetheless, I included how OpenSubtitles and the plug-in search by name, for people like you who don't use hashes, assuming people like you would read it before complaining about it.

Regardless, it's understandable that you're not fully aware of how others use their media or the legality in other countries. Depending on your country it may not be illegal to download a movie you own (or even one you don't). Where I live, for example, if you have the original media you're allowed to have copies of it, even if you didn't make them yourself. This not only is not illegal but is explicitly defended as a right. I have several hundred DVDs and Blu-Rays yet I'd be an idiot to try and rip them myself, seeing as how it's faster to get it from others, usually better than I would've done it myself, and completely legal. 

I'm sure others would appreciate less of the holier-than-thou condescendence and self-righteousness, avoiding doing moral judgements in a forum like this would be fantastic for everyone, all things considered (me? I couldn't care less, I'm used to ignorance about these subjects after years involved in this).

As stated, the plug-in searches by name and selects from the results. This would help you not only if you hadn't the subtitles in the original blu-ray but if you didn't care about ripping them originally. The plug-in, though, doesn't filter results intelligently and, thus, the selection is all over the place. At any rate, matching the same IMDB is *bound* to on average produce better results than not (and selecting a subtitle that says "BDRIP" is *bound* to be a better selection than one what says "CAM"), so yes, you're affected by these problems (and would benefit from my comments) as well.

And we haven't even discussed the Technical issues of using external SRTs in the case of Streaming and Media servers where the player device does not see the folder the source is in and therefore can't use subtitles without transcoding which is expensive compared to the wat SRT was DESIGNED to work as a LOCAL HINT file Co-Loaded with the Video Source.

We haven't because, as the "legality of downloading vs ripping" before, it's off-topic to the OP's initial request. Your comment is absolutely useless to the problem at hand, since it seems to be a complaint about SRTs in general. 

Even if you find a matching subtitle, you still have issues using that subtitle on players like a CCast or a Roku which never sees the folder the SRT file is in and therefore the Server has to do the integration where SRT was designed to help the CLIENT side overlay the subs. This requires the CLIENT is able to see the entire folder of the source not just the source.

This is why I say the SRT method is antiquated, old and needs to be re-engineered not just from a matching POV but from a technical implementation POV.

My point is more technical, With MKV container there is no longer a reason to have external subtitles and for those who need Subtitle support, they would be much better off finding a SRT that matched and merged them directly into the container. It may seem like more work but it's not any harder than it is to try 10 different SRT files hoping to hit on one that works while your trying to watch a movie.
That is if you even have the option to do that which many player targets do not because they don't support external Subtitles at all.

You do realize that nothing of what you're trying to say has any bearing on whether the OpenSubtitles plugin search can be improved? You realize that you'll still need to "try 10 different SRT files" to get the one that should be merged?

You're confusing having a good subtitle delivery format (MKV with merged SRTs works for you, I'd add MP4 with TTXT because I prefer them, others might prefer AVIs with embedded subs) with a process on how to get those subtitles in the first place. I can't understand why you thought that was relevant to this topic.

I won't even go into the problems in logic in your assumptions and arguments because this thread has been broken enough, and it should get back on track. If you have comments on whether the OpenSubtitles plug-in should be improved and how (or why not), I assume that'd be welcome. Reading an ignorant rant about legality just to get a point that has no relation to Plex (since Plex won't make those MKVs for you) that solves a problem you have with ROKUs but doesn't other people's problem (like iOS devices, which don't view MKVs with merged SRTs but view MP4 with TTXT) seems to me like a waste of time, personally.

You do realize that nothing of what you're trying to say has any bearing on whether the OpenSubtitles plugin search can be improved? You realize that you'll still need to "try 10 different SRT files" to get the one that should be merged?

And you do realize I covered that in the statements about how OpenSubtitle.org needs to change for any improvement to ocurr right?
The issue (as I stated in my very first post in this thread was how the Plugin is not going to solve a problem that resides on OpenSubs servers!

I'm only familiar with opensubs via XBMC...
 
And it would seem to me the solution lay with opensubs not the agents we are using to find subtitles.


They use an unreliable matching system and even when a match is found the SRT file in question may be useless!
And no agent on the planet is ever going to change that!
So either OpenSubs needs to do a better job of weeding out bad volunteer submissions, Find some other way to match SRT to Source other than Hash and Name and invent some convention that can designate which cut or version you have or this is pretty much a waste of time because all the Plugin improvements in the world will not make the content and SRT files work any better than they do now!

OK AsphyxNYC, we have read what you have to say about your opinion on OpenSubtitle. Fine. Currently OS succeed rate is about 80% for my video and many other users use this agent as their main source of subs. It's not perfect and I didn't identify any reliable alternative today. 

Now I want to improve the current solution waiting a totally new way of subtitle management. I think the works is not so hard to have a real improvement. Now if you want to help, you can contribute to this topic, otherwise please avoid off topic contribution here.

OK AsphyxNYC, we have read what you have to say about your opinion on OpenSubtitle. Fine. Currently OS succeed rate is about 80% for my video and many other users use this agent as their main source of subs. It's not perfect and I didn't identify any reliable alternative today. 

Now I want to improve the current solution waiting a totally new way of subtitle management. I think the works is not so hard to have a real improvement. Now if you want to help, you can contribute to this topic, otherwise please avoid off topic contribution here.

Actually, I have had a popular subtitle searching program with other 5 million successful matches that says the same. The success rate of OpenSubtitles as-is is already pretty high. With some (trivial and obvious) tweaks from the part of the client, the success rate becomes close to 100% easily.

I'm talking here from experience, not from inference or assumptions :) 

In french we say "On ai jamais mieux servi que par soit même" (translated by Google in "we've never better served than by yourself").

I'm testing some Eduo propositions in a fork of the OS agent. I have good result and very encouraging. It's not ready for production or beta testing. I will keep you informed.

PS : the good news is that @Sander1 is working on the OS agent update too.

You can find a prototype for testing in this thread : https://forums.plex.tv/topic/102923-mock-up-for-opensubtitles-metadata-agent-improvement/

An RC2 with some limitations is available for download : More information on the public thread https://forums.plex.tv/topic/102923-beta-for-opensubtitles-metadata-agent-improvement/

Early 2021 clean-up: implemented (3rd party agents, Plex’ subtitle on demand feature)