Automate subtitle offsets with voice to text



So we all have had these subs that are off by a few seconds or start to slowly drift due to different frame rates or pauses in the media files. Especially the subtitles that will need adjustment multiple times during viewing are super annoying.

My idea is to utilize some public library that can generate text from audio. I don't propose to generate subtitles on the fly as this would need Siri/Cortana grade speech recognition and then there's the subtitle presentation/timing esthetics.
What I do propose is to generate OK translations from speech that are somewhat accurate. These results could be matched with subtitles present in plex to find common patterns and ultimately determine an offset.

I have rough ideas about how to implement the above, I can share those at a later point if anyone is interested.

What I am unclear about is how plex exactly handles subtitles. I know transcoding will burn them in, so this would be the easiest case as everything happens serverside. But how about direct streams? Are the subtitles transmitted completely or is it streamed in chunks? For the former it would mean that the server would need to update the client with (new) offsets while for the latter I can simply adjust future chunks.

In any case, my idea is to have the server determine offsets and not push this logic client side. Ideally this needs to be an implement once use everywhere kind of solution.

But are we even able to build something like this into plex? I'm not familiar with the API or other exposed parts. I only know that the actual server is closed source. I don't think the above could be managed with metadata plugins.

So what would be the options here? The key point would be (just to summarize) to determine offsets continuously for media during playback. I'm not interested in only client side solutions as they are limited in audience and likely not future proof.

Would love to hear your thoughts on this!


Doubtful, IMHO

In Python, you would have to isolate the audio stream, then walk through that, and feed that to a for me yet unknown solution, that can translate audio to text for every language supported by Plex.

If above exists, we also would have to extract relevant timestamps.

And determining the language spoken, is a nightmare I simply can't imagine how to solve.

So not to be negative here, cuz a great idea, but sadly think technology simply isn't there yet




Let's assume we limit this to English language only for now.

Can you give any comment on the options for interfacing with plex server? Can we develop tooling that will hook to a viewing session and will piggyback on the open connection (ie can we add additional data in the stream on the fly?)


Regarding your remarks on the technical feasibility, there are some decent python libraries out there for offline voice recognition. We only need a mediocre performing algorithm and can apply heuristics from there to eventually determine an offset with reasonable confidence.

I'm not saying I have this all worked out and there's a more than real chance it won't work well at all, but someone has to give this a try right?


If I was given the option to choose how my PlexPass subscription is spent. Better subtitle integration would be on the top of my list. The way that the plex team chose to implement subtitles only took into account very specific use cases. The rest of us were forgotten =(