Plex Media Server SEGFAULT with Nvidia GPU

Server Version#: 1.30.2.6563
Player Version#: N/A All Players
PMS will segfault randomly when transcode starts. Exact kernel message:
kernel: [36699.008952] PMS ReqHandler[34863]: segfault at 7f710f9b73f0 ip 00007f7109944f81 sp 00007f7106ed7ed0 error 4 in libnvcuvid.so.525.60.13[7f7109931000+92a000]

Crash does not seem related to load as it can happen with 1 transcode stream or 20. I have tried numerous different nvidia drivers but all exhibit the same issue. This configuration is a little unique in that it is a vGPU on ESXi.
PMS is an Ubuntu VM on ESXi 7.0.3 latest, 8 cores and 16GB RAM. E5-2697A CPU and Nvidia A16 GPU in the host, A16-8Q vGPU profile. This issue does NOT happen with the same server operating on Windows.

Can provide verbose logging and crash dump files.

This is SDK v12.0. we have seen problems with this and are working on it.

Recommend Nvidia GPU drivers 515.86.01.

This will give you AV1 decode capability as well.

Here is some more details about it:

Also attached is the dump file from the crash (it auto uploaded as well if that helps). I know this is a very edge-case setup but hoping it’s something that can be fixed or worked around.
49aebc1d-820c-477e-f4854a99-55b174c3.zip (356.3 KB)

I can’t dissect DMP files here. Only Engineering can.

The logs of (PMS and/or the relevant line from syslog) tells a great deal.

You referencing libnvcuvid.so told me where this is headed

I do request you keep DEBUG logging on and VERBOSE logging OFF.
We don’t need VERBOSE unless we’re in deep water up to the watches :wink:

Will they let you pass the GPU through with 515.86.01 drivers?
I’m not certain after first read of that doc but looks like it won’t.

You might have no choice but to pass the entire GPU to the VM.

Let me dig thru the logs, I had debug logging on the past few days to try and spot the issue so they may have already rolled over but I will check.

The next driver version I can use is “510.108.03” which is what I was on a couple weeks ago with the same problem. I can keep going back older to see if any of them resolve it. Is CUDA 11.4 more stable than v12 in this instance?

Here’s how they align.

510-x is ok for 1.29.2 and older.

right at 510 the transcoder added AV1 decode support which required the Nvidia driver bump.

The NAS vendors (QNAP most commonly) have 512+.
I personally use 515.86.01 so it’s known stable.

525 drivers introduce the new V12.0 CUDA api.

It’s not entirely 100% upgrade compatible . That’s where I’m at as I work with Engineering. … Figuring out what is failing.

You are safe being CUDA 11.7

If you’re running V15 manager then you require 525 drivers

PMS has a problem with the V12 API ; we’re working on it.

AV1 isn’t the end of the world for me, I just need stability. I have tested drivers all the way back to 510.47 and still has the same exact crash. I will see if there is anything on the Nvidia side that might indicate the problem.

510 drivers are fine for PMS HEVC HDR.

You’re trying to use the vGPU manager and having trouble.

When you go more basic and pass the entire GPU into the VM, at the exclusion of all other VMs, does it now work?

If it does, you have your answer - ESXi or Nvidia vGPU manager

Passing thru a physical GPU does work, but defeats the purpose of a VM that can be moved around when doing host maintenance. This exact same setup does work on Windows without any issues with the vGPU, but the database performance was abysmal and caused lockups. I will try a few different driver combinations to see if it improves, I am also going to try the libnvcuvid.so module from 515.86.01 driver installer to see if I can shoe-horn them into working.

Windows support is understandably better.

  1. Much higher visibility.
  2. Microsoft controls everything.

On Windows (from what I’m told), PMS doesn’t have direct GPU access. It must pass all requests into Windows to be handled. With Windows controlling the tasking, sharing the GPU is a lot easier.

I think you might be pushing the bleeding edge here a bit.

That said, when you get a memory access violation, it usually means something on the other end screwed up. vGPU might not be designed for video transcoding ??

That makes sense in regards to Windows, if i could get the database to stop being so sluggish (essentially locks up the server for 30-60 seconds any time new media is added) I would be using it.

I would agree with you on the vGPU not for transcoding, but I have 4 other identical machines running Tdarr/ffmpeg processing the same media at hundreds of FPS without a single hiccup. Something about how PMS is calling the nvenc/nvdec occasionally upsets the driver, and I’m not sure what it is. What’s even more upsetting is the same media file will cause a crash, server will restart and then works just fine.

I do appreciate the help, I was very hesitant to post because, as you said, this is bleeding edge. I would venture no one else has a setup like this and I’m blazing a trail. I knew I was in deep when a google search for the error message found only 2 hits.

I hope we can get it resolved but I understand this probably has more to do with nvidia+drivers than it does PMS. I may re-visit the Windows PMS again and see if i can get the DB to play nice but its trading one headache for another at that point.

You can’t compare stock FFMPEG and the Plex Transcoder equally.

  1. Codecs
    – FFMPEG codecs are compiled in
    – Transcoder codecs are dynamic and loaded on demand as per Dolby licenses.
    – (Dolby EAE Audio is a PITA to work with and the code cannot be changed by us)

  2. Processing output
    – FFMPEG writes text output and the target video stream to a static file
    – Transcoder sends progress status over a socket to PMS
    – Transcoder manages output buffer utilization and trims buffers as needed.

  3. Realtime vs Batch
    – FFMPEG is shoot and go until done
    – Transcoder is dynamic with pause, skip forward & back, etc.

Summary: It’s a far more complex so can’t be compared to FFMPEG

Could there be; Is there probably; something which upsets the Nvidia drivers?
Yes there can and might be. Finding it is the $MEGA question.

This is the first case I’ve seen where vGPUs are in use.
If this happens more often and warrants making PMS fully support ESXi vGPU then i know they will.

I imagine this is just going to be a “me” problem, I doubt many other people, if any, are going to be going this route in the near future.

If they do decide to look into things it would be great, but I understand that this would be a pretty low priority. I will continue some different tests to try and remediate it and update the thread if anything of benefit comes out of it.

Thanks

At the moment, I think so.

Given where you’re finding the error, and how VMs work,

I am going to assert that the vGPU mechanism isn’t providing the 100%-safe context save & restore operation it should.

By the rules of VMs, the host OS should never know anything happened to it.

I assert the same rules should apply to vGPU. The host should never know the GPU is shared / floating.

For what it’s worth, the vGPU profile I have been using is assigning the entire GPU to the single VM, so there is no sharing in that regard. The host doesn’t seem to care when PMS crashes, it sees the load on the GPU drop but nothing else.

I’m guessing that the highly-specialized build of ffmpeg that PMS uses is finding an extreme edge case issue with vGPU and that’s why I’m basically the first one to report it.

VERY likely.

I wanted to come back around.

We found the SDK 12.0 problem.
Also found the 1080p problem.

Issue we found was our buffers were being allocated dynamically (as needed) during the transcode.

SDK 12.0 doesn’t allow this anymore.

Reading into the SDK, might static buffers vs dynamic buffers be why you’re having difficulty?

We’ve set the pool to be a generous (not very efficient) amount as initial test.

Would you be willing to try?

-or-

This is engineering-grade code (pre-qa)

I would love to, I will get this installed tonight and report back if anything happens. I have a suspicion it could be something on the Nvidia driver side that they call out in their release notes in a round-about way. I will try this first and see how it goes. It fails pretty regularly so I should know in a day or so of the family watching.

Thanks for the follow up!

Well, I was hopeful that this had resolved my issue, but after a couple days its back. It ran really well with zero issues for almost 3 days, so I was about to reply back that it was fixed, but back to the drawing board I suppose. On a positive note, it did fix the “no decoder surfaces left” error and seems like the transcodes start up slightly faster, so thank you for that!!

Again, I know this is a very edge case, and I do appreciate the help! Nvidia had a couple ideas to try in regards to how I have MMIO setup in the VM to see if this is a memory mapping issue, but they have no idea what Plex is or what I’m trying to do haha.

Thanks again, I’ll report back if anything happens just in case someone stumbles upon this post down the road.