Server Version#: 1.30.2.6563
Player Version#: N/A All Players
PMS will segfault randomly when transcode starts. Exact kernel message:
kernel: [36699.008952] PMS ReqHandler[34863]: segfault at 7f710f9b73f0 ip 00007f7109944f81 sp 00007f7106ed7ed0 error 4 in libnvcuvid.so.525.60.13[7f7109931000+92a000]
Crash does not seem related to load as it can happen with 1 transcode stream or 20. I have tried numerous different nvidia drivers but all exhibit the same issue. This configuration is a little unique in that it is a vGPU on ESXi.
PMS is an Ubuntu VM on ESXi 7.0.3 latest, 8 cores and 16GB RAM. E5-2697A CPU and Nvidia A16 GPU in the host, A16-8Q vGPU profile. This issue does NOT happen with the same server operating on Windows.
Also attached is the dump file from the crash (it auto uploaded as well if that helps). I know this is a very edge-case setup but hoping it’s something that can be fixed or worked around. 49aebc1d-820c-477e-f4854a99-55b174c3.zip (356.3 KB)
Let me dig thru the logs, I had debug logging on the past few days to try and spot the issue so they may have already rolled over but I will check.
The next driver version I can use is “510.108.03” which is what I was on a couple weeks ago with the same problem. I can keep going back older to see if any of them resolve it. Is CUDA 11.4 more stable than v12 in this instance?
AV1 isn’t the end of the world for me, I just need stability. I have tested drivers all the way back to 510.47 and still has the same exact crash. I will see if there is anything on the Nvidia side that might indicate the problem.
Passing thru a physical GPU does work, but defeats the purpose of a VM that can be moved around when doing host maintenance. This exact same setup does work on Windows without any issues with the vGPU, but the database performance was abysmal and caused lockups. I will try a few different driver combinations to see if it improves, I am also going to try the libnvcuvid.so module from 515.86.01 driver installer to see if I can shoe-horn them into working.
On Windows (from what I’m told), PMS doesn’t have direct GPU access. It must pass all requests into Windows to be handled. With Windows controlling the tasking, sharing the GPU is a lot easier.
I think you might be pushing the bleeding edge here a bit.
That said, when you get a memory access violation, it usually means something on the other end screwed up. vGPU might not be designed for video transcoding ??
That makes sense in regards to Windows, if i could get the database to stop being so sluggish (essentially locks up the server for 30-60 seconds any time new media is added) I would be using it.
I would agree with you on the vGPU not for transcoding, but I have 4 other identical machines running Tdarr/ffmpeg processing the same media at hundreds of FPS without a single hiccup. Something about how PMS is calling the nvenc/nvdec occasionally upsets the driver, and I’m not sure what it is. What’s even more upsetting is the same media file will cause a crash, server will restart and then works just fine.
I do appreciate the help, I was very hesitant to post because, as you said, this is bleeding edge. I would venture no one else has a setup like this and I’m blazing a trail. I knew I was in deep when a google search for the error message found only 2 hits.
I hope we can get it resolved but I understand this probably has more to do with nvidia+drivers than it does PMS. I may re-visit the Windows PMS again and see if i can get the DB to play nice but its trading one headache for another at that point.
You can’t compare stock FFMPEG and the Plex Transcoder equally.
Codecs
– FFMPEG codecs are compiled in
– Transcoder codecs are dynamic and loaded on demand as per Dolby licenses.
– (Dolby EAE Audio is a PITA to work with and the code cannot be changed by us)
Processing output
– FFMPEG writes text output and the target video stream to a static file
– Transcoder sends progress status over a socket to PMS
– Transcoder manages output buffer utilization and trims buffers as needed.
Realtime vs Batch
– FFMPEG is shoot and go until done
– Transcoder is dynamic with pause, skip forward & back, etc.
Summary: It’s a far more complex so can’t be compared to FFMPEG
Could there be; Is there probably; something which upsets the Nvidia drivers?
Yes there can and might be. Finding it is the $MEGA question.
This is the first case I’ve seen where vGPUs are in use.
If this happens more often and warrants making PMS fully support ESXi vGPU then i know they will.
I imagine this is just going to be a “me” problem, I doubt many other people, if any, are going to be going this route in the near future.
If they do decide to look into things it would be great, but I understand that this would be a pretty low priority. I will continue some different tests to try and remediate it and update the thread if anything of benefit comes out of it.
For what it’s worth, the vGPU profile I have been using is assigning the entire GPU to the single VM, so there is no sharing in that regard. The host doesn’t seem to care when PMS crashes, it sees the load on the GPU drop but nothing else.
I’m guessing that the highly-specialized build of ffmpeg that PMS uses is finding an extreme edge case issue with vGPU and that’s why I’m basically the first one to report it.
I would love to, I will get this installed tonight and report back if anything happens. I have a suspicion it could be something on the Nvidia driver side that they call out in their release notes in a round-about way. I will try this first and see how it goes. It fails pretty regularly so I should know in a day or so of the family watching.
Well, I was hopeful that this had resolved my issue, but after a couple days its back. It ran really well with zero issues for almost 3 days, so I was about to reply back that it was fixed, but back to the drawing board I suppose. On a positive note, it did fix the “no decoder surfaces left” error and seems like the transcodes start up slightly faster, so thank you for that!!
Again, I know this is a very edge case, and I do appreciate the help! Nvidia had a couple ideas to try in regards to how I have MMIO setup in the VM to see if this is a memory mapping issue, but they have no idea what Plex is or what I’m trying to do haha.
Thanks again, I’ll report back if anything happens just in case someone stumbles upon this post down the road.