NVIDIA hardware acceleration inconsistently working with web streaming

Adarnof · January 30, 2023, 1:16am

Yes, I believe Debian 11 was originally based on the 5.10 kernel. Not sure what OpenMediaVault 6 shipped with originally.

ChuckPa · January 30, 2023, 1:20am

crap. you probably got the backports crap too?

Where I’m at is this:

Ubuntu 20.04.5 LTS or Debian 11 - OK
Nvidia P2200 - OK
Nvidia drivers 515.86.01 - OK

Put those 3 together and it’s rock solid.

I don’t have a machine I can vet OMV with. Without that, I can’t come up with a recipe / template for everyone.

Theoretically,

Debian 11 - stock and up to date kernel
OMV base package without wonky backports which will jack up the kernel and drivers
Nvidia 515.86.01

That should work because you’re on the Debian base with OMV service apps on top

HW transcoding in Linux still is “Rocket Science” to a degree.

Adarnof · January 30, 2023, 1:23am

Yeah my install is full of backports. Sorting those out will be a ton of fun. Depending on how much time I have this evening, I might be able to get a VM spun up with a GTX1080Ti to see how it behaves. Hopefully with it being the same approximate generation of GPU, it should behave similarly. If I can find an appropriately-dimensioned PCI riser cable, and with a bit of wiggling, I might be able to cram the Tesla P4 in the case…

ChuckPa · January 30, 2023, 1:31am

I don’t have a case I can move the P2200 into.
The best I can do is a NUC8 and use QSV which doesn’t help us here.

Adarnof · January 30, 2023, 2:41am

I’ve been keeping my eye on this thread, and it sure sounds like he’s describing the same behavior I’m seeing. Transcodes work for non-web clients, web client streams start properly even with transcoding, but changing quality in the web client breaks it. He’s also reporting the transcode runner threads being killed prematurely, which I think I’m seeing in the logs.

Have a look at this quick summary table I put together of my “testing matrix” (filtered to unique combinations only) - the combination of using the web client and having hardware-accelerated video encoding enabled seems to be important. I don’t know enough about the inner working of plex, but is there any reason for the web client to know about hardware acceleration and thus make the request slightly differently, or for plex to handle requests from the web client differently from an app? I can’t imagine either answer is yes, but somehow when I combine those two factors I get problems.

Testing Matrix Summary.txt (2.2 KB)

EDIT: I tried using the Plex Media Player client on my Arch Linux desktop, and it also suffered from the same issue.

Adarnof · January 30, 2023, 6:22am

I spun up a VM with my 1080Ti passed through, running Debian 11 with kernel 5.10, driver 525, and PMS 1.31.1.6641 in docker and… it works perfectly. I’m going to keep playing around and see if I can break it, otherwise I’ll need to go ahead and transplant my Tesla P4.

ChuckPa · January 30, 2023, 7:50am

I now have 525.60.13 with PMS 1.31.1.6638, and P2200, nvdecExtraFrames=2

– no errors

Minxster · January 30, 2023, 2:43pm

Looking at your PMS log file, you are getting similar types of errors that I’m seeing!

(My words, so not language copied straight from the logs )

21:02:40.144 - HEVC test suceeded
21:02:40.209 - Killing job - but why is this job being killed proactively by PMS? This job is running transcode session “72ok1e2kps6zv8bart3hjzcm” (I think?)??
21:02:40.210 - Transcode is going to start cleaning the directory for session “72ok1e2kps6zv8bart3hjzcm”

Client still trying to GET files that it thinks are transcoded and ready for use:

Jan 26, 2023 21:02:40.326 [0x7f265f857b38] Debug — Completed: [192.168.2.150:33988] 404 GET /video/:/transcode/universal/session/72ok1e2kps6zv8bart3hjzcm/0/3.m4s (10 live) TLS GZIP 0ms 504 bytes (pipelined: 4)
Jan 26, 2023 21:02:40.326 [0x7f265f857b38] Debug — Completed: [192.168.2.150:33966] 404 GET /video/:/transcode/universal/session/72ok1e2kps6zv8bart3hjzcm/1/3.m4s (10 live) TLS GZIP 0ms 504 bytes (pipelined: 33)

Client still waiting, has been for over 1 minute:

Jan 26, 2023 21:03:57.752 [0x7f265f857b38] Debug — Completed: [192.168.2.150:33966] 404 GET /video/:/transcode/universal/session/72ok1e2kps6zv8bart3hjzcm/1/3.m4s (8 live) TLS GZIP 0ms 504 bytes (pipelined: 46)

Again, with the above. PMS decided to stop the transcode, PMS decided to start cleaning up the transcode session folder, but the client (chrome/browser) is just wait waiting.

But here’s a thought, why did it decide to stop session 72ok1e2kps6zv8bart3hjzcm, and why was it still check VAAPI/NVENC even though transcode said it was happy to start session 72ok1e2kps6zv8bart3hjzcm? It may be nothing, and there may be a good reason. But I’ve been a few times, comments leading to “why was the process killed with -9”, but it’s clear in the logs that the kill was sent my PMS and nothing something else on the machine.

For me, when I’ve been doing my testing, I have one browser filling my screen showing the PMS console, and a 2nd browser shrunk down so I can use that to try and watch something… At this point, if I see any “404 GET” errors, I know the video playback will fail. I can that just switch the bandwidth in the viewer and it may (usually does) start transcoding properly. Once it’s playing ok, if I adjust the mbps again, it’ll fail… Again, this is what I’m seeing with my setup.

If I turn off all HW transcoding features, it works fine, turn it on again, I get this strange round-robin type of error.

Adarnof · January 30, 2023, 2:45pm

After enabling the bullseye-backports repository and doing a full upgrade from it, the problem has returned! Not as frequently as I was seeing on OMV, but it is happening. So, sure looks like something in that repository is breaking it… not sure how to go about figuring out what, unless I start updating packages one-at-a-time (there’s a lot of them).

EDIT: nope, scratch that. I went back to confirm my findings, and got it to break without having updated from bullseye-backports. It’s very infrequent, but it does happen.

Adarnof · January 30, 2023, 2:50pm

What happens if you leave “Use hardware acceleration when available” checked, but turn off “Use hardware-accelerated video encoding”?

What distro are you using?

Minxster · January 30, 2023, 3:52pm

@Adarnof I’m running an Ubuntu VM (from within ESXi), with passthrough for the GPU. For a lot of my recent testing I’ve had a snapshot running the latest experimental PMS plus latest 525 Nvidia Drivers. Today though, I’m just running my “normal snapshot” that is PMS 1.29 and 515 drivers

What happens if you leave “Use hardware acceleration when available” checked, but turn off “Use hardware-accelerated video encoding”?

I nearly didn’t try this (your question above), thinking “that’s not going to do anything!”… But, erm, it actually seems to fix the problem with the browser??? How did you get on with testing this yourself?.. For me, I can switch between bandwidth mbps (from Chrome/browser) and though it’s slow to pickup again it seems to just work🤷‍♂️ I’ve tried on a couple of 4k files and a 1080p file and haven’t had one repeat of the issue? The only thing I saw, once, was it switched to none NVENC transcoding, but it was just the once and I can’t get it to repeat.

I took this screen shot just to show HW transcoding was working

I did have tone mapping off for this, so I’m just going to try again with it turned on. Though, to be fair, I had no colour issues with it off? And used 2 x 4k HDR files Which I think is quite odd?!

Minxster · January 30, 2023, 4:03pm

Just posting back to say tone-mapping didn’t alter performance. I tried it on and off. So the issue(s) isn’t connected to that

So why would this (being unchecked) make such a different? Why does PMS kill off the process but not say why?

Use hardware-accelerated video encoding

Adarnof · January 30, 2023, 5:08pm

I only noticed the hardware-accelerated video encoding behaviour as I was migrating my plex install from a computer with a GT1030 (NVDEC support, but no NVENC) to one with a Tesla P4 (both NVDEC and NVENC), and that’s when this problem appeared. If you take a look at some test results I did earlier, are you seeing the same behavior?

Minxster · January 30, 2023, 5:35pm

TL;DR : I’m not really seeing the same behaviour, mainly because of all my tests being from the browser, so because it’s not working correctly, my results are meaningless

That said, when I looked at your PMS logs, I could see the same pattern as I saw with mine when using web client… I don’t think that’s a coincidence…

…

I’m on a P2200 and to be fair, it’s been solid as a rock until the latest public version of PMS. That’s when I had transcoding issue up the wazzu. That’s when my other problems started. In that, the broken version of PMS was showing up issues with remote family members who run the cheap Roku boxes; so they all need transcoding as my libraries are all HEVC. So for me to “test” what was going on, I started to use the Chrome/browser web client… And that’s when I noticed the strange transcoding issues.

So now I have a VM, that I’ve “spit in half” (so to speak) by using snapshots. One half is on PMS 1.29.x with 515 nvidia drivers (my daily mode VM), the other half (snapshot) is on the latest experimental version of PMS with 525 drivers. Its easier for me to do it this way since I only have 1 GPU in my server… The only other recent change to my whole setup, because of all the VAAPI rubbish, was to remove the ESXi/VM “svga gpu”. So now I only have 1 GPU that Ubuntu sees (in the VM) and its the Nvdia P2200.

Untitled

This has made no difference to the VAAPI shenanigans, as we know , but at least means things are a little cleaning as far as my VM and PMS

So… All in all, my normal snapshot, is running but has the strange “encoder” issue (when using a browser), and the experimental snapshot is in the same boat. By me not having a Roku, I can’t test correctly the situation I originally find myself in. And I have no way of talking my 76yr old father-in-law through the hoops to test things for me … This leaves my in a strange position that I’m not helping the community with the latest issues with transcoding and CUDA v12, because of the browser issue. But because of the browser issue, I can’t prove or disprove if things are actually working ok with PMS in general.

It’s been a long time since I’ve really had to dig this deep into PMS. And it’s not helpful that the logs, debug logs, transcoding debug logs, are just hectic. Why would PMS decided to kill-off/stop a transcode job but not actually give a reason?

Adarnof · January 30, 2023, 6:27pm

I “borrowed” the 1080Ti from my windows VM to create a Debian 11 VM (what OMV is based on, to see if it’s doing something screwy), and tested it to see how it behaved. I’ve attached the record of my attempts: in essence I would try a quality, and if it worked, request the next lower quality; if it failed, I would request the previous higher quality. If it failed 3 times in a row going back-and-forth (lower failed, higher worked, lower failed etc) I would move to the next lower anyway (this only happened on OMV tests).

Debian 11 base failed 23/100
Debian 11 updated from bullseye-backports failed 24/100
OMV 6 failed 58/100

Yikes. Worth noting that OMV 6 tests were done on my server, not the VM on my desktop, so it’s different hardware. I’ll spin up an OMV VM with my 1080Ti to see if this is a hardware-specific difference. Also because I wasn’t randomly sampling, it could be that I took more samples from a quality which fails more often and thus skewed the results. Either way it shows that all the conditions fail.

@ChuckPa how many times did you test? I had a few stretches of 10+ succeeding, only for there to be a ton of failures after. See these results, where I list bitrate and pass/fail in the order I tested.

OMV6.txt (831 Bytes)
debian11_backports.txt (943 Bytes)
debian11_base.txt (932 Bytes)

UPDATE: made an OMV6-based VM with my 1080Ti and… failed 23/100. Very consistent with the others. I’d need to do some proper random sampling to ensure I wasn’t biasing the results from my OMV6+Tesla P4 server, but there could be a contribution from different hardware.

OMV6_VM.txt (938 Bytes)

bigDottee · January 30, 2023, 9:08pm

Hi all, I also am having the same issue as described here.

Hardware:
Intel Xeon E5-2690 v3
Nvidia Tesla P4

Environment:
ESXi 6.7u3
Ubuntu 22.04 virtual machine
Kernel 5.15.0-58-generic
P4 passthrough to Ubuntu 22.04 virtual machine
PMS 1.28.0.5999 installed directly through dpkg to Ubuntu 22.04
Nvidia drivers: “nvidia-utils-470-server and nvidia-driver-470-server”

Around 3:45-3:51pm on 1/30/2023 on my Android, I started playing The Lord Of The Rings: The Return of the King in h264 transcoded down to 720p and then in h265 transcoded down to 720p. Both of these streams worked perfectly without issue.

Around 3:54pm I started the same movie on the Windows client in h264, then transcoded down to 720p. When I initially did this, a few seconds were transcoded and then I could see the process that started in nvidia-smi died. The stream continued playing for approximately 5 more seconds before it quits to the movie screen without error.

Around 3:57pm I started the same movie on the Windows client in h265, then transcoded down to 720p. I could initially see a very quick process start in nvidia-smi then die. I proceeded to see that the stream played for around 5-10 more seconds before dying as well, without error.

In the console, I can see that the Transcode is testing API vaapi
Then it fails with the error: [Req#50bc/Transcode] [FFMPEG] - libva: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)
Immediately followed by: [Req#50bc/Transcode] [FFMPEG] - Failed to initialise VAAPI connection: -1 (unknown libva error).
and finally: [Req#50bc/Transcode] Codecs: hardware transcoding: opening hw device failed - probably not supported by this system, error: I/O error

This tells me that the time after changing quality settings may still have been a buffer from the original quality, then when it went to switch over to the transcoded files, they weren’t there so it died.
After the stream dies, I see lots of these types of errors: [Req#48f5] Versions: skipping items for generator 1549: unable to generate version set query

I know based on the previous comments that we should be trying a newer driver version, this is just the one that nvidia’s site recommends as the most recent for the P4.
Please let me know if there is anything else I can help show.

Plex Media Server Logs_2023-01-30_16-07-27.zip (2.7 MB)

ChuckPa · January 30, 2023, 11:58pm

Just got home. Will catch up after dinner

@Adarnof

Do you have any way to test pure Debian 11 without OMV installed?
(peel the layers)

ChuckPa · January 31, 2023, 2:11am

ALL.

Because of all the possible permutations involved, We need to reduce this to the least common factor.

PMS server
Plex/Web playback in the browser - Playback quality 20 Mbps - 1080p
nvdecExtraFrames="2"

If it doesn’t work here, there’s no reason to proceed because the browser makes the server do all the work.

EDIT:

IF you can make minor adjustment to nvdecExtraFrames and make it work, please annotate this way

e.g
Test 2 : FAIL (nvdecExtraFrames=2)
Test 2: (PASS - nvdecExtraFrames=4)

Test 1:

Test 2:

Test 3:

Test 4:

Test 5:

Please reply using:

Distro Name and version:
Graphics card:
Driver version installed:

Test 1: Pass/Fail
Test 2: Pass/Fail
Test 3: Pass/Fail
Test 4: Pass/Fail/NA (Card cannot transcode HEVC/HDR)
Test 5: Pass/Fail/NA (Card cannot transcode VP9)

I will compile everyone’s results into a matrix and discuss with Engineering.

thanks.

EDIT:

IF you can make minor adjustment to nvdecExtraFrames and make it work, please annotate this way

e.g
Test 2 : FAIL (nvdecExtraFrames=2)
Test 2: (PASS - nvdecExtraFrames=4)

Adarnof · January 31, 2023, 3:08am

Yes, I created a Debian 11 virtual machine using QEMU with a GTX 1080Ti passed through, kernel 5.10, driver 515, PMS 1.31.1.6641, and experienced the same issues. That’s the “debian11_base.txt” results in my post above - I listed the bitrates tested and the order I did them in. Notice that some bitrates work on one attempt and fail on another - not sure your request for data will capture this variability. I did no updates from what the installer provided, other than installing nvidia-driver, nvidia-smi, libnvidia-encode1, nvidia-container-runtime, and docker via the script they provide. Maybe I should try a VM without docker - I assume I’ll need to install CUDA properly in that case.

ChuckPa · January 31, 2023, 3:32am

After removing all nvidia packages manually, rebuilding the initram-fs, and rebooting,
I installed:

 sudo apt install nvidia-driver-525-server nvidia-settings

Here’s what you get

[chuck@glockner ~.2006]$ dpkg -l | grep nvidia
ii  libnvidia-cfg1-525-server:amd64       525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-525-server           525.60.13-0ubuntu0.20.04.1        all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-525-server:amd64    525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA libcompute package
ii  libnvidia-decode-525-server:amd64     525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-525-server:amd64     525.60.13-0ubuntu0.20.04.1        amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-525-server:amd64      525.60.13-0ubuntu0.20.04.1        amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-525-server:amd64       525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-525-server:amd64         525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ml-dev                      10.1.243-3                        amd64        NVIDIA Management Library (NVML) development files
ii  nvidia-compute-utils-525-server       525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA compute utilities
ii  nvidia-cuda-dev                       10.1.243-3                        amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                       10.1.243-3                        all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                       10.1.243-3                        amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                   10.1.243-3                        amd64        NVIDIA CUDA development toolkit
ii  nvidia-dkms-525-server                525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA DKMS package
ii  nvidia-driver-525-server              525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA Server Driver metapackage
ii  nvidia-kernel-common-525-server       525.60.13-0ubuntu0.20.04.1        amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-525-server       525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA kernel source package
ii  nvidia-opencl-dev:amd64               10.1.243-3                        amd64        NVIDIA OpenCL development files
ii  nvidia-prime                          0.8.16~0.20.04.2                  all          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                       10.1.243-3                        amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-utils-525-server               525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA Server Driver support binaries
ii  nvidia-visual-profiler                10.1.243-3                        amd64        NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia-525-server  525.60.13-0ubuntu0.20.04.1        amd64        NVIDIA binary Xorg driver
[chuck@glockner ~.2007]$

Topic		Replies	Views
Hardware transcoding issue Plex Media Server server-linux	431	9305	January 30, 2024
Transcoding Failure - CUDA_ERROR_INVALID_VALUE: invalid argument Desktops & Laptops server-linux	65	825	June 9, 2023
Converting to a lower quality fails with Hardware-Accelerated Streaming in Plex Web Plex Media Server server-linux , plex-web	264	4474	November 11, 2022
Hardware encoding not working reliably Plex Media Server server-linux	24	429	June 26, 2023
PMS Version 1.30.1.6562 - this update stopped my transcodes from working Plex Media Server server-linux	99	1378	June 29, 2023

NVIDIA hardware acceleration inconsistently working with web streaming

ALL.

Related topics