NVIDIA hardware acceleration inconsistently working with web streaming

Thanks. stand by a few please. We’re testing here.

Time to get serious… haha

Please change nvdecExtraFrames="8" and retest.

The plan is:

8 - 4 - 6 - 3 as test candidate values.

This is unclear.

How about if I ask this way?

  1. Play 1080p video but reduce bitrate to 2 Mbps 720p
  2. Play 4K video reducing bitrate to 4 Mbps

We see two different problems:

  1. Transcoding 4K will fail with SDK errors (that’s what should be fixed)
  2. Transcoding 1080p fails (should be fixed) but think that’s what you’re still seeing?

Here’s a screenshot of “nothing happening” after waiting for 30 seconds. I’ve sent you the logs of situations like this, where no error appears, but nothing happens on the client. No activity is seen in nvidia-smi, no CPU usage is reported in the dashboard. That buffering reported on the progress bar is from before I changed quality.

While doing this testing of various nvdecExtraFrames values, I have to attempt to change stream quality multiple times (reloading the page between attempts) before I can either force the “No Decoder Surfaces Left” error to appear, or for the stream to work.

Do you have a real STANDARD video test file?

I don’t trust that 2K file.

I’m trying to establish a baseline.

Here’s a 50MB snippet of a proper 1920x1080 h264 file.

I’ve repeated testing with this file, and see “No Decoder Surfaces Left” when values of 4 and 3 are used, but both of 8 and 6 work.

However I’m still plagued by the “nothing happens most of the time” issue, which I think might be a separate problem to what you’re diagnosing. This happens when I play a 1080p file or 4K file, and happens at every bitrate I’ve tried, but not every time I test it. And again, only when requested from the web client, not the android client.

I need to see these logs please.

Nice and clean. No other streaming if possible.

Knowing time of test, what’s being tested, and results (build a matrix)

Given the successes we’ve had with others, I still think there’s something else going on.

What’s the default kernel for Debian 11 ?

Documentation shows me:

Linux 5.10 LTS kernel
Debian 11 (Bullseye) was released on 14 August 2021. It is based on the Linux 5.10 LTS kernel and will be supported for five years.

You’re showing me a 6.0 kernel.

Jan 27, 2023 19:23:25.206 [0x7f862b148b38] INFO - Plex Media Server v1.31.1.6638-fad659ac6 - Docker Docker Container x86_64 - build: linux-x86_64 debian - GMT -05:00
Jan 27, 2023 19:23:25.206 [0x7f862b148b38] INFO - Linux version: 6.0.0-0.deb11.6-amd64, language: en-US

I’ve had a look and it seems OpenMediaVault ships with the bullseye-backports repository enabled - that’s where it’s pulled kernel 6.0 from while updating.

Not trying to be insensitive here. Please understand where I’m at.

I’ve not even tested main supported distributions with kernel 6.0.

I’m willing to bet this is OMV-specific. It’s a stickler for other things too. (it has the reputation).

I’m going to see if I can craft an OMV VM and then pass through a GPU.

This isn’t going to be an easy fix.

Can you dowgrade to Nvidia drivers protocol 11.7 which is known to work (515.86.01) ?

I greatly appreciate all the effort you’re putting in to help debug this. It took some time, but I’ve put together a collection of logs and results from testing a few variables:

  • Client: web, or NVIDIA Shield app.
  • Source: either 1080p or 4k, starting at Direct Play for 1080p, “Maximum Quality” (24Mbps?) for 4k on web client, and Direct Play for 4k on NVIDIA Shield
  • Target bitrate: 20, 12, 10, 8, and 4 Mbps
  • nvdecExtraFrames: 8, 6, or none (setting removed from Preferences.xml)
  • PMS settings: hardware-accelerated video encoding enabled, or disabled

For testing with the web client, I would press play, wait for the stream to start (it ALWAYS started, even when transcoding 4k), and then request a change in quality to the target bitrate. I would wait ~15-30 seconds, if I saw no activity then I would press “stop” and test again. I did not refresh the page between tests.

For testing with the NVIDIA Shield app, unfortunately I was not sitting in view of my computer, so the timestamps may be a bit off, but I wrote down the time on my phone. I would press play, wait for the stream to start (it always started due to direct streaming), and then attempt to switch quality to the target bitrate. Failure with this app would look like the episode had ended, prompting me to start the next episode of the series.

Not sure what I changed, but the weird “doing nothing” problem in the web client was much more predictable in this round of testing: either a combination of variables worked, or it didn’t, and it would or wouldn’t work every time. I swear I didn’t change anything - maybe I was just paying more attention to what I was doing this time… Good news at least is that it’s reproducible.

I’ve attached a zip of the log files from each combination of variables. There is a text file with the same name that includes the timestamps of each attempt, the source, the target bitrate, and result (with comments, if I noticed anything). There were a few hiccups with the NVIDIA Shield app which required restarting it, their times are noted (the app has always been a bit unstable when switching resolutions and refresh rates).

Testing Matrix.zip (2.1 MB)

The tl;dr is that each of the variables I listed had an impact on the result. The NVIDIA Shield was weird, playing 4k perfectly every time but failing with 1080p. Maybe a bug in the app.

In terms of kernel and driver changes, that’s likely what I’ll try next, however I’m on day 3 of a really inconveniently timed hard drive recovery which would require restarting if interrupted. ETA Sunday, optimistically. When it finishes I will downgrade the driver and kernel and see what happens.

But wait, it gets weirder.

I was unhappy with the change in behavior in my last test, so considered how my actions changed: when I was seeing weirdness before, I would stay in the same “playback session” and just keep flipping the quality until something worked; when producing that matrix, I would restart from the default (Direct Play or Maximum Quality) every time. So this time I flipped quality from within the same playback session and oh boy it’s weird.

Here’s the logs and text file explaining what order I did things, with timestamps. Good news is everything is perfectly reproducible. Bad news is it makes no sense (to me, at least).

4k_switching_quality_different_orders.txt (1.4 KB)
4k_switching_quality_different_orders.zip (881.1 KB)

This was done with nvdecExtraFrames set to 8.

I see a pattern in the logs: when the change in quality fails, there are 404 errors every second for about a minute, then they stop; if it works, these are messages are not present:

Jan 28, 2023 02:06:20.348 [0x7f793e277b38] DEBUG - Request: [192.168.2.150:35604 (Subnet)] GET /video/:/transcode/universal/session/aakj49ae1un24keudzm7q2jp/1/header (7 live) #25150 TLS GZIP Signed-in
Jan 28, 2023 02:06:20.348 [0x7f7940b7ab38] DEBUG - Completed: [192.168.2.150:35604] 404 GET /video/:/transcode/universal/session/aakj49ae1un24keudzm7q2jp/1/header (7 live) #25150 TLS GZIP 0ms 496 bytes (pipelined: 14)

My next task will be to produce a matrix of starting versus ending quality, and seeing which combinations work. A tomorrow problem, I think, given the number of combinations…

Hard drive recovery finished on-schedule so I was able to test driver and kernel changes today. I installed kernel 5.19.0-0.deb11.2-amd64 and the nvidia-tesla-470-driver group of packages. nvidia-smi confirms driver version 470.161.03 with CUDA 11.4. I also set nvdecExtraFrames to 16 for fun.

Attached are the logs from my attempt. Unfortunately it didn’t work. I see the same behaviour as before - the stream initially works, transcoding the source file to “Max Quality” with associated process visible in nvidia-smi, but upon requesting 1080p 8Mbps, a new process briefly appears in nvidia-smi before both disappear and no activity is seen. You can see in the logs at 16:06:42.095 I requested bitrate 8000, and while it goes through the motions of picking the encoder and decoder, no transcoding actually takes place.

kernel_5.19_driver_470.zip (66.5 KB)

EDIT: Looking more closely, you can see the transcoder process start at 16:06:44.306 with PID 571, and it looks like it does actually transcode from 16:06:44.637 to 16:06:44.677, but then process 571 gets killed at 16:06:45.055, and the web client keeps requesting data from that transcode session 4x96cvx0lm97qv8ks6ku9sdw for the next 5 minutes (I let it sit there to see what happened).

EDIT2: testing with build PMS 1.31.1.6641 from this post still has the same problem.

@Adarnof

With CUDA API 11.x you won’t need nvdecExtraFrames

You probably need driver version 510 or higher. 470 is very close to the line.
The driver bump was for AV1 decoding in the transcoder.

I recommend 515.86.01 – It’s known to work

I’ve just tried with kernel 5.19, driver 515.86.01, and CUDA 11.7, with no nvdecExtraFrames setting - same problem. In the logs you’ll see I request a quality change to 8Mbps at 17:32:27.640, it tests the various encoders and decoders, starts PID 588 for transcode session gsgr7hlmeo6n8x03mlot4wd2, it transcodes from 17:32:29.628 to 17:32:29.662, and then checks decoders again before killing PID 588 and stopping transcode session gsgr7hlmeo6n8x03mlot4wd2 at 17:32:30.363, after which the web client requests and receives a 404 for session gsgr7hlmeo6n8x03mlot4wd2 until I stop playback at 17:33:13.333.

kernel_5.19_driver_515.zip (63.9 KB)

Try these videos please.

I’m seeing subtitles in your ffmpeg invocation command lines (subrip) which disabled HW.

Jan 29, 2023 17:32:13.228 [0x7f42bd383b38] DEBUG - Request: [127.0.0.1:42238 (Loopback)] PUT /video/:/transcode/session/e9ittgftyvk9ahaa4nha4cey/be37b542-8874-4903-9173-9ace50429733/progress/streamDetail?index=2&id=0&codec=subrip&type=subtitle&language=eng (9 live) #8dd Signed-in Token (Adarnof) (range: bytes=0-) 

“The World in HDR” is 10-bit VP9 which my card can’t decode, so it used software decoding in that test. All three videos failed.

Interestingly the Costa Rica footage would direct play when I started, I was able to convert to 8Mbps twice, however subsequent attempts to convert to 10Mbps (attempt 1) or 4Mbps (attempt 2) failed.

The_World_in_HDR.zip (55.8 KB)
LG_Colors_of_Journey_HDR_UHD_4K_Demo.zip (56.4 KB)
Costa_Rica_UltraHD4K_attempt_1.zip (68.1 KB)
Costa_Rica_UltraHD4K_attempt_2.zip (66.0 KB)

Without changing anything else – Do you feel like downgrading the Nvidia driver back to 515.86.01 ?

I’m asking because we still don’t know the real root cause and I’d like you to have a working / workaround solution if possible.

I’m currently running driver 515.86.01 with kernel 5.19, and was for those files you sent me.

What’s the default kernel for your distro? 5.10 ?