Hardware Accelerated Decode (Nvidia) for Linux

I just don’t see the point of feeding it codecs it doesn’t support. VC1 is one that comes to mind, but I’d rather not come up with all unsupported codecs. Since we all know the supported codecs it should be much easier, faster and more robust to only feed relevant codecs, especially since 1000s of different users use it.

Oh, and by inverting I mean whitelisting codecs instead of blacklisting.

While the wrapper script seems to be working and I can see activity in nvidia-smi, I still can’t get smooth playbook of 4k content encoded in HEVC. I’m running on E5-1650v3, 16GB of memory, Quadro P2000, Centos 7.6, nVidia driver 418.43. GPU usage during encode is 49%. However, on CPU side, I can see 1 thread of 12 is hitting 100%. I assume this is for audio encode?

Just wanted to share my experience. Looks good! I put up a rig with Core i5-4690, 16GB RAM and Nvidia GTX 1070. I have movies on a separate box, accessing them from this rig via NFS share. As you can see below, the rig can easily handle simultaneous full hardware decoding and encoding of 5 videos where each is 4K video, transcoded to 1080p video (CPU is only doing Audio decode/encode). And there is plenty room for more simultaneous transcoding :wink:

nice.

the recurring thing that bothers me is the memory usage.

If transcode process takes ~1 gig video ram, small ram cards like p400, 960gtx, any others with 2 gig or less, may not be able to handle multiple 4k streams.

on windows, the transcodes use much less video ram

True, in order to be able to transcode multiple 4k on a GPU you need a very recent GPU, at least 10xx. 960 is actually an exception in the 9xx series and still supports NVDEC, but it’s definitely too weak and as you said it does not have enough VRAM. See the GPU Support Matrix (click on “GeForce/TITAN” button on the page to get a complete list).

Regarding the VRAM usage - I believe the VRAM usage during transcoding is pretty much the same on Windows as it is on Linux since it is Nvidia Driver / CUDA dependent. Do you have same Win vs. Linux tests on the same configuration which is showing differently? (please note that transcoding with NVDEC requires significantly more VRAM usage).

yup well aware of the chart, I wouldn’t say its weak, it works ok in windows.

This was discussed at various points earlier in this thread.

vram usage in windows was MUCH lower than it is showing in linux

Have you got these numbers directly from the nvidia-smi tool or by checking numbers in Windows Task Manager / Performace? Because the Windows Task Manager sometimes reports GPU memory usage when also taking into account paged memory allocation and sometimes it reports without it. Also make sure you are testing exactly the same video transcoded into exactly the same output, because VRAM usage depends on all of these factors. In any case … for several simultaneous DEC+ENC 4k transcoding the 960 will be too weak. But yes, it definitely should be a very good choice for a few simultaneous DEC+ENC 4k transcoding, especially due to it’s very good price on second market (cryptocurrency miners are dumping it)!

I am in no way saying that the 960 can handle many 4k transcodes, no aside from the memory usage, I would expect maybe 3 or 4 max before running out of steam.

As far as Vram I am going by what is reported by windows task manager, to my knowledge there isn’t an equivalent to nvidia-smi on windows.

you might be right that task manager is not reporting the memory usage the same way.

also, windows does not actually use NVDEC, it uses windows native decoding (DXVA2), so that could also help explain the large difference in apparent memory usage.

edit: let me clarify above, PLEX in windows, uses DXVA2 decoding.

other applications may be able to use nvdec on windows, I have no idea about those.

https://support.plex.tv/articles/115002178853-using-hardware-accelerated-streaming/

Windows

  1. Windows native (DXVA2)*
  2. software decoder (libavformat)
1 Like

You can find nvidia-smi.exe tool here:

C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe

And yes, if DXVA2 is used, then memory (as well as performance) numbers will be definitely different.

2 Likes

Are you able to transcode 4k HEVC/H265 without periodic buffering during playback? I been playing around with the settings but I cannot seem to get it to play smoothly.

I am accessing the media files via NFS from on NAS. As a test, I copied the file locally onto the Plex server and it’s able to transcode and buffer smoothly. However, if the media is access over NFS, Plex doesn’t seem to be able to keep up. Network traffic is not anywhere near maxing out the NIC. I am not quite sure what the problem is.

throughput is something different than latency.

whether its something client, network, or server related, it sounds like the NFS mount is not responding fast enough for the transcoder.

might see if there are any NFS tuning optimizations you can do on either the client (mount options) or the server (disk/stripe caching adjustments etc).

I just dived into that, but no matter what tuning I tried it did not alleviate the buffering. As a sanity check, I changed the mounts to CIFS and now it’s behaving properly. 4K playback starts very quickly and I can see it buffering ahead correctly. Pretty strange that something about NFS was preventing Plex to transcode smoothly.

Also keep in mind that Plex transcodes ahead so if you start 3 simultaneous transcodes, they will all go at more than 1x for a while, and then they’ll stop for a while. So the gpu ram utilization you’re seeing may be due to transcoding ahead and thus higher than what is expected for a 1x transcode

Looks like I spoke too soon. Same issue came back even with CIFS. The NAS is serving NFS/CIFS on 10Gb link and the Plex server at most using a few MB/s reading the .mkv file. I don’t think network connectivity/latency is at play here. Could Plex be not respecting the throttle settings?

Based on your description the issues are likely to be related to actual transcoding performance and not to network performace. You mentioned that 1 CPU thread is hitting 100% (you assume audio encode for this) … well try to remux one of the 4k sources to mp3 stereo audio (use ffmpeg for this) which will disable audio transcode (any player can direct play mp3 stereo) - to make sure that there are no bottlenecks in the actual transcoding process.

Also, when you try to do a direct copy of the file from the NAS server in the OS level, is the copy process smooth, no interruptions?

No issue on the performance for straight copy. The source is True-HD 7.1 AC3. Would AAC 7.1 work for audio direct play?

According to the logs, I see the behavior of Plex seem to be grabbing chunks of source file and encode for playback, rinse and repeated. When the file is located on the disk locally, it’s just able to keep up. However, when it’s accessed across NFS/CIFS, it seems to fall behind even though NIC utilization is very low. I managed to install fscache and enable it for the respective NFS mounts and it seems to be streaming reliably now.

1 Like

Direct play depends on clients’ (players’) capabilities: if clients’ hardware & software is able to decode and play TrueHD 7.1, then Plex will not transcode it, otherwise it will force transcoding.

Hmm, normaly NFS should be able to do OS-level caching when accessing NFS, but if fscache works for you, then this is also fine. How are you mounting NFS shares? Try something like this in fstab:

x.x.x.x:/mnt/nfs-server /mnt/local-folder nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

Try a different but similar video file. I’ve seen inconsistent behavior when trying to transcode 4k videos. I’m not 100% sure, but it seems that video files with “forced” PGS subtitles contribute to the problem. Since many clients (i.e., Apple TV) don’t support this native Blu-Ray format, they need to get the subtitles ‘burned’ into the video… so when this happens, we are happily using hardware assisted encoding, but after that the subtitles still need to be burned and that appears to be a single threaded operation (bound to a single CPU core). At least that’s what I saw when I poked at this a few weeks ago (but gave up and moved on to other stuff). My test case was a Blu-Ray 4k rip of Black Panther, which does have forced PGS subtitles… but a different 4k rip (without forced PGS) seemed to play fine…

I’ve purchased myself a RTX2060 to use the improved NVENC/DEC on these cards. Could’ve opted for a P2000, but figured I’d see if I’d get the hack for extra transcodes working and have a better quality stream. However. I’m running into some problems what settings to setup for my LXC container.

The next is from the following guide: PMS installation guide when using a Proxmox 5.1 LXC container

If I run ls -l /dev/dri my output is:

crw-rw---- 1 root video 226, 0 Mar 17 14:10 card0
crw-rw---- 1 root video 226, 1 Mar 17 14:10 card1
crw-rw---- 1 root video 226, 128 Mar 17 14:10 renderD128

I’m not sure what settings to put in my lxc .conf file. My server has one VGA output controlled by a matrox card. Isn’t that one of the cards I’m seeing above? Which I wouldn’t want to passthrough to my lxc container. Also. the lines you use:

lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

Aren’t mentioned in the guide. Should I use these as well. Or are these specific for your usecase? Thanks for anyone that could give me some insight!

1 Like

My server looks the same (gigabyte pesh2 board), but I am running plex on bare metal proxmox and it just works.

I’d simply suggest start with card 0, if that doesn’t work switch to card1.

lxc.cgroup.devices.allow = c 226:0 rwm <<<< card0, change to 226,1 for card1
lxc.cgroup.devices.allow = c 226:128 rwm <<<<< render128

further, I’d recommend getting everything (encoding) working before messing with driver hacks and transcoder hacks. (ie don’t complicate things until they actually work in a stock configuration)