GPU and better multithread support for transcoding

Chloibar · October 28, 2016, 8:12pm

I would really love this. I am using an Intel NUC as my HTPC which is running great. I don’t transcode at home, but need to transcode occasionally when I’m on the move.
I also share my library with a few friends and I noticed that I can only support 1 transcode stream at a time, which is okayish but I would love to support more.
The thing is: I know I could upgrade to a more powerful CPU, but the whole point of my small HTPC is the low power usage. A better hardware acceleration support would allow me to keep using low power CPUs for the occasional evening when I would like to stream 2-3 movies at the same time. I’m not going to buy an i7 for this though, that’s for sure.

teabag1701 · October 28, 2016, 8:42pm

Yes, having to spend more on high end pc isn’t a favored option. Like me I do have an 17 and all my media is mp4,acc,h264 format and don’t break a sweat remotely I’m still limited to my upload speed by my ISP 200 mb down and 20 mb up. Wish I had google fiber 1 Gig up and 1 Gig down. Then I’m sure I would be cpu bottleneck.

thedenethor · October 30, 2016, 7:08pm

I think this is technically implemented but only available on Nvdia Shield

teabag1701 · October 30, 2016, 7:31pm

I know so unfair lol.

Joe4992 · October 30, 2016, 10:14pm

Would love to see GPU transcoding working in unison with CPU transcoding.

I removed my video card from my Plex server because all it was doing was eating up 15 watts of power during idle. Onboard video was working just fine.

gordan79 · October 31, 2016, 8:08am

If you have an on-board Intel GPU, QikckSync is probably the better option for you because it doesn’t require an Nvidia card in addition. Unfortunately, it was much harder to get all the library and header dependencies together to build ffmpeg with QuickSync compared to the relatively trivial process for building it with NVENC (you can have support for both in the same build).

thedenethor · October 31, 2016, 8:43pm

Let me more serious about the topic this time.

Currently we have 11,000 CPU Mark score for i7-6700K which means ~5.5 simultaneous 1080p transcodes. We already know that Tegra X1 can transcodes 2-3 stream at the same time. Lets say 2.5.

If you compare Tegra X1 with GTX-1070 GPU:

CUDA cores: 1,920 vs 256 (7.5x)
Frequency: 1,500Mhz vs 1,000Mhz (1.5x)
Memory BW: 256GB/s vs 25GB/s (10x)
G3D Mark: 11,287 vs 1,300 (8,6x) (Assuming that X1 has %40 performance of GTX 750 desktop counter part)

Assume that we may have 5-6x more transcoding performance with GTX 1070 comparing to Tegra. Which leads us to conclusion:

Performance wise:
6700K: 5 streams with 91 Watts, consumes 18W per stream
Tegra X1: 2 streams with 15 Watts, consumes 7.5W per stream
GTX 1070: 14 stream with 150W, consumes 10W per stream

Price wise:
6700K: 5 streams, costs 325$, per stream cost $65
Tegra X1: 2 streams, costs 240$, per stream cost $120 (Don’t know the price for Tegra X1 iteself )
GTX 1070: 14 streams, costs 400$, per stream cost $28

Yes this is not a pure science. But my point is since Plex already implemented GPU trancoding with Nvidia (assuming that you are not using ARM Cortex part of the SoC) why not port it to desktops? I am not trying to underestimate the developments efforts I am pretty sure that it is not a simple copy/paste in transcoding codes. But looks like it is worth it.

By the way my ultimate request is ultra smart transcoding engine. Why we have to pick one? It should be smart enough to choose between transcoding engines. Like, If GPU is loaded -playing games- use CPU. If CPU is loaded let say already transcoding 3 streams than switch to GPU on next one or use GPU for background transcoding for Optimised versions, CPU for realtime trancodes etc.

gordan79 · October 31, 2016, 10:25pm

@thedenethor said:
Let me more serious about the topic this time.

Currently we have 11,000 CPU Mark score for i7-6700K which means ~5.5 simultaneous 1080p transcodes.

Have you actually measured that? What software did you use to encode? At what quality preset? Did you use libx264 or QuickSync? The latter runs on the Intel GPU, not on the CPU.

A top of the line quad core x86 CPU can just about manage a single 1080p transcode with the highest quality / highest compression preset using libx264. I have tested this quite extensively. Of course, it can handle more than that at low quality / high speed preset - it depends on what exactly you are trying to achieve.

We already know that Tegra X1 can transcodes 2-3 stream at the same time. Lets say 2.5.

Tegra X1 is a Maxwell generation GPU. According to the Nvidia spec, Kepler (previous generation) can produce 8x realtime 1080p performance, i.e. 8x30=240fps. GeForce is limited to two simultaneous streams (call it 120fps each). Maxwell supposedly doubles that. However - that is on the fastest preset, which produces both very poor visual quality and very high bit rates. On the highest quality preset, a Kepler GPU can handle about 40fps of 1080p transcode.

If you compare Tegra X1 with GTX-1070 GPU:

CUDA cores: 1,920 vs 256 (7.5x)
Frequency: 1,500Mhz vs 1,000Mhz (1.5x)
Memory BW: 256GB/s vs 25GB/s (10x)
G3D Mark: 11,287 vs 1,300 (8,6x) (Assuming that X1 has %40 performance of GTX 750 desktop counter part)

All of the above is completely meaningless for the purpose of this discussion. Hardware encoding on Nvidia GPUs doesn’t run on the shader cores, and is not implemented using CUDA. Nvidia’s NVENC is implemented using a proprietary ASIC built into the GPU itself. It doesn’t consume shader processing power in any way. It was specifically designed for realtime encoding for game streaming, and if it was implemented to consume shader cores, it would degrade gaming performance dramatically, and it would be extremely power inefficient. NVENC hardware ASIC does it in a power envelope of about 5W.

This is why all Kepler GPUs regardless of number of shader cores have the same encoding performance. Similarly, all Maxwell generation GPUs are the same, and all Pascal GPUs are the same.

Assume that we may have 5-6x more transcoding performance with GTX 1070 comparing to Tegra. Which leads us to conclusion:

Performance wise:
6700K: 5 streams with 91 Watts, consumes 18W per stream
Tegra X1: 2 streams with 15 Watts, consumes 7.5W per stream
GTX 1070: 14 stream with 150W, consumes 10W per stream

Completely baseless and meaningless comparison. See above. It simply doesn’t work that way.

Price wise:
6700K: 5 streams, costs 325$, per stream cost $65
Tegra X1: 2 streams, costs 240$, per stream cost $120 (Don’t know the price for Tegra X1 iteself )
GTX 1070: 14 streams, costs 400$, per stream cost $28

Yes this is not a pure science. But my point is since Plex already implemented GPU trancoding with Nvidia (assuming that you are not using ARM Cortex part of the SoC) why not port it to desktops? I am not trying to underestimate the developments efforts I am pretty sure that it is not a simple copy/paste in transcoding codes. But looks like it is worth it.

Yes and no. Plex Transcoder is slightly patched ffmpeg, which since 2.6.x (i.e. quite a while now) ships with all the of required code to build against Nvidia development headers for NVENC and work against the libraries that ship with the Nvidia driver. It would be quite trivial to build this with NVENC support.

However, last I checked Tegra works a little differently, and although NVENC hardware itself is the same, IIRC it is accessed somewhat differently. So if you are referring to the Shield implementation, it isn’t quite as portable as you might hope. Not that it actually matters in light of the previous paragraph.

By the way my ultimate request is ultra smart transcoding engine. Why we have to pick one? It should be smart enough to choose between transcoding engines. Like, If GPU is loaded -playing games- use CPU. If CPU is loaded let say already transcoding 3 streams than switch to GPU on next one or use GPU for background transcoding for Optimised versions, CPU for realtime trancodes etc.

See above. It doesn’t work that way. See my reasonably comprehensive post on the subject a few entries above in this thread. The key choice is quality/bitrate vs. power efficiency. libx264 software encoder produces significantly superior results in terms of visual quality and bit rate. NVENC produces worse output, but does so on a much smaller power budget. Plus, of course, it makes it possible in cases where the CPU simply isn’t up to the task.

teabag1701 · November 1, 2016, 4:47am

From what Thedenethor mention, it does kinda feel like Plex is treating Nvidia shield like game programmers treat pc games. They code it for a console then use a cheap port for the pc lol. Any way remember gordan79 I too have an i7 4930k and can transcode 8 1080p movies at a time remotely but I re-encode them with Wondershare Video Converter to mp4,h264 and acc. Now we don’t have to beat around the bush now. We all know the majority of people here “share” their files with other friends. So when they download them it comes in mkv, x265, avi etc… My point is if Plex could at least make a standard for gpu transcoding I say let it be for mp4,h264 and acc since most devices old and new support it better then any other format. It be less of a headache for the devs on having to see how to make all formats compatible with different formats.

thedenethor · November 1, 2016, 5:29am

@gordan79 said:

@thedenethor said:
Let me more serious about the topic this time.

Currently we have 11,000 CPU Mark score for i7-6700K which means ~5.5 simultaneous 1080p transcodes.

Have you actually measured that? What software did you use to encode? At what quality preset? Did you use libx264 or QuickSync? The latter runs on the Intel GPU, not on the CPU.

Since this is a Plex forum, encoder is PlexTranscoder, presets are the ones listed in Plex Players.

And yes I really love the guys who fell in love with software encoding. It always inspiring to read them especially in Doom9. No offence, I’m serious. It is just really painfully to discuss with you so I am stopping here. You can always enjoy your ultra mega high quality encodes in your 5" smartphone screens.

gordan79 · November 1, 2016, 6:42am

@teabag1701 said:
My point is if Plex could at least make a standard for gpu transcoding I say let it be for mp4,h264 and acc since most devices old and new support it better then any other format.

MP4 is the container format, nothing to do with acceleration.

There is no such thing as accelerated AAC encoding, and all available libraries are single-thrraded (ffmpeg’s built in and FDK codec). This is in many (maybe even most) cases the bottleneck when transcoding.

H.264 (and on Maxwell and Pascal H.265) is all you get on hardware acceleration. It’s not like NVENC will help you with anything else even if there was a plausible reason to use a different format.

And to re-iterate again, GeForce cards are limited to 2 simultaneous encodes. You need a much more expensive Quadro if you want note than that.

@thedenethor said:
Since this is a Plex forum, encoder is PlexTranscoder, presets are the ones listed in Plex Players.

The CPU preset is controlled via Plex Web interface (“make my CPU hurt” being the “veryslow” ffmpeg preset).

It is just really painfully to discuss with you so I am stopping here.

The painful part was reading about doing hardware transcoding on CUDA cores as if whoever wrote it was under the impression that they had any idea how it works.

edddeduck · November 1, 2016, 1:07pm

Being able to use Intel QuickSync for transcoding is something that I’d rate above any other improvements to Plex Server. I don’t like having my CPU get pegged to 100% every time I’d like to transcode something to my iPad!

boboki · November 8, 2016, 8:08pm

I just asked wneilson if this is something he could maybe implement in his PRT multi-server solution for those of us that use multiple servers to transcode.

gordan79 · November 8, 2016, 10:10pm

Do tell, what is this"PRT multi-server solution"?

boboki · November 20, 2016, 8:45pm

Sorry missed your reply.

boboki · November 20, 2016, 8:49pm

So I just had a thought. I was pricing out a new 4x E7-4820 system when I thought about ASIC bitcoin miners, and how AMD (and other SoC) uses ASIC to do hardware transcoding. So, it strikes me of asking if it is possible to use some of those older ASIC bitcoin miner units to do transcodes? I mean, thinking about it, if it was possible, we could save MASSIVE power running transcodes through those boxes, and the USB 3 could easily do multiple 4k streams. This could allow us to make a ASIC farm, just like we did for bitcoin mining, but to be used for transcoding. The cost would be THROUGH THE FLOOR.

Grab a raspberry pi or nvidia shield or whatever with a few usb 3 ports (to be the server), a few $50-150 ASIC miners (which would easily be the horsepower equivalent of the highest end GPUs out) link them together and let it rip. For sub $500 could probably handle a few 4k streams at once. Anyone have any raw data for streams per ASIC power?

karbowiak · November 21, 2016, 3:19pm

To the guys going “but nVidia can transcode 534553345 streams at 1080p, in real time…” sorry no, it can’t. NVEnc is limited to 2 encoding jobs, atleast this is true for Kepler and Maxwell, couldn’t find any numbers for Pascal, but i didn’t google hard enough, and i don’t see why it would have changed. Quadro cards can encode a lot more tho, but those cards also cost a fortune.

So this leaves us with VAAPI on Linux, and QuickSync and NVEnc on Windows. (AMD VCE IS doable on Windows, but you need a patched ffmpeg)

NVEnc should work on Linux, so i guess supporting that there is doable, but QuickSync is a pain in the ass to setup on Linux, and should be ignored completely.

So, realistically, all that would have to be supported for now is 3 encoding methods (1 on Linux, 2 on Windows) with some gotchas for handling so that only X amount of transcodes can be run on the GPU at any one time (Fairly trivial to make tho)

So, yeah - why hasn’t it happened yet? no idea, Emby did it in a span of a couple of months, it’s even to the point where it’s integrated and working well. (Handled 5 live tv encodes via VAAPI on my i5’s HD4600 GPU without issue, and barely any CPU load)

Links:
FFMpeg HW Acceleration info: https://trac.ffmpeg.org/wiki/HWAccelIntro
AMD VCE Patched FFMpeg: https://github.com/GPUOpen-LibrariesAndSDKs/AMF

ufo56 · November 22, 2016, 12:29am

GPU transcoding would be nice

Is any plex dev shared any comments about gpu transcoding ? or plans…

KarlDag · November 22, 2016, 12:58am

@ufo56 said:
GPU transcoding would be nice

Is any plex dev shared any comments about gpu transcoding ? or plans…

No. Radio silence. Some could get the impression that plex doesn’t care, but that’s just how plex rolls. Some day they’ll announce it.

Pretty sure most PMS devs are working on making Plex Cloud better, some are on the stream brain, and some are (I hope) trying to make the ShieldTV a decent server experience. Your bet is as good as mine as to when they’ll introduce a feature like this next.

edddeduck · November 22, 2016, 9:02am

It’s a real pity as this feature would likely made the experience better and make a lot more machines into potential servers in a single swoop. So many NAS devices and low end servers have an Intel CPU with the ability to transcode in hardware and it’s not being used.

It’s not a shiny feature you can tie marketing around like PVR or Plex Cloud but that doesn’t mean it wouldn’t make a large impact to more users.

Topic		Replies	Views
Transcoding: Better to have powerful CPU only or decent CPU with GPU transcoding? General Discussions	43	2558	November 3, 2017
Any news on GPU transcoding, especially Intel Quick Sync? Plex Media Server server-windows	393	8259	March 24, 2017
Transcoding with GPU General Discussions	25	32722	August 12, 2019
When Will Plex support GPU Transcoding? General Discussions	23	159	May 19, 2017
Hardware recommendation for GPU transcoding (Please)... General Discussions	11	4051	April 27, 2018

GPU and better multithread support for transcoding

Related topics