Server: 1.21.0.3616
OS: Ubuntu 20.04
HDR tone mapping dependencies installed
CPU: Celeron quad core J5005 with QSV
Note: Standard transcoding works perfectly on this server up to 5 streams
Hi guys,
When I play a 4K HDR file and transcode it everything is being done in HW as expected and both the network and CPU are very low (~10% usage) but there is still buffering and I cant tell why.
At the command prompt I use bmon to check bandwidth and I use htop to view CPU usage and nothing is being taxed so there shouldnt be any buffering.
I know there are other factors involved now with the extra packages so just curious what else I should be checking. Also are there technical issues tone mapping HDR10 vs HDR HLG that might cause this? Some of my test files are HLG.
You are using the J5005 which has the design limit of 2400 MT/s versus that of Intel Core series which are spec’d in GT/s.
Like its siblings, SoCs have limits so experiencing buffering is expected. OpenCL ToneMapping is moving each video frame, as a raw image from QSV ASIC (decode) -> GPU (for openCL) across the internal PCI bus. The QSV ASIC is invoked one last time to encode.
This differs from the non-tonemapping process which does all operations at once with a single-hit on memory.
Even this non-tone mapping, J-series CPUs have limits of 3-5 HW transcodes based on memory speeds.
HLG
Cannot speak to HLG format until the team is back in the office on Monday. I know this is for HDR UHD (Main 10) . At no point in time did they mention Dolby Vision or Technicolor support would be included in this initial offering (which was a bit of a holiday surprise).
Thanks @ChuckPa, like we both mentioned the J5005 can handle transcoding 5 streams typically and I can usually observe the bottlenecks as I hit the limits but I am not seeing any bottleneck in this case, everything looks fine so I suppose my question is where can I observe the bottleneck since all the Linux resource tools I usually use (htop, bmon etc) do not indicate any bottleneck in the cpu or gpu
Linux doesn’t have any tools to allow this level of debug.
The reason is because you can’t monitor the PCI bus utilization without hardware.
Compound that by moving the data around with the GPU PCI bus (internally to the CPU).
You need an ICE or HWA system to actually see this. With either, it’s obvious.
With some math:
Calculate the number of bytes being shoveled around per frame. (raw uncompressed image)
a, 2160 x 3840 = 8.294,400
b. 8,294,400 * 4 bytes (3x 10 bit color) = 33,177,600 [This is minimum requirement]
c. 3,177,600 bytes per frame & 30 fps (rounding up 29.97) = 995,328,000 bytes/sec
Compare 995 MB/sec against total memory and I/O bus bandwidth limits.
Factor in the requirements of the other processes running on the host.
a. At 4k bytes/page (being extremely generous as I can’t confirm if a full 4K byte page is moved per request.)
b. with 1 GB/sec to move
c. 250K transfers on top of normal loading.
What happens when you have two SATA-3 SSDs being read raw (block I/O) simultaneously?
That’s a load you notice, isn’t it?
Now multiply by the number of simultaneous transcodes.