Plex Server version 1.13.4.5251-2e6e8f841 has been causing playpack issues and periodically crashes
I have had several clients that report a transcode error : Playback was terminated by the server Conversion failed. The transcoder exited due to an error.
That error happens periodically and seems to clear up after some time.
Secondly, a crash has also happened around the time of the transcoder issues. Crash Info: Jul 12, 2018 15:13:03.131 [0x7ff559878700] INFO - Crash Uploader - Platform: linux-ubuntu-x86_64 (3.10.0-693.11.6.el7.x86_64 (#1 SMP Thu Jan 4 01:06:37 UTC 2018)) ServerUUID: 1a4bbb9f4c29a0125c69c390d74c8fe1ce7277b1 UserId: xxxx@xxxx.com Version: 1.13.4.5251-2e6e8f841
System info:
Linux Centos 7
12GB of mem
32 CPU - Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz
There are a large amount of transcoder errors like these:
Jul 12, 2018 15:14:22.283 [0x7fc6e3bff700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 5632
Jul 12, 2018 15:14:22.284 [0x7fc6e93ff700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 1817).
Jul 12, 2018 15:14:22.286 [0x7fc6e8bfe700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 1821
Jul 12, 2018 15:14:22.288 [0x7fc6e5ffe700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 563).
Jul 12, 2018 15:14:22.290 [0x7fc6e4bff700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 567
Jul 12, 2018 15:14:22.291 [0x7fc6e77ff700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 514).
Jul 12, 2018 15:14:22.293 [0x7fc6e3bff700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 518
Jul 12, 2018 15:14:22.296 [0x7fc6e93ff700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 6578).
Jul 12, 2018 15:14:22.297 [0x7fc6e8bfe700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 6582
Jul 12, 2018 15:14:22.299 [0x7fc6e5ffe700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 1832).
Jul 12, 2018 15:14:22.301 [0x7fc6e4bff700] ERROR - [Transcoder] [h264 @ 0x19a1580] missing picture in access unit with size 1836
Jul 12, 2018 15:14:22.302 [0x7fc6e77ff700] ERROR - [Transcoder] [h264 @ 0x19a1580] Invalid NAL unit size (409945858 > 513).
Downgrading back to 1.13.2.5154 seems to resolve the issues.
Already looked at the logs with regards to the server crashes
The logs suggested to me some hardware issue or some problem with the binaries. Around the same time, five different processes crashed with segmentation error (SIGSEGV).
We had these separate processes all crash out with SIGSEGV around same time
Plex Media Server Process Jul 12, 2018 15:13:02.893 [0x7f651e1f7840] DEBUG - Job running: '/usr/lib/plexmediaserver/CrashUploader'
EasyAudioEncoder and two Plex Transcoder processes crashed at same time
Jul 12, 2018 15:13:02.400 [0x7f64f63ff700] DEBUG - Jobs: '/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Codecs/EasyAudioEncoder-141-linux-ubuntu-x86_64/EasyAudioEncoder/EasyAudioEncoder' exit code for process 15689 is -11 (signal: Segmentation fault)
Jul 12, 2018 15:13:02.417 [0x7f64f63ff700] DEBUG - Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 20677 is -11 (signal: Segmentation fault)
Jul 12, 2018 15:13:02.418 [0x7f64f63ff700] DEBUG - Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 21053 is -11 (signal: Segmentation fault)
Web search on centos 7 and sigsegv does find a number of hits
Now you have updated your post to say that 1.13.2.5154 is ok
But i see an identical crash on 11 July at 11:33:26 using that version.
What binary / package have you downloaded ? and what is the hardware exactly?
I would recommend looking for possible centos / hardware issue - unless we have a problem with the binaries and you need to tell me which version was clear. I know that 1.13.2.5154 crashed the same way
@sa2000 The SEGV was initialed when the server was unresponsive for some time after the transcoder errors. This is part of my monit script that will force dump the plex service as in the past when plex would deadlock and this was the procedure to get a good crash file for you guys to debug.
Information from my monitoring system: Edit: Updated the image to reflect accurate data from the logs and the downgrade blip
You may have seen that “crash” for this server ID on 1.13.2.5154 because I have the KillSignal=SIGSEGV set in /usr/lib/systemd/system/plexmediaserver.service and it will do that every time I restart the server/upgrade the server. I was planning removing that for normal restarts and upgrades. I just have not built a script for that yet.
Does that explain why the transcoder died for multiple clients all at the same time before the crash?
This is a virtual machine and It was running without transcoder failures or real crashes for quite some time. The transcoder failures happen pretty often on 1.13.4.5251-2e6e8f841 but does not seem to happen on 1.13.2.5154. It happened in about 3 different instances on 1.13.4.5251-2e6e8f841 before I posted it here. It just so happen that plex became completely unresponsive this time after the transcoder issues and monit restarted the server which in turn killed it with SIGSEGV as set in the service.
SHA sum of the exact RPM I installed on centos:
# shasum -a 1 plexmediaserver-1.13.4.5251-2e6e8f841.x86_64.rpm
a1e36e2d5513203dc828848823d1f255d63a00b3 plexmediaserver-1.13.4.5251-2e6e8f841.x86_64.rpm
Compared to what is listed for the Centos Binary:
"label": "CentOS 64-bit (RPM for CentOS 7 or newer)",
"build": "linux-ubuntu-x86_64",
"distro": "redhat",
"url": "https://downloads.plex.tv/plex-media-server/1.13.4.5251-2e6e8f841/plexmediaserver-1.13.4.5251-2e6e8f841.x86_64.rpm",
"checksum": "a1e36e2d5513203dc828848823d1f255d63a00b3"
Monit Log: (The failed protocol tests start at [EDT Jul 12 15:11:01], I could not display that here since it limited the post)
[EDT Jul 12 15:12:32] error : 'plexmediaserver' failed protocol test [DEFAULT] at [127.0.0.1]:32400 [TCP/IP] -- Cannot assign requested address
[EDT Jul 12 15:13:02] error : 'plexmediaserver' failed protocol test [HTTP] at [127.0.0.1]:32400/web/index.html [TCP/IP] -- Cannot assign requested address
[EDT Jul 12 15:13:02] info : 'plexmediaserver' exec: /usr/local/bin/slack.sh
[EDT Jul 12 15:13:02] error : 'plexmediaserver' failed protocol test [DEFAULT] at [127.0.0.1]:32400 [TCP/IP] -- Cannot assign requested address
[EDT Jul 12 15:13:02] info : 'plexmediaserver' exec: /usr/local/bin/slack.sh
[EDT Jul 12 15:13:02] error : 'plexmediaserver' failed protocol test [HTTP] at [127.0.0.1]:32400/web/index.html [TCP/IP] -- Cannot assign requested address
[EDT Jul 12 15:13:02] info : 'plexmediaserver' trying to restart
[EDT Jul 12 15:13:02] info : 'plexmediaserver' stop: /bin/systemctl
[EDT Jul 12 15:13:04] info : 'plexmediaserver' start: /bin/systemctl
[EDT Jul 12 15:13:05] error : 'plexmediaserver' failed protocol test [DEFAULT] at [localhost]:32401 [TCP/IP] -- Connection refused
[EDT Jul 12 15:13:05] error : 'plexmediaserver' failed protocol test [DEFAULT] at [127.0.0.1]:32400 [TCP/IP] -- Connection refused
[EDT Jul 12 15:13:05] info : 'plexmediaserver' trying to restart
[EDT Jul 12 15:13:35] error : 'plexmediaserver' uptime test failed for /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/plexmediaserver.pid -- current uptime is 30 seconds
For Transcoder errors, the only way to have them investigated is to have a short clip that recreates the errors plus logs confirming the issue with the clip. Transcoder errors do not impact the whole server responsiveness unless a number of processes are consuming all the available processing power
What is the exact request (or set of requests that your script makes) - well not exact because i do not want see the actual security token used in the requests
The script triggered at 15:13:02 - I am not sure what cannot assign requested address means in your script log. Requests would normally hang / timeout / get a response code of 503 or 401 or 200 etc. Is it saying that the server process is no longer listening on 32400? But the log shows requests still coming in and dealt with
Ok, I may upgrade again and see if I can reproduce. The issue has been that when the transcoder errors out on the clients. If you restart the stream most times it just starts working or you just restart the video from the beginning it goes away. It is more intermittent vs, when you get to X timecode it fails on a specific file and it seems to happen on several clients at the same time. Would it be helpful to get the actual session data in the transcode folder when this happens?
It does a GET request to http://127.0.0.1:32400/web/index.html and protocol tests. I am not sure what mechanism it uses for the protocol test. My searches have not turned up any details.
Monit Config: (Each Cycle is 30 seconds)
if failed host 127.0.0.1 port 32400 with timeout 30 seconds for 5 cycles then restart
if failed port 32401 with timeout 30 seconds for 5 cycles then restart
if failed url http://127.0.0.1:32400/web/index.html with timeout 30 seconds for 5 cycles then restart
I am not sure since this all happen pretty quick. I can investigate further when it happens again.
Agreed. I think the “crash” might be a red herring in this instance. .51 is my monitoring server and .64 is the reverse proxy that all outside traffic comes from as well as a script to dump the threads. https://plex.server.com/connections?X-Plex-Token=XXXXXXXXX
The transcoder failures is where I am going to focus my efforts and see if I can provide enough data back.
@sa2000 I looked again at the logs and what more is needed for this to be debugged?
I just want to make sure I can provide everything that is needed.
In the log 3 minuets before the service was restarted the logs are flooded with:
Jul 12, 2018 15:10:07.265 [0x7f64e07ff700] DEBUG - Play progress on 132650 'Asia Special: China & Japan' - got played 63000 ms by account 15040132!
Jul 12, 2018 15:10:07.280 [0x7f64e07ff700] DEBUG - We're going to try to auto-select an audio stream for account 15040132.
Jul 12, 2018 15:10:07.280 [0x7f64e07ff700] DEBUG - Selecting best audio stream for part ID 257358 (autoselect: 0 language: en)
Jul 12, 2018 15:10:07.280 [0x7f64e07ff700] DEBUG - Audio Stream: 602688, Subtitle Stream: -1
Jul 12, 2018 15:10:07.281 [0x7f64e07ff700] DEBUG - [Now] User is PLEXUSER (ID: 15040132)
Jul 12, 2018 15:10:07.282 [0x7f64e07ff700] DEBUG - [Now] Device is Xbox One (XboxOne).
Jul 12, 2018 15:10:07.282 [0x7f64e07ff700] DEBUG - [Now] Profile is Xbox One
Jul 12, 2018 15:10:07.282 [0x7f64e07ff700] DEBUG - [Now] Updated play state for /library/metadata/132650.
Jul 12, 2018 15:10:07.282 [0x7f64e07ff700] WARN - WebSocket: Ignoring message since we're closing the connection
Jul 12, 2018 15:10:07.282 [0x7f64e07ff700] WARN - WebSocket: Ignoring message since we're closing the connection
Jul 12, 2018 15:10:07.284 [0x7f64e07ff700] DEBUG - HubCache: Expiring '15040132/home.continue/hubs/en/count=14&excludeElements=Actor%2CCollection%2CCountry%2CDirector%2CGenre%2CLabel%2CMood%2CPart%2CProducer%2CRole%2CSimilar%2CWriter%2CPhoto%2CMedia&excludeFields=file%2Csummary%2Ctagline&excludePlaylists=1&includeLibraryPlaylists=1&includeTypeFirst=1&libraryHubsOnly=1' because event plex.event.play.session.state.update referenced an item the hub contained
Jul 12, 2018 15:10:07.286 [0x7f64e07ff700] DEBUG - Statistics: (o21sgdgfttr4kjz7w7a7chfg) Reporting active playback in state 2 of type 4 (scrobble: 0) for account 15040132
Jul 12, 2018 15:10:07.305 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 35955
Jul 12, 2018 15:10:07.318 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 2834
Jul 12, 2018 15:10:07.331 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 6911
Jul 12, 2018 15:10:07.349 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] Invalid NAL unit size (409945858 > 12356).
Jul 12, 2018 15:10:07.362 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] Invalid NAL unit size (409945858 > 14905).
Jul 12, 2018 15:10:07.374 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 6154
Jul 12, 2018 15:10:07.388 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 1323
Jul 12, 2018 15:10:07.400 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] missing picture in access unit with size 804
Jul 12, 2018 15:10:07.415 [0x7f64e07ff700] ERROR - [Transcoder] [h264 @ 0x329a580] Invalid NAL unit size (409945858 > 6448).
I can make the file ‘Asia Special: China & Japan’ available to you. In the logs it shows a user trying to play that file and it failed.
Edit: Just to reiterate this happens to many streams at the same time in spirts. If you play the file after the transcoder failure event the file plays properly.
Logs can get flooded with some errors and it does happen without a real impact to the operation of the server. If there is evidence that the flooding of logs with errors causes an issue, then this can be looked into
Ok I was able to reproduce. It seems that one of the video files might be corrupt and throws all those errors. Also when this video file is played it kills every other session on the server.
I did some more testing. The player in question was an Xbox one. If the clients tries to play the file and just sits there it will buffer indefinitely and the server gets inundated with the transcode errors which I believe, inturn causes the server to go unresponsive to most clients.
What is the current position with this?
Are you still having problems?
If you are, could i have server logs with debug logging covering the time of error and indication of when the issue arose. Also the sample we can use to try to replicate the problem