Ubuntu 16.04 - Playback stops, PMS becomes unreachable, but service is still "active and running"

server-linux

#1
  • PMS 1.11.3.4803 on Ubuntu 16.04
  • 8 cores of a 24 Core Xeon E5 @ 1.8Ghz
  • Media stored on G-Drive via plexdrive v5.0
  • Home ISP is symmetrical gigabit speed over fiber (Comcast Gigabit Pro)
  • Player is a local client: RasPlex on RPi2 (Direct Play)

Every time I use plex for long durations of time (over 2 consecutive hours), playback drops, and PMS becomes unreachable. In this linked Plex Media Server.log file, the crash happened at 10:17pm or 10:18pm. Host server logs show less than 10% resource utilization for RAM and CPU consumption. My home internet connection is rock solid, fiber backbone with ubiquti gear, and no drops or network problems in logs. systemctl status plexmediaserver shows PMS as "active(running)" even when it's unreachable.

With next to zero information to go on, I'm not even sure where to start. The only way I've found to get it working again is just to restart PMS and plexdrive in tandem, any time it happens (which is every day).

My knee jerk reaction was to blame plexdrive, but that wouldn't make the entire plex server unreachable, it would just make the media inaccessible.


#2

I need to see what precedes that please.

Settings - Server - Help - Download Logs.

I hope it’s still in the buffer


#3

My bad. Unfortunately that time frame was already out of rotation by the time I woke up. The good/bad news though is that it will just happen again sometime today, so I’ll be sure to capture a larger time frame next time.

Just to confirm though, is Plex Media Server.log the only log file data we need in this case?


#4

There are multiple Plex Media Server logs.

Settings - Server - Help - Download Logs . attach the ZIP file


#5

Just happened again at 10:34pm EST. Logs attached. 2 remote users and 1 local users were all direct-playing content at the time of the crash.


#6

Something is burying your processor or the disk is getting hammered.

May 19, 2018 02:38:02.251 [0x7f982e7fd700] DEBUG - Thread: Running async task started by thread 0x7f986bbff700.
May 19, 2018 02:38:07.104 [0x7f982effe700] DEBUG - Thread: Running async task started by thread 0x7f986bbff700.
May 19, 2018 02:38:07.137 [0x7f984ffff700] WARN - SLOW QUERY: It took 30810.000000 ms to retrieve 2 items.
May 19, 2018 02:38:07.137 [0x7f9860fff700] DEBUG - JobManager: child process killed by signal: 15 (Terminated)
May 19, 2018 02:38:07.137 [0x7f98773ab800] DEBUG - Shutting down with signal 15
May 19, 2018 02:38:07.138 [0x7f98773ab800] DEBUG - Ordered to stop server.
May 19, 2018 02:38:07.139 [0x7f98773ab800] WARN - Timed out waiting for server to finish.
May 19, 2018 02:38:07.140 [0x7f98773ab800] DEBUG - Stopping server...
May 19, 2018 02:38:07.140 [0x7f98773ab800] DEBUG - HttpServer: Stopping server.
May 19, 2018 02:38:07.143 [0x7f983abf9700] ERROR - Error issuing curl_easy_perform(handle): 52
May 19, 2018 02:38:07.143 [0x7f982dffc700] DEBUG - Thread: Running async task started by thread 0x7f986bbff700.
May 19, 2018 02:38:07.143 [0x7f984ffff700] WARN - SLOW QUERY: It took 4640.000000 ms to retrieve 2 items.
May 19, 2018 02:38:07.145 [0x7f983abf9700] WARN - HTTP error requesting GET http://127.0.0.1:43157/:/plugins/com.plexapp.agents.localmedia/messaging/function/X0FnZW50S2l0OlVwZGF0ZU1ldGFkYXRh/Y2VyZWFsMQoxCmxpc3QKMApyMAo_/Y2VyZWFsMQoxCmRpY3QKMTAKczIKZW5zNApsYW5nYjFzNQpmb3JjZWIwczgKcGVyaW9kaWNzNQo2NzAxOHM0CmRiaWRpMApzNwp2ZXJzaW9ubnMxMApwYXJlbnRHVUlEbnM4CnBhcmVudElEczcKVFZfU2hvd3MxMAptZWRpYV90eXBlczQzCmNvbS5wbGV4YXBwLmFnZW50cy50aGV0dmRiOi8vMTIxMzYxP2xhbmc9ZW5zNApndWlkczQKbnVsbHMyCmlkcjAK (0, No error) (Empty reply from server)

30 seconds to retrieve 15 items from the DB followed immediately by 4.6 seconds for 2 items.

How many are streaming? I can’t tell accurately


#7

During that time frame I had 2 remote users via direct play, and 1 local user via direct play.

Would that make the entire server unreachable though?30 seconds to retrieve 15 items from the DB followed immediately by 4.6 seconds for 2 items.

Also, I’ve got a 24 Core Xeon E5 @ 1.8Ghz processor, and it didn’t even break 30% for my entire server during that time frame.


#8

The processor shows as a E5-2650L. If you’re playing H.264, you should have been ok.
If you were software decoding HEVC SDR and/or trying to do HEVC HDR, it’s 13,000 Passmarks weren’t going to cut it.

It doesn’t have QSV capability so everything would fall on the CPU.


#9

Everything is post-processed into .mp4’s (h264) for direct play before it even enters the Plex Library. Like I said, CPU logs never show it even cracking 30%.


#10

Then first & foremost:

  1. Settings - Server - Scheduled Tasks
  2. Enable database optimization and other cleanup during scheduled tasks. Make certain the system is on during that time period
  3. Manually Optimize the database and restart PMS
  4. Empty Trash and Clean bundles again to make certain those are gone
  5. Manually optimize the database again

I also strongly recommend you upgrade from 1.11.3. Most of the useful information I need has already been lost due to the noisy log entries

May 19, 2018 01:58:25.162 [0x7f98267fe700] DEBUG - Thread: Running async task started by thread 0x7f9868bff700.

This was fixed.


#11

I verified that my optimization and cleanup tasks were all enabled and running during the scheduled tasks time window (which is 2am - 6am EST in my case). My server is always on, so no concern of it being off during the scheduled time.

Then, I:

  1. Manually optimized the database and restarted PMS
  2. Emptied the trash, cleaned the bundles, and restarted PMS
  3. Manually optimized the database again
  4. Shutdown PMS, installed version 1.13.1.5063 of PMS on ubuntu 16.04, restarted PMS
  5. repeated steps 1-3 just in case

I’ll give it a few more days and update with my results.


#12

Now PMS is reachable, media can be seen in the libraries, but every client receives a generic playback error. Also, my Movies library scans fine, but when my TV Shows library scans, it hangs. I’m not sure where it’s hanging, because the status readout at the bottom of the webgui is missing for some reason.

I generated the attached log file while the TV Shows scanning way still hung, and immediately after the playback errors occurred.


#13
  1. Database is still messed up. We may need take that to the next level of optimization (manual SQL)
  2. You should also resolve your domain issue at some point… It’s not your root cause but it will eventually cause you grief because .lan is for the LAN but PMS sees the domain name, doesn’t know about it, and therefore treats as remote.
May 19, 2018 19:48:47.246 [0x7fd2253ff700] DEBUG - Updating metadata item (save) () with ID 67214
May 19, 2018 19:48:47.249 [0x7fd2253ff700] DEBUG - Doing expensive tags write for 'Pretty Little Liars' because something changed.
May 19, 2018 19:48:47.256 [0x7fd2253ff700] DEBUG - Doing expensive tags write for 'Pretty Little Liars' because something changed.
May 19, 2018 19:48:47.278 [0x7fd2253ff700] DEBUG - Doing expensive tags write for 'Pretty Little Liars' because something changed.
May 19, 2018 19:48:47.334 [0x7fd2253ff700] DEBUG - Loaded metadata for Pretty Little Liars (ID 67213) in 550.000000 ms
May 19, 2018 19:48:47.334 [0x7fd2253ff700] DEBUG - There was a change for metadata item 67213 (Pretty Little Liars), saving.
May 19, 2018 19:48:47.334 [0x7fd2253ff700] DEBUG - Updating metadata item (save) (Pretty Little Liars) with ID 67213
May 19, 2018 19:48:47.336 [0x7fd2253ff700] DEBUG - Done with metadata update for 67213
May 19, 2018 19:48:48.849 [0x7fd212bff700] DEBUG - Killing job.
May 19, 2018 19:48:48.849 [0x7fd212bff700] DEBUG - Signalling job ID 15445 with 9
May 19, 2018 19:48:48.849 [0x7fd212bff700] DEBUG - Media Analyzer: Background analysis completed in 30.0 seconds, removing 1 IDs
May 19, 2018 19:48:48.852 [0x7fd243fff700] DEBUG - Jobs: '/usr/lib/plexmediaserver/Plex Media Scanner' exit code for process 15445 is -9 (signal: Killed)
May 19, 2018 19:48:48.853 [0x7fd23cbfd700] WARN - SLOW QUERY: It took 350.000000 ms to retrieve 3 items.
May 19, 2018 19:48:48.853 [0x7fd230fff700] WARN - SLOW QUERY: It took 350.000000 ms to retrieve 3 items.
May 19, 2018 19:48:48.855 [0x7fd2413fe700] DEBUG - Completed: [192.168.1.29:33598] 200 GET /library/metadata/67294 (18 live) 30013ms 2287 bytes (pipelined: 1)
May 19, 2018 19:48:48.855 [0x7fd2413fe700] DEBUG - Completed: [192.168.1.29:34154] 200 GET /library/metadata/67294 (18 live) 14998ms 2287 bytes (pipelined: 1)
May 19, 2018 19:48:48.857 [0x7fd2413fe700] DEBUG - Auth: authenticated user 1 as foogama
May 19, 2018 19:48:48.857 [0x7fd20f3f8700] DEBUG - Request: [192.168.1.29:34540 (Allowed Network)] GET /library/metadata/67271 (18 live) Signed-in Token (foogama)

Whatever is happening on your machine at this moment in time is bringing it to its kneeds.

  1. You state it stops responding
  2. I see metadata retrieval stops
  3. I see the database comes slamming down.

This is simply unacceptable on a Xeon class machine.

Why does the time you state not match the time in the logs? you say 02 and the logs have 23. When you say 22, the logs said 02.
Did you grab an incorrect one by accident? (I’ve done it so no big deal if you did) Is your machine in the right TimeZone / Date/Time sync’d with pool.ntp.org ?

Possible causes:

  1. CPU overload
  2. I/O overload
  3. Database fragmentation requiring manual intervention (dependent on the total number of indexed items)
  4. Completely junked up data directories from previously failed matching / retrieval attempts

First, metadata retrieval area cleanup.

  1. Stop PMS
  2. sudo rm -rf "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Caches/*"
  3. Start PMS
  4. Rescan the libraries and make sure all matching & metadata retrieval complete successfully.
  5. If metadata retrieval and/or web interface is still sluggish, (again, number of items dependent – load time), manually optimize the database (see below)
  6. Again attempt to complete matching.

Before next playback, If there’s anything you can shut down (other processes / background tasking) you have on the box while playing, try it. see what happens.

Manual database optimization

Stop PMS

sudo sh
cd "/var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Plug-in Support/Databases"
sqlite3 com.plexapp.plugins.library.db .dump > database.sql
ls -la database.sql    # verify size is  larger than binrary `.db` as this is the exported ascii source form
mv com.plexapp.plugins.library.db  com.plexapp.plugins.library.db.old
rm -f *-shm *-wal
sqlite3 com.plexapp.plugins.library.db < database.sql
chown plex:plex com.plexapp.plugins.library.db

Start PMS
Wait 30 seconds for database indexes to regenerate
Open Dashboard
Observe performance
Resume testing metadata retrieval and other functions.


#14

@ChuckPA said:
2. You should also resolve your domain issue at some point… It’s not your root cause but it will eventually cause you grief because .lan is for the LAN but PMS sees the domain name, doesn’t know about it, and therefore treats as remote.

Okay, how do I resolve something like that? Just use IP numerals instead? I can do that if necessary, but using local DNS has been helpful with other services.

Why does the time you state not match the time in the logs? you say 02 and the logs have 23. When you say 22, the logs said 02.
Did you grab an incorrect one by accident? (I’ve done it so no big deal if you did) Is your machine in the right TimeZone / Date/Time sync’d with pool.ntp.org ?

I’m not sure where I said 02 or 22 previously, or what that’s referring to in the logs, but PlexMediaServer is using UTC and I live on the east coast. Might that be the confusion?

root@PlexMediaServer:/# date
Sun May 20 00:12:54 UTC 2018

Possible causes:

  1. CPU overload
    Unlikely, given that it never spikes above 30% consumption, but I’ve been wrong plenty of times before.
  2. I/O overload
    I’m using an enterprise grade Intel 400GB SATA SSD that was brand new 2 months ago. No SMART errors are reporting on the drive, and I/O logs on the server during that time look normal.
  1. Database fragmentation requiring manual intervention (dependent on the total number of indexed items)
    I’ve got 2,700 movies, 106 TV shows w/5,700 episodes, and 0 music. I’ll try the sql repair you prescribed in your post.

#15
  1. EST5EDT vs UTC without telling the person helping you? OUCH :smiley: lol That does explain it.
  2. Enterprise grade? NVDIMM or NVME. Either is fast enough. Regardless of which, if the database is fragmented (the indirection links which connect records together in a particular table get too long, it’s hell on performance. SQLITE3 is an ISAM structure in a single file. This is why full export with re-import is so helpful. Each table is exported and output sequentially. When the DB is recreated, there are no indirect links to the next row
  3. I use .lan here as well. If you have a cert for the domain, import it into PMS. PMS already reports the players are on an Allowed Network (RFC1918 compliant) If PMS is complaining about the referring header but sees the sending IP, check the DNS server & resolver. Something is fishy there.

#16

:blush: yeaaaah, in hindsight, UTC timezone does feel like a pretty dumb oversight to exclude from an “omg help me, look at muh logs!” post. My bad. Perhaps when you guys are done working VR and live tv, you can work on mind reading :wink:

This is my exact SSD: https://ark.intel.com/products/86641/Intel-SSD-DC-S3610-Series-400GB-2_5in-SATA-6Gbs-20nm-MLC

I’ve got a few users enjoying some movies without issue (somehow) at the moment. I’ll try the manual sql database voodoo when they log off.


#17
# ls -la database.sql
-rw-r--r-- 1 root root 114542917 May 20 03:09 database.sql
# mv com.plexapp.plugins.library.db  com.plexapp.plugins.library.db.old
# rm -f *-shm *-wal
# sqlite3 com.plexapp.plugins.library.db < database.sql
Error: near line 653882: no such collation sequence: naturalsort

I’m a sqlite novice, so I’m entirely clueless what to make of this, sorry.


#18

Something about your SQLITE3 version is incompatible.

Move the previous .db.old back to be the .db

We will abort for now but the sizes do not look bad. This is not the root cause of the issue.

If you would like to ZIP the database.sql and if you can make it available for me to check, I’ll run it against my version of SQLITE3.


#19

I moved the old database back, restarted PMS, manually re-optimized the database, restarted PMS again, refreshed all metadata for both libraries (Movies, TV Shows), performed a scan on each library, and everything completed fine.

Then, I left it sitting for 2 hours, came back, and an automatic scan was hung again on the TV Shows library. No output was visible at the footer of the page. The attached logs were generated while it was still hung on scanning the TV Shows library.


#20

It’s still scanning and adding media including right up until you ask for the log files

===== One minute earlier,  still updating metadata =====
May 20, 2018 15:49:58.117 [0x7f2a88ffe700] DEBUG - There was a change for metadata item 67194 (Terry Crews), saving.
May 20, 2018 15:49:58.117 [0x7f2a88ffe700] DEBUG - Updating metadata item (save) (Terry Crews) with ID 67194
May 20, 2018 15:49:58.128 [0x7f2aacfff700] DEBUG - Request came in with unrecognized domain / IP 'plex.homesocket.lan' in header Referer; treating as non-local
May 20, 2018 15:49:58.128 [0x7f2aacfff700] DEBUG - Auth: authenticated user 1 as foogama
May 20, 2018 15:49:58.128 [0x7f2aa7fff700] DEBUG - Request: [192.168.1.35:51025 (Allowed Network)] GET /library/metadata/61887?skipRefresh=1 (11 live) TLS GZIP Signed-in Token (foogama)
May 20, 2018 15:49:58.133 [0x7f2a88ffe700] DEBUG - Loading season 23 from /var/lib/plexmediaserver/Library/Application Support/Plex Media Server/Metadata/TV Shows/5/690fa67d8e4861eebd47e221345794074e85fe3.bundle/Contents/_combined/seasons/23.
May 20, 2018 15:49:58.133 [0x7f2a88ffe700] DEBUG - Item 67205 (Season 23) Scanning metadata graphic elements in XML file "seasons"/23.xml
May 20, 2018 15:49:58.133 [0x7f2a88ffe700] DEBUG - Item 67205 (Season 23) Scanning metadata graphic elements in XML file "seasons"/23.xml
May 20, 2018 15:49:58.134 [0x7f2a88ffe700] DEBUG - There was a change for metadata item 67205 (Season 23), saving.
May 20, 2018 15:49:58.134 [0x7f2a88ffe700] DEBUG - Updating metadata item (save) () with ID 67205
May 20, 2018 15:49:58.177 [0x7f2aacfff700] DEBUG - Request came in with unrecognized domain / IP 'plex.homesocket.lan' in header Referer; treating as non-local
May 20, 2018 15:49:58.178 [0x7f2aacfff700] DEBUG - Auth: authenticated user 1 as foogama


===== LOG request =====
May 20, 2018 15:50:49.000 [0x7f2a897ff700] DEBUG - Request: [192.168.1.35:51130 (Allowed Network)] GET /diagnostics/logs (11 live) TLS GZIP Signed-in Token (foogama)
May 20, 2018 15:50:49.000 [0x7f2a897ff700] DEBUG - Diagnostics: Building logfile zip

Which browser please?
This is looking like a Plex/Web or browser interaction issue. Definitely not a server issue