latest-plex(1.13.2.5154)-spinning-out-of-control-swamping-network-connection

server-linux

#1

I’m seeing the same behavior reported by Benjamin on his message back in 2016, but TODAY, almost 2 years after!

On the linux machine running plex server (the latest plexpass), I run this command (anyone can try it!):

netstat -nat | awk ‘{print $6}’ | sort | uniq -c | sort -n

and this is the result when plex server is running:
1 established)
1 FIN_WAIT2
1 Foreign
2 FIN_WAIT1
49 LISTEN
53 CLOSE_WAIT
76 ESTABLISHED
13301 TIME_WAIT

No-one is accessing plex, apart from 1 plex player on the SAME machine running the server.

Turning the server off, this is the result of the same netstat command after about 60 seconds:
1 established)
1 FIN_WAIT2
1 Foreign
14 TIME_WAIT
18 SYN_SENT
33 LISTEN
49 CLOSE_WAIT
69 ESTABLISHED

from 13K TIME_WAIT connections down to 14!!!

And my the machine is WAAAAAAAAAAAAAY more responsive!!

It’s being a while I’ve noticed that running plex makes a linux machine really sluggish, but I could never pin point the reason.
After installing grafana to monitor my machines, I was finally able to notice the huge 13/14k TIME_WAIT values on the machine running plex!

This is really a BIG problem, since there’s no reason to have so many connections lingering constantly in the machine. These connections tend to increase with time, to the point plex becomes unresponsive to clients and need to be restarted manually. (now I known why Plex on my Projector android box can’t play movies sometimes, or sometimes in the middle of a movie it stops saying “can’t connect to Plex Server”)

This behavior also causes NFS, ssh, http connections to be slow… basically everything that relies of the tcp stack gets sluggish for no good reason!

From kernel documentation, it seems it’s a coding problem from Plex server side, since the application is responsible to close it’s connections instead of leaving then hanging to be closed by the kernel after timeout.

And things get even worse now that kernel 4.12 removed tcp_tw_recycle, so the kernel WON’T kill TIME_WAIT connections anymore. I was able to make things a bit better by setting tcp_tw_reuse=1, so Plex re-uses it’s TIME_WAIT connections, but it only helps so much. (without tcp_tw_reuse=1, I get 25K to 35K lingering TIME_WAIT connections)

I think this is a long time bug in linux plex server that needs to be fixed, instead of ignored like it was when Ben reported it back in 2016.

my 2 cents…
-H

PS: I left Plex server off for about 10 minutes, and TIME_WAIT stayed at reasonable values, between 10-20. As soon as I turn Plex back on, the 14K TIME_WAIT connections where back, as you can clearly see in the grafana graph below:

and they stay there, doing absolutely NOTHING, apart from slowing the whole system down:

It’s interesting to see that all those connections are open RIGHT at the start of Plex server… just like a loop going though a list and “forgeting” to close then!

after 30 minutes, no change:


#2

What Linux distro are you using?
Have you ever heard of Linux kernel tuning? Distributions with the standard 5 minute WAIT timeout values and a less-than-perfect TCP stack behave exactly as you are describing.


#3

I did mine twice with pms running:

netstat -nat | awk ‘{print $6}’ | sort | uniq -c | sort -n
1 established)
1 Foreign
4 CLOSE_WAIT
4 ESTABLISHED
21 LISTEN

netstat -nat | awk ‘{print $6}’ | sort | uniq -c | sort -n
1 established)
1 Foreign
1 TIME_WAIT
4 CLOSE_WAIT
4 ESTABLISHED
21 LISTEN

-wbm


#4

ChuckPA:
Arch Linux, kernel 4.16.5

Yes, I known what Linux kernel tuning is, hence the information I gave about tcp_tw_reuse and tcp_tw_recycle.
As I said, the tcp_tw_recycle has been removed since kernel 4.12, sooo… no timeout tweaking!!

Which means that, all distros starting with kernel 4.12 will NOT timeout TIME_WAIT connections by themselves… the only option now is tcp_tw_reuse, which will attempt to use an pre-existing TIME_WAIT connection instead of creating a new one!

But if one have an application opening 14K connections without cleaning then up, they will hang there doing nothing until the process closes.

WillieBuckMerle:
What version of the kernel are you using? (cat /proc/version)


#5

And now you know why Arch isn’t supported. We’ve never tested on it.
The only advice I can give you is to refer you back to whomever you got the package from.


#6

Respectfully,

[chuck@lizum ~.104]$ uname -a
Linux lizum.hessen.lan 4.16.11-100.fc26.x86_64 #1 SMP Tue May 22 20:02:12 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[chuck@lizum ~.105]$ 

with this, and others, prominently available for tweaking.

net.ipv4.tcp_fin_timeout = 60

The names did change but timeouts are still there


#7

well… arch (as any distro) is just a bunch of scripts/apps running on top of a kernel.

Even if you don’t support it, other distros WILL get to a kernel bigger than 4.12, and people will run on the same problem. THAT’s why I decided to report the bug.

You do realize that relying on the kernel to close opened connections instead of doing the housekeeping yourself it’s NOT how it should be done, right? (that’s why tcp_tw_recycle has been removed, to force people to fix their code)

I think it’s not a good way to support your software when someone is trying to report a bug, to just dismiss as “your distro is not supported”, since we’re talking about kernel here, not distro.

regarding you comment about tcp_fin_timeout, this is mine:

as you can see, since I first posted this bug this morning, TIME_WAIT is steady at 13k, and tcp_fin_timeout has being at 10 since I first installed this machine years ago.

also, as you may known, tcp_fin_timeout is a timeout for connections at FIN-WAIT-2 state, not TIME_WAIT, as stated here:

https://www.frozentux.net/ipsysctl-tutorial/chunkyhtml/tcpvariables.html

So, sorry the misunderstanding, but tcp_tw_recycle HAS BEING REMOVED, while tcp_fin_timeout doesn’t replace it by any means since it’s not for the same thing. (and is part of the kernel tcp stack since kernel version 2.0, the names didn’t “change” as you stated!)

anyhow, all of this should be a NULL point, since clearly PLEX server is opening a bunch of connection and abandoning then without closing it, in the hopes (or the assumption) that the kernel will do the job instead. I just wan’t to make it clear: The kernel is NOT doing the housekeeping for it anymore, and it should indeed take care of its own abandoned connections.

If you guys are going to fix it, one can only hope! (as a plexpass user for a few years now, I would expect WAY more from support than blaming a bug on linux distros like you did… very sad to read this!)

anyhow, I hope this helps someone else…
cheers…

-H


#8

I don’t get why you’re seeing it but I have two PMS’s running here and have zero issues nor is anyone else reporting an issue.
If they were, since I’m the primary support for the Linux forum, I’d have seen it. I see every thread created by email notification.

I’m not going to enter into a dispute but Arch isn’t supported and never had been along with several others.

If you saw some of the bugs because some distro decided to roll their own kernel, you’d understand my skepticism and unwillingness to accept this as a PMS fault. I have PMS source code access. I can see how sockets are opened and closed. I’ve been a developer for 35+ years and Linux developer for some 25 now .

The four big hitters, Debian, Ubuntu, Redhat, Fedora (in no specific order) do not display this issue. If it did, we’d be all over it like flies.


#9

Did you read my uname -a output? Notice the kernel version?

Notice 4.16.11

[chuck@lizum ~.101]$ uname -a
Linux lizum.hessen.lan 4.16.11-100.fc26.x86_64 #1 SMP Tue May 22 20:02:12 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[chuck@lizum ~.102]$ 


#10

Arch runs the original kernel, no patches… I did saw the uname -a from your example, which is fedora, and although is 4.16.11, we known redhat does a lot of patching on their own kernel. (And that’s why we use Arch linux so we have the vanilla kernel as it should)

I’ll update my Arch to latest kernel code, and when I have time, I’ll download the latest kernel source from kernel.org and build it, just to see how plex behaves on it.

anyhow, there’s a bug somewhere or else plex wouldn’t be letting connections hanging while running. Unfortunately the code is not open source (as far as I known), so I can’t see it as you can, or else I would take care of it myself instead of reporting it here.

The only thing I can say is that we use Arch linux on VFX production for about 5 years now, without issues (It actually solved a lot of issues we had due to kernel patches in other distros).

In this kernel version, there’s absolutely no other software behaving like Plex server does, so I think there’s a really small chance of the problem being in the kernel itself, but there is a chance.

anyhow, that’s it.

thanks anyway…
-H


#11

cat /proc/version
Linux version 4.4.0-128-generic (buildd@lcy01-amd64-019) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) ) #154-Ubuntu SMP Fri May 25 14:15:18 UTC 2018


#12

some more info. It seems those 13K connections keep being recreated over and over again, actually.

I just did a quick test reducing the maximum allowed tw connections to 500 using "sysctl net.ipv4.tcp_max_tw_buckets=500.

This basically instructed the kernel to immediately starting closing TIME_WAIT connections, and it went down to less than 500.

As soon as I returned tcp_max_tw_buckets to 65536, all the TIME_WAIT connections started to accumulate again, stopping at around 13K again.

So, I stand corrected. The TIME_WAIT connections are NOT being created at startup, but constantly. From kernel docs, one case when TIME_WAIT accumulates is when many client connects and disconnects very fast.

One other info I didn’t mention is that all the 13K connections are from 127.0.0.1:NNNNN to 127.0.0.1:3240, where NNNNN varies from numbers between 32768 to 60K:

From now on, I’ll just keep posting whatever I can collect here just to keep a register in the hopes of it’s being useful to someone else in the future.
-H


#13

I went to the plex server log, and found this:

it keeps spitting out this “ERROR - [PlexRelay] /dev/null:” over and over again, multipe times in a second!

maybe it has something to do with multiple connections TIME_WAIT problem?


#14

I’ve noticed a couple of “Plex Relay” processes running at 100% cpu each. After killing those 2 processes, the TIME_WAIT connections went down to normal levels, and Plex server is still running and working as it should!

So, It’s definitely this “Plex Relay” processes who’s constantly creating tons of connections to 127.0.0.1:32400, hence creating the tons of TIME_WAIT states.

Funny thing… After killing the “Plex Relay” processes, my plex server is still working normally, actually much faster, responding almost instantaneously when I start a video, or jump forward and backwards! (not to mention my machine is working faster again, as when plex server is off)

What’s exactly does this “Plex Relay” process?

for now, as a workaround, I’m creating a cron job to kill the “Plex Relay” process every minute, so I don’t run into this problem anymore!

But it definitely looks like something that should be fixed.

anyways…
hope this helps someone else with the same problem!
cheers…
-H


#15

https://support.plex.tv/articles/216766168-accessing-a-server-through-relay/


#16

@Peter_W said:
https://support.plex.tv/articles/216766168-accessing-a-server-through-relay/

That answers “What’s exactly does this Plex Relay process?”, indeed! Thanks.

So, there’s still the issue “why” Plex Relay is creating so many connections per second to 127.0.0.1:32400, not to mention the error messages in the log:

And the fact that I had more than one process running on my machine, each consuming 100% cpu doing absolutely nothing apart from filling up the tcp stack with TIME_WAIT connections.

off course, in my case, I have the workaround of killing Plex Relay every minute by a cron job, and since I don’t need the relay service through plex servers, I actually don’t need multiple Plex Relay processes running all the time anyway, consuming resources!

Maybe a “use plex relay” setting would be a nice feature to add, so one like me who has the plex ports forwarded properly by the router, can disable it and save resources since there’s no need for it.

Anyhow, the issue is finally found to be in the Plex Relay process, and a workaround for someone else running on the same problem is to setup a cron job to kill the “Plex Relay” process constantly, with something like this in ‘crontab -e’:

* * * * *    [ "$(pidof 'Plex Relay')" != "" ] && pkill -fc -9 "Plex Relay"

or in case your distro doesn’t have pidof:

* * * * *    [ "$(ps -AHfc | grep 'Plex Relay' | grep -v grep)" != "" ] && pkill -fc -9 "Plex Relay"

as long as you don’t actually need the Plex Relay service, like myself!

cheers…
-H