Perpetual Database Corruption

gordan79 · February 8, 2016, 9:06pm

It all starts with timeouts and retries…

Feb 08, 2016 20:57:32 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:35 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=1): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:38 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:41 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=2): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:45 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:50 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=3): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:51 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:54 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=4): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:54 [0x6d6ef400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:56 [0x5ee22400] ERROR - BIF transcoder exited abnormally
Feb 08, 2016 20:57:58 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:58:01 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=5): Cannot begin transaction. database is locked
Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Waited one whole second for a busy database.

Until PMS seemingly decides to time out waiting for lock and do something stupid anyway:

Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Took too long (1.030000 seconds) to start a transaction on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Transaction that was running was started on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:04 [0x69212400] WARN - Took too long (0.690000 seconds) to start a transaction on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:04 [0x69212400] WARN - Transaction that was running was started on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:05 [0x653b3400] WARN - Took too long (0.640000 seconds) to start a transaction on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:05 [0x653b3400] WARN - Transaction that was running was started on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:13 [0x69212400] ERROR - SQLITE3:(nil), 11, database corruption at line 57498 of [8a8ffc862e]
Feb 08, 2016 20:58:13 [0x69212400] ERROR - SQLITE3:(nil), 11, statement aborts at 109: [update metadata_items set library_section_id=:metadata_items_library_section_id,parent_id=:metadata_items_parent_id,metadata_type=:metadata_items_metadata_type,guid=:metadata_items_gu
Feb 08, 2016 20:58:13 [0x69212400] ERROR - Exception inside transaction (inside=1) (…/…/Library/MetadataItem.cpp:814): sqlite3_statement_backend::loadOne: database disk image is malformed
Feb 08, 2016 20:58:13 [0x69212400] ERROR - * Error loading episode.
Feb 08, 2016 20:58:13 [0x5c712400] ERROR - BIF transcoder exited abnormally
Feb 08, 2016 20:58:14 [0x69212400] ERROR - SQLITE3:(nil), 26, statement aborts at 1: [BEGIN IMMEDIATE] file is encrypted or is not a database
Feb 08, 2016 20:58:14 [0x69212400] ERROR - Failed to begin transaction (…/…/Library/MetadataItem.cpp:812) (tries=1): Cannot begin transaction. file is encrypted or is not a database

And then the only cure is to restore from backup. But then it happens again because a small single core NAS isn’t somehow magically going to get faster.

I’ve rolled back to a known good backup several times today, and every time a rescan kicks off, it all gets corrupted all over.

Plex is becoming increasingly unusuably broken.

drzoidberg33 · February 8, 2016, 9:14pm

Can you upload the entire log file?

dane22 · February 8, 2016, 9:26pm

And adding to @drzoidberg33
Also state what QNAP you got, as well as tell us what else is running on your QNAP

/T

gordan79 · February 23, 2016, 10:19am

I eventually managed to get things back to a usable state by rolling back the ZFS snapshot of the plex library directory a month or so to a known uncorrupted state. But it does still show that the SQLite database it maintains is prone to corruption. Perhaps a move to MySQL might be in order to improve robustness by some orders of magnitude.

The QNAP I am running is TS-421, it is running RedSleeve EL6, the Plex binaries I am using are the QNAP ones repackaged as rpm, and everything is running on ZFS (fuse based).

(Disclosure: I am the maintainer of RedSleeve (armv5tel port of EL6 and EL7) and zfs-fuse.)

drzoidberg33 · February 23, 2016, 10:36am

@gordan79 said:
I eventually managed to get things back to a usable state by rolling back the ZFS snapshot of the plex library directory a month or so to a known uncorrupted state. But it does still show that the SQLite database it maintains is prone to corruption. Perhaps a move to MySQL might be in order to improve robustness by some orders of magnitude.

The QNAP I am running is TS-421, it is running RedSleeve EL6, the Plex binaries I am using are the QNAP ones repackaged as rpm, and everything is running on ZFS (fuse based).

(Disclosure: I am the maintainer of RedSleeve (armv5tel port of EL6 and EL7) and zfs-fuse.)

Database corruption should never be a problem unless there is some underlying issue causing it.

Reports of database corruption are rare and it is usually caused a hardware fault (or something like a power interruption). Not saying that is your issue but a full log may still provide more information.

gordan79 · February 23, 2016, 3:23pm

There was no power outage, and there is no disk level data corruption (ZFS would have spotted that and either transparently fixed it (4-disk RAIDZ2, RAID6 equivalent) or blocked the operation and return an error to the application). Having said that, any database that can suffer from data corruption due to a power outage is unfit for purpose. Data loss of the most recent transaction may be unavoidable, but never data corruption.

gordan79 · March 3, 2016, 3:31pm

Here is the full Plex Media Server Log:
http://pastebin.com/EqYPZkaL

alexburke · May 5, 2016, 8:54am

Did this issue get resolved?

gordan79 · May 14, 2016, 10:10pm

I didn’t see it get resolved. It seems to only be reproducible on slow machines (mine was a single core 32-bit ARM with everything on zfs-fuse). On a faster machine, the transactions don’t time out, and the database thus doesn’t get corrupted, more by luck than by design.

dane22 · May 14, 2016, 10:16pm

Even though posted in the PMS QNAP forum, this hardly belongs here, since @gordan79 more or less switched everything but the bare metal

As such, moving this to the Linux forum, where people more into your port might me listening.
And when said, I happens to know, that ZFS does req. some Ram, and a TS-421 comes with only 1 Gb of it!

/T

MatthewTandy · June 6, 2016, 8:13pm

I have exactly the same issue even with a fresh build on either Ubuntu 14.04 or 16.04

Has anyone managed to resolve this or is it being investigated and worked on? I have paid for subscription and simply cannot use plex anymore

haygoodhtpc · June 21, 2016, 3:50am

I am having this same issue. I can use plex like normal, but I have no dashboard.

It’s really annoying.

sforan · June 21, 2016, 7:50am

@haygoodhtpc said:
I am having this same issue. I can use plex like normal, but I have no dashboard.

It’s really annoying.

I’ve had the same issues - try doing the Optimize/Clean Bundles/Empty Trash steps. Usually it’s the Optimize that’e effective and fixes the issue temporarily 75% of the time.

haygoodhtpc · June 21, 2016, 4:15pm

Another thing I found last night was deleting the Music library helped immediately “fix” the corruption. It looks like whatever is giving my NAS fits is in my Music library. Using another app to stream music at the moment.

sforan · June 22, 2016, 7:25am

Tried this last night and it seems to have sped up the loading of the dashboard and loading media. I have a large photo library too - think I’ll delete that too and see if it makes a difference.

gordan79 · July 16, 2016, 10:33am

@dane22 said:
Even though posted in the PMS QNAP forum, this hardly belongs here, since @gordan79 more or less switched everything but the bare metal

As such, moving this to the Linux forum, where people more into your port might me listening.
And when said, I happens to know, that ZFS does req. some Ram, and a TS-421 comes with only 1 Gb of it!

zfs-fuse has very modest memory requirements. Obviously more is better for improved caching, but carving out about 128MB to dedicate to ZFS caching is hardly a show-stopper on a 1GB machine.

Also note that although I swapped out the OS, the kernel is original QNAP, and the Plex binaries are original QNAP build, so I would argue that swapping out the rest of the userspace is completely irrelevant because Plex brings all of it’s dependencies with it, and the only thing it interacts with is the kernel.

And as other posters have chipped in, it isn’t an ARM or zfs-fuse specific issue. It’s a timeout/race issue. I moved the exact same image onto a much faster ARM machine, and it doesn’t occur there.

@haygoodhtpc said:
Another thing I found last night was deleting the Music library helped immediately “fix” the corruption. It looks like whatever is giving my NAS fits is in my Music library. Using another app to stream music at the moment.

@sforan said:
Tried this last night and it seems to have sped up the loading of the dashboard and loading media. I have a large photo library too - think I’ll delete that too and see if it makes a difference.

What seems to be happening, and logs to some extent confirm it, is that certain operations have a certain assumed timeouts associated. If they don’t complete in time, they are assumed to have failed/crashed and a retry occurs. But interrupting DB transactions half way through and retrying them on a lousy embedded database like SQLite with at best highly questionable ACID compliance just trashes the data. This is hardly rocket science, but is clearly beyond many in the agile nu-age. The reason why deleting your music / photo libraries helps is because you almost certainly have many more audio files than video files, which clogs the already struggling metadata database past the breaking point.

There are essentially two options for fixing this:

Remove the retry timeouts on database operations. As a DBA I can confidently say that if your database is so broken that it needs this kind of workaround, you might as well be storing data in /dev/null (it’s pretty web-scale) - which is exactly what ends up happening anyway with all the database corruptions, so let’s just skip the part where it’s pretending to work.
Move to a decent database (PostgreSQL and MySQL would both work), AND still remove the timeouts - because you won’t need them. If the DB is taking a certain amount of time to do something, it’s because it takes a certain amount of time to do something. It is not in the application’s remit to question how long a DB operation takes.

fiLLLip · December 1, 2016, 6:14am

@gordan79 said:
snip

I totally agree with you. My library is starting to get pretty big now, and my log shows a lot of timeouts (when the server is under heavy load). In case of heavy load on the server, users report crashes. It shouldn’t be necessary to upgrade to a SSD drive just to get a lousy 130MB Sqlite-db to work properly. I have run larger db’s on lousier hardware with mysql without any problems before. It would be nice to have the PostgreSQL and MySQL alternatives for “power users”.

Mootoo · January 16, 2017, 10:32pm

Waitttt a second. I have this issue too. @gordan79 - are you saying that we have no options then? As i do not think that we have the option to change the plex database from sqlite to mysql. Or am I wrong? My server is crashing every few hours and my only way to fix it is cronjobbing a pms restart every 4 hours. Which is annoying to say the least. Any ideas?

Thanks!

koll48 · January 23, 2019, 2:21am

Just had the same issue and I agree with @gordan79 findings. The retry timeouts are too low for large amounts of data.

ChuckPa · January 23, 2019, 2:35am

Would you mind creating a fresh thread please, with log files?

This thread is 2 years old. It was closed due to age before we migrated but must have someone slipped through.

I am closing now so we can begin anew using current software versions.

Topic		Replies	Views
PMS Crash but service looks running NAS & Devices server-qnap	25	1376	September 9, 2019
Plex Media Sever database Database keep getting corrupted NAS & Devices server-qnap	22	1822	November 22, 2019
Corruption of my database and inability to 'roll back' system Plex Media Server server-mac	9	203	December 6, 2020
Database corruption issues on a database only a couple of days old General Discussions	21	483	January 8, 2020
Plexmediaserver on linux just constantly corrupts its own database Desktops & Laptops server-linux	9	359	November 15, 2020

Perpetual Database Corruption

Related topics