Perpetual Database Corruption

It all starts with timeouts and retries…

Feb 08, 2016 20:57:32 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:35 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=1): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:38 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:41 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=2): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:45 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:50 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=3): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:51 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:54 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=4): Cannot begin transaction. database is locked
Feb 08, 2016 20:57:54 [0x6d6ef400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:57:56 [0x5ee22400] ERROR - BIF transcoder exited abnormally
Feb 08, 2016 20:57:58 [0x5d0d6400] WARN - Waited one whole second for a busy database.
Feb 08, 2016 20:58:01 [0x5d0d6400] ERROR - Failed to begin transaction (…/Statistics/StatisticsManager.h:185) (tries=5): Cannot begin transaction. database is locked
Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Waited one whole second for a busy database.

Until PMS seemingly decides to time out waiting for lock and do something stupid anyway:

Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Took too long (1.030000 seconds) to start a transaction on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:04 [0x5d0d6400] WARN - Transaction that was running was started on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:04 [0x69212400] WARN - Took too long (0.690000 seconds) to start a transaction on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:04 [0x69212400] WARN - Transaction that was running was started on …/Statistics/StatisticsManager.h:185
Feb 08, 2016 20:58:05 [0x653b3400] WARN - Took too long (0.640000 seconds) to start a transaction on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:05 [0x653b3400] WARN - Transaction that was running was started on …/…/Library/MetadataItem.cpp:812
Feb 08, 2016 20:58:13 [0x69212400] ERROR - SQLITE3:(nil), 11, database corruption at line 57498 of [8a8ffc862e]
Feb 08, 2016 20:58:13 [0x69212400] ERROR - SQLITE3:(nil), 11, statement aborts at 109: [update metadata_items set library_section_id=:metadata_items_library_section_id,parent_id=:metadata_items_parent_id,metadata_type=:metadata_items_metadata_type,guid=:metadata_items_gu
Feb 08, 2016 20:58:13 [0x69212400] ERROR - Exception inside transaction (inside=1) (…/…/Library/MetadataItem.cpp:814): sqlite3_statement_backend::loadOne: database disk image is malformed
Feb 08, 2016 20:58:13 [0x69212400] ERROR - * Error loading episode.
Feb 08, 2016 20:58:13 [0x5c712400] ERROR - BIF transcoder exited abnormally
Feb 08, 2016 20:58:14 [0x69212400] ERROR - SQLITE3:(nil), 26, statement aborts at 1: [BEGIN IMMEDIATE] file is encrypted or is not a database
Feb 08, 2016 20:58:14 [0x69212400] ERROR - Failed to begin transaction (…/…/Library/MetadataItem.cpp:812) (tries=1): Cannot begin transaction. file is encrypted or is not a database

And then the only cure is to restore from backup. But then it happens again because a small single core NAS isn’t somehow magically going to get faster.

I’ve rolled back to a known good backup several times today, and every time a rescan kicks off, it all gets corrupted all over.

Plex is becoming increasingly unusuably broken.

Can you upload the entire log file?

And adding to @drzoidberg33
Also state what QNAP you got, as well as tell us what else is running on your QNAP

/T

I eventually managed to get things back to a usable state by rolling back the ZFS snapshot of the plex library directory a month or so to a known uncorrupted state. But it does still show that the SQLite database it maintains is prone to corruption. Perhaps a move to MySQL might be in order to improve robustness by some orders of magnitude.

The QNAP I am running is TS-421, it is running RedSleeve EL6, the Plex binaries I am using are the QNAP ones repackaged as rpm, and everything is running on ZFS (fuse based).

(Disclosure: I am the maintainer of RedSleeve (armv5tel port of EL6 and EL7) and zfs-fuse.)

@gordan79 said:
I eventually managed to get things back to a usable state by rolling back the ZFS snapshot of the plex library directory a month or so to a known uncorrupted state. But it does still show that the SQLite database it maintains is prone to corruption. Perhaps a move to MySQL might be in order to improve robustness by some orders of magnitude.

The QNAP I am running is TS-421, it is running RedSleeve EL6, the Plex binaries I am using are the QNAP ones repackaged as rpm, and everything is running on ZFS (fuse based).

(Disclosure: I am the maintainer of RedSleeve (armv5tel port of EL6 and EL7) and zfs-fuse.)

Database corruption should never be a problem unless there is some underlying issue causing it.

Reports of database corruption are rare and it is usually caused a hardware fault (or something like a power interruption). Not saying that is your issue but a full log may still provide more information.

There was no power outage, and there is no disk level data corruption (ZFS would have spotted that and either transparently fixed it (4-disk RAIDZ2, RAID6 equivalent) or blocked the operation and return an error to the application). Having said that, any database that can suffer from data corruption due to a power outage is unfit for purpose. Data loss of the most recent transaction may be unavoidable, but never data corruption.

Here is the full Plex Media Server Log:
http://pastebin.com/EqYPZkaL

Did this issue get resolved?

I didn’t see it get resolved. It seems to only be reproducible on slow machines (mine was a single core 32-bit ARM with everything on zfs-fuse). On a faster machine, the transactions don’t time out, and the database thus doesn’t get corrupted, more by luck than by design.

Even though posted in the PMS QNAP forum, this hardly belongs here, since @gordan79 more or less switched everything but the bare metal :slight_smile:

As such, moving this to the Linux forum, where people more into your port might me listening.
And when said, I happens to know, that ZFS does req. some Ram, and a TS-421 comes with only 1 Gb of it!

/T

I have exactly the same issue even with a fresh build on either Ubuntu 14.04 or 16.04

Has anyone managed to resolve this or is it being investigated and worked on? I have paid for subscription and simply cannot use plex anymore

I am having this same issue. I can use plex like normal, but I have no dashboard.

It’s really annoying.

@haygoodhtpc said:
I am having this same issue. I can use plex like normal, but I have no dashboard.

It’s really annoying.

I’ve had the same issues - try doing the Optimize/Clean Bundles/Empty Trash steps. Usually it’s the Optimize that’e effective and fixes the issue temporarily 75% of the time.

Another thing I found last night was deleting the Music library helped immediately “fix” the corruption. It looks like whatever is giving my NAS fits is in my Music library. Using another app to stream music at the moment.

Tried this last night and it seems to have sped up the loading of the dashboard and loading media. I have a large photo library too - think I’ll delete that too and see if it makes a difference.

@dane22 said:
Even though posted in the PMS QNAP forum, this hardly belongs here, since @gordan79 more or less switched everything but the bare metal :slight_smile:

As such, moving this to the Linux forum, where people more into your port might me listening.
And when said, I happens to know, that ZFS does req. some Ram, and a TS-421 comes with only 1 Gb of it!

zfs-fuse has very modest memory requirements. Obviously more is better for improved caching, but carving out about 128MB to dedicate to ZFS caching is hardly a show-stopper on a 1GB machine.

Also note that although I swapped out the OS, the kernel is original QNAP, and the Plex binaries are original QNAP build, so I would argue that swapping out the rest of the userspace is completely irrelevant because Plex brings all of it’s dependencies with it, and the only thing it interacts with is the kernel.

And as other posters have chipped in, it isn’t an ARM or zfs-fuse specific issue. It’s a timeout/race issue. I moved the exact same image onto a much faster ARM machine, and it doesn’t occur there.

@haygoodhtpc said:
Another thing I found last night was deleting the Music library helped immediately “fix” the corruption. It looks like whatever is giving my NAS fits is in my Music library. Using another app to stream music at the moment.

@sforan said:
Tried this last night and it seems to have sped up the loading of the dashboard and loading media. I have a large photo library too - think I’ll delete that too and see if it makes a difference.

What seems to be happening, and logs to some extent confirm it, is that certain operations have a certain assumed timeouts associated. If they don’t complete in time, they are assumed to have failed/crashed and a retry occurs. But interrupting DB transactions half way through and retrying them on a lousy embedded database like SQLite with at best highly questionable ACID compliance just trashes the data. This is hardly rocket science, but is clearly beyond many in the agile nu-age. The reason why deleting your music / photo libraries helps is because you almost certainly have many more audio files than video files, which clogs the already struggling metadata database past the breaking point.

There are essentially two options for fixing this:

  1. Remove the retry timeouts on database operations. As a DBA I can confidently say that if your database is so broken that it needs this kind of workaround, you might as well be storing data in /dev/null (it’s pretty web-scale) - which is exactly what ends up happening anyway with all the database corruptions, so let’s just skip the part where it’s pretending to work.

  2. Move to a decent database (PostgreSQL and MySQL would both work), AND still remove the timeouts - because you won’t need them. If the DB is taking a certain amount of time to do something, it’s because it takes a certain amount of time to do something. It is not in the application’s remit to question how long a DB operation takes.

2 Likes

@gordan79 said:
snip

I totally agree with you. My library is starting to get pretty big now, and my log shows a lot of timeouts (when the server is under heavy load). In case of heavy load on the server, users report crashes. It shouldn’t be necessary to upgrade to a SSD drive just to get a lousy 130MB Sqlite-db to work properly. I have run larger db’s on lousier hardware with mysql without any problems before. It would be nice to have the PostgreSQL and MySQL alternatives for “power users”.

Waitttt a second. I have this issue too. @gordan79 - are you saying that we have no options then? As i do not think that we have the option to change the plex database from sqlite to mysql. Or am I wrong? My server is crashing every few hours and my only way to fix it is cronjobbing a pms restart every 4 hours. Which is annoying to say the least. Any ideas?

Thanks!

Just had the same issue and I agree with @gordan79 findings. The retry timeouts are too low for large amounts of data.

Would you mind creating a fresh thread please, with log files?

This thread is 2 years old. It was closed due to age before we migrated but must have someone slipped through.

I am closing now so we can begin anew using current software versions.