| From: | Andres Freund <andres-AT-anarazel.de> |
| To: | Andreas Dilger <adilger-AT-dilger.ca> |
| Subject: | Re: fsync() errors is unsafe and risks data loss |
| Date: | Wed, 11 Apr 2018 19:17:52 -0700 |
| Message-ID: | <20180412021752.2wykkutkmzh4ikbf@alap3.anarazel.de> |
| Cc: | 20180410184356.GD3563-AT-thunk.org, "Theodore Y. Ts'o" <tytso-AT-mit.edu>, Ext4 Developers List <linux-ext4-AT-vger.kernel.org>, Linux FS Devel <linux-fsdevel-AT-vger.kernel.org>, Jeff Layton <jlayton-AT-redhat.com>, "Joshua D. Drake" <jd-AT-commandprompt.com> |
Hi, On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote: It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too: /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir); (a bunch of other places too) Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that. I think there's some legitimate arguments that a database should use direct IO (more on that as a reply to David), but claiming that all sorts of random utilities need to use DIO with buffering etc is just insane. Except that they won't notice that they got a failure, at least in the dpkg case. And happily continue installing corrupted data Yea, I agree that'd not be sane. As far as I understand the dpkg code (all of 10min reading it), that'd also be unnecessary. It can abort the installation, but only if it detects the error. Which isn't happening. And that's *horrible*. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus. Or even more extreme, you untar/zip/git clone a directory. Then do a sync. And you don't know whether anything actually succeeded. The data in the file also is corrupt. Having to unmount or delete the file to reset the fact that it can't safely be assumed to be on disk isn't insane. Except that postgres uses multiple processes. And works on a lot of architectures. If we started to fsync all opened files on process exit our users would *lynch* us. We'd need a complicated scheme that sends processes across sockets between processes, then deduplicate them on the receiving side, somehow figuring out which is the oldest filedescriptors (handling clockdrift safely). Note that it'd be perfectly fine that we've "thrown away" the buffer contents if we'd get notified that the fsync failed. We could just do WAL replay, and restore the contents (just was we do after crashes and/or for replication). There's already a per-process cache of open files. Well, I'm making that argument because several people argued that throwing away buffer contents in this case is the only way to not cause OOMs, and that that's incompatible with reporting errors. It's clearly not... Sure. I don't think this is that PG specific, as explained above. Greetings, Andres Freund
