Voozh

From:	Robert Haas <robertmhaas-AT-gmail.com>
To:	Anthony Iliopoulos <ailiop-AT-altatus.com>
Subject:	Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date:	Tue, 3 Apr 2018 17:47:01 -0400
Message-ID:	<CA+TgmoZJsNt_57LgmDd7y+TLOQL+36VT9x=JMda+sYuCoJ4aDw@mail.gmail.com>
Cc:	Andres Freund <andres-AT-anarazel.de>, Tom Lane <tgl-AT-sss.pgh.pa.us>, Craig Ringer <craig-AT-2ndquadrant.com>, Thomas Munro <thomas.munro-AT-enterprisedb.com>, Catalin Iacob <iacobcatalin-AT-gmail.com>, PostgreSQL Hackers <pgsql-hackers-AT-postgresql.org>

On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos <ailiop@altatus.com> wrote:

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD,
POSIX.1-2001, which on the first system I tested, it did. Also, the
summary should be changed from the current "fsync, fdatasync -
synchronize a file's in-core state with storage device" by adding ",
possibly by randomly undoing some of the changes you think you made to
the file".


No, that's not questionable at all. fsync() doesn't take any argument
saying which part of the file you care about, so the kernel is
entirely not entitled to assume it knows to which writes a given
fsync() call was intended to apply.


I don't deny that it's possible that somebody could have an
application which is utterly indifferent to the fact that earlier
modifications to a file failed due to I/O errors, but is A-OK with
that as long as later modifications can be flushed to disk, but I
don't think that's a normal thing to want.


Well, in PostgreSQL, we have a background process called the
checkpointer which is the process that normally does all of the
fsync() calls but only a subset of the write() calls. The
checkpointer does not, however, necessarily have every file open all
the time, so these fixes aren't sufficient to make sure that the
checkpointer ever sees an fsync() failure. What you have (or someone
has) basically done here is made an undocumented assumption about
which file descriptors might care about a particular error, but it
just so happens that PostgreSQL has never conformed to that
assumption. You can keep on saying the problem is with our
assumptions, but it doesn't seem like a very good guess to me to
suppose that we're the only program that has ever made them. The
documentation for fsync() gives zero indication that it's
edge-triggered, and so complaining that people wouldn't like it if it
became level-triggered seems like an ex post facto justification for a
poorly-chosen behavior: they probably think (as we did prior to a week
ago) that it already is.


Well, the way PostgreSQL works today, we typically run with say 8GB of
shared_buffers even if the system memory is, say, 200GB. As pages are
evicted from our relatively small cache to the operating system, we
track which files need to be fsync()'d at checkpoint time, but we
don't hold onto the blocks. Until checkpoint time, the operating
system is left to decide whether it's better to keep caching the dirty
blocks (thus leaving less memory for other things, but possibly
allowing write-combining if the blocks are written again) or whether
it should clean them to make room for other things. This means that
only a small portion of the operating system memory is directly
managed by PostgreSQL, while allowing the effective size of our cache
to balloon to some very large number if the system isn't under heavy
memory pressure.

Now, I hear the DIRECT_IO thing and I assume we're eventually going to
have to go that way: Linux kernel developers seem to think that "real
men use O_DIRECT" and so if other forms of I/O don't provide useful
guarantees, well that's our fault for not using O_DIRECT. That's a
political reason, not a technical reason, but it's a reason all the
same.

Unfortunately, that is going to add a huge amount of complexity,
because if we ran with shared_buffers set to a large percentage of
system memory, we couldn't allocate large chunks of memory for sorts
and hash tables from the operating system any more. We'd have to
allocate it from our own shared_buffers because that's basically all
the memory there is and using substantially more might run the system
out entirely. So it's a huge, huge architectural change. And even
once it's done it is in some ways inferior to what we are doing today
-- true, it gives us superior control over writeback timing, but it
also makes PostgreSQL play less nicely with other things running on
the same machine, because now PostgreSQL has a dedicated chunk of
whatever size it has, rather than using some portion of the OS buffer
cache that can grow and shrink according to memory needs both of other
parts of PostgreSQL and other applications on the system.


That would certainly be better than nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

URL: https://lwn.net/Articles/752101/

⇱ Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS [LWN.net]

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS