To: Private List <evals@tux.org>
From: "Theodore Ts'o" <tytso@mit.edu>
Date: Sun, 19 Dec 2004 23:10:09 -0500
Subject: Re: [evals] ext3 vs reiser with quotas
User-Agent: Mutt/1.5.6+20040907i
List-Id: Private List <evals.tux.org>
Reply-To: Private List <evals@tux.org>
In-Reply-To: <41C0AD21.7040507@speakeasy.net>
Message-Id: <20041220041009.GC6988@thunk.org>
References: <20041216014939.GM24923@wiggy.net> <41C0EC1B.8060406@linux-sxs.org>
	<20041216021730.GD5228@kainx.org> <41C0F30E.2030903@linux-sxs.org>
	<20041216023808.GF5228@kainx.org> <41C08699.3080104@speakeasy.net>
	<41C0FA60.9000307@linux-sxs.org> <41C0AD21.7040507@speakeasy.net>

On Wed, Dec 15, 2004 at 01:31:13PM -0800, Sean Perry wrote:
> Net Llama! wrote:
> >I really have to ask what you were running when you had that FS 
> >corruption.  I've never seen anything like that, and i've been using XFS 
> >for about 3 years now on over 15 different boxes.
> >
> 
> simple as pie. Setup a machine running XFS. Start some disk activity 
> (maybe a test mail server). When the FS is getting solid activity yank 
> the power out. When I saw this happen I had to mkfs and start over.

What probably hit you here is caused by the very simple fact that
PC-class hardware is crap.

You see, when you yank the power cord out of the wall, not all parts
of the computer stop functioning at the same time.  As the voltage
starts dropping on the +5 and +12 volt rails, certain parts of the
system may last longer than other parts.  For example, the DMA
controller, hard drive controller, and hard drive unit may continue
functioning for several hundred of milliseconds long after the DIMM's,
which are very voltage sensitive, have gone crazy and are returning
total random garbage.  If this happens while the filesystem is writing
critical sections of the filesystem metadata, well, you get the visit
the fun web pages at http://You.Lose.Hard.

I was actually told about this by an XFS engineer, who discovered this
the hardware.  Their solution was to add a power fail interrupt and
bigger capacitors in the power supplies in SGI hardware, and in Irix,
when the power fail interrupt triggers, the first thing the OS does is
to run around frantically aborting IO transfers to the disk.
Unfortunately, PC-class hardware doesn't have power fail interrupts.
Remember, PC-class hardware is cr*p.

Why doesn't ext3 get hit by this?  Well, because ext3 does physical
block journalling.  This means that we write the entire physical block
to the journal and, only have the updates to the journal are commited,
do we write the data to the final location on disk.  So if you yank
out the power cord, and inode tables get trashed, they will get
restored when the journal gets replayed.

XFS and Reiserfs do what logical journaling.  This means that if you
modify the mod time of an inode, instead of writing the entire inode
table block to the journal, instead, they just write a note in the
journal stating that "inode X now has a mod time of Y".  This takes
much less space, and so you can pack many more journal updates into a
single disk block.  For filesystem benchmarks that have try to
saturate the filesystem's write bandwidth and/or which also have a
very high levels of metadata updates, XFS and Reiserfs will tend to do
much better than ext3.  Fortunately, many real-world workloads don't
have this characteristic, which is why ext3 tends to perform just fine
in practice in many applications, despite what would appear to be much
worse benchmarks numbers, at least for some benchmarks.

The problem with logical journalling is that if you do have an
unexpected power drop, and the storage system scribbles garbage on the
inode table, there's no way to recover.  In some cases, the only thing
that is left to be done is mkfs and restore from backups (backups?
you do keep backups, right?).

The practical upshot of this is that if you use XFS or reiserfs, and
you are on crappy PC-class hardware, you ***MUST*** have a UPS, and
use a serial/USB cable so that system can do a graceful shutdown when
the UPS's batteries are exhausted.


This issue is completely different from the XFS issue of zero'ing all
open files on an unclean shutdown, of course.  Having a UPS won't save
you from that particular XFS misfeature.  The reason why it is done is
to avoid a potential security problem, where a file could be left with
someone else's data.  Ext3 solves this problem by delaying the journal
commit until the data blocks are written, as opposed to trashing all
open files.  Again, it's a solution which can impact performance, but
at least in my opinion, for a filesystem, performace is Job #2.
Making sure you don't lose data is Job #1.

> Install Debian on a laptop using XFS. Was attempting to use ACPI for 
> battery stats, sleeping, etc. Turned out ACPI crashes this thinkpad. Was 
> doing a apt-get install on a large package list and the box wedged, 
> hard. Turn it back on and there was nothing left of /var. Just rubble.

Your other mistake was trying to use ACPI.  :-)

							- Ted
_______________________________________________
Evals mailing list
Evals@tux.org
http://www.tux.org/mailman/listinfo/evals


To: Private List <evals@tux.org>
From: Joey Hess <joey@kitenet.net>
Date: Mon, 20 Dec 2004 09:17:04 -0500
Subject: Re: [evals] ext3 vs reiser with quotas
User-Agent: Mutt/1.5.6+20040907i
List-Id: Private List <evals.tux.org>
Reply-To: Private List <evals@tux.org>
In-Reply-To: <20041220035402.GB6988@thunk.org>
Message-Id: <20041220141704.GA26183@kitenet.net>
References: <20041216014939.GM24923@wiggy.net> <41C0EC1B.8060406@linux-sxs.org>
	<20041216021730.GD5228@kainx.org> <41C0F30E.2030903@linux-sxs.org>
	<20041216023808.GF5228@kainx.org> <41C08699.3080104@speakeasy.net>
	<20041216031819.GI5228@kainx.org> <20041216035308.GJ21637@zork.net>
	<20041216225045.GI22615@linuxmafia.com>
	<20041220035402.GB6988@thunk.org>

Theodore Ts'o wrote:
> For a very good time, create a few dozen files containing images of
> reiserfs filesystems on a reiserfs (scratch) filesystem, and force an
> fsck.reiser.  All of reiserfs is a single B-tree, that can be anywhere
> on disk.  So what fsck.reiser does is to search the entire disks for
> blocks that look vaguely like parts of the filesystem B-tree, and
> stiches them altogether.  Whee!!!!

Crap, now I have 400 gb of data that you just forced me to migrate to
ext3. Curse you Ted..

-- 
see shy jo
_______________________________________________
Evals mailing list
Evals@tux.org
http://www.tux.org/mailman/listinfo/evals

To: Private List <evals@tux.org>
From: Brian Chrisman <incubus@shell.izap.com>
Date: Mon, 20 Dec 2004 13:14:04 -0800
Subject: Re: [evals] ext3 vs reiser with quotas
User-Agent: Mutt/1.3.28i
List-Id: Private List <evals.tux.org>
Reply-To: Private List <evals@tux.org>
In-Reply-To: <20041220035402.GB6988@thunk.org>
Message-Id: <20041220211404.GS1789@shell.izap.com>
References: <20041216014939.GM24923@wiggy.net> <41C0EC1B.8060406@linux-sxs.org>
	<20041216021730.GD5228@kainx.org> <41C0F30E.2030903@linux-sxs.org>
	<20041216023808.GF5228@kainx.org> <41C08699.3080104@speakeasy.net>
	<20041216031819.GI5228@kainx.org> <20041216035308.GJ21637@zork.net>
	<20041216225045.GI22615@linuxmafia.com>
	<20041220035402.GB6988@thunk.org>

On Sun, Dec 19, 2004 at 10:54:02PM -0500, Theodore Ts'o wrote:
> On Thu, Dec 16, 2004 at 02:50:45PM -0800, Rick Moen wrote:
> > Quoting Nick Moffitt (nick@zork.net):
> > 
> > > Tell me, is fsck.reiser still equivalent to mkfs.reiser?
> > 
> > Pretty much.  It's the closest thing to filesystem Lotto.
> 
> For a very good time, create a few dozen files containing images of
> reiserfs filesystems on a reiserfs (scratch) filesystem, and force an
> fsck.reiser.  All of reiserfs is a single B-tree, that can be anywhere
> on disk.  So what fsck.reiser does is to search the entire disks for
> blocks that look vaguely like parts of the filesystem B-tree, and
> stiches them altogether.  Whee!!!!

Wow.. didn't know that... that's more accurately
labelled something like a filesystem 'horror reanimator'
than 'consistency check'.

As with all reanimations of the horror genre, extra 
arms, legs, tails, and bad attitudes are likely. :-)

As I recall, there are several other btree based filesystems
out there (xfs.. etc) do they perform similar nefarious acts
under fsck?  Are there blended btree lookup/inode based file
systems out there?  Personally, when someone tells me they 
have a problem which involves two million files in a single
directory, I tell them it's an application architecture 
stupidity problem, not an OS problem.... 
Not that they often have any choice in the matter.....
_______________________________________________
Evals mailing list
Evals@tux.org
http://www.tux.org/mailman/listinfo/evals


To: Private List <evals@tux.org>
From: "Theodore Ts'o" <tytso@mit.edu>
Date: Mon, 20 Dec 2004 17:13:20 -0500
Subject: Re: [evals] ext3 vs reiser with quotas
User-Agent: Mutt/1.5.6+20040907i
List-Id: Private List <evals.tux.org>
Reply-To: Private List <evals@tux.org>
In-Reply-To: <20041220211404.GS1789@shell.izap.com>
Message-Id: <20041220221320.GA15245@thunk.org>
References: <41C0EC1B.8060406@linux-sxs.org> <20041216021730.GD5228@kainx.org>
	<41C0F30E.2030903@linux-sxs.org> <20041216023808.GF5228@kainx.org>
	<41C08699.3080104@speakeasy.net> <20041216031819.GI5228@kainx.org>
	<20041216035308.GJ21637@zork.net>
	<20041216225045.GI22615@linuxmafia.com>
	<20041220035402.GB6988@thunk.org>
	<20041220211404.GS1789@shell.izap.com>

On Mon, Dec 20, 2004 at 01:14:04PM -0800, Brian Chrisman wrote:
> As I recall, there are several other btree based filesystems
> out there (xfs.. etc) do they perform similar nefarious acts
> under fsck?  Are there blended btree lookup/inode based file
> systems out there?  Personally, when someone tells me they 
> have a problem which involves two million files in a single
> directory, I tell them it's an application architecture 
> stupidity problem, not an OS problem.... 

I haven't looked at XFS's fsck in detail, but remember that XFS isn't
using a single btree for the entire filesystem.  There are two
b-tree's used to keep track of free space (which can be recomputed by
fsck without any problems).  XFS also uses separate b-tree's for each
directory, so if that gets trashed you then get lots of inodes in
lost+found, and a b-tree for extent maps (and if that gets trashed you
only lose one file).

With ext3/htree, we use a separate b-tree for each directory.  In
addition, the location of each page in the b-tree is located in the
direct and indrect blocks of the ext3 inode.  So we never have to
search the entire disk looking for the pages of the b-tree.  It also
means that if the b-tree is corrupt, we can just simply throw away the
interior nodes and rebuild it, and 99% of the time we don't even have
inods landing in lost+found.

						- Ted
_______________________________________________
Evals mailing list
Evals@tux.org
http://www.tux.org/mailman/listinfo/evals