views:

972

answers:

2

We have a number of embedded systems requiring r/w access to the filesystem which resides on flash storage with block device emulation. Our oldest platform runs on compact flash and these systems have been in use for over 3 years without a single fsck being run during bootup and so far we have no failures attributed to the filesystem or CF.

On our newest platform we used USB-flash for the initial production and are now migrating to Disk-on-Module for r/w storage. A while back we had some issues with the filesystem on a lot of the devices running on USB-storage so I enabled e2fsck in order to see if that would help. As it turned out we had received a shipment of bad flash memories so once those were replaced the problem went away. I have since disabled e2fsck since we had no indication that it made the system any more reliable and historically we have been fine without it.

Now that we have started putting in Disk-on-Module units I've started seeing filesystem errors again. Suddenly the system is unable to read/write certain files and if I try to access the file from the emergency console I just get "Input/output error". I enabled e2fsck again and all the files were corrected.

O'Reilly's "Building Embedded Linux Systems" recommends running e2fsck on ext2 filesystems but does not mention it in relation to ext3 so I'm a bit confused to whether I should enable it or not.

What are your takes on running fsck on an embedded system? We are considering putting binaries on a r/o partition and only the files which has to be modified on a r/w partition on the same flash device so that fsck can never accidentally delete important system binaries, does anyone have any experience with that kind of setup (good/bad)?

+2  A: 

I think the answer to your question more relates to what types of coherency requirements you application has relative to its data. That is, what has to be guaranteed if power is lost without a formal shutdown of the system? In general, none of the desktop operating system type file systems handle this all that well without specific application closing/syncing of files and flushing of the disk caches, etc. at key transaction points in the application to ensure what you need to maintain is in fact committed to the media.

Running fsck fixes the file-system but without the above care, there is no guarantees about what changes you made will actually be kept. ie: It's not exactly deterministic what you'll lose as a result of the power failure.

I agree that putting your binaries or other important read-only data on a separate read-only partition does help ensure that they can't erroneously get tossed due to an fsck correction to file-system structures. As a minimum, putting them in a different sub-directory off the root than where the R/W data is held will help. But in both cases, if you support software updates, you still need to have scheme to deal with writing the "read-only" areas anyway.

In our application, we actually maintain a pair of directories for things like binaries and the system is setup to boot from either one of the two areas. During software updates, we update the first directory, sync everything to the media and verify the MD5 checksums on disk before moving onto the second copy's update. During boot, they are only used if the MD5 checksum is good. This ensures that you are booting a coherent image always.

Tall Jeff
Actually we've had directories which are very rarely written to deleted by fsck like /lib/modules (!). I like your dual partition setup, I wanted to implement something like that here but it was given a very low priority by management.
David Holm
@David - Yes, deletes are certainly possible if ever modified. What fsck does is less than ideal and there might even be bugs that causes it to toss more than it should/could. It fixes the file system integrity, but it also happens at the expense of some of the data.
Tall Jeff
+1  A: 

Dave,

I always recommend running the fsck after a number of reboots, but not every time.

The reason is that, the ext3 is journal-ed. So unless you enable the writeback (journal-less), then most of the time, your metadata/file-system table should be in sync with your data (files).

But like Jeff mentioned, it doesn't guarantee the layer above the file-system. It means, you still get "corrupted" files, because some of the records probably didn't get written to the file system.

I'm not sure what embedded device you're running on, but how often does it get rebooted? If it's controlled reboot, you can always do "sync;sync;sync" before restart.

I've been using the CF myself for years, and very rare occasion I got file-system errors. fsck does help on that case.

And about separating your partition, I doubt the advantage of it. For every data/files on the file-system, there's a metadata associated with it. Most of the time, if you don't change the files, eg. binary/system files, then this metadata shouldn't change. Unless you have a faulty hardware, like cross-talking write & read, those read-only files should be safe.

Most problems arises when you have something writable, and regardless where you put this, it can cause problems if the application doesn't handle it well.

Hope that helps.

KOkon
We usually reboot once every 24 hours. Why three syncs, shouldn't one be enough?
David Holm
I'm just curious, why every 24 hours? Most embedded systems I know should run forever (except for S/W upgrade).These 3 syncs were because of some bugs on older kernel, where the sync was not synchronized, and conjunction with umount, it will flush out blocks that was mounted before the umount.
KOkon
The 24h reboot was implemented because we had to deal with some buggy and poorly supported drivers which occasionally stops working. Unloading them usually causes a kernel panic so we decided on a controlled reboot rather than having the watchdog reset the board.
David Holm
If you happen to have the source code, you probably want to fix the driver, rather than dealing with complexity of the file-system corruption.
KOkon