Performance fix for O_SYNC behaviour Remove the unconditional mark_inode_dirty() which was being done on O_SYNC inodes, as that is a major performance hit if we are updating files in place and don't need an inode timestamp update to hit the journal (Linux really implements O_DATASYNC instead of O_SYNC). Cater for the data=journal case by forcing a commit after the write if needed; also do this for IS_SYNC() case. (Simply setting the inode dirty before the write is not sufficient in the data=journal case --- the write might get split over multiple transactions and there is no guarantee that later transactions would get flushed.) Add a bit of documentation around the ext3_file_write --- the interactions between the various sync modes and the core VFS (especially generic_osync_inode) are subtle. --- linux-2.4.19-ext3/fs/ext3/file.c.=K0007=.orig Thu Nov 15 21:37:55 2001 +++ linux-2.4.19-ext3/fs/ext3/file.c Fri Oct 11 15:52:01 2002 @@ -61,19 +61,52 @@ static ssize_t ext3_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos) { + int ret, err; struct inode *inode = file->f_dentry->d_inode; - /* - * Nasty: if the file is subject to synchronous writes then we need - * to force generic_osync_inode() to call ext3_write_inode(). - * We do that by marking the inode dirty. This adds much more - * computational expense than we need, but we're going to sync - * anyway. - */ - if (IS_SYNC(inode) || (file->f_flags & O_SYNC)) - mark_inode_dirty(inode); + ret = generic_file_write(file, buf, count, ppos); - return generic_file_write(file, buf, count, ppos); + /* Skip file flushing code if there was an error, or if nothing + was written. */ + if (ret <= 0) + return ret; + + /* If the inode is IS_SYNC, or is O_SYNC and we are doing + data-journaling, then we need to make sure that we force the + transaction to disk to keep all metadata uptodate + synchronously. */ + + if (file->f_flags & O_SYNC) { + /* If we are non-data-journaled, then the dirty data has + already been flushed to backing store by + generic_osync_inode, and the inode has been flushed + too if there have been any modifications other than + mere timestamp updates. + + Open question --- do we care about flushing + timestamps too if the inode is IS_SYNC? */ + if (!ext3_should_journal_data(inode)) + return ret; + + goto force_commit; + } + + /* So we know that there has been no forced data flush. If the + inode is marked IS_SYNC, we need to force one ourselves. */ + if (!IS_SYNC(inode)) + return ret; + + /* Open question #2 --- should we force data to disk here too? + If we don't, the only impact is that data=writeback + filesystems won't flush data to disk automatically on + IS_SYNC, only metadata (but historically, that is what ext2 + has done.) */ + +force_commit: + err = ext3_force_commit(inode->i_sb); + if (err) + return err; + return ret; } struct file_operations ext3_file_operations = {