Performance fix for O_SYNC behaviour

Remove the unconditional mark_inode_dirty() which was being done on O_SYNC
inodes, as that is a major performance hit if we are updating files in place
and don't need an inode timestamp update to hit the journal (Linux really
implements O_DATASYNC instead of O_SYNC).

Cater for the data=journal case by forcing a commit after the write if needed;
also do this for IS_SYNC() case.  (Simply setting the inode dirty before the
write is not sufficient in the data=journal case --- the write might get split
over multiple transactions and there is no guarantee that later transactions
would get flushed.)

Add a bit of documentation around the ext3_file_write --- the interactions
between the various sync modes and the core VFS (especially
generic_osync_inode) are subtle.

--- linux-2.4.19-ext3/fs/ext3/file.c.=K0007=.orig	Thu Nov 15 21:37:55 2001
+++ linux-2.4.19-ext3/fs/ext3/file.c	Fri Oct 11 15:52:01 2002
@@ -61,19 +61,52 @@
 static ssize_t
 ext3_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
 {
+	int ret, err;
 	struct inode *inode = file->f_dentry->d_inode;
 
-	/*
-	 * Nasty: if the file is subject to synchronous writes then we need
-	 * to force generic_osync_inode() to call ext3_write_inode().
-	 * We do that by marking the inode dirty.  This adds much more
-	 * computational expense than we need, but we're going to sync
-	 * anyway.
-	 */
-	if (IS_SYNC(inode) || (file->f_flags & O_SYNC))
-		mark_inode_dirty(inode);
+	ret = generic_file_write(file, buf, count, ppos);
 
-	return generic_file_write(file, buf, count, ppos);
+	/* Skip file flushing code if there was an error, or if nothing
+	   was written. */
+	if (ret <= 0)
+		return ret;
+	
+	/* If the inode is IS_SYNC, or is O_SYNC and we are doing
+           data-journaling, then we need to make sure that we force the
+           transaction to disk to keep all metadata uptodate
+           synchronously. */
+
+	if (file->f_flags & O_SYNC) {
+		/* If we are non-data-journaled, then the dirty data has
+                   already been flushed to backing store by
+                   generic_osync_inode, and the inode has been flushed
+                   too if there have been any modifications other than
+                   mere timestamp updates.
+		   
+		   Open question --- do we care about flushing
+		   timestamps too if the inode is IS_SYNC? */
+		if (!ext3_should_journal_data(inode))
+			return ret;
+
+		goto force_commit;
+	}
+
+	/* So we know that there has been no forced data flush.  If the
+           inode is marked IS_SYNC, we need to force one ourselves. */
+	if (!IS_SYNC(inode))
+		return ret;
+	
+	/* Open question #2 --- should we force data to disk here too?
+           If we don't, the only impact is that data=writeback
+           filesystems won't flush data to disk automatically on
+           IS_SYNC, only metadata (but historically, that is what ext2
+           has done.) */
+	
+force_commit:
+	err = ext3_force_commit(inode->i_sb);
+	if (err) 
+		return err;
+	return ret;
 }
 
 struct file_operations ext3_file_operations = {