2
Vote

Storage providers do not use diffs to store changes

description

At least in the files system storage provider, each change is stored as a complete NEW file.
When comparisons are made, they are made one file against the other.

It is believed that this model is used for all (file-system, db, etc) storage providers.

Over time, with changes, this leads to a large amount of storage bloat.


See the PmWiki model (at least) for storage that stores only the diff of changes (and can still restore back to any version).

Adjusting the storage providers to store as diffs might be a breaking change for some extensions (or maybe not, I don't know).

If at all possible, we should probably make it a config option -- but the default option to use the diffs, and configurable to use non-diffs.

comments

Mriswith wrote May 24, 2013 at 11:28 AM

I can confirm the SQL Server Provider also stores all content of previous changes. Diffs are needed here as well

Mriswith wrote Jun 7, 2013 at 12:56 PM

We could do with tracking down a Diff library then allows recreation of a file from the diffs.

MichaelPaulukonis wrote Jun 10, 2013 at 5:13 PM

I'd be interested it taking ownership, but I can't start for a month or so.

Some links interest:

http://code.google.com/p/google-diff-match-patch/

https://diffplex.codeplex.com/

Mriswith wrote Jun 11, 2013 at 2:32 PM

I think the google-diff-match-patch will be easier to integrate also I'm more convinced with it because it explains what algorithms it uses.

I had already spotted the diffplex project but I wasn't convinced. Not managed to track down other c# ones my self but not had time to check

MichaelPaulukonis wrote Sep 19, 2013 at 3:42 PM

there is a diff-engine already in STW - it's used to compare revisions.
However, each revision is still stored in toto.

Here are some notes on how MediaWiki handles revisions: http://www.mediawiki.org/wiki/MediaWiki_architecture_document/text/revision_2#Database_and_text_storage

And DokuWiki apparently stored complete page versions, diffing them for comparison: https://en.wikipedia.org/wiki/DokuWiki#Main_features

I'm more familiar with PmWiki's version:
Inside of a page file, PmWiki stores the latest version of the markup text, and uses this to render the page. The page history is kept as a sequence of differences between the latest version of the page and each previous version.
If space is an issue (and I would think it might be), then compressing that previous page versions might be partial answer, since decompression would only be needed when doing a diff (or so I believe).

The official word was that the file-storage version was slower than the DB version.
Anybody have an idea why that would be true?

PmWiki uses flat-files and remains speedy at tens of thousands of pages.