You may have noticed that we now show an edit summary on Community Wiki posts:
community wiki
220 revisions, 48 users
I'd like to also show the user who "most owns" the final content displayed on the page, as a percentage of the remaining text:
community wiki
220 revisions, 48 users
kronoz 87%
Yes, there could be top (n) "owners", but for now I want the top 1.
Assume you have this data structure, a list of user/text pairs ordered chronologically by the time of the post:
User Id Post-Text ------- --------- 12 The quick brown fox jumps over the lazy dog. 27 The quick brown fox jumps, sometimes. 30 I always see the speedy brown fox jumping over the lazy dog.
Which of these users most "owns" the final text?
I'm looking for a reasonable algorithm -- it can be an approximation, it doesn't have to be perfect -- to determine the owner. Ideally expressed as a percentage score.
Note that we need to factor in edits, deletions, and insertions, so the final result feels reasonable and right. You can use any stackoverflow post with a decent revision history (not just retagging, but frequent post body changes) as a test corpus. Here's a good one, with 15 revisions from 14 different authors. Who is the "owner"?
http://stackoverflow.com/revisions/327973/list
Click "view source" to get the raw text of each revision.
I should warn you that a pure algorithmic solution might end up being a form of the Longest Common Substring Problem. But as I mentioned, approximations and estimates are fine too if they work well.
Solutions in any language are welcome, but I prefer solutions that are
- Fairly easy to translate into c#.
- Free of dependencies.
- Put simplicity before efficiency.
It is extraordinarily rare for a post on SO to have more than 25 revisions. But it should "feel" accurate, so if you eyeballed the edits you'd agree with the final decision. I encourage you to test your algorithm out on stack overflow posts with revision histories and see if you agree with the final output.
I have now deployed the following approximation, which you can see in action for every new saved revision on Community Wiki posts
- do a line based diff of every revision where the body text changes
- sum the insertion and deletion lines for each revision as "editcount"
- each userid gets sum of "editcount" they contributed
- first revision author gets 2x * "editcount" as initial score, as a primary authorship bonus
- to determine final ownership percentage: each user's edited line count total divided by total number of edited lines in all revisions
(There are also some guard clauses for common simple conditions like 1 revision, only 1 author, etcetera. The line-based diff makes it fairly speedy to recalc for all revisions; in a typical case of say 10 revisions it's ~50ms.)
This works fairly well in my testing. It does break down a little when you have small 1 or 2 line posts that several people edit, but I think that's unavoidable. Accepting Joel Neely's answer as closest in spirit to what I went with, and upvoted everything else that seemed workable.