views:

1484

answers:

15

You may have noticed that we now show an edit summary on Community Wiki posts:

community wiki
220 revisions, 48 users

I'd like to also show the user who "most owns" the final content displayed on the page, as a percentage of the remaining text:

community wiki
220 revisions, 48 users
kronoz 87%

Yes, there could be top (n) "owners", but for now I want the top 1.

Assume you have this data structure, a list of user/text pairs ordered chronologically by the time of the post:

User Id     Post-Text
-------     ---------
12          The quick brown fox jumps over the lazy dog.
27          The quick brown fox jumps, sometimes.
30          I always see the speedy brown fox jumping over the lazy dog.

Which of these users most "owns" the final text?

I'm looking for a reasonable algorithm -- it can be an approximation, it doesn't have to be perfect -- to determine the owner. Ideally expressed as a percentage score.

Note that we need to factor in edits, deletions, and insertions, so the final result feels reasonable and right. You can use any stackoverflow post with a decent revision history (not just retagging, but frequent post body changes) as a test corpus. Here's a good one, with 15 revisions from 14 different authors. Who is the "owner"?

http://stackoverflow.com/revisions/327973/list

Click "view source" to get the raw text of each revision.

I should warn you that a pure algorithmic solution might end up being a form of the Longest Common Substring Problem. But as I mentioned, approximations and estimates are fine too if they work well.

Solutions in any language are welcome, but I prefer solutions that are

  1. Fairly easy to translate into c#.
  2. Free of dependencies.
  3. Put simplicity before efficiency.

It is extraordinarily rare for a post on SO to have more than 25 revisions. But it should "feel" accurate, so if you eyeballed the edits you'd agree with the final decision. I encourage you to test your algorithm out on stack overflow posts with revision histories and see if you agree with the final output.


I have now deployed the following approximation, which you can see in action for every new saved revision on Community Wiki posts

  • do a line based diff of every revision where the body text changes
  • sum the insertion and deletion lines for each revision as "editcount"
  • each userid gets sum of "editcount" they contributed
  • first revision author gets 2x * "editcount" as initial score, as a primary authorship bonus
  • to determine final ownership percentage: each user's edited line count total divided by total number of edited lines in all revisions

(There are also some guard clauses for common simple conditions like 1 revision, only 1 author, etcetera. The line-based diff makes it fairly speedy to recalc for all revisions; in a typical case of say 10 revisions it's ~50ms.)

This works fairly well in my testing. It does break down a little when you have small 1 or 2 line posts that several people edit, but I think that's unavoidable. Accepting Joel Neely's answer as closest in spirit to what I went with, and upvoted everything else that seemed workable.

A: 

Not sure if it is possible, but you could count the number of characters added.

Example:

  • User 1 submits a question with 400 characters (User 1 has 400 of 400)
  • User 2 deletes 40 and adds 60 (User 1 had 360 and User 2 has 60)

If you revert you should also revert to the previous user/character count.

But... maybe it is just simpler and more fair to name the original poster...

At second thought, I edit a lot, but I never consider myself "owner" of the tekst because I just alter the presentation (format and grammar) but not the contents.

Gamecat
+3  A: 

Off the top of my head I'd do something like this:

  • I think counting words as opposed to lines or characters makes sense
  • Tokenize the original revision into words, and attach the author to each
  • Step through revision history and as words are added, attach the author to them.
  • If words are deleted, just forget them.
  • If words are changed, you can either count them as a delete/insert, or have some sort of character-based threshold so that spelling corrections aren't attributed to a new author.
  • You end up with a list of words against who originally wrote them
Draemon
+17  A: 

Saw your tweet earlier. From the display of the 327973 link, it appears you already have a single-step diff in place. Based on that, I'll focus on the multi-edit composition:

  1. A, the original poster owns 100% of the post.

  2. When B, a second poster, makes edits such that e.g. 90% of the text is unchanged, the ownership is A:90%, B:10%.

  3. Now C, a third party, changes 50% of the text. (A:45%, B:5%, C:50%)

    In other words, when a poster makes edits such that x% is changed and y = (100-x)% is unchanged, then that poster now owns x% of the text and all previous ownership is multiplied by y%.

    To make it interesting, now suppose...

  4. A makes a 20% edit. Then A owns a "new" 20%, and the residual ownerships are now multiplied by 80%, leaving (A:36%, B:4%, C:40%). The "net" ownership is therefore (A:56%, B:4%, C:40%).

Applying this to your specimen (327973) with everything rounded to the nearest percent:

Version 0: The original post.

  • Paul Oyster: 100%

Version 1: Your current diff tool shows pure addition of text, so all those characters belong to the second poster.

  • Paul Oyster: 91%
  • onebyone: 9%

Version 2: The diff shows replacement of a word. The new word belong to the third poster, and the remaining text belongs to the prior posters.

  • Paul Oyster: 90%
  • onebyone: 9%
  • Blogbeard: 1%

Version 3: Tag-only edit. Since your question was about the text, I'm ignoring the tags.

  • Paul Oyster: 90%
  • onebyone: 9%
  • Blogbeard: 1%

Version 4: Addition of text.

  • Paul Oyster: 45%
  • onebyone: 4%
  • Blogbeard: 1%
  • Mark Harrison: 50%

I hope that's enough to give the sense of this proposal. It does have a couple of limitations, but I'm sliding these in under your statement that an approximation is acceptable. ;-)

  1. It brute-forcedly distributes the effect of change across all prior owners. If A posts, B does a pure addition, and C edits half of what B added, this simplistic approach just applies C's ownership across the entire post, without trying to parse out which prior ownership was changed the most.

  2. It accounts for additions or changes, but doesn't give any ownership credit for deletion, because the deleter adds 0% to the remaining text. You can either regard this as a bug or a feature. I chose door number 2.

Update: A bit more about issue #1 above. I believe that fully-tracking the ownership of the part of a post that is edited would require one of two things (The margin of the web page is not big enough for a formal proof ;-):

  • Changing the way text is stored to reflect ownership of individual portions of the text (e.g. A owns words 1-47, B owns words 48-59, A owns words 60-94,...), applying the "how much remains" approach in my proposal to each portion, and updating the portion-ownership data.

  • Considering all versions from first to current (in effect, recomputing the portion-ownership data on the fly).

So this is a nice example of a trade-off between a quick-and-dirty approximation (at the cost of precision), a change to the entire database (at the cost of space), or every calculation having to look at the entire history (at the cost of time).

joel.neely
This is similar to Gamecat's proposal, but the problem here is around step 3, what if C changes 50% of the text, but they remove *all* of the text added by B? Should it really be A:50%, B:0%, C:50% in that case?
Adam Bellaire
@Adam, you can't just add a ownership label on each character. So it has to be an estimate.
Gamecat
it's actually quite complicated if you sit down and think out all the scenarios. In particular the "overwrote all of another user's content" one.
Jeff Atwood
"you can't just add a ownership label on each character." how about an ownership label on each word/token? why not?
Jeff Atwood
@Jeff: I like that better. It too would be an estimate, tho. Certain formatting changes can lead the current diff engine to think I added words when they were just moved around. Also, if I correct a typo or caps, the current highlighting says I added the whole word.
Adam Bellaire
@Gamecat: That's true of course, you have to draw the line somewhere. I guess the question is, what kind of estimate is good enough for Jeff? :)
Adam Bellaire
Ownership at the word level is reasonable for human-readable text IMHO.
joel.neely
@Adam Bellaire: I expanded the post to show a concrete example and commented that this simple-minded approach doesn't distribute ownership in a context-sensitive way (i.e. tracking who added the stuff changed). It's only an estimate.
joel.neely
C benefits from reading the text entered by B before removing all of it. C's contribution may be just rephrasing B. Or he could have learned from B's mistake. I don't think it's illogical for B to still get credit even though his original work has been eliminated. KISS.
Patrick McElhaney
+3  A: 

This will work if you want to implement/leverage a diff algorithm ala Python's difflib -- you will probably have to do some kind of diff in any case. This snippet calls the user with the most text churn the winner.

Excuse my hardcoding.

#!/usr/bin/env python

import collections
import difflib
import logging
import pprint
import urllib2
import re

class OwnageDeterminer(object):

    add_coefficient = 1
    remove_coefficient = .5

    def __init__(self, edits):
        self.edits = edits
        self.counts_by_username = {}

    def __call__(self):
        edits, counts_by_username = self.edits, self.counts_by_username
        for i, edit in enumerate(edits):
            username = edit['username']
            unique_counts = {'added': 0, 'removed': 0}
            existing_text = edits[i-1]['text'] if i > 0 else ''
            new_text = edits[i]['text']
            for char_diff in difflib.ndiff(existing_text, new_text):
                if char_diff.startswith('+'):
                    unique_counts['added'] += 1
                elif char_diff.startswith('-'):
                    unique_counts['removed'] += 1
            user_counts = counts_by_username.get(username, collections.defaultdict(int))
            user_counts['removed'] += self.remove_coefficient * unique_counts['removed']
            user_counts['added'] += self.add_coefficient * unique_counts['added']
            counts_by_username[username] = user_counts
        winner = None
        winning_score = 0
        score_by_username = {}
        for username, counts in counts_by_username.iteritems():
            score = counts['removed'] + counts['added']
            if score > winning_score:
                winner = username
                winning_score = score
            score_by_username[username] = score
        logging.debug('Scores: %s', pprint.pformat(score_by_username))
        return winner


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    site = urllib2.urlopen('http://stackoverflow.com/revisions/327973/list')
    contents = site.read()
    regex = re.compile(r'(/revisions/viewmarkup/\d+).*?/users/\d+/([\w-]+)',
                       re.MULTILINE|re.DOTALL)
    revisions = regex.findall(contents)
    print revisions
    edits = []
    for reluri, username in sorted(revisions, key=lambda t: t[0]):
        text = urllib2.urlopen('http://stackoverflow.com{0}'.format(reluri)).read()
        edit = {'username': username, 'text': text}
        edits.append(edit)
    od = OwnageDeterminer(edits)
    print od()

The output:

DEBUG:root:Scores: {'blorgbeard': 0.5,
 'dave-markle': 0.5,
 'dbr': 1172.0,
 'gatekiller': 69.5,
 'joseph-ferris': 0.0,
 'lkessler': 0.0,
 'mark-harrison': 592.0,
 'mdb': 3.0,
 'onebyone-livejournal-com': 0.0,
 'paul-oyster': 482.0,
 'rob-wells': 0.0,
 'simucal': 1070.5,
 'skiphoppy': 0.0,
 'thesoftwarejedi': 701.0}
dbr

Difflib docs on complexity:

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Another nice thing is that this winner calculation is linear, so you can cache the original results and do incremental updates on new edits, despite the large initialization load.

cdleary
hmm. looking at the revision history, SoftwareJedi mostly *deleted* content. (and Simucal can be disregarded as a rollback.) Should Jedi be the winner? Doesn't feel right.
Jeff Atwood
You can add arbitrary coefficients on the add and removal counts. Tweak it at will! :-)
cdleary
Also, even if every bit of the original users question was modified, I think they should get some credit for having asked the original question. IMHO whatevers decided, it should probably be weighted such that the original user gets at least X% credit (my suggestion: X = 10)
Phil
@Jeff, one thin you can be sure off, regardless which algorithm you chose, there will always be people complaining ;-).
Gamecat
I think it makes more sense for the OP to start with a fixed offset instead of a percentage. They made the original contribution, but did they cultivate?
cdleary
Given the "revisions \approx time" ideas that are cropping up, I think there are some more interesting things you can do here. I'd use this algorithm if I wanted to keep it simple, but there's lots of other neat possibilities. +1 for simple/tweakable, -1 for boring. :-)
cdleary
I looked up Aaron Swartz's "Who Writes Wikipedia" (which everyone should read) at http://www.aaronsw.com/weblog/whowriteswikipedia because it's a similar problem, and he uses difflib.SequenceMatcher.find_longest_match. This answer uses difflib but some other function -- are the answers same?
ShreevatsaR
+2  A: 

A question is whether removing characters is as valid as adding them. Using Diff may work well, but it doesn't always follow that the person who has added the most is the most valid editor (some people can write in 10 characters what others take a page to write).

So i think you have two factors here:

  • Quantity of change
  • Quality of change

Hence i would be tempted to write something that simply maps words added/removes to a points score. Adding to this some kind of quality factor (this would have to be an external parameter) and the combined value could then be used in sorting.

You could arguably generate some kind of primitive "quality factor" based on the distance it took before an item was edited. So if something i wrote wasn't changed until the 12th edit then it couldn't have been too wrong (relative to something of less quality changed as soon as it was added).

good point -- "stability" of text over time should be a factor. If your text stayed in from versions 1 through 15, that should be weighted!
Jeff Atwood
Absolutely. Quality is a tricky concept to introduce here, but valid i think."Simplicity is the ultimate sophistication." ... Leonardo da Vinci.
I like the idea of text remaining across several revisions being more valuable, but it seems prone to error given that you have no concept of real time. I can make five edits in five minutes, as I did in the above post. They could be small, which would make the rest of the text seem more stable.
cdleary
@cdleary that's true. Perhaps you're consecutive edits would count as one edit. Or perhaps we need "pagerank" here :)
+1  A: 

Number of characters is tricky methinks.

How about:

Break latest revision into a set of sentences. 
//Sentence is any text fragment surrounded by punctuation

For each Sentence
    Find which user created that sentence. 
    Add 1 to the user who created the sentence 
    Add 1 to the number of sentences

For Each user 
    % ownership = Count for that user / Number of sentences.

Finding which user created that sentence.
Matching the sentence to the revision is easy if you want an exact match, but I'd be more happy with a partial match.

To do this partial match...

Strip Common words from the sentence fragment, and search for that stripped fragment in a stripped version of each revision. The earliest hit is the sentence written by the owner.

Options: (instead of common word stripping)

  • For each word in the sentence, condense each word to the first and last characters. This'll account for spelling mistakes, etc.
  • Only use the first and last two words in the sentence fragment.
  • Matching 80% of the words would probably be good enough to attribute ownership. (This is a difficult thing for the computer to do - and it can't be cached)

Stripping common words

You already do this for search, so the libraries, etc are already there. Even if you use someone elses formula, you should maybe CommonWordStrip each revision before starting.

seanyboy
+4  A: 

If I am understanding your question correctly, it looks like you are trying to do what IBM did awhile back with a Wikipedia research project. Namely, seeing who's revisions to the text where the most accepted by other users and how the overall text changed over time. The name of the project was history flow and the history flow - how it works gives a pretty nice overview of how their algorithm worked.

Rob
I recently did something similar for Wikipedia text using Python's difflib module: http://hewgill.com/journal/entries/461-wikipedia-blame
Greg Hewgill
+1: I had the exact same idea, but couldn't think of what the hell it was called. Kudos.
Alex Lyman
+24  A: 

I think the idea is fundamentally flawed.

If someone writes a brilliant analysis with awful spelling and unclear examples, and I copy edit it extensively, have I created 60 % of the work? Clearly not; the result is a derivative where most of the value comes from the initial poster. A useful measure is not possible based on character or word counts, but requires strong AI-level semantic analysis.

Quite apart from that, seeking credit based on “ownership” of articles would likely be entirely unhelpful and anti-wiki. On Wikipedia, for example, people who act as though they own articles are one of the most damaging influences.

Ahruman
I completely agree. Most of the users on SO understand wikis and revisions so displaying x revisions should suffice.
GateKiller
If this is the case, I don't think we should show the avatar/credentials of the last user edited. Make it totally void of personal branding, with just an "edited [time] ago" link for more details.
cdleary
cdleary I agree with you on that
John Sheehan
You deeply misunderstand; this is purely informational. In my post, above, John is the only user attached (right now). But I am the primary author. This is helpful information in assessing the post.
Jeff Atwood
@Jeff, please explain how it's helpful to know who is the primary author.
Patrick McElhaney
because we mix oil and water: wiki and reputation systems. We're a hybrid. We need both visible.
Jeff Atwood
As far as mixing fluids is concerned, SO feels far more reputation-systemy than wikiesque as it is.
Ahruman
@Ahruman: Agreed, it seems like we can either "trust the wiki", have no ownership, and trust the votes **OR** have "primary ownership" and try to hack the lower editing barrier of Community Wiki into the existing (working) reputation system.
cdleary
Of course, in Community Wiki, posts that were up-voted have a lower barrier to being *totally* replaced/removed in a newer revision, potentially making those up-votes are no longer applicable. This is the point of trust. :-)
cdleary
+4  A: 

What about simply calculating the Levenshtein distance of each person's edit to the previous version. Then sum the distance scores for each user and calculate the user's percentage of the sum of all users' distance scores.

Here's some C# code to calculate the distances.

brien
wow, Levenshtein is *CRAZY* expensive. Like 200 milliseconds per post on a 3.5 Ghz Core 2 Duo for a typical post, expensive!
Jeff Atwood
Well, Levenshtein *does* have quadratic runtime and there's not much that you can do about this (in this general case). Is there no way to cache the values, i.e. only calculate them once, when the edit was made?
Konrad Rudolph
Not that the C# code mentioned in the answer is the naive way to calculate the Levenshtein distance. Because it allocates a 2-dimensional array it uses a gigantic amount of memory which is largely unneeded. By allocating only two 1-dimensional arrays the algorithm is at least 4 times more efficient.
Bas Leijdekkers
If I'm not mistaken, this is what I did, except I didn't normalize the scores. What do you gain by normalizing, anyway? The largest score wins in either case, no?
cdleary
Actually, there are different semantics for replacement. I classify as unique to either the first or second string, so replacement is both an addition and a removal, whereas the edit distance calculation has its own, separate, weight for replacement.
cdleary
+3  A: 

Nobody owns it. Awarding ownership violates the spirit of "community wiki" and may lead to counterproductive edit wars.

Patrick McElhaney
A: 

The initial concept must be given a weight. Spelling corrections must be given a weight. Grammar/structure must be given a weight. Wording must be given a weight. Conciseness must be given a weight. Etc...

Weights are the only fair way to do it but also impossible to determine.

Joe Philllips
+5  A: 
Lasse V. Karlsen
"I don't know how to solve the [rollbacks] problem though" we have a flag indicating rollback to revision; I will ignore content from a rollback as if it never got changed at all.
Jeff Atwood
I'll post the code to this when I get home, so you can experiment, if you're interested. The code relies on classes in my class library but they're there for the taking if this fancies you.
Lasse V. Karlsen
Source (VS2008) can be found here: http://vkarlsen.serveftp.com:81/svn/StackOverflow/SO424220/ (use guest for both username and password.) Use a Subversion client to grab the source code.
Lasse V. Karlsen
+1  A: 

The key to having a good solution to this problem is to get extra information about what is going on with the editing. The extra information available in this medium is the voting and response rate to questions. So if someone makes an edit that leads to the question getting a lot of upvotes, comments, and responses, their edit was very valuable. Special cases probably exist for a brand new unanswered question that gets edited before it has been on the site for a very long time.

You should look at how much the post has changed using the Levenshtein distance algorithm and then weight their edit by the number of votes, comments, and answers the question receives after their edit.

Let n = total revisions
Let m = the revision number of a poster
Let post[it] = array with text of post at revision 'it'
Let votes[it] = votes that revision 'it' received (also add bonus for comments/answers)
value = 0
for (it = m; it < n; ++it) {
  value += (Levenshtein(post[it-1], post[m]) / average_length_post) * (votes[it])
}

If you calculate the value for each post, the ownership of a post is the total value of all edits by that user divided by the sum of all edit values for that post.

MikeN
+2  A: 

It's a wiki—why not let each editor choose the significance of his or her change? Provide a drop-down with something like...

Please qualify your edit:

  1. A few minor edits (spelling, grammar, etc)
  2. Many minor edits
  3. A few significant edits (changes to specific facts, figures, etc)
  4. Many significant edits
  5. A few major edits (changes to thesis, conclusion, etc)
  6. Many major edits

Then use the combined responses to estimate ownership.

inxilpro
+2  A: 

How about this idea?

  • Original poster's name is displayed, until the post has been changed more than x% from the original text
  • This percentage is displayed as something like "x users, y edits, z% of original"
  • The last edit thing currently present could also have a diff amount - so for example, it makes it clear the first edit to your question was a minor addition

The point is that after a certain amount of change it doesn't really matter who owns it. I agree with the users saying that trying to assign "owners" to wiki content is counter-productive.

Ant P.