ansaurus

Question

Determining "Owner" of Text Edited by Multiple Users

Answer 1

A:

Not sure if it is possible, but you could count the number of characters added.

Example:

User 1 submits a question with 400 characters (User 1 has 400 of 400)
User 2 deletes 40 and adds 60 (User 1 had 360 and User 2 has 60)

If you revert you should also revert to the previous user/character count.

But... maybe it is just simpler and more fair to name the original poster...

At second thought, I edit a lot, but I never consider myself "owner" of the tekst because I just alter the presentation (format and grammar) but not the contents.

Gamecat 2009-01-08 13:38:39

Answer 2

+3 A:

Off the top of my head I'd do something like this:

I think counting words as opposed to lines or characters makes sense
Tokenize the original revision into words, and attach the author to each
Step through revision history and as words are added, attach the author to them.
If words are deleted, just forget them.
If words are changed, you can either count them as a delete/insert, or have some sort of character-based threshold so that spelling corrections aren't attributed to a new author.
You end up with a list of words against who originally wrote them

Draemon 2009-01-08 13:46:56

Answer 3

+17 A:

Saw your tweet earlier. From the display of the 327973 link, it appears you already have a single-step diff in place. Based on that, I'll focus on the multi-edit composition:

A, the original poster owns 100% of the post.
When B, a second poster, makes edits such that e.g. 90% of the text is unchanged, the ownership is A:90%, B:10%.
Now C, a third party, changes 50% of the text. (A:45%, B:5%, C:50%)

In other words, when a poster makes edits such that x% is changed and y = (100-x)% is unchanged, then that poster now owns x% of the text and all previous ownership is multiplied by y%.

To make it interesting, now suppose...
A makes a 20% edit. Then A owns a "new" 20%, and the residual ownerships are now multiplied by 80%, leaving (A:36%, B:4%, C:40%). The "net" ownership is therefore (A:56%, B:4%, C:40%).

Applying this to your specimen (327973) with everything rounded to the nearest percent:

Version 0: The original post.

Paul Oyster: 100%

Version 1: Your current diff tool shows pure addition of text, so all those characters belong to the second poster.

Paul Oyster: 91%
onebyone: 9%

Version 2: The diff shows replacement of a word. The new word belong to the third poster, and the remaining text belongs to the prior posters.

Paul Oyster: 90%
onebyone: 9%
Blogbeard: 1%

Version 3: Tag-only edit. Since your question was about the text, I'm ignoring the tags.

Paul Oyster: 90%
onebyone: 9%
Blogbeard: 1%

Version 4: Addition of text.

Paul Oyster: 45%
onebyone: 4%
Blogbeard: 1%
Mark Harrison: 50%

I hope that's enough to give the sense of this proposal. It does have a couple of limitations, but I'm sliding these in under your statement that an approximation is acceptable. ;-)

It brute-forcedly distributes the effect of change across all prior owners. If A posts, B does a pure addition, and C edits half of what B added, this simplistic approach just applies C's ownership across the entire post, without trying to parse out which prior ownership was changed the most.
It accounts for additions or changes, but doesn't give any ownership credit for deletion, because the deleter adds 0% to the remaining text. You can either regard this as a bug or a feature. I chose door number 2.

Update: A bit more about issue #1 above. I believe that fully-tracking the ownership of the part of a post that is edited would require one of two things (The margin of the web page is not big enough for a formal proof ;-):

Changing the way text is stored to reflect ownership of individual portions of the text (e.g. A owns words 1-47, B owns words 48-59, A owns words 60-94,...), applying the "how much remains" approach in my proposal to each portion, and updating the portion-ownership data.
Considering all versions from first to current (in effect, recomputing the portion-ownership data on the fly).

So this is a nice example of a trade-off between a quick-and-dirty approximation (at the cost of precision), a change to the entire database (at the cost of space), or every calculation having to look at the entire history (at the cost of time).

joel.neely 2009-01-08 13:47:07

This is similar to Gamecat's proposal, but the problem here is around step 3, what if C changes 50% of the text, but they remove *all* of the text added by B? Should it really be A:50%, B:0%, C:50% in that case?

Adam Bellaire 2009-01-08 13:49:51

@Adam, you can't just add a ownership label on each character. So it has to be an estimate.

Gamecat 2009-01-08 13:53:09

it's actually quite complicated if you sit down and think out all the scenarios. In particular the "overwrote all of another user's content" one.

Jeff Atwood 2009-01-08 13:55:03

"you can't just add a ownership label on each character." how about an ownership label on each word/token? why not?

Jeff Atwood 2009-01-08 13:55:39

@Jeff: I like that better. It too would be an estimate, tho. Certain formatting changes can lead the current diff engine to think I added words when they were just moved around. Also, if I correct a typo or caps, the current highlighting says I added the whole word.

Adam Bellaire 2009-01-08 14:14:28

@Gamecat: That's true of course, you have to draw the line somewhere. I guess the question is, what kind of estimate is good enough for Jeff? :)

Adam Bellaire 2009-01-08 14:15:00

Ownership at the word level is reasonable for human-readable text IMHO.

joel.neely 2009-01-08 14:17:50

@Adam Bellaire: I expanded the post to show a concrete example and commented that this simple-minded approach doesn't distribute ownership in a context-sensitive way (i.e. tracking who added the stuff changed). It's only an estimate.

joel.neely 2009-01-08 14:19:03

C benefits from reading the text entered by B before removing all of it. C's contribution may be just rephrasing B. Or he could have learned from B's mistake. I don't think it's illogical for B to still get credit even though his original work has been eliminated. KISS.

Patrick McElhaney 2009-01-08 15:20:44

Answer 4

+3 A:

This will work if you want to implement/leverage a diff algorithm ala Python's difflib -- you will probably have to do some kind of diff in any case. This snippet calls the user with the most text churn the winner.

Excuse my hardcoding.

#!/usr/bin/env python

import collections
import difflib
import logging
import pprint
import urllib2
import re

class OwnageDeterminer(object):

    add_coefficient = 1
    remove_coefficient = .5

    def __init__(self, edits):
        self.edits = edits
        self.counts_by_username = {}

    def __call__(self):
        edits, counts_by_username = self.edits, self.counts_by_username
        for i, edit in enumerate(edits):
            username = edit['username']
            unique_counts = {'added': 0, 'removed': 0}
            existing_text = edits[i-1]['text'] if i > 0 else ''
            new_text = edits[i]['text']
            for char_diff in difflib.ndiff(existing_text, new_text):
                if char_diff.startswith('+'):
                    unique_counts['added'] += 1
                elif char_diff.startswith('-'):
                    unique_counts['removed'] += 1
            user_counts = counts_by_username.get(username, collections.defaultdict(int))
            user_counts['removed'] += self.remove_coefficient * unique_counts['removed']
            user_counts['added'] += self.add_coefficient * unique_counts['added']
            counts_by_username[username] = user_counts
        winner = None
        winning_score = 0
        score_by_username = {}
        for username, counts in counts_by_username.iteritems():
            score = counts['removed'] + counts['added']
            if score > winning_score:
                winner = username
                winning_score = score
            score_by_username[username] = score
        logging.debug('Scores: %s', pprint.pformat(score_by_username))
        return winner


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    site = urllib2.urlopen('http://stackoverflow.com/revisions/327973/list')
    contents = site.read()
    regex = re.compile(r'(/revisions/viewmarkup/\d+).*?/users/\d+/([\w-]+)',
                       re.MULTILINE|re.DOTALL)
    revisions = regex.findall(contents)
    print revisions
    edits = []
    for reluri, username in sorted(revisions, key=lambda t: t[0]):
        text = urllib2.urlopen('http://stackoverflow.com{0}'.format(reluri)).read()
        edit = {'username': username, 'text': text}
        edits.append(edit)
    od = OwnageDeterminer(edits)
    print od()

The output:

DEBUG:root:Scores: {'blorgbeard': 0.5,
 'dave-markle': 0.5,
 'dbr': 1172.0,
 'gatekiller': 69.5,
 'joseph-ferris': 0.0,
 'lkessler': 0.0,
 'mark-harrison': 592.0,
 'mdb': 3.0,
 'onebyone-livejournal-com': 0.0,
 'paul-oyster': 482.0,
 'rob-wells': 0.0,
 'simucal': 1070.5,
 'skiphoppy': 0.0,
 'thesoftwarejedi': 701.0}
dbr

Difflib docs on complexity:

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Another nice thing is that this winner calculation is linear, so you can cache the original results and do incremental updates on new edits, despite the large initialization load.

cdleary 2009-01-08 13:49:23

hmm. looking at the revision history, SoftwareJedi mostly *deleted* content. (and Simucal can be disregarded as a rollback.) Should Jedi be the winner? Doesn't feel right.

Jeff Atwood 2009-01-08 13:52:58

You can add arbitrary coefficients on the add and removal counts. Tweak it at will! :-)

cdleary 2009-01-08 13:54:13

Also, even if every bit of the original users question was modified, I think they should get some credit for having asked the original question. IMHO whatevers decided, it should probably be weighted such that the original user gets at least X% credit (my suggestion: X = 10)

Phil 2009-01-08 13:55:24

@Jeff, one thin you can be sure off, regardless which algorithm you chose, there will always be people complaining ;-).

Gamecat 2009-01-08 13:56:48

I think it makes more sense for the OP to start with a fixed offset instead of a percentage. They made the original contribution, but did they cultivate?

cdleary 2009-01-08 13:57:28

Given the "revisions \approx time" ideas that are cropping up, I think there are some more interesting things you can do here. I'd use this algorithm if I wanted to keep it simple, but there's lots of other neat possibilities. +1 for simple/tweakable, -1 for boring. :-)

cdleary 2009-01-08 14:22:04

I looked up Aaron Swartz's "Who Writes Wikipedia" (which everyone should read) at http://www.aaronsw.com/weblog/whowriteswikipedia because it's a similar problem, and he uses difflib.SequenceMatcher.find_longest_match. This answer uses difflib but some other function -- are the answers same?

ShreevatsaR 2009-01-08 14:33:14

Answer 5

+2 A:

A question is whether removing characters is as valid as adding them. Using Diff may work well, but it doesn't always follow that the person who has added the most is the most valid editor (some people can write in 10 characters what others take a page to write).

So i think you have two factors here:

Quantity of change
Quality of change

Hence i would be tempted to write something that simply maps words added/removes to a points score. Adding to this some kind of quality factor (this would have to be an external parameter) and the combined value could then be used in sorting.

You could arguably generate some kind of primitive "quality factor" based on the distance it took before an item was edited. So if something i wrote wasn't changed until the 12th edit then it couldn't have been too wrong (relative to something of less quality changed as soon as it was added).

2009-01-08 13:58:55

good point -- "stability" of text over time should be a factor. If your text stayed in from versions 1 through 15, that should be weighted!

Jeff Atwood 2009-01-08 14:06:59

Absolutely. Quality is a tricky concept to introduce here, but valid i think."Simplicity is the ultimate sophistication." ... Leonardo da Vinci.

2009-01-08 14:11:08

I like the idea of text remaining across several revisions being more valuable, but it seems prone to error given that you have no concept of real time. I can make five edits in five minutes, as I did in the above post. They could be small, which would make the rest of the text seem more stable.

cdleary 2009-01-08 14:17:57

@cdleary that's true. Perhaps you're consecutive edits would count as one edit. Or perhaps we need "pagerank" here :)

2009-01-08 14:21:29

Answer 6

+1 A:

Number of characters is tricky methinks.

How about:

Break latest revision into a set of sentences. 
//Sentence is any text fragment surrounded by punctuation

For each Sentence
    Find which user created that sentence. 
    Add 1 to the user who created the sentence 
    Add 1 to the number of sentences

For Each user 
    % ownership = Count for that user / Number of sentences.

Finding which user created that sentence.
Matching the sentence to the revision is easy if you want an exact match, but I'd be more happy with a partial match.

To do this partial match...

Strip Common words from the sentence fragment, and search for that stripped fragment in a stripped version of each revision. The earliest hit is the sentence written by the owner.

Options: (instead of common word stripping)

For each word in the sentence, condense each word to the first and last characters. This'll account for spelling mistakes, etc.
Only use the first and last two words in the sentence fragment.
Matching 80% of the words would probably be good enough to attribute ownership. (This is a difficult thing for the computer to do - and it can't be cached)

Stripping common words

You already do this for search, so the libraries, etc are already there. Even if you use someone elses formula, you should maybe CommonWordStrip each revision before starting.

seanyboy 2009-01-08 13:59:20

Answer 7

+4 A:

If I am understanding your question correctly, it looks like you are trying to do what IBM did awhile back with a Wikipedia research project. Namely, seeing who's revisions to the text where the most accepted by other users and how the overall text changed over time. The name of the project was history flow and the history flow - how it works gives a pretty nice overview of how their algorithm worked.

Rob 2009-01-08 14:39:37

I recently did something similar for Wikipedia text using Python's difflib module: http://hewgill.com/journal/entries/461-wikipedia-blame

Greg Hewgill 2009-01-08 18:48:40

+1: I had the exact same idea, but couldn't think of what the hell it was called. Kudos.

Alex Lyman 2009-01-08 19:45:54

Answer 8

+24 A:

I think the idea is fundamentally flawed.

If someone writes a brilliant analysis with awful spelling and unclear examples, and I copy edit it extensively, have I created 60 % of the work? Clearly not; the result is a derivative where most of the value comes from the initial poster. A useful measure is not possible based on character or word counts, but requires strong AI-level semantic analysis.

Quite apart from that, seeking credit based on “ownership” of articles would likely be entirely unhelpful and anti-wiki. On Wikipedia, for example, people who act as though they own articles are one of the most damaging influences.

Ahruman 2009-01-08 14:41:22

I completely agree. Most of the users on SO understand wikis and revisions so displaying x revisions should suffice.

GateKiller 2009-01-08 16:43:52

If this is the case, I don't think we should show the avatar/credentials of the last user edited. Make it totally void of personal branding, with just an "edited [time] ago" link for more details.

cdleary 2009-01-08 17:27:36

cdleary I agree with you on that

John Sheehan 2009-01-08 19:14:23

You deeply misunderstand; this is purely informational. In my post, above, John is the only user attached (right now). But I am the primary author. This is helpful information in assessing the post.

Jeff Atwood 2009-01-08 22:38:29

@Jeff, please explain how it's helpful to know who is the primary author.

Patrick McElhaney 2009-01-09 01:20:17

because we mix oil and water: wiki and reputation systems. We're a hybrid. We need both visible.

Jeff Atwood 2009-01-09 02:54:40

As far as mixing fluids is concerned, SO feels far more reputation-systemy than wikiesque as it is.

Ahruman 2009-01-09 13:01:37

@Ahruman: Agreed, it seems like we can either "trust the wiki", have no ownership, and trust the votes **OR** have "primary ownership" and try to hack the lower editing barrier of Community Wiki into the existing (working) reputation system.

cdleary 2009-01-09 15:35:34

Of course, in Community Wiki, posts that were up-voted have a lower barrier to being *totally* replaced/removed in a newer revision, potentially making those up-votes are no longer applicable. This is the point of trust. :-)

cdleary 2009-01-09 15:54:16

Answer 9

+4 A:

What about simply calculating the Levenshtein distance of each person's edit to the previous version. Then sum the distance scores for each user and calculate the user's percentage of the sum of all users' distance scores.

Here's some C# code to calculate the distances.

brien 2009-01-08 14:47:30

wow, Levenshtein is *CRAZY* expensive. Like 200 milliseconds per post on a 3.5 Ghz Core 2 Duo for a typical post, expensive!

Jeff Atwood 2009-01-09 01:12:00

Well, Levenshtein *does* have quadratic runtime and there's not much that you can do about this (in this general case). Is there no way to cache the values, i.e. only calculate them once, when the edit was made?

Konrad Rudolph 2009-01-09 08:18:09

Not that the C# code mentioned in the answer is the naive way to calculate the Levenshtein distance. Because it allocates a 2-dimensional array it uses a gigantic amount of memory which is largely unneeded. By allocating only two 1-dimensional arrays the algorithm is at least 4 times more efficient.

Bas Leijdekkers 2009-01-09 15:37:30

If I'm not mistaken, this is what I did, except I didn't normalize the scores. What do you gain by normalizing, anyway? The largest score wins in either case, no?

cdleary 2009-01-09 15:39:23

Actually, there are different semantics for replacement. I classify as unique to either the first or second string, so replacement is both an addition and a removal, whereas the edit distance calculation has its own, separate, weight for replacement.

cdleary 2009-01-09 15:43:28

Answer 10

+3 A:

Nobody owns it. Awarding ownership violates the spirit of "community wiki" and may lead to counterproductive edit wars.

Patrick McElhaney 2009-01-08 15:24:29

Answer 11

A:

The initial concept must be given a weight. Spelling corrections must be given a weight. Grammar/structure must be given a weight. Wording must be given a weight. Conciseness must be given a weight. Etc...

Weights are the only fair way to do it but also impossible to determine.

Joe Philllips 2009-01-08 15:44:38

Answer 12

+5 A:

Lasse V. Karlsen 2009-01-08 15:48:20

"I don't know how to solve the [rollbacks] problem though" we have a flag indicating rollback to revision; I will ignore content from a rollback as if it never got changed at all.

Jeff Atwood 2009-01-09 06:56:48

I'll post the code to this when I get home, so you can experiment, if you're interested. The code relies on classes in my class library but they're there for the taking if this fancies you.

Lasse V. Karlsen 2009-01-09 08:04:21

Source (VS2008) can be found here: http://vkarlsen.serveftp.com:81/svn/StackOverflow/SO424220/ (use guest for both username and password.) Use a Subversion client to grab the source code.

Lasse V. Karlsen 2010-01-26 18:49:53

Answer 13

+1 A:

The key to having a good solution to this problem is to get extra information about what is going on with the editing. The extra information available in this medium is the voting and response rate to questions. So if someone makes an edit that leads to the question getting a lot of upvotes, comments, and responses, their edit was very valuable. Special cases probably exist for a brand new unanswered question that gets edited before it has been on the site for a very long time.

You should look at how much the post has changed using the Levenshtein distance algorithm and then weight their edit by the number of votes, comments, and answers the question receives after their edit.

Let n = total revisions
Let m = the revision number of a poster
Let post[it] = array with text of post at revision 'it'
Let votes[it] = votes that revision 'it' received (also add bonus for comments/answers)
value = 0
for (it = m; it < n; ++it) {
  value += (Levenshtein(post[it-1], post[m]) / average_length_post) * (votes[it])
}

If you calculate the value for each post, the ownership of a post is the total value of all edits by that user divided by the sum of all edit values for that post.

MikeN 2009-01-08 16:26:04

Answer 14

+2 A:

It's a wiki—why not let each editor choose the significance of his or her change? Provide a drop-down with something like...

Please qualify your edit:

A few minor edits (spelling, grammar, etc)
Many minor edits
A few significant edits (changes to specific facts, figures, etc)
Many significant edits
A few major edits (changes to thesis, conclusion, etc)
Many major edits

Then use the combined responses to estimate ownership.

inxilpro 2009-01-09 06:03:45

Answer 15

+2 A:

How about this idea?

Original poster's name is displayed, until the post has been changed more than x% from the original text
This percentage is displayed as something like "x users, y edits, z% of original"
The last edit thing currently present could also have a diff amount - so for example, it makes it clear the first edit to your question was a minor addition

The point is that after a certain amount of change it doesn't really matter who owns it. I agree with the users saying that trying to assign "owners" to wiki content is counter-productive.

Ant P. 2009-01-09 22:29:45

ansaurus

tags:

views:

answers:

Determining "Owner" of Text Edited by Multiple Users

related questions