views:

821

answers:

3

Is anyone working on something to render individual questions, or SO as a whole with codeswarm? If so, can you post a link to your work that transforms SO questions into revisions that codeswarm can understand (i.e. svn?)

It would be really, really cool to see SO played (as a whole) via codeswarm, so I hope to not only ask if anyone is working it, but see if anyone is interested in trying to accomplish it.

Augmenting that, will database dumps be made available?

EDIT:

Database dumps have since been made available :) Enough with user voice, is anyone doing it? If so, what VCS did you mock?

+3  A: 

One needs a log of the version-control system to generate a code_swarm video. However, even static database dump is not yet provided. See UserVoice ticket.

jetxee
Maybe the DB structure could be provided first ... at least then we could get to work on something :) Scraping the site just seems rude (and rather ugly)
Tim Post
I agree. Please vote for the ticket if you too would like to have a DB dump :)
jetxee
this answer is obselete now :P
hasen j
yes, it's obsolete now
jetxee
+5  A: 

With the release of database dumps, this has essentially turned into a problem of generating an activity.xml file from the database contents. This wiki page on the codeswarm project site details the typical process of generating these files from common version control systems (most notably SVN, CVS, VSS, Mercurial, and MediaWiki). I've had a brief look at the code (specifically the convert_logs.py file, and infact the conversion code seems truly quite simple. In fact, the format of the actual activity.log file, which is all you need to generate the end results (i.e. cool looking videos), is very straightforward itself.

Here's an example of the activity.xml file I generated from the sample svn_log.txt file. I've just pasted a portion, since the entire contents is rather long.

<?xml version="1.0"?>
<file_events>
<event date="1213658962000" filename="/branches" author="(no author)" />
<event date="1213658962000" filename="/tags" author="(no author)" />
<event date="1213658962000" filename="/trunk" author="(no author)" />
<event date="1213867405000" filename="/prototype" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype/ColorAssigner.pde" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype/ColorBins.pde" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype/Edge.pde" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype/FileEvent.pde" author="michael.ogawa" />
<event date="1213867405000" filename="/trunk/prototype/FileNode.pde" author="michael.ogawa" />
<event date="1214050286000" filename="/trunk/prototype/code_swarm.pde" author="[email protected]" />
<event date="1214050286000" filename="/trunk/prototype/data/code_swarm-repository.xml" author="[email protected]" />
<event date="1214053719000" filename="/trunk/convert_logs" author="[email protected]" />
<event date="1214053719000" filename="/trunk/convert_logs/README" author="[email protected]" />
<event date="1214053719000" filename="/trunk/convert_logs/convert_logs.py" author="[email protected]" />
</file_events>

(The date attribute simply seems to be a Unix timestamp.)

Now, I haven't yet investigated what the full extent of the format for this XML file is, but it would seem quite easy to generate from any suitable data source. (Indeed, it appears that these flashy videos can be generated with no more information.) Certainly, I see no reason why one would need to take the approach of mocking a VCS. The nature of StackOverflow content seems pretty well-suited to the format/codeswarm in general, so compatibility is not likely to be an issue in my opinion. Indeed, there already exists a MediaWiki converter.

So yeah, I really don't see that this should be too difficult a project. The idea of representing StackOverflow questions with codeswarm does quite intrigue me, so I am actually thinking about spending a bit of time writing a converter that takes the StackOverflow database dump (or a subset thereof) and converts to the activity.xml format. If you haven't yet attempted anything yourself, please let know, and I would be glad to at least create a quick and dirty convert (probably in C#).

Update

Here's my code in C# that generates the activity.xml output file that codeswarm uses. I have verified that the format is correct, but haven't managed to get around checking that the video generates correctly. (I had to install the JDK because even that wasn't on my machine.)

public class DumpConverter
{
    public event Action<object, int> ProgressChanged;

    public DumpConverter()
    {
    }

    public void ConvertToLog(XmlWriter outputWriter, XmlReader postsReader)
    {
        outputWriter.WriteStartDocument();
        outputWriter.WriteStartElement("file_events");

        int numPostsRead = 0;
        while (postsReader.Read())
        {
            switch (postsReader.NodeType)
            {
                case XmlNodeType.Document:
                    break;
                case XmlNodeType.Element:
                    switch (postsReader.Name)
                    {
                        case "posts":
                            break;
                        case "row":
                            var postDate = DateTime.Parse(postsReader["CreationDate"]);
                            var postFileName = postsReader["Title"];
                            var postAuthor = postsReader["LastEditorDisplayName"];
                            postsReader.MoveToElement();

                            outputWriter.WriteStartElement("event");
                            outputWriter.WriteAttributeString("date", 
                                ((int)postDate.GetUnixEpoch()).ToString());
                            outputWriter.WriteAttributeString("filename", postFileName);
                            outputWriter.WriteAttributeString("author", postAuthor);
                            outputWriter.WriteEndElement();

                            if (ProgressChanged != null)
                                ProgressChanged(this, ++numPostsRead);

                            break;
                    }
                    break;
            }

        }

        outputWriter.WriteEndElement();
        outputWriter.WriteEndDocument();
    }
}

And a sample program that uses the DumpConverter class:

static void Main(string[] args)
{
    if (args.Length < 1)
        return;
    var inputPath = args[0];

    using (var outputWriter = XmlWriter.Create(Path.Combine(inputPath, "activity.xml")))
    using (var postsReader = XmlReader.Create(Path.Combine(inputPath, "posts.xml")))
    {
        var dumpConverter = new DumpConverter();

        int nodesRead = 0;
        string lastStatus = string.Empty;
        Console.Write("Posts converted: ");
        dumpConverter.ProgressChanged += (sender, e) =>
            {
                Console.Write(new string('\b', lastStatus.Length));
                lastStatus = (nodesRead++).ToString("#,#0");
                Console.Write(lastStatus);
            };
        dumpConverter.ConvertToLog(outputWriter, postsReader);
        Console.WriteLine();
    }
}

I'll let you know if I can actually get the video rendered in codeswarm now, though once you generate the activity data (which doesn't take terribly long), it should be trivial provided that you have the correct environment and config file set up.

Hope that helps!

Noldorin
code_swarm was not my only motivation for considering the idea of putting the data dumps into a mock VCS format (though it was the main goal). I was also considering how useful the existing VCS tools would be for analyzing the dumps (i.e. graphs, etc).However, I think you are correct, its overkill. I was going to start work on a converter in c99, however if you've already got something in the oven I'd really like to see it :)
Tim Post
+9  A: 

The following Ruby script will transform users.xml and 880Mb posts.xml and create an 20Mb activity.xml from it:

require 'time'

Dir.chdir("d:\\stackoverflow")

$stderr.puts "Reading users..."
id2username = []
IO.foreach("users.xml") { |line|
    id2username[$1.to_i] = $2 if line =~ /^\<row Id="(\d+)" .* DisplayName="([^"]+)" /
}
$stderr.puts "Got #{id2username.size} users"

io = File.open("activity.xml", "wb")
io.puts '<?xml version="1.0"?>'
io.puts "<file_events>"
$stderr.puts "Reading posts & writing activity.xml..."
IO.foreach("posts.xml") { |line|
    # Only posts, no comments (yet)
    if line =~ / CreationDate="([^"]+)\.[0-9]{3}" .* OwnerUserId="(\d+)" .* Title="([^"]+)" /
        date = Time.parse($1).to_i
        filename = $3           # Use title for this
        author = id2username[$2.to_i] || "(no author)"
        io.puts "<event date=\"#{date}\" filename=\"#{filename}\" author=\"#{author}\" />"
    end
}
io.puts "</file_events>"
io.close

This took more time than anticipated, and now the day has ended :)

Currently I've done:

  • download code_swarm 0.1 from google
  • download Ant from apache
  • download java JDK1.6
  • download bittorrent client
  • download the torrent of 200Mb
  • extract the torrent to 1Gb
  • run script to create activity.xml
  • edit code_swarm config
  • run code_swam

...and currently I'm looking for 15 minutes already at a blank Java screen while java.exe takes 100% CPU...

Hmpf.

As an intermediate solution, I put up activity.xml on FTP server here for other to continue exploring (or maybe my computer comes to senses this night).

Rutger Nijlunsing
You Rock!!!! I am working on doing the same thing with C. I have the same problem with Java choking on my puny Atom, but will try again when I get back home tonight.
Tim Post
I would guess you would have to take it up with the code_swarm authors to work something out with them. It seems code_swarm is not that scalable. An alternative would be to start plotting a subset and going from there.
Rutger Nijlunsing