tags:

views:

173

answers:

5

I'm planning on running a ruby process that may take a month to finish. If possible, I'd like to ensure that a blackout or hitting the wrong button won't cost me the whole month's work.

Is there an easy way to periodically save the program's state to disk? (Techniques that involve more effort would include adding code that marshals everything apart from the database, or possibly using a virtual machine for the process' operating system)

(For those interested: the process involves parsing a multi-gigabyte XML file of a well-known website, processing some information, and saving the information to an ActiveRecord database as it goes along. Twice.)

Edit: The project is this one, and the XML file is pages-articles.xml (eg enwiki-20090306-pages-articles.xml). Nothing proprietary, I just didn't want to be in "Plz halp" mode. The first pass gets a list of Wikipedia page titles, the next pass determines the first link from each page to another page, and then I calculate some statistics.

Continuing from where I left off, as suggested by some answerers, is probably a valid option. If it crashes during the first pass, then I probably could re-run it, telling it not to add entries that already exist. If it crashes during the second pass, then I should only ask it to build links for pages that haven't already had their link calculated. If it crashes during calculating the statistics, I could just re-calculate the statistics.

Another edit: More general version of this question asked at Save a process’ memory for later use?. It looks like you can't easily back up long-running processes.

+1  A: 

I can't think of a super-easy way to do this, but if you're willing to modify you're code a bit, you might get a bit of help from YAML (an easy to use markup library. yaml.org). Importing the YAML library gives every object a .to_yml function, which will serialize the entire object, so it can be saved to a file, and objects can be restored from yml as well. So, that would require adding a bit of code to periodically save, but the actual saving bit could be relatively easy. Also, yaml is built-in, so no download required.

require "yaml"
def backup(objects_im_Using)
  out_file = File.open("prefix"+Time.now.strftime('%Y-%M-%d')+".yml","w")
  objects_im_Using.each {|object| out_file 

(although I suppose the real ruby way to do this would be to have the backup function yield a block or somesuch.)

Sorry, no better way I can think of. I'd be interested to read a better response to this question!

YenTheFirst
It's Yaml Ain't Markup Language, actually.
Chuck
YAML used to be called "Yet Another Markup Language" as well according to Wikipedia.
Andrew Grimm
well, either way, it's certainly not 'yet another markup library'. I don't know what I was thinking when I wrote that. hmm.
YenTheFirst
+1  A: 

From the point of view of having had my work machines unexpectedly powered down last weekend (construction elsewhere in the building) I sympathise with the idea.

Is there any value in partitioning the task? Could the input file be reworked into many smaller ones?

Orders of magnitude smaller, I know, but I have a process that loads about 2 million rows across a few AR models each morning. To get around the appalling database latency issues that I suffer from (DB server in a different country - don't ask) I rewrite my input CSV files into 16 "fragments" each. Each fragment is recorded in the Fragment model, which helps me identify any completion failures for re-run. It works surprisingly well and restarts, when needed, are simple. Usual run time about 30 minutes.

If your XML input is reasonably well-structured, it should be fairly straightforward to extract sub-structures (I'm sure there's a better term than that) into separate files. I don't know how fast a SAX parser would be able to do this - probably not too horrific, but it could be done without an XML library at all if it was still too slow. Consider adding a column to the target model to identify the fragment that it was loaded from - that way stripping out incomplete runs is simple.

Beyond that, consider holding all the state in one class and using Marshal to save periodically?

Mike Woodhouse
It's only about 5 million objects. (The computer I'm going to do it on is a 5-10 year old PC, with no DB optimization apart from indexing, and I'm probably doing it wrong). Partitioning helps to an extent (1st pass) but each object needs to know if string X is the title of another object (2nd pass).
Andrew Grimm
+1  A: 

It saves to the database as it goes along, but from your question it seems you can't pick up where you left off with that data alone.

So is there data in memory that you could persist in a temporary table or temporary column, that would let you pick up where you left off? Maybe you don't need the whole state - maybe a subset of the data would let you recreate the point where the power went off (or whatever).

Sarah Mei
A: 

OK. Now that we know a little more, I think the whole question may be moot. I'm guessing, from a little Good Friday fooling around, that you should be able to extract the data you need in a matter of hours.

You'll probably need a couple of days' to get yourself set up, figure out exactly what you need to store, how to store it and what to do with it when you've got it, but that's the fun part anyway.

Here's how I think you could approach the problem.

You know the file structure. It's a large (mind-bogglingly large, let's be honest) XML file; I see about 21GB. Structurally it's pretty simple though. You need <page> elements, from which you need to extract some basic information: title, text (or at least the links within it) and maybe id. That's a pretty simple parsing job - no need for XML libraries or whatnot, a simple string matching algorithm should suffice. For titles, use String#index to find the open and close tags and extract the bit between. For the first link in the text it's a bit trickier, because you have to determine the first real link according to the rules.

Reading 21GB of text into memory would be a good trick, but of course you don't have to do that: you just need a useful-sized chunk to work on. A megabyte would seem reasonable. Or maybe 10K. It's not a big deal - chop off a GB or so to experiment on.

I have a script that extracts and writes to a text file about 250,000 title/first-link pairs a minute. It ignores "redirect" pages (so it's processing many more pages) and ignores links with a ":" (not smart enough by far, but I wanted to put some processing in there). No regexen, heck, no requires. About 30 lines of not very terse code. It found about 5.23 million titles (I think there are more non-required ones: files, projects and whatnot) and wrote a rather more focused and manageable 1.03GB of output (see below) in about 20 minutes. Ruby (MRI) 1.8.6, Windows Vista, 2GHz Core 2 Duo. And they say Ruby's slow.

The first 3 lines:

Anarchism, [[political philosophy]]
Autism, [[Neurodevelopmental disorder|brain development disorder]]
Albedo, [[Sun]]
Mike Woodhouse
A: 

A more general version of this question is asked at Save a process’ memory for later use?. It looks like you can't easily back up long-running processes.

Andrew Grimm