views:

288

answers:

6

I have 4 reasonably complex r scripts that are used to manipulate csv and xml files. These were created by another department where they work exclusively in r.

My understanding is that while r is very fast when dealing with data, it's not really optimised for file manipulation. Can I expect to get significant speed increases by converting these scripts to python? Or is this something of a waste of time?

A: 

My guess is that you probably won't see much of a speed-up in time. When comparing high-level languages, overhead in the language is typically not to blame for performance problems. Typically, the problem is your algorithm.

I'm not very familiar with R, but you may find speed-ups by reading larger chunks of data into memory at once vs smaller chunks (less system calls). If R doesn't have the ability to change something like this, you will probably find that python can be much faster simply because of this ability.

xyld
+1  A: 

Know where the time is being spent. If your R scripts are bottlenecked on disk IO (and that is very possible in this case), then you could rewrite them in hand-optimized assembly and be no faster. As always with optimization, if you don't measure first, you're just pissing into the wind. If they're not bottlenecked on disk IO, you would likely see more benefit from improving the algorithm than changing the language.

Eloff
.Very eloquent.
graham.reeds
+10  A: 

I write in both R and Python regularly. I find Python modules for writing, reading and parsing information easier to use, maintain and update. Little niceties like the way python lets you deal with lists of items over R's indexing make things much easier to read.

I highly doubt you will gain any significant speed-up by switching the language. If you are becoming the new "maintainer" of these scripts and you find Python easier to understand and extend, then I'd say go for it.

Computer time is cheap ... programmer time is expensive. If you have other things to do then I'd just limp along with what you've got until you have a free day to putz with them.

Hope that helps.

JudoWill
+1: Having experience in both R and Python, I fully concur with you!
EOL
+1  A: 

what do you mean by "file manipulation?" are you talking about moving files around, deleting, copying, etc., in which case i would use a shell, e.g., bash, etc. if you're talking about reading in the data, performing calculations, perhaps writing out a new file, etc., then you could probably use Python or R. unless maintenance is an issue, i would just leave it as R and find other fish to fry as you're not going to see enough of a speedup to justify your time and effort in porting that code.

wescpy
A: 

Hi Dan,

R data manipulation has rules for it to be fast. The basics are:

  1. vectorize
  2. use data.frames as little as possible (for example, in the end)

Search for R time optimization and profiling and you will find many resources to help you.

Tal Galili
Vectorise, yes. Avoid data frames, no.
hadley
Hi Hadley. Why do you say that it is not important to avoid data frames (in case of computation reasons) ?
Tal Galili
+2  A: 

Few weeks ago, I wrote a Python script to extract some rows from a large (280 MB) CSV file. More precisely, I wanted to extract all available information on companies in the dbpedia that have an ISIN field. Later I tried the same in R, but as hard as I tried, the R script took 10x more than the python script (10min vs 1min on my rather old laptop). Maybe this is due to my knowledge of R, in which case I would appreciate any hint on how to make the script faster. Here is the python code

from time import clock

clock()
infile = "infobox_de.csv"
outfile = "companies.csv"

reader = open(infile, "rb")
writer = open(outfile, "w")

oldthing = ""
isCompany = False
hasISIN = False
matches = 0

for line in reader:
    row = line.strip().split("\t")
    if len(row)>0: thing = row[0]
    if len(row)>1: key = row[1]
    if len(row)>2: value = row[2]
    if (len(row)>0) and (oldthing != thing):
      if isCompany and hasISIN:
        matches += 1
        for tup in buf:
          writer.write(tup)
      buf = []
      isCompany = False
      hasISIN = False
    isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
    hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
    oldthing = thing
    buf.append(line)

writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."

and here is the R script (I do not have the equivalent version anymore, but a very similar which returns a dataframe instead of writing a csv file and does not check for ISIN):

infile <- "infobox_de.csv"
maxLines=65000

reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)

oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""

repeat {
  lines <- readLines(reader, maxLines)
  if (length(lines)==0) break
  for (line in lines) {
    lineNumber <- lineNumber + 1
    row = unlist(strsplit(line, "\t"))
    if (length(row)>0) thing <- row[1]
    if (length(row)>1) key <- row[2]
    if (length(row)>2) value <- row[3]
    if ((length(row)>0) && (oldthing != thing)) {
      if (hasInfobox) {
        matches <- matches + 1
        writeLines(buf, writer)
      }
      buf <- c()
      hasInfobox <- FALSE
    }
    hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
    oldthing <- thing
    buf <- c(buf, line)
  }
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result

What I did explicitly, was to restrict readLines to 65000 lines maximum. I did this because I thought my 500MB RAM machine would be run out of memory otherwise. I did not try without this restriction.

Karsten W.