ansaurus

Question

Answer 1

A:

My guess is that you probably won't see much of a speed-up in time. When comparing high-level languages, overhead in the language is typically not to blame for performance problems. Typically, the problem is your algorithm.

I'm not very familiar with R, but you may find speed-ups by reading larger chunks of data into memory at once vs smaller chunks (less system calls). If R doesn't have the ability to change something like this, you will probably find that python can be much faster simply because of this ability.

xyld 2010-05-05 01:35:24

Answer 2

+1 A:

Know where the time is being spent. If your R scripts are bottlenecked on disk IO (and that is very possible in this case), then you could rewrite them in hand-optimized assembly and be no faster. As always with optimization, if you don't measure first, you're just pissing into the wind. If they're not bottlenecked on disk IO, you would likely see more benefit from improving the algorithm than changing the language.

Eloff 2010-05-05 01:53:04

.Very eloquent.

graham.reeds 2010-05-05 09:45:04

Answer 3

+10 A:

I write in both R and Python regularly. I find Python modules for writing, reading and parsing information easier to use, maintain and update. Little niceties like the way python lets you deal with lists of items over R's indexing make things much easier to read.

I highly doubt you will gain any significant speed-up by switching the language. If you are becoming the new "maintainer" of these scripts and you find Python easier to understand and extend, then I'd say go for it.

Computer time is cheap ... programmer time is expensive. If you have other things to do then I'd just limp along with what you've got until you have a free day to putz with them.

Hope that helps.

JudoWill 2010-05-05 03:01:57

+1: Having experience in both R and Python, I fully concur with you!

EOL 2010-05-05 09:23:00

Answer 4

+1 A:

what do you mean by "file manipulation?" are you talking about moving files around, deleting, copying, etc., in which case i would use a shell, e.g., bash, etc. if you're talking about reading in the data, performing calculations, perhaps writing out a new file, etc., then you could probably use Python or R. unless maintenance is an issue, i would just leave it as R and find other fish to fry as you're not going to see enough of a speedup to justify your time and effort in porting that code.

wescpy 2010-05-05 03:16:57

Answer 5

A:

Hi Dan,

R data manipulation has rules for it to be fast. The basics are:

vectorize
use data.frames as little as possible (for example, in the end)

Search for R time optimization and profiling and you will find many resources to help you.

Tal Galili 2010-05-05 09:22:35

Vectorise, yes. Avoid data frames, no.

hadley 2010-05-05 18:20:43

Hi Hadley. Why do you say that it is not important to avoid data frames (in case of computation reasons) ?

Tal Galili 2010-05-05 19:16:57

Answer 6

+2 A:

Few weeks ago, I wrote a Python script to extract some rows from a large (280 MB) CSV file. More precisely, I wanted to extract all available information on companies in the dbpedia that have an ISIN field. Later I tried the same in R, but as hard as I tried, the R script took 10x more than the python script (10min vs 1min on my rather old laptop). Maybe this is due to my knowledge of R, in which case I would appreciate any hint on how to make the script faster. Here is the python code

from time import clock

clock()
infile = "infobox_de.csv"
outfile = "companies.csv"

reader = open(infile, "rb")
writer = open(outfile, "w")

oldthing = ""
isCompany = False
hasISIN = False
matches = 0

for line in reader:
    row = line.strip().split("\t")
    if len(row)>0: thing = row[0]
    if len(row)>1: key = row[1]
    if len(row)>2: value = row[2]
    if (len(row)>0) and (oldthing != thing):
      if isCompany and hasISIN:
        matches += 1
        for tup in buf:
          writer.write(tup)
      buf = []
      isCompany = False
      hasISIN = False
    isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
    hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
    oldthing = thing
    buf.append(line)

writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."

and here is the R script (I do not have the equivalent version anymore, but a very similar which returns a dataframe instead of writing a csv file and does not check for ISIN):

infile <- "infobox_de.csv"
maxLines=65000

reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)

oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""

repeat {
  lines <- readLines(reader, maxLines)
  if (length(lines)==0) break
  for (line in lines) {
    lineNumber <- lineNumber + 1
    row = unlist(strsplit(line, "\t"))
    if (length(row)>0) thing <- row[1]
    if (length(row)>1) key <- row[2]
    if (length(row)>2) value <- row[3]
    if ((length(row)>0) && (oldthing != thing)) {
      if (hasInfobox) {
        matches <- matches + 1
        writeLines(buf, writer)
      }
      buf <- c()
      hasInfobox <- FALSE
    }
    hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
    oldthing <- thing
    buf <- c(buf, line)
  }
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result

What I did explicitly, was to restrict readLines to 65000 lines maximum. I did this because I thought my 500MB RAM machine would be run out of memory otherwise. I did not try without this restriction.

Karsten W. 2010-05-09 16:35:23

ansaurus

tags:

views:

answers:

R or Python for file manipulation

related questions