Few weeks ago, I wrote a Python script to extract some rows from a large (280 MB) CSV file. More precisely, I wanted to extract all available information on companies in the dbpedia that have an ISIN field. Later I tried the same in R, but as hard as I tried, the R script took 10x more than the python script (10min vs 1min on my rather old laptop). Maybe this is due to my knowledge of R, in which case I would appreciate any hint on how to make the script faster. Here is the python code
from time import clock
clock()
infile = "infobox_de.csv"
outfile = "companies.csv"
reader = open(infile, "rb")
writer = open(outfile, "w")
oldthing = ""
isCompany = False
hasISIN = False
matches = 0
for line in reader:
row = line.strip().split("\t")
if len(row)>0: thing = row[0]
if len(row)>1: key = row[1]
if len(row)>2: value = row[2]
if (len(row)>0) and (oldthing != thing):
if isCompany and hasISIN:
matches += 1
for tup in buf:
writer.write(tup)
buf = []
isCompany = False
hasISIN = False
isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
oldthing = thing
buf.append(line)
writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."
and here is the R script (I do not have the equivalent version anymore, but a very similar which returns a dataframe instead of writing a csv file and does not check for ISIN):
infile <- "infobox_de.csv"
maxLines=65000
reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)
oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""
repeat {
lines <- readLines(reader, maxLines)
if (length(lines)==0) break
for (line in lines) {
lineNumber <- lineNumber + 1
row = unlist(strsplit(line, "\t"))
if (length(row)>0) thing <- row[1]
if (length(row)>1) key <- row[2]
if (length(row)>2) value <- row[3]
if ((length(row)>0) && (oldthing != thing)) {
if (hasInfobox) {
matches <- matches + 1
writeLines(buf, writer)
}
buf <- c()
hasInfobox <- FALSE
}
hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
oldthing <- thing
buf <- c(buf, line)
}
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result
What I did explicitly, was to restrict readLines to 65000 lines maximum. I did this because I thought my 500MB RAM machine would be run out of memory otherwise. I did not try without this restriction.