Hi,
I find myself having to do this very often -- compare specific columns from 2 different files. The columns, formats are the same, but the columns that need comparison have floating point/exponential format data, e.g. 0.0058104642437413175
, -3.459017050577087E-4
, etc.
I'm currently using the below R code:
test <- read.csv("C:/VBG_TEST/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
prod <- read.csv("C:/VBG_PROD/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
sqldf("select sum(V10), sum(V15) from test")
sqldf("select sum(V10), sum(V15) from prod")
I read in the files, and sum the specific columns -- V10
, V15
and then observe the values. This way I can ignore very small differences in floating point data per row.
However, going forward, I would like to set a tolerance percent, ie. if abs( (prod.V10 - test.V10)/prod.V10 ) > 0.01%
, and only print those row numbers that exceed this tolerance limit.
Also, if the data is not in the sane order, how can I do a comparison by specifying columns that will act like a composite primary key?
For e.g., if I did this in Sybase, I'd have written something like:
select A.*, B.*
from tableA A, tableB B
where abs( (A.Col15-B.Col15)/A.Col15) ) > 0.01%
and A.Col1 = B.Col1
and A.Col4 = B.Col4
and A.Col6 = B.Col6
If I try doing the same thing using sqldf
in R, it does NOT work as the files contain 500K+ rows of data.
Can anyone point me to how I can do the above in R?
Many thanks, Chapax.