ansaurus

Question

Reading correctly (alpha)numeric fields into R

Answer 1

+1 A:

the character and factor question is something only you can answer. It depends if you need to use them later as factors or characters. It also depends whether you need to clean them up at all afterwards. For example, if you plan to apply a number of ifelse() modifications to a factor afterwards you might as well just read it in as a character now and turn it into a factor later. Or, if you want to specifically code the factor in some way you will likely be better off reading it in as character.

As an aside, the reason you use read.delim over read.table is because of the default settings therefore don't bother setting the sep to the same as the default.

John 2010-04-29 10:17:17

Thank you. What matters most of all for me, is to be able to load the data back to the database without damaging it. As for the sep setting, it is still there because I initially tried to change it. http://stackoverflow.com/questions/2732397/why-the-field-separator-character-must-be-only-one-byte

gd047 2010-04-29 10:28:20

Answer 2

+4 A:

Would you suggest "character" or "factor" for varchar fields?

As John mentioned, this depends upon usage. It is simple to switch between the two, so don't worry too much about it. If the column represents a categorical variable, it should eventually be considered as a factor. If you intend on mining the text (e.g. comments fields), then character makes more sense.

Is it ok to use "character" for datetime ones?

It's fine for storing the dates in a data frame, but if you want them to be treated correctly for analysis purposes, you'll have to convert it to Date or POSIXct/POSIXlt form.

What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?

Values are read in to usual double accuracy (about 15 sig figs); in this particular example, 540912.69 is the best accuracy you can achieve. Compare

print(540912.68999999994)             # 540912.7
print(540912.68999999994, digits=22)  # 540912.69
print(540912.6899999994)              # 540912.7
print(540912.6899999994, digits=22)   # 540912.6899999994

EDIT: If you need more precision for your numbers, use the Rmpfr package.

I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.

The default for colClasses (when you don't specify it) does a pretty good job of guessing what columns should be. If you are doing things like using 01 as a character, then there's no way round explicitly specifying it.

Richie Cotton 2010-04-29 14:52:02

+1 Thank you very much. Regarding the precision issue: I completely understand that the values are read with correct accuracy. I just wanted to be able to get a *bit perfect comparison* between the initial text file and the one produced by the write.table command (when -as in my case- the only modification I did was to rearrange the columns), so that I can be confident that nothing messed up. Obviously this can't be done when a rounding of any kind is performed :(

gd047 2010-04-29 15:25:00

ansaurus

tags:

views:

answers:

Reading correctly (alpha)numeric fields into R

related questions