This cannot be done in a reliable manner and that is not due to limitations in Python or any other programming language for that matter. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).
So lets first design a few heuristics then encode them in Python. Things to consider are:
- All the values are valid strings we know that because that is the basis of our problem so there is no point in checking for this at all. We should check everything else we can whatever falls through we can just leave as a string.
- Dates are the most obvious thing to check first if they are formatted in predictable manner such as
[YYYY]-[MM]-[DD].
(ISO ISO 8601 date format) they are easy to distinguish from other bits of text that contain numbers. If the dates are in a format with just numbers like YYYYMMDD
then we are stuck as these dates will be indistinguishable from ordinary numbers.
- We will do integers next because all integers are valid floats but not all floats are valid integers. We could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value as an integer.
- Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise
3.14159265
as a floating point number. However 5.0
which can be written simply as 5
is also a valid float but would have been caught in the previous steps and not be recognised as a float even if it was intended to be.
- Any values that are left unconverted can be treated as strings.
Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to support (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.
Now lets make this real in Python, most of the heuristics I mentioned above are taken care of for us by Python we just need to decide on the order in which to apply them:
from datetime import datetime
heuristics = (lambda value: datetime.strptime(value, "%Y-%m-%d"),
int, float)
def convert(value):
for type in heuristics:
try:
return type(value)
except ValueError:
continue
# All other heuristics failed it is a string
return value
values = ['3.14159265', '2010-01-20', '16', 'some words']
for value in values:
converted_value = convert(value)
print converted_value, type(converted_value)
This outputs the following:
3.14159265 <type 'float'>
2010-01-20 00:00:00 <type 'datetime.datetime'>
16 <type 'int'>
some words <type 'str'>