tags:

views:

44

answers:

3

I have some XML tagged string as follows.

<Processor>AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ 2.31 GHz</Processor>
<ClockSpeed>2.31</ClockSpeed>
<NumberOfCores>2</NumberOfCores>
<InstalledMemory>2.00</InstalledMemory>
<OperatingSystem>Windows 7 Professional</OperatingSystem>

How can I detect the data type automatically using python? For example, "AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ 2.31 GHz" -> string, "2.31" -> float, and on.

I need this functionality as I need to make SQLite Table out of the XML data something like

CREATE table ABC (Processor string, ClockSpeed float ... )
+2  A: 

Depending on the kinds of formats you expect, you could use regexes to detect floats and ints, and then assume that anything which can't be parsed into a number is a string, like so:

import re

FLOAT_RE = re.compile(r'^(\d+\.\d*|\d*\.\d+)$')
INT_RE = re.compile(r'^\d+$')

# ... code to get xml value into a variable ...

if FLOAT_RE.match(xml_value):
    value_type = 'float'
elif INT_RE.match(xml_value):
    value_type = 'int'
else:
    value_type = 'string'

This is just a very basic stab at it - there are more complex formats for expressing numbers that are possible; if you think you might expect some of the more complex formats you'd have to expand this to make it work properly in all cases.

Amber
A: 

BeautifulSoup is a good HTML/XML parser:

http://www.crummy.com/software/BeautifulSoup/

I'm not entirely sure if it can convert data by type given an xsd/xsl, but it can detect encoding, so there might be a start.

eruciform
+3  A: 

One possibility is to try various types in precise sequence, defaulting to str if none of those work. E.g.:

def what_type(s, possible_types=((int, [0]), (float, ()))):
    for t, xargs in possible_types:
        try: t(s, *xargs)
        except ValueError: pass
        else: return t
    return str

This is particularly advisable, of course, when you to use want exactly the same syntax conventions as Python -- e.g., accept '0x7e' as int as well as '126', and so on. If you need different syntax conventions, then you should instead perform parsing on string s, whether via REs or by other means.

Alex Martelli
@Alex : How can I return "int", instead of <type 'int'>?
prosseek
Add a third value to the tuples in the `possible_types` array which is a string representation of that type; then modify the for loop to be `for t, xargs, stype:` and then `return stype`.
Amber
@Amber : Thanks, it works.
prosseek
@Amber's suggestion is good and very general, but `return t.__name__` will work if Python's exact type names are good for you.
Alex Martelli