Parsing a domain name

views:

answers:

+1 Q:

Parsing a domain name

I am parsing the domain name out of a string by strchr() the last . (dot) and counting back until the dot before that (if any), then I know I have my domain.

This is a rather nasty piece code and I was wondering if anyone has a better way.

The possible strings I might get are:

domain.com
something.domain.com
some.some.domain.com

You get the idea. I need to extract the "domain.com" part.

Before you tell me to go search in google, I already did. No answer, hence I am asking here.

Thank you for your help

EDIT:

The string I have contains a full hostname. This usually is in the form of whatever.domain.com but can also take other forms and as someone mentioned it can also have whatever.domain.co.uk. Either way, I need to parse the domain part of the hostname: domain.com or domain.co.uk

Not sure what flavor of C, but you probably want to tokenize the domain using "." as the separator.

Try this: http://www.metalshell.com/source_code/31/String_Tokenizer.html

As for the domain name, not sure what your end goal is, but domains can have lots and lots of nodes, you could have a domain name foo.baz.biz.boz.bar.co.uk.

If you just want the last 2 nodes, then use above and get the last two tokens.

Joelio 2010-08-27 19:48:14

A domain name can only have 255 octets, not quite "infinite"...

bstpierre 2010-08-27 19:54:23

just the name.something where the .something can be .com, .net. etc or in the form of .co.uk etc

Jessica 2010-08-27 19:56:53

I tried before with strtok but I need to keep on reading an saving strings... unless you know a good way to do it

Jessica 2010-08-27 19:57:37

not sure what your program needs to do, you need to take http://www.foo.bar.co.uk and turn it into co.uk ?

Joelio 2010-08-27 20:05:46

no, I get a a string containing a full hostname. this usually is in the form of whatever.domain.com but can also take other forms and as someone mentioned it can also have whatever.domain.co.uk. Either way, I need to parse the domain part of the hostnamer: domain.com or domain.co.uk

Jessica 2010-08-27 20:07:57

should be easy with strtok, use the example I noted, store each token in an array of strings, keep track of the number of tokens. Then just join the last 2 or 3 tokens (depending on extension) for your answer.

Joelio 2010-08-27 20:59:49

OK... how many char* do I need? 3? 4? 5?

Jessica 2010-08-28 10:14:48

+2 A:

Did you mean strrchr()?

I would probably approach this by doing:

strrchr to get the last dot in the string, save a pointer here, replace the dot with a NUL ('\0').
strrchr again to get the next to last dot in the string. The character after this is the start of the name you are looking for (domain.com).
Using the pointer you saved in #1, put the dot back where you set it NUL.

Beware that names can sometimes end with a dot, if this is a valid part of your input set, you'll need to account for it.

Edit: To handle the flexibility you need in terms of example.co.uk and others, the function described above would take an additional parameter telling it how many components to extract from the end of the name.

You're on your own for figuring out how to decide how many components to extract -- as Philip Potter mentions in a comment below, this is a Hard Problem.

bstpierre 2010-08-27 19:50:10

let me try it, however as Philip Potter mentioned, how can I handle domain.co.uk?

Jessica 2010-08-27 20:01:49

@Jessica: you need to define what you want more clearly. For example, .jp, .uk, and .au have second-level general domains: co.uk, co.jp, com.au. But others such as .de, .es, .be, .tk don't: amazon.de etc. What exactly do you want? If you want to extract the last "meaningful" part of the domain, it's a Hard Problem.

Philip Potter 2010-08-27 20:11:39

See edited question

Jessica 2010-08-27 20:13:01

+2 A:

This isn't a reply to the question itself, but an idea for an alternate approach:

In the context of already very nasty code, I'd argue that a good way to make it less nasty, and provide a good facility of parsing domain names and the likes - is to use PCRE or a similar library for regular expressions. That will definitly help you out if you also want to validate that the tld exists, for instance.

It may take some effort to learn initially, but if you need to make changes to existing matching/parsing code, or create more code for string matching - I'd argue that a regex-lib may simplify this a lot in the long term. Especially for more advanced matching.

Another library I recall which supports regex, is glib.

Kvisle 2010-08-27 23:54:13

thanks. I would definatelly try regular expressions, however I can't use any external.

Jessica 2010-08-28 10:13:42

What does the project currently link with? Libc only?

Kvisle 2010-08-28 15:57:29

yes. only standard C and that's it.

Jessica 2010-08-30 15:22:44

ansaurus

tags:

views:

answers:

Parsing a domain name

related questions