ansaurus

Question

Answer 1

+1 A:

Try:

"www\\.[a-z]+[a-z0-9]*\\.(com|edu|org)"

I removed the [0-9]+ and replaced [a-z][a-z]* with [a-z]+.

Bart Kiers 2009-11-14 15:34:56

I want this to be case insensitive, would the following be preferable then? : "www\\.[a-zA-Z]+[a-zA-Z0-9]*\\.(com|edu|org)"

cb 2009-11-16 15:13:54

You could do that, but that won't permit `.COM` or `.eDu` as a TLD. You could enable case insensitive matching by adding the `(?i)` flag in front of your regex: `"(?i)www\\.[a-z]+[a-z0-9]*\\.(com|edu|org)"` assuming `(?i)` is supported by the regex library you're using.

Bart Kiers 2009-11-16 15:53:19

I am using regex.h in c on Ubuntu. I'll give (?i) a test."www\\.[a-z]+[a-z0-9]*\\.(com|edu|org)" is working delightfully, now I need to add in the characters - and _ Where do I drop them in? I attempted to add them in as such: [a-z0-9-_] but it failed. What key point am I glazing over here? Thanks all, CB

cb 2009-11-16 16:12:44

The `-` might be seen as the range indicator by the regex library. Try `[a-z0-9_-]` instead (st the end, it should not be taken for a range indicator). Or else, try escaping the `-` like this: `[a-z0-9_\-]`.

Bart Kiers 2009-11-16 17:28:16

Answer 2

A:

Your slashes are going the wrong way. Remember the web uses forward slashes.

Rob 2009-11-14 15:56:29

S/He's escaping the DOT meta character, it's not meant to be a forward slash.

Bart Kiers 2009-11-14 15:58:10

Ack! I misread it. I know he was escaping the dot character but misread the rest.

Rob 2009-11-14 22:15:00

Answer 3

+1 A:

The problem is in (?: ), You need just (www)\\.([a-z][a-z]*[0-9]+[a-z0-9]*)\\.(com|edu|org).

Btw, your inner expression says: "at least one alpha character, then at least one numeric character, then any alphanumeric characters". Is it what you mean? If so, you can make it a little bit shorter: [a-z]+[0-9]+[a-z0-9]*.

egorius 2009-11-14 19:52:49

"at least one alpha character, then at least one numeric character, then any alphanumeric characters". Is it what you mean? No, that would be a mistake.I will give your regex a try and report back, thank you!CB

cb 2009-11-16 15:12:04

Then you may prefer [a-z]+[a-z0-9]*, but beware, domain names can start with digit :)

egorius 2009-11-16 15:48:30

Answer 4

A:

From Coding Horror:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

What I mean is: are you sure a regular expression is the best way to solve your problem? Maybe you can test whether the string is a URL with some more lightweigth method?

Edit

The following program on my computer, with output redirected to /dev/null, prints (to stderr)

rx time: 1.730000
lw time: 0.920000

Program Listing:

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <regex.h>
#include <string.h>
#include <time.h>

int goodurl_rx(const char *buf) {
  static regex_t rx;
  static int done = 0;
  int e;

  if (!done) {
    done = 1;
    if ((e = regcomp(&rx, "^www\\.[a-z][a-z0-9]*\\.(com|edu|org)$", REG_EXTENDED)) != 0) {
      printf("Error %d compiling regular expression.\n", e);
      exit(EXIT_FAILURE);
    }
  }
  return !regexec(&rx, buf, 0, NULL, 0);
}

int goodurl_lw(const char *buf) {
  if (*buf++ != 'w') return 0;
  if (*buf++ != 'w') return 0;
  if (*buf++ != 'w') return 0;
  if (*buf++ != '.') return 0;
  if (!isalpha((unsigned char)*buf++)) return 0;
  while (isalnum((unsigned char)*buf)) buf++;
  if (*buf++ != '.') return 0;
  if ((*buf == 'c') && (*(buf+1) == 'o') && (*(buf+2) == 'm') && (*(buf+3) == 0)) return 1;
  if ((*buf == 'e') && (*(buf+1) == 'd') && (*(buf+2) == 'u') && (*(buf+3) == 0)) return 1;
  if ((*buf == 'o') && (*(buf+1) == 'r') && (*(buf+2) == 'g') && (*(buf+3) == 0)) return 1;
  return 0;
}

int main(void) {
  clock_t t0, t1, t2;
  char *buf[] = {"www.alphanumerics.com", "ww2.alphanumerics.com", "www.alphanumerics.net"};
  int times;

  t0 = clock();
  times = 1000000;
  while (times--) {
    printf("    %s: %s\n", buf[0], goodurl_rx(buf[0])?"pass":"invalid");
    printf("    %s: %s\n", buf[1], goodurl_rx(buf[1])?"pass":"invalid");
    printf("    %s: %s\n", buf[2], goodurl_rx(buf[2])?"pass":"invalid");
  };
  t1 = clock();
  times = 1000000;
  while (times--) {
    printf("    %s: %s\n", buf[0], goodurl_lw(buf[0])?"pass":"invalid");
    printf("    %s: %s\n", buf[1], goodurl_lw(buf[1])?"pass":"invalid");
    printf("    %s: %s\n", buf[2], goodurl_lw(buf[2])?"pass":"invalid");
  } while (0);
  t2 = clock();

  fprintf(stderr, "rx time: %f\n", (double)(t1-t0)/CLOCKS_PER_SEC);
  fprintf(stderr, "lw time: %f\n", (double)(t2-t1)/CLOCKS_PER_SEC);
  return 0;
}

pmg 2009-11-14 19:57:20

Hmmm, In this case I do believe a regex is the best answer, yet I am certainly open to valid suggestions. I am inputting a web address from a user in the general form: www.alphanumerics.com. Could you suggest a more lightweight method?

cb 2009-11-16 15:20:34

pmg, almost certanly this function is intended for validating user input, so a tiny fraction of a second really doesn't matter here.But what about development time, readability, supportability etc?Slight changes will require you to rewrite your code, ending up with home-made FSA.

egorius 2009-11-16 20:07:54

For this simple example, I agree that the regular expression function is easier to deal with. Someday, however, when you want to "accept" `co.uk` and `net.au`, invalidate `ads*.*` but not `cads*.*`, ..., ...; neither will be good. When this happens a parser is the best option, but IMVHO, a solution based on regular expressions tends to keep using regular expressions: more of them and more awkward.

pmg 2009-11-16 21:26:45

Answer 5

A:

You probably should be using inet_pton() which is a standard POSIX function (replacing inet_aton()) and handles both IPv4 and IPv6 address formats.

dajobe 2009-11-14 23:05:12

I'm not sure that I fully understand the use of this. Would it be a more dignified way of checking the validity of an IP address versus the regex that I currently employ? (Defined above as IPEXPR) CB

cb 2009-11-16 15:18:16

yes, especially since IPv6 has abbreviated forms that inet_pton will handle. It's not just 0-9 and .

dajobe 2009-11-16 21:17:08

ansaurus

tags:

views:

answers:

URL regex with regex.h in c

related questions