ansaurus

Question

Generating a URL pattern when provided a set of 5 or so URLs

Answer 1

+3 A:

A naive approach would be to split your URL into groups (say url.split("/")) and compare the resulting arrays. It the parts match, just add them as constant strings to the pattern. If they don't, add a pattern that matches all possible values. Here is a simple implementation:

public static void main(String[] args) throws Exception {
    String[] urls = {
            "http://www.buy.com/prod/disney-s-star-struck/q/loc/109/213724402.html", 
            "http://www.buy.com/prod/samsung-f2380-23-widescreen-1080p-lcd-monitor-150-000-1-dc-8ms-1920-x/q/loc/101/211249863.html",
            "http://www.buy.com/prod/panasonic-nnh765wf-microwave-oven-countertop-1-6-ft-1250w-panasonic/q/loc/66357/202045865.html",
            "http://www.buy.com/prod/escape-by-calvin-klein-for-women-3-4-oz-edp-spray/q/loc/66740/211210860.html",
            "http://www.buy.com/prod/v-touch-8gb-mp3-mp4-2-8-touch-screen-2mp-camera-expandable-minisd-w/q/loc/111/211402014.html"
    };

    String all = "[^/]+";
    String[] pattern = urls[0].split("/");
    for (int i = 0; i < urls.length; i++) {
        String parts[] = urls[i].split("/");

        // TODO handle urls with different number of parts
        for (int j = 0; j < pattern.length; j++) {
            // intentionally match by reference
            if (pattern[j] != all && !pattern[j].equals(parts[j])) {
                pattern[j] = all;
            }
        }
    }

    // build pattern - use [^/]+ as a replacement (anything but a '/')
    StringBuilder buf = new StringBuilder();
    for (int i = 0; i < pattern.length; i++) {
        buf.append(pattern[i] == all ? all : Pattern.quote(pattern[i]));
        buf.append("/");
    }
    // stip last "/"
    buf.setLength(buf.length() - 1);

    // compile pattern
    Pattern p = Pattern.compile(buf.toString());

    // output
    System.out.println(p.pattern());
    for (int i = 0; i < urls.length; i++) {
        System.out.println(p.matcher(urls[i]).matches());
    }

}

Here's the output of this example:

\Qhttp:\E/\Q\E/\Qwww.buy.com\E/\Qprod\E/[^/]+/\Qq\E/\Qloc\E/[^/]+/[^/]+
true
true
true
true
true

As you see, the pattern looks a bit weird. That's due to the Pattern quoting. Nevertheless, the pattern matches all urls from this example. There's some work left though, most noteably handling urls with different number of parts after split and common suffixes (.html).

sfussenegger 2010-03-02 10:44:49

Hey, thanks for your answer, but I've done a similar thing. My code works for URLS with same number of parts(I've gone with the assumption that all products are at the same level in the domain tree). I was just looking for some ideas to go about generating RegEx for URLS with different number of parts.Thanks

ryan 2010-03-03 05:03:23

@ryan this highly depends on your requirements. you could even say that such urls just don't have a common pattern. I thing it would be best if you give some examples (obviously examples that are a bit more complex than what you've been able to do yourself) and an expected pattern. If you want the algorithm to decide what's best suited, it's getting *extremely* complex (as in the question @DR linked above).

sfussenegger 2010-03-03 09:37:37

Ok, here's an example.http://cgi.ebay.com/NEW-SKIP-HOP-Zoo-Packs-Baby-Kids-Animal-Backpack-MOUSE_W0QQitemZ160407569887QQcmdZViewItemQQptZLH_DefaultDomain_0?hash=item25590945dfhttp://cgi.ebay.com/ebaymotors/Harley-davidson-softail-41mm-lower-sliders_W0QQitemZ180474968580QQcmdZViewItemQQptZMotorcycles_Parts_Accessories?hash=item2a05257a04#ht_500wt_1182This should give me http://cgi.motors.ebay.com/[^~]*Or take this case. Here, say, bbb occurs in every URLhttp://aaaa/bbb/ccc.htmlhttp://aaaa/rr/bbb/d.htmlhttp://bbb/e.htmlMy pattern would be something like http://[^~/]*/bbb/[^~]*.html

ryan 2010-03-05 04:05:00

Actually, both example patterns you gave won't match all of the URLs, but I think I know what you meant. Implementing this is a bit more complex than what I currently provided. I don't have time to provide something right now. But maybe I find some time to solve this tomorrow - it looks like a fun challenge for a lazy Saturday afternoon :)

sfussenegger 2010-03-05 08:52:35

Answer 2

+3 A:

You can try this tool txt2re A nice online tool, where you enter an example string and it generates a regexp that matches it for you.

txt2re describes itselfs as:

headache relief for programmers :: regular expression generator

HeDinges 2010-03-02 11:03:12

+1 nice tool, thanks

sfussenegger 2010-03-02 14:38:51

Answer 3

A:

What is the expected output if, for example, the patterns were the following?

http://www.buy.com/prod/abc.html
http://www.buy.com/prod/xyzabc.html
http://www.buy.com/prod/abcpqr.html
http://www.buy.com/prod/xyzabcpqr.html

Is

http://www.buy.com/prod/*.html

sufficient? Or, is

http://www.buy.com/prod/*abc*.html

required? Some clarification may be helpful.

ArunSaha 2010-03-02 18:00:57

http://www.buy.com/prod/*.html is sufficient

ryan 2010-03-03 05:03:53

ansaurus

tags:

views:

answers:

Generating a URL pattern when provided a set of 5 or so URLs

related questions