views:

89

answers:

3

Provided with a set of URLs, I need to generate a pattern,

For example:

http://www.buy.com/prod/disney-s-star-struck/q/loc/109/213724402.html
http://www.buy.com/prod/samsung-f2380-23-widescreen-1080p-lcd-monitor-150-000-1-dc-8ms-1920-x/q/loc/101/211249863.html
http://www.buy.com/prod/panasonic-nnh765wf-microwave-oven-countertop-1-6-ft-1250w-panasonic/q/loc/66357/202045865.html
http://www.buy.com/prod/escape-by-calvin-klein-for-women-3-4-oz-edp-spray/q/loc/66740/211210860.html
http://www.buy.com/prod/v-touch-8gb-mp3-mp4-2-8-touch-screen-2mp-camera-expandable-minisd-w/q/loc/111/211402014.html

Pattern is

http://www.buy.com/prod/[^~]/q/loc/[^~].html

+3  A: 

A naive approach would be to split your URL into groups (say url.split("/")) and compare the resulting arrays. It the parts match, just add them as constant strings to the pattern. If they don't, add a pattern that matches all possible values. Here is a simple implementation:

public static void main(String[] args) throws Exception {
    String[] urls = {
            "http://www.buy.com/prod/disney-s-star-struck/q/loc/109/213724402.html", 
            "http://www.buy.com/prod/samsung-f2380-23-widescreen-1080p-lcd-monitor-150-000-1-dc-8ms-1920-x/q/loc/101/211249863.html",
            "http://www.buy.com/prod/panasonic-nnh765wf-microwave-oven-countertop-1-6-ft-1250w-panasonic/q/loc/66357/202045865.html",
            "http://www.buy.com/prod/escape-by-calvin-klein-for-women-3-4-oz-edp-spray/q/loc/66740/211210860.html",
            "http://www.buy.com/prod/v-touch-8gb-mp3-mp4-2-8-touch-screen-2mp-camera-expandable-minisd-w/q/loc/111/211402014.html"
    };

    String all = "[^/]+";
    String[] pattern = urls[0].split("/");
    for (int i = 0; i < urls.length; i++) {
        String parts[] = urls[i].split("/");

        // TODO handle urls with different number of parts
        for (int j = 0; j < pattern.length; j++) {
            // intentionally match by reference
            if (pattern[j] != all && !pattern[j].equals(parts[j])) {
                pattern[j] = all;
            }
        }
    }

    // build pattern - use [^/]+ as a replacement (anything but a '/')
    StringBuilder buf = new StringBuilder();
    for (int i = 0; i < pattern.length; i++) {
        buf.append(pattern[i] == all ? all : Pattern.quote(pattern[i]));
        buf.append("/");
    }
    // stip last "/"
    buf.setLength(buf.length() - 1);

    // compile pattern
    Pattern p = Pattern.compile(buf.toString());

    // output
    System.out.println(p.pattern());
    for (int i = 0; i < urls.length; i++) {
        System.out.println(p.matcher(urls[i]).matches());
    }

}

Here's the output of this example:

\Qhttp:\E/\Q\E/\Qwww.buy.com\E/\Qprod\E/[^/]+/\Qq\E/\Qloc\E/[^/]+/[^/]+
true
true
true
true
true

As you see, the pattern looks a bit weird. That's due to the Pattern quoting. Nevertheless, the pattern matches all urls from this example. There's some work left though, most noteably handling urls with different number of parts after split and common suffixes (.html).

sfussenegger
Hey, thanks for your answer, but I've done a similar thing. My code works for URLS with same number of parts(I've gone with the assumption that all products are at the same level in the domain tree). I was just looking for some ideas to go about generating RegEx for URLS with different number of parts.Thanks
ryan
@ryan this highly depends on your requirements. you could even say that such urls just don't have a common pattern. I thing it would be best if you give some examples (obviously examples that are a bit more complex than what you've been able to do yourself) and an expected pattern. If you want the algorithm to decide what's best suited, it's getting *extremely* complex (as in the question @DR linked above).
sfussenegger
Ok, here's an example.http://cgi.ebay.com/NEW-SKIP-HOP-Zoo-Packs-Baby-Kids-Animal-Backpack-MOUSE_W0QQitemZ160407569887QQcmdZViewItemQQptZLH_DefaultDomain_0?hash=item25590945dfhttp://cgi.ebay.com/ebaymotors/Harley-davidson-softail-41mm-lower-sliders_W0QQitemZ180474968580QQcmdZViewItemQQptZMotorcycles_Parts_Accessories?hash=item2a05257a04#ht_500wt_1182This should give me http://cgi.motors.ebay.com/[^~]*Or take this case. Here, say, bbb occurs in every URLhttp://aaaa/bbb/ccc.htmlhttp://aaaa/rr/bbb/d.htmlhttp://bbb/e.htmlMy pattern would be something like http://[^~/]*/bbb/[^~]*.html
ryan
Actually, both example patterns you gave won't match all of the URLs, but I think I know what you meant. Implementing this is a bit more complex than what I currently provided. I don't have time to provide something right now. But maybe I find some time to solve this tomorrow - it looks like a fun challenge for a lazy Saturday afternoon :)
sfussenegger
+3  A: 

You can try this tool txt2re A nice online tool, where you enter an example string and it generates a regexp that matches it for you.

txt2re describes itselfs as:

headache relief for programmers :: regular expression generator

HeDinges
+1 nice tool, thanks
sfussenegger
A: 

What is the expected output if, for example, the patterns were the following?

http://www.buy.com/prod/abc.html
http://www.buy.com/prod/xyzabc.html
http://www.buy.com/prod/abcpqr.html
http://www.buy.com/prod/xyzabcpqr.html

Is

http://www.buy.com/prod/*.html

sufficient? Or, is

http://www.buy.com/prod/*abc*.html

required? Some clarification may be helpful.

ArunSaha
http://www.buy.com/prod/*.html is sufficient
ryan