tags:

views:

52

answers:

2

I am going to develop a web crawler using java to capture hotel room prices from hotel websites. In this case i want to capture room price with the room type and the meal type, so my algorithm should intelligent for that. as an example: Room type: Delux Meal type: HalfBoad price : $20.00

The main problem is room prices can be in different different ways in different different hotel sites. so my algorithm should independent from hotel sites.

I am plan to use above room types and meal types as a fuzzy sets and compare the words in webpage with above fuzzy sets using a suitable membership function.

any one experienced with this??? or have an Idea for my problem??

+2  A: 

There are two ways to approach this problem:

  1. You can customize your crawler to understand the formats used by different Websites; or

  2. You can come up with a general ("fuzzy") solution.

(1) will, by far, be the easiest. Ideally you want to create some tools that make this easier so you can create a filter for any new site in minimal time. IMHO your time will be best spent with this approach.

(2) has lots of problems. Firstly it will be unreliable. You will come across formats you don't understand or (worse) get wrong. Second, it will require a substantial amount of development to get something working. This is the sort of thing you use when you're dealing with thousands or millions of sites.

With hundreds of sites you will get better and more predictable results with (1).

cletus
+1 for sure - (1) is the way to go (2) is death.
hey cletus, tnx for ur attention frist,I also agree with ur 1st option, but this is my final year project in university and they asked me for a general solution as u mentioned in option 2.I am still final year under Graduate student and not experienced with these kind of problem before this. have you any experienced regarding option 2.any way tnx for your attention agin:)
Kasun Chinthaka
@Kasun: This is why people leave universities thinking they have learned how to write software and are shattered to find out that computer science and software development are all but unrelated. This is going to be a massive undertaking. In the real world, you are far better of demonstrating that harnessing human intelligence is a far lower-cost endeavor than creating machine intelligence. There's no reason why that shouldn't be true in the academic world as well.
A: 

As with all problems, design can let you deliver value adapt to situations you haven't considered much more quickly than the general solution.

Start by writing something that parses the data from one provider - the one with the simplest format to handle. Find a way to adapt that handler into your crawler. Be sure to encapsulate construction - you should always do this anyway...

public class RoomTypeExtractor
{
  private RoomTypeExtractor() { }

  public static RoomTypeExtractor GetInstance()
  {
    return new RoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // BEHAVIOR #1
  }
}

The GetInstance() ,ethod lets you promote to a Strategy pattern for practically free.

Then add your second provider type. Say, for instance, that you have a slightly more complex data format which is a little more prevalent than the first format. Start by refactoring what was your concrete room type extractor class into an abstraction with a single variation behind it and have the GetInstance() method return an instance of the concrete type:

public abstract class RoomTypeExtractor
{
  public static RoomTypeExtractor GetInstance()
  {
    return SimpleRoomTypeExtractor.GetInstance();
  }

  public abstract string GetRoomType(string content);
}

public final class SimpleRoomTypeExtractor extends RoomTypeExtractor
{
  private SimpleRoomTypeExtractor() { }

  public static SimpleRoomTypeExtractor GetInstance()
  {
    return new SimpleRoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // BEHAVIOR #1
  }
}

Create another variation that implements the Null Object pattern...

public class NullRoomTypeExtractor extends RoomTypeExtractor
{
  private NullRoomTypeExtractor() { }

  public static NullRoomTypeExtractor GetInstance()
  {
    return new NullRoomTypeExtractor();
  }

  public string GetRoomType(string content)
  {
    // whatever "no content" behavior you want... I chose returning null
    return null;
  }
}

Add a base class that will make it easier to work with the Chain of Responsibility pattern that is in this problem:

public abstract class ChainLinkRoomTypeExtractor extends RoomTypeExtractor
{
  private final RoomTypeExtractor next_;

  protected ChainLinkRoomTypeExtractor(RoomTypeExtractor next)
  {
    next_ = next;
  }

  public final string GetRoomType(string content)
  {
    if (CanHandleContent(content))
    {
      return GetRoomTypeFromUnderstoodFormat(content);
    }
    else
    {
      return next_.GetRoomType(content);
    }
  }

  protected abstract bool CanHandleContent(string content);
  protected abstract string GetRoomTypeFromUnderstoodFormat(string content);
}

Now, refactor the original implementation to have a base class that joins it into a Chain of Responsibility...

public final class SimpleRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
  private SimpleRoomTypeExtractor(RoomTypeExtractor next)
  {
    super(next);
  }

  public static SimpleRoomTypeExtractor GetInstance(RoomTypeExtractor next)
  {
    return new SimpleRoomTypeExtractor(next);
  }

  protected string CanHandleContent(string content)
  {
    // return whether or not content contains the right format
  }

  protected string GetRoomTypeFromUnderstoodFormat(string content)
  {
    // BEHAVIOR #1
  }
}

Be sure to update RoomTypeExtractor.GetInstance():

  public static RoomTypeExtractor GetInstance()
  {
    RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();

    extractor = SimpleRoomTypeExtractor.GetInstance(extractor);

    return extractor;
  }

Once that's done, create a new link for the Chain of Responsibility...

public final class MoreComplexRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
  private MoreComplexRoomTypeExtractor(RoomTypeExtractor next)
  {
    super(next);
  }

  public static MoreComplexRoomTypeExtractor GetInstance(RoomTypeExtractor next)
  {
    return new MoreComplexRoomTypeExtractor(next);
  }

  protected string CanHandleContent(string content)
  {
    // Check for presence of format #2
  }

  protected string GetRoomTypeFromUnderstoodFormat(string content)
  {
    // BEHAVIOR #2
  }
}

Finally, add the new link to the chain, if this is a more common format, you might want to give it higher priority by putting it higher in the chain (the real forces that govern the order of the chain will become apparent when you do this):

  public static RoomTypeExtractor GetInstance()
  {
    RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();

    extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
    extractor = MoreComplexRoomTypeExtractor.GetInstance(extractor);

    return extractor;
  }

As time passes, you may want to add ways to dynamically add new links to the Chain of Responsibility, as pointed out by Cletus, but the fundamental principle here is Emergent Design. Start with high quality. Keep quality high. Drive with tests. Do those three things and you will be able to use the fuzzy logic engine between your ears to overcome almost any problem...

EDIT

Translated to Java. Hope I did that right; I'm a little rusty.

hi MaxGuernseyIII, Im real happy with ur codes and ur atemps. realy it will use full for my developments, but in this time i hope help for recognizing room type and meal types with price,(Using an intelligent algorithm)any wy realy tnx Guernsey,
Kasun Chinthaka
@Kasun: The code is meant to demonstrate a process not to be used directly. At this time, engaging in that kind of process - the process of letting a design emerge - is a far more fruitful activity than attempting to write an intelligent algorithm.