views:

32

answers:

1

I'm using this library: http://benreeves.co.uk/objective-c-hmtl-parser/ to parse HTML for a little iPhone app I'm making. I have got the code working so far, but it fails when presented with an accent (so far only experienced é). This is the code I'm using:

NSError * error = nil;
HTMLParser * parser = [[HTMLParser alloc] initWithContentsOfURL:[NSURL URLWithString:@"http://intranet.westminster.org.uk/almanack/food.asp?nextweek=TRUE"] error:&error];

if (error) {
    NSLog(@"Error: %@", error);
    return nil;
}
HTMLNode * bodyNode = [parser body]; //Find the body tag
NSArray *individualMeals = [bodyNode findChildTags:@"font"];
for (HTMLNode *node in individualMeals) {
    if ([[node getAttributeNamed:@"color"] isEqual:@"green"]) {
        NSLog(@"%@",[node rawContents]);
    }
}

But it doesn't parse all of the text. It seems to give up after it finds an accent in the URL. This is the result it produces when run:

2010-10-07 18:40:59.296 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.298 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.305 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.307 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.308 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon&#13;<br/>Hash Brown&#13;<br/>Baked Beans&#13;<br/>Breakfast special&#13;<br/>Three cheese omelets&#13;<br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/>Croissants &#13;<br/><br/> Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.309 Westminster[1011:207] <font color="green">Mulligatawny &#13;<br/>Black Olive &#13;<br/>RICE&#13;<br/>Roasted med veg in paella rice&#13;<br/>Hot and sticky wings on yellow rice&#13;<br/>Hoi Sin Pork Belly Steaks&#13;<br/>Vegetable Biriyani with a Mild Curry Sauce&#13;<br/>Babycorn Bamboo Shoots and Water Chestnuts &#13;<br/>Stir fried noodles with seaweed &#13;<br/>Lemon Sponge with Orange Sauce&#13;<br/>Vanilla Granola</font>
2010-10-07 18:40:59.310 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.312 Westminster[1011:207] <font color="green">Pea &amp; Ham &#13;<br/><br/>Black Olive &#13;<br/>Roast Chicken with Bread Sauce and Roast Jus&#13;<br/>Warm Salad of Salmon and Crispy Bacon&#13;<br/><br/><br/>Vegetarian Chilli&#13;<br/>With Sour Cream and Braised Rice&#13;<br/>Green Beans&#13;<br/><br/>Bubble &amp; Squeak&#13;<br/><br/>Tiramisu&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.313 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon&#13;<br/>Grilled Tomato&#13;<br/>Grilled mushrooms&#13;<br/>Fried Egg&#13;<br/><br/><br/><br/>Plain Porridge&#13;<br/><br/><br/><br/>Bread &#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.317 Westminster[1011:207] <font color="green">Root Vegetable&#13;<br/>Red Pesto &#13;<br/>WRAP&#13;<br/>Chimichanga&#13;&#13;<br/>Mexican fish tortillas&#13;&#13;<br/>Roast Leg of Lamb &#13;<br/>Gnocchi with Roasted Vegetables and Flaked Parmesan &#13;<br/>Broccoli &#13;<br/><br/><br/>Thyme Roasted Potatoes  &#13;<br/> Sticky Toffee Pudding and Toffee Sauce&#13;<br/>Banana Bread</font>
2010-10-07 18:40:59.318 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.318 Westminster[1011:207] <font color="green">Tomato with Basil Oil&#13;&#13;<br/>Red Pesto &#13;<br/>Beef Olives&#13;<br/><br/>Lamb with Ginger, Spring onion and Noodles&#13;<br/><br/><br/>Field Mushroom Pies&#13;&#13;<br/>Ratatouille &#13;<br/><br/>Creamed Potatoes&#13;<br/><br/>Lemon Tart&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.319 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon&#13;<br/>Baked Beans&#13;<br/>Grilled Tomato&#13;<br/>Breakfast special&#13;<br/>Avocado on toast&#13;<br/><br/>Plain Porridge &#13;<br/><br/><br/>Bread and banana bread&#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.333 Westminster[1011:207] <font color="green">(GREEK)&#13;<br/><br/>FLAT BREADS&#13;<br/>SPINACH, ROCKET AND FETA AND TOASTED SOUR DOUGHS &#13;<br/>SEAFOOD STUFFED PEPPERS&#13;<br/>STIFADO (beef)&#13;<br/><br/>LAMB FRICASSEE&#13;<br/>zucchini pie from Macedonia&#13;&#13;<br/>RICE&#13;<br/><br/>GIGANTIS PLAKI&#13;<br/><br/>ORANGE AND LEMON CAKE TOPPED WITH GREEK YOGURT AND HONEY</font>
2010-10-07 18:40:59.333 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.334 Westminster[1011:207] <font color="green">Roasted Vegetable&#13;<br/>FLAT BREADS&#13;<br/>Pork Steak Served with a Tomato, Tarragon and Mushroom sauce&#13;<br/>Roast beef and homemade horseradish sauce&#13;<br/><br/><br/>Lancashire Cheese Sausages with Onion Gravy&#13;&#13;<br/>Courgettes&#13;<br/><br/>Roast Potatoes&#13;<br/><br/>Mississippi Mud Pie&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.343 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon &#13;<br/>Hash Brown&#13;<br/>Grilled mushrooms&#13;<br/>Fried Egg&#13;<br/><br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/>Bread &#13;<br/><br/> Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.344 Westminster[1011:207] <font color="green">Leek, Blue Cheese and Potato &#13;&#13;<br/>Sunflower Seed&#13;<br/>COUS COUS&#13;<br/>Couscous with apricots, lemon and coriander &#13;<br/><br/>Couscous fried chicken with couscous and spiced tomato sauce &#13;&#13;<br/>Butchers Sausages &#13;<br/>Balsamic Roasted Vegetable Frittata&#13;<br/>Red Cabbage&#13;<br/><br/><br/>Mashed Potatoes&#13;<br/><br/>Jam Roly Poly &#13;<br/>Bakewell Slice</font>
2010-10-07 18:40:59.344 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.345 Westminster[1011:207] <font color="green">Curried Parsnip and Apple &#13;&#13;<br/>Sunflower Seed&#13;<br/>Spiced Sticky chicken pieces&#13;<br/>Mexican Beef Chilli Wraps with Natural Yogurt and Guacamole&#13;<br/><br/><br/>Roasted Teriyaki Tofu Steaks with Glazed Green Vegetables&#13;&#13;<br/>Spiced Aubergine&#13;<br/><br/>Rice and Peas&#13;<br/><br/>Mango Mousse&#13;<br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.351 Westminster[1011:207] <font color="green">Sausage&#13;<br/>Bacon&#13;<br/>Baked Beans&#13;<br/>Grilled Tomato&#13;<br/>Breakfast special&#13;<br/>Muffin bar&#13;<br/><br/>Plain Porridge&#13;<br/><br/><br/><br/>Croissants &#13;<br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.352 Westminster[1011:207] <font color="green">Carrot and Chilli &#13;&#13;<br/>Rosemary &#13;<br/>NOODLES&#13;<br/><br/>Crispy tofu&#13;<br/>Lemon chicken&#13;<br/>Fish with Traditional Crispy Batter&#13;<br/>Japanese Vegetable Curry with Rice Noodles and Tofu&#13;&#13;<br/>Garden peas&#13;<br/><br/><br/>Chips &#13;<br/>Viennese Jam Tart and Custard &#13;<br/>Fresh Fruit Salad</font>
2010-10-07 18:40:59.361 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.361 Westminster[1011:207] <font color="green">Three onion, spring, red and white&#13;<br/>Rosemary &#13;<br/>Pepperoni Pizza Topped with Boccaccio&#13;<br/>Bolognaise pasta bake&#13;<br/><br/>Vegetarian Plait&#13;&#13;<br/>Green Cabbage&#13;<br/><br/>Oven Baked Cajun Wedges&#13;<br/><br/>Ice&#13;<br/>Cream Sundae&#13;<br/><br/>3 Cheeses &amp; Biscuits</font>
2010-10-07 18:40:59.362 Westminster[1011:207] <font color="green">Sausage &#13;<br/>Bacon &#13;<br/>Hash Brown&#13;<br/>Grilled Mushrooms&#13;<br/>Poached Eggs&#13;<br/><br/><br/><br/>Plain Porridge &#13;<br/><br/><br/><br/><br/><br/>Natural Yogurt&#13;<br/>Dried Fruits &#13;<br/>Granola&#13;<br/>Honey</font>
2010-10-07 18:40:59.362 Westminster[1011:207] (null)
2010-10-07 18:40:59.363 Westminster[1011:207] <font color="green"/>
2010-10-07 18:40:59.363 Westminster[1011:207] <font color="green"/>

It gives up at the section with sautéed potatoes, and doesn't return any results from that or any of the latter sections.

I think this might be due to the website not having encoded the és. When I view the source I see é rather than & eacute; (without space, otherwise SO formats it...) as suggested by this website: http://www.w3.org/MarkUp/html3/latin1.html

Thanks for your time. If you know a better way to obtain whats for lunch from that website I would love to hear it too.

A: 

I had similar problem. The problem is that libxml2 can only parse UTF8 encoded document. So you need to convert the html page to UTF8 first.

Max
Ah. I have accomplished it now. I read the data as a ISO Latin 1 format, then I replaced all instances of é with the html code, then I sent it to the parser.
Tom H
All ASCII will work. For my application, I cannot because the text are asian characters.
Max