views:

4141

answers:

8

Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?

A: 

I haven't tested it, but I'd say it'll most likely work with the latest IPy2.

As for distribution, it's very simple. Use the -X:SaveAssemblies option to compile your Python code down to a binary and then ship it with your other DLLs and the IPy dependencies.

Cody Brocious
+5  A: 

If BeautifulSoup doesn't work on IronPython, it's because IronPython doesn't implement the whole Python language (the same way CPython does). BeautifulSoup is pure-python, no C-extensions, so the only problem is the compatibility of IronPython with CPython in terms of Python source code.There shouldn't be one, but if there is, the error will be obvious ("no module named ...", "no method named ...", etc.). Google says that only one of BS's tests fails with IronPython. it probably works, and that test may be fixed by now. I wouldn't know.

Try it out and see, would be my advice, unless anybody has anything more concrete.

Devin Jeanpierre
+8  A: 

I've tested and used BeautifulSoup with both IPy 1.1 and 2.0 (forget which beta, but this was a few months back). Leave a comment if you are still having trouble and I'll dig out my test code and post it.

technomalogical
A: 

If you have the complete standard library and the real re module (google for IronPython community edition) it might work. But IronPython is an incredible bad python implementation, I wouldn't count on that.

Besides, give html5lib a try. That parser parses with the same rules firefox parses documents.

Armin Ronacher
I don't use IronPython, but what I've read so far about it does not certify the "incredibly bad python implementation" [typo fixed].
ΤΖΩΤΖΙΟΥ
I certainly don't consider IronPython to be incredibly bad. It does just fine on loads of stuff. Just don't expect it to be a drop-in replacement for CPython.
Jeff Brown
+1  A: 

Regarding the second part of your question, you can use the DLR Hosting APIs to run IronPython code from within a C# application. The DLR hosting spec is here. This blog also contains some sample hosting applications

srivatsn
+2  A: 

Also, regarding one of the previous comments about compiling with -X:SaveAssemblies - that is wrong. -X:SaveAssemblies is meant as a debugging feature. There is a API meant for compiling python code into binaries. This post explains the API and the difference between the two modes.

srivatsn
+27  A: 

I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.

Edit

Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:

using System;
using System.Collections.Generic;
using HtmlAgilityPack;

namespace GovParsingTest
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlWeb hw = new HtmlWeb();
            string url = @"http://www.house.gov/house/House_Calendar.shtml";
            HtmlDocument doc = hw.Load(url);

            HtmlNode docNode = doc.DocumentNode;
            HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");
            HtmlNodeCollection tableRows = div.SelectNodes(".//tr");

            foreach (HtmlNode row in tableRows)
            {
                HtmlNodeCollection cells = row.SelectNodes(".//td");
                HtmlNode dateNode = cells[0];
                HtmlNode eventNode = cells[1];

                while (eventNode.HasChildNodes)
                {
                    eventNode = eventNode.FirstChild;
                }

                Console.WriteLine(dateNode.InnerText);
                Console.WriteLine(eventNode.InnerText);
                Console.WriteLine();
            }

            //Console.WriteLine(div.InnerHtml);
            Console.ReadKey();
        }
    }
}
bouvard
HAP is a good solution, I've got a bunch of apps in production using it. I've used Mozilla Html Parser and there's not much difference.
Scott Cowan
+1  A: 

We are distributing a 40k line IronPython application. We have not been able to compile the whole thing into a single binary distributable. Instead we have been distributing it as a zillion tiny dlls, one for each IronPython module. This works fine though.

However, on the newer release, IronPython 2.0, we have a recent spike which seems to be able to compile everything into a single binary file. This also results in faster application start-up too (module importing is faster.) Hopefully this spike will migrate into our main tree in the next few days.

To do the distribution we are using WiX, which is a Microsoft internal tool for creating msi installs, that has been open-sourced (or made freely available, at least.) It has given us no problems, even though our install has some quite fiddly requirements. I will definitely look at using WiX to distribute other IronPython projects in the future.

Tartley