tags:

views:

413

answers:

2

Hi,

How can I handle HTML souce in SSIS package. After some google search I found some answer here HTML Table as data source

I tried to follow the same but with not the desired output.

May I be helped in this regard.

Thanks in advance

+3  A: 

The question is very vague, but this may get you started.

HTML is a rich language so you're almost certainly going to have to build a custom parser. You can use a script transform as a data source calling html agility pack to help parse the source into an appropriate format for further transforms / destinations. The agility pack supports html -> xml transform, XPATH, XSLT etc so you shouldn't have to write too much custom code.

Steve Homer
Interesting-I learned something today!
rfonn
+1  A: 

Steve Homer's html agility pack answer is definately worth investigating. I've never tried it myself, but the codeplex description seems encouraging. Having said that, here's what I've done in the past to scrape HTML pages to return a status code from an intranet web page using a C# script task:

        static bool GetTextBetweenTextBlocks(string input_expression, string left_text, string right_text, out string matched_text)
    {
        // Declare results variable.
        bool results = false;

        // Define the regular expression that needs to be found.
        string regex_find = left_text + "(?'text'.*?)" + right_text;

        // Match the string.
        Match string_output = Regex.Match(input_expression, regex_find);

        // Output results
        if (string_output.Success.ToString() == "True")
        {
            matched_text = string_output.ToString().Substring(left_text.Length, string_output.Length - left_text.Length - right_text.Length);
            results = true;
            return results;
        }
        else
        {
            matched_text = "";
            return results;
        }
    }

This function will return the first occurance of a string of text that appears between two other strings of text. You could replace this with a more useful function for your specific need.

 public void Main()
    {
        // Declare variables.
        int CaseSensitiveVariable = Convert.ToInt32(Dts.Variables["CaseSensitiveVariableFromPackage"].Value.ToString());
        string Internal_URL = "http://www.MySite.com/SomeWebPage.asp?cn=" + CaseSensitiveVariable.ToString("X");
        Boolean fireAgainFlag = true;
        Boolean StatusIWantToCheck = false;
        string SomethingIWantToCheck = "";

        // Try-Catch block.
        try
        {
            // The WebRequest.
            HttpWebRequest oWebrequest;
            oWebrequest = (HttpWebRequest)WebRequest.Create(Internal_URL);
            oWebrequest.Credentials = System.Net.CredentialCache.DefaultCredentials;
            oWebrequest.UserAgent = "My SSIS Server Name";
            oWebrequest.Method = "POST";
            oWebrequest.Timeout = (1000 * 60 * 10);
            oWebrequest.ProtocolVersion = HttpVersion.Version10;

            // The WebResponse.
            HttpWebResponse oWResponse;
            oWResponse = (HttpWebResponse)oWebrequest.GetResponse();
            Stream s = oWResponse.GetResponseStream();
            StreamReader sr = new StreamReader(s);
            String sReturnString = sr.ReadToEnd();
            oWResponse.Close();

            // Parse text for Pricing Plan section.  Change flag to true if Enterprise or Pro Shipper plans are found.
            bool includes_what_I_want_to_check = GetTextBetweenTextBlocks(sReturnString.Replace("\n", ""), "<td>Is it there?  Let's check for this.</td>", "</td>", out SomethingIWantToCheck);
            if (includes_what_I_want_to_check == true)
            {
                // Log what I want to check to the SSIS Events Log.
                Dts.Events.FireInformation(0, "Something I Want To Check", SomethingIWantToCheck, "", 0, ref fireAgainFlag);
                if (SomethingIWantToCheck.ToLower().Contains("Do I have this value?") || SomethingIWantToCheck.ToLower().Contains("Or Maybe I have this value?"))
                {
                    StatusIWantToCheck = true;
                }
            }
            else
            {
                // Log response and fail.
                Dts.Events.FireError(0, "I could not find what I wanted in the Web Response", sReturnString.Replace("\n", ""), "", 0);
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
        }
        catch (WebException e)
        {
            Dts.Events.FireError(0, "WebException", e.Message, "", 0);
        }

        // Log variable and write value to the package variable.
        Dts.Events.FireInformation(0, "Status I Want to Check", StatusIWantToCheck.ToString(), "", 0, ref fireAgainFlag);
        Dts.Variables["StatusIWantToCheck"].Value = StatusIWantToCheck;

        // Return success.
        Dts.TaskResult = (int)ScriptResults.Success;
    }

OK. The above chunk of code is full of stuff you may or may not want. The above code performs an HTTP post of a web page, reads the response, searches the text for specific blocks of code, and uses IF THEN ELSE clauses to process the relevant data. It also includes examples of writing out the variable values to the package to keep track of what's happening. I rely on the logging to troubleshoot errors, particularly when I am adjusting the code. The script task is also set to fail if certain text blocks are not found in the script task.

Good luck with whatever solution you attempt to implement. Let me know if you have any questions about this code snippet.

Registered User

related questions