views:

25

answers:

1

Hi,

I have some regular expression code that grabs the data between the title tags on a page:

<%
    Function UrlExists(sURL)
        Dim objXMLHTTP
        Dim thePage
        Dim strPTitle   
        Dim blnReturnVal
        Dim objRegExp
        Dim strTitleResponse

        'Create object
        Set objXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
        on error resume next

        'Get the head
        objXMLHTTP.Open "HEAD", sURL, false
        objXMLHTTP.setRequestHeader "User-Agent", Request.ServerVariables("HTTP_HOST")
        objXMLHTTP.Send ""

        '404?        
        If Err.Number <> 0 or objXMLHTTP.status <> 200 then blnReturnVal = "0|404 Error" Else blnReturnVal = "1|"
        objXMLHTTP.close

        'If not 404
        if left(blnReturnVal,1) = "1" then

            'Get the physical page
            objXMLHTTP.Open "GET", sURL, false
            objXMLHTTP.Send ""
                thePage = objXMLHTTP.responseText 
                thePage = replace(thePage, vbCrlf, "")
            objXMLHTTP.close

            'Find title
            Set objRegExp = New Regexp

            objRegExp.IgnoreCase = true
            objregexp.Multiline = true
            objRegExp.Global = false
            objRegExp.Pattern = "<title[^>]*?>(.*)</title>" 

            set strPTitle =  objRegExp.Execute(thePage)
            strTitleResponse = strPTitle.Item(0).Value
            strTitleResponse = replace(strTitleResponse, vbCrlf, "")
            strTitleResponse = trim(strTitleResponse)
            if len(strTitleResponse) <1 OR strTitleResponse = "" then strTitleResponse = "(No Title)"

            set objRegExp = nothing
            strTitleResponse = replace(strTitleResponse,"</title>","")
            strTitleResponse = replace(strTitleResponse,"<title>","")
            strTitleResponse = replace(strTitleResponse,"'","&#39; ")
            blnReturnVal = blnReturnVal & strTitleResponse

        end if

        Set objXMLHTTP = nothing

        UrlExists = blnReturnVal
    End Function
%>        

This works fine and has been for many months, but when I wrote it (stupidly?) I made the assumption each page would only have one or no title tags. It's recently started to throw weird errors on the John Lewis page because it has two titles in it's HTML:

    <title>John Lewis - Shop online at Britain's Favourite Retailer</title>
... bunch of html
<title>

    </title>

How can I modify the regexp to match only the first matched pair, not getting confused with the HTML above?

+1  A: 

In before all this "you should use a parser": make you regexp non-greedy:

objRegExp.Pattern = "<title[^>]*?>(.*?)</title>" 

Notice the added ? after .*. Per default .* will match as much as possible. This behaviour is inverted with the additional ?, now matching as little as possible.

Warning: I know absolutely nothing about regular expressions i classic ASP (or "modern" ASP, if there is such a thing), but since the non-greedy / laziness operator is already used on the <title> tag match, I reckon it will work.

jensgram
Works great, thanks!
Tom Gullen
You're welcome :)
jensgram