Hi,
I have some regular expression code that grabs the data between the title tags on a page:
<%
Function UrlExists(sURL)
Dim objXMLHTTP
Dim thePage
Dim strPTitle
Dim blnReturnVal
Dim objRegExp
Dim strTitleResponse
'Create object
Set objXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
on error resume next
'Get the head
objXMLHTTP.Open "HEAD", sURL, false
objXMLHTTP.setRequestHeader "User-Agent", Request.ServerVariables("HTTP_HOST")
objXMLHTTP.Send ""
'404?
If Err.Number <> 0 or objXMLHTTP.status <> 200 then blnReturnVal = "0|404 Error" Else blnReturnVal = "1|"
objXMLHTTP.close
'If not 404
if left(blnReturnVal,1) = "1" then
'Get the physical page
objXMLHTTP.Open "GET", sURL, false
objXMLHTTP.Send ""
thePage = objXMLHTTP.responseText
thePage = replace(thePage, vbCrlf, "")
objXMLHTTP.close
'Find title
Set objRegExp = New Regexp
objRegExp.IgnoreCase = true
objregexp.Multiline = true
objRegExp.Global = false
objRegExp.Pattern = "<title[^>]*?>(.*)</title>"
set strPTitle = objRegExp.Execute(thePage)
strTitleResponse = strPTitle.Item(0).Value
strTitleResponse = replace(strTitleResponse, vbCrlf, "")
strTitleResponse = trim(strTitleResponse)
if len(strTitleResponse) <1 OR strTitleResponse = "" then strTitleResponse = "(No Title)"
set objRegExp = nothing
strTitleResponse = replace(strTitleResponse,"</title>","")
strTitleResponse = replace(strTitleResponse,"<title>","")
strTitleResponse = replace(strTitleResponse,"'","' ")
blnReturnVal = blnReturnVal & strTitleResponse
end if
Set objXMLHTTP = nothing
UrlExists = blnReturnVal
End Function
%>
This works fine and has been for many months, but when I wrote it (stupidly?) I made the assumption each page would only have one or no title tags. It's recently started to throw weird errors on the John Lewis page because it has two titles in it's HTML:
<title>John Lewis - Shop online at Britain's Favourite Retailer</title>
... bunch of html
<title>
</title>
How can I modify the regexp to match only the first matched pair, not getting confused with the HTML above?