When you have (State)
in a regex, it will match the term State
in the input string as a group, it won't match literal parenthesis in the input - you'll need to escape them as you have with the /
s - /\(State\)<\/...
.
Then there's the problem that there's lots of whitespace around (including new lines - you'll need to include the m
modifier), and a <b/>
tag around the header which you seem to have not included in the regex. Even if you fix these problems, you're highly reliant on the exact markup used by the website you're scraping. This is a general problem you'll encounter when trying to parse HTML using regular expressions. It would be a better idea to use a HTML parser (e.g. creating a new DOMDocument
and calling its loadhtml
method).