views:

425

answers:

2

I have a string of text that contains html with all different types of links (relative, absolute, root-relative). I need a regex that can be executed by PHP's preg_replace to replace all relative links with root-relative links, without touching any of the other links. I have the root path already.

Replaced links:

<tag ... href="path/to_file.ext" ... >   --->   <tag ... href="/basepath/path/to_file.ext" ... >
<tag ... href="path/to_file.ext" ... />   --->   <tag ... href="/basepath/path/to_file.ext" ... />

Untouched links:

<tag ... href="/any/path" ... >
<tag ... href="/any/path" ... />
<tag ... href="protocol://domain.com/any/path" ... >
<tag ... href="protocol://domain.com/any/path" ... />
A: 

I came up with this:

preg_replace('#href=["\']([^/][^\':"]*)["\']#', $root_path.'$1', $html);

It might be a little too simplistic. The obvious flaw I see is that it will also match href="something" when it is outside of a tag, but hopefully it can get you started.

Keare
+1  A: 

If you just want to change the base URI, you can try the BASE element:

<base href="/basepath/">

But note that changing the base URI affects all relative URIs and not just relative URI paths.

Otherwise, if you really want to use regular expression, consider that a relative path like you want must be of the type path-noscheme (see RFC 3986):

path-noscheme = segment-nz-nc *( "/" segment )
segment       = *pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

So the begin of the URI must match:

^([a-zA-Z0-9-._~!$&'()*+,;=@]|%[0-9a-fA-F]{2})+($|/)

But please use a proper HTML parser for parsing the HTML an build a DOM out of that. Then you can query the DOM to get the href attributes and test the value with the regular expression above.

Gumbo
the base tag worked almost perfectly except it seems to mess with links such as href="#test" it turns them into href="/basepath/#test". On the up side, href="link#test" turns into href="/basepath/link#test", which works. Is there a way for anchors to work with the base class, without knowing anything current url?
Kendall Hopkins
The above comment is actually a bug in webkit (safari, chrome) and IE, it works fine in Firefox.
Kendall Hopkins
@Kendall Hopkins: Just as I said: *all* relative URIs are affected. And `#test` is a relative URI. And I would rather say it’s a bug in Firefox not to resolve `#test` with a base URI of `/basepath/` to `/basepath/#test`. (I think Firefox uses the algorithm of RFC 2396 while the others use the one of RFC 3986 that obsoleted RFC 3986 five years ago.)
Gumbo