Can sed regex simulate lookbehind and lookahead?

merlin2011 picture merlin2011 · Feb 15, 2013 · Viewed 14.4k times · Source

I'm trying to write a sed script that will capture all "naked" URL's in a text file and replace them with <a href=[URL]>[URL]</a>. By "naked" I mean a URL that is not wrapped inside an anchor tag.

My initial thought was that I should match URL's that do not have a " or a > in front of them, and also do not have a < or a " after them. However, I am running into difficulty with expressing the concept of "do not have in front of or behind" because as far as I know sed does not have look-ahead or look-behind.

Sample Input:

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

Sample Desired Output:

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

Observe that the third line is unmodified, because it is already inside <a href>. On the other hand, both the first and second lines are modified. Finally, observe that all non-URL text is unmodified.

Ultimately, I am trying to do something like:

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

I began by verifying that the following will correctly match and remove a URL:

sed 's/http:\/\/[^\s]\+//g'

I then tried this, but it is not able to match URL's that start at the beginning of file / input:

sed 's/[^\>"]http:\/\/[^\s]\+//g'

Is there a way to work around this in sed, either by simulating lookbehind / lookahead, or explicitly matching beginning of file and end of file?

Answer

Ed Morton picture Ed Morton · Feb 15, 2013

sed is an excellent tool for simple substitutions on a single line, for any other text manipulation problems just use awk.

Check the definition I'm using in the BEGIN section below for a regexp that matches URLs. It works for your sample but I don't know if it captures all possible URL formats. Even if it doesn't though it may be adequate for your needs.

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}