How can I extract all anchor tags, their hrefs and their anchor text within a string?

Ryan picture Ryan · May 7, 2014 · Viewed 8.9k times · Source

I need to process links within an html string in several different ways.

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>.'
$links = extractLinks($str);
foreach ($links as $link) {
    $pattern = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
    if (preg_match($pattern,$str)) {
        // Process Remote links
        //   For example, replace url with short url,
        //   or replace long anchor text with truncated
    } else {
        // Process Local Links, Anchors

    }
}
function extractLinks($str) {
    // First, I tried DomDocument
    $dom = new DomDocument();
    $dom->loadHTML($str);
    return $dom->getElementsByTagName('a');
    // But this just returns:
    //   DOMNodeList Object
    //   (
    //       [length] => 3
    //   )

    // Then I tried Regex
    if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $str, $matches)) {
        print_r($matches);
    }
    // But this didn't work either.
}

Desired result of extractLinks($str):

[0] => Array(
           'str' = '<a href="http://example.com/abc" rel="link">string</a>',
           'href' = 'http://example.com/abc';
           'anchorText' = 'string'
       ),
[1] => Array(
           'str' = '<a href="/local/path" title="with attributes">number</a>',
           'href' = '/local/path';
           'anchorText' = 'number'
       ),
[2] => Array(
           'str' = '<a href="#anchor" data-attr="lots">links</a>',
           'href' = '#anchor';
           'anchorText' = 'links'
       );

I need all of these so I can do things like edit the href (add tracking, shorten, etc.), or replace the whole tag with something else (<a href="/u/username">username</a> could become username).

Here's a demo of what I'm trying to do.

Answer

Javad picture Javad · May 7, 2014

You just need to change it as:

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
    <a href="/local/path" title="with attributes">number</a> of
    <a href="#anchor" data-attr="lots">links</a>.';

$dom = new DomDocument();
$dom->loadHTML($str);
$output = array();
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = array (
      'str' => $dom->saveHTML($item),
      'href' => $item->getAttribute('href'),
      'anchorText' => $item->nodeValue
   );
}

By putting it in a loop and using getAttribute, nodeValue and saveHTML(THE_NODE) you will have your output