I want to put paragraph tags around any text items. It should therefore avoid tables and other elements. How do I do that? I guess it somehow can be made with preg_replace?
Here are a couple of functions that should help you to do what you want to do:
// nl2p
// This function will convert newlines to HTML paragraphs
// without paying attention to HTML tags. Feed it a raw string and it will
// simply return that string sectioned into HTML paragraphs
function nl2p($str) {
$arr=explode("\n",$str);
$out='';
for($i=0;$i<count($arr);$i++) {
if(strlen(trim($arr[$i]))>0)
$out.='<p>'.trim($arr[$i]).'</p>';
}
return $out;
}
// nl2p_html
// This function will add paragraph tags around textual content of an HTML file, leaving
// the HTML itself intact
// This function assumes that the HTML syntax is correct and that the '<' and '>' characters
// are not used in any of the values for any tag attributes. If these assumptions are not met,
// mass paragraph chaos may ensue. Be safe.
function nl2p_html($str) {
// If we find the end of an HTML header, assume that this is part of a standard HTML file. Cut off everything including the
// end of the head and save it in our output string, then trim the head off of the input. This is mostly because we don't
// want to surrount anything like the HTML title tag or any style or script code in paragraph tags.
if(strpos($str,'</head>')!==false) {
$out=substr($str,0,strpos($str,'</head>')+7);
$str=substr($str,strpos($str,'</head>')+7);
}
// First, we explode the input string based on wherever we find HTML tags, which start with '<'
$arr=explode('<',$str);
// Next, we loop through the array that is broken into HTML tags and look for textual content, or
// anything after the >
for($i=0;$i<count($arr);$i++) {
if(strlen(trim($arr[$i]))>0) {
// Add the '<' back on since it became collateral damage in our explosion as well as the rest of the tag
$html='<'.substr($arr[$i],0,strpos($arr[$i],'>')+1);
// Take the portion of the string after the end of the tag and explode that by newline. Since this is after
// the end of the HTML tag, this must be textual content.
$sub_arr=explode("\n",substr($arr[$i],strpos($arr[$i],'>')+1));
// Initialize the output string for this next loop
$paragraph_text='';
// Loop through this new array and add paragraph tags (<p>...</p>) around any element that isn't empty
for($j=0;$j<count($sub_arr);$j++) {
if(strlen(trim($sub_arr[$j]))>0)
$paragraph_text.='<p>'.trim($sub_arr[$j]).'</p>';
}
// Put the text back onto the end of the HTML tag and put it in our output string
$out.=$html.$paragraph_text;
}
}
// Throw it back into our program
return $out;
}
The first of these, nl2p(), takes a string as an input and converts it to an array wherever there is a newline ("\n"
) character. Then it goes through each element and if it finds one that isn't empty, it will wrap <p></p>
tags around it and add it to a string, which is returned at the end of the function.
The second, nl2p_html(), is a more complicated version of the former. Pass an HTML file's contents to it as a string and it will wrap <p>
and </p>
tags around any non-HTML text. It does this by exploding a string into an array where the delimiter is the <
character, which is the start of any HTML tag. Then, iterating through each of these elements, the code will look for the end of the HTML tag and take anything that comes after it into a new string.
This new string will itself be exploded into an array where the delimiter is a newline ("\n"
). Looping through this new array, the code looks for elements that are not empty. When it finds some data, it will wrap it in paragraph tags and add it to an output string. When this loop is finished, this string will be added back onto the HTML code and this together will be amended to an output buffer string which is returned once the function has completed.
tl;dr: nl2p() will convert a string to HTML paragraphs without leaving any empty paragraphs and nl2p_html() will wrap paragraph tags around the contents of the body of an HTML document.
I tested this on a couple of small example HTML files to make sure that spacing and other things don't ruin the output. The code that's generated by nl2p_html() may not be W3C-compliant, either, as it will wrap anchors around paragraphs and the like rather than the other way around.
Hope this helps.