This post looks at how to extract the first paragraph from an HTML page using PHP’s strpos and substr functions to find the location of the first <p> and </p> tags and get the content between them.
Using strpos and substr
Assuming the content to extract the paragraph from is in the variable $html (which may have come from a file, database, template or downloaded from an external website), use the following code to work out the position of the first <p> tag, the first </p> tag after that tag, and then get all the HTML between them including the opening and closing tags:
$start = strpos($html, '<p>'); $end = strpos($html, '</p>', $start); $paragraph = substr($html, $start, $end-$start+4);
Line 1 gets the position of the first opening <p> tag
Line 2 gets the position of the first </p> after the first opening <p>
Line 3 then uses substr to get the HTML. The third parameter is the number of characters to copy and is calculated by subtracting $start from $end and adding on the length of "</p>" so it is included in the extracted HTML.
Converting to plain text
If the extracted paragraph needs to be in plain text rather than HTML, use the following to remove the HTML tags and convert HTML entities into normal plain text:
$paragraph = html_entity_decode(strip_tags($paragraph));