Updated Simple PHP Scraping Function

Going back over some of my old posts on web scraping and looking through the code I’ve noticed a few places where there is room for improvement. One such example is the scrape_between() function used in Working With The Scraped Data.

Functionally, it works. But it’s not great.

  • Repeatedly overwriting the $data variable isn’t a great idea.
  • It’s working on the assumption that both the $start and $end string are found in the search string, i.e. if $start isn’t found then the function should be terminating and returning false, not continuing on performing operations on an empty variable. This isn’t necessarily that bad, but it’s not very elegant and could become a performance issue if the function is used in a large appliction processing thousands of pages.

With that said, here’s my rewrite, which is more readable, makes more sense and is structurally more sound.

<?php
    
/*
 * String Extract PHP Function
 *
 * Simple function for extracting a string from within a string, given a start and end point.
 *
 * Copyright (c) 2014 Jacob Ward (http://www.jacobward.co.uk)
 *
 * Licensed under the MIT (http://opensource.org/licenses/MIT) and GPL (http://www.gnu.org/copyleft/gpl.html) licenses.
 *
 */

    function stringExtract($item, $start, $end) {
    	if (($startPos = stripos($item, $start)) === false) {	// If $start string is not found
    		return false;	// Return false
    	} else if (($endPos = stripos($item, $end)) === false) {	// If $end string is not found
    		return false;	// Return false
    	} else {
    		$substrStart = $startPos + strlen($start);	// Assigning start position
    		return substr($item, $substrStart, $endPos - $substrStart);	// Returning string between start and end positions
    	}
    }
    
?>

I’ve changed the name of the function and added some info before the function as it’s now hosted on GitHub as php-string-extract.

Just thought I’d post it here too for the sake of doing so.

By the way, I haven’t forgotten about the epic post I promised, I’m working on it most days but don’t have a great amount of free time at the moment. But trust me, the wait will be worth it!

2 thoughts on “Updated Simple PHP Scraping Function

  1. Great script Jacob. I found this has a problem if your end position ever occurs before your start position. I’m trying to scrape a webpage where the string I want is terminated by “/class” in brackets. However, that is not the first occurrence of that word in the webpage.

    I’m trying to rework the script to use the optional third argument of stipos to tell it where to began searching, like $endPos = stripos($item, $end, $substrStart), if I can figure out how to assign $substrStart first.

Leave a Reply