Working With The Scraped Data [Part 2]

Web Scraping With PHP & CURL [Part 1] was pretty short and simple, so I thought I’d follow it up rather quickly with Part 2 – Working With The Scraped Data.

In this part, we’re going to create a function to use the data that we scraped in Part 1, for scraping a specific section of data from the page and breaking the page up into sections to iterate over and scrape multiple sections of similar data into an array for further use.

Also, we’re going to introduce a couple of modifications to our cURL PHP function.

Then, we’ll put everything together using a real world example.

The Scraping Function

In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.

<?php
	// Defining the basic scraping function
	function scrape_between($data, $start, $end){
		$data = stristr($data, $start);	// Stripping all data from before $start
		$data = substr($data, strlen($start));	// Stripping $start
		$stop = stripos($data, $end);	// Getting the position of the $end of the data to scrape
		$data = substr($data, 0, $stop);	// Stripping all data from after and including the $end of the data to scrape
		return $data;	// Returning the scraped data from the function
	}
?>

The comments in the function should explain it pretty clearly, but just to clarify further:

1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).

2. stristr() is used to strip all data from before the $start position.

3. substr() is used to strip the $start from the beginning of the data. The $data variable now holds the data we want scraped, along with the trailing data from the input string.

4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.

5. The data we wanted scraped, in $data, is returned from the function.

In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.

Modifying The CURL Function

Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function from Part 1.

<?php	
	// Defining the basic cURL function
	function curl($url) {
		// Assigning cURL options to an array
		$options = Array(
			CURLOPT_RETURNTRANSFER => TRUE,	// Setting cURL's option to return the webpage data
			CURLOPT_FOLLOWLOCATION => TRUE,	// Setting cURL to follow 'location' HTTP headers
			CURLOPT_AUTOREFERER => TRUE,	// Automatically set the referer where following 'location' HTTP headers
			CURLOPT_CONNECTTIMEOUT => 120,	// Setting the amount of time (in seconds) before the request times out
			CURLOPT_TIMEOUT => 120,	// Setting the maximum amount of time for cURL to execute queries
			CURLOPT_MAXREDIRS => 10,	// Setting the maximum number of redirections to follow
			CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",	// Setting the useragent
			CURLOPT_URL => $url,	// Setting cURL's URL option with the $url variable passed into the function
		);
		
		$ch = curl_init();	// Initialising cURL 
		curl_setopt_array($ch, $options);	// Setting cURL's options using the previously assigned array data in $options
		$data = curl_exec($ch);	// Executing the cURL request and assigning the returned data to the $data variable
		curl_close($ch);	// Closing cURL 
		return $data;	// Returning the data from the function 
	}
?>

If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.

The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().

Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…

The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.

Putting It All Together

We place both of those functions in our PHP script and we can use them like so:

<?php
	$scraped_page = curl("http://www.imdb.com");	// Downloading IMDB home page to variable $scraped_page
	$scraped_data = scrape_between($scraped_page, "<title>", "</title>");	// Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
	
	echo $scraped_data;	// Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
?>

As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.

Scraping Multiple Data Points From A Web Page

Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.

So, we’ll expand on this a bit and scrape multiple data points from a web page.

For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.

First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:

http://www.imdb.com/search/title?title=goodfellas

Shown in green is the keyword being searched for.

Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:

http://www.imdb.com/search/title?genres=action

Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.

Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.

Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.

<?php
	$url = "http://www.imdb.com/search/title?genres=action";	// Assigning the URL we want to scrape to the variable $url
	$results_page = curl($url);	// Downloading the results page using our curl() funtion
	
	$results_page = scrape_between($results_page, "<div id=\"main\">", "<div id=\"sidebar\">");	// Scraping out only the middle section of the results page that contains our results
	
	$separate_results = explode("<td class=\"image\">", $results_page);	// Expploding the results into separate parts into an array
		
	// For each separate result, scrape the URL
	foreach ($separate_results as $separate_result) {
		if ($separate_result != "") {
			$results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title=");	// Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
		}
	}
	
	print_r($results_urls); // Printing out our array of URLs we've just scraped
?>

Now with an explanation of what’s happening here, if it’s not already clear.

1. Assigning the search results page URL we want to scrape the the $url variable.

2. Downloading the results page using our curl() function.

3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…

4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.

5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urls array. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.

6. Right at the end, we printout our array of URLs, just to check that the script worked properly.

That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs. So, up next time we’re going to cover traversing the pages of a website to scrape data from multiple pages and organise the data in a logical structure.

Up next time: Navigating Multiple Pages Of A Website With PHP & CURL

All Posts From This Series

65 thoughts on “Working With The Scraped Data [Part 2]

  1. Hi sir,
    I’ve tried the code given by u for web scraping.
    i used this code for retrieving data from thinkdigit.com/top10.html
    but its showing error as
    Parse error: syntax error, unexpected ‘product’ (T_STRING) in /home/vhosts/myapps.freevar.com/index.php on line 31
    what is this problem? can u help out for this?

  2. Hi! I am trying to echo a specific div. However, my code isn’t working. Can you take a look?

    $url = "http://www.zagat.com/r/lillys-french-cafe-wine-bar-venice";    // Assigning the URL we want to scrape to the variable $url
        $results_page = curl($url); // Downloading the results page using our curl() funtion
      
        $results_page = scrape_between($results_page, "", ""); // Scraping out only the middle section of the results page that contains our results
         
        $separate_results = explode("", $results_page);   // Expploding the results into separate parts into an array
    
    echo $results_page;
    
  3. hi jacob i want to scrap Company name|contact person name|address|Email ID and phone numbers from the following website http://www.bizzduniya.com and http://www.askme.com in to excel sheet? pls help the above mentioned websites are yellow pages/directories which will have list of companies in india with there contact details

    pls help me with this requirement.

  4. Hello Jacob,

    i need some help for a hobby project, my goal is to scape some game data for our Lords and Knights Clan. For example i wish to scape all castle data for a user to insert it in a mysql database an use it for game action calculating.

    I do it before with some grease monkey scripts, but i would like to do it over my php scripts.

    I found your very interesting Website explaining how to do it, but it is a little bit to difficult, i think i need some additional startup tips.

    Perhaps i can pay for your support, but it is only a hobby project.

    Until now i started with greasemonky scripts for firefox and Lords & Knights.

    Then i use Celtic Knights from Xyrality as iPad App an analyze the traffic from my iPad to the Xyrality Game servers with Charles Web Debugging Proxy Tool.

    But after a few month the shown data is encrypted with url encoding.

    At that point i give up.

    My idea is to pay for the php base script to scape the game data from Celtic Knights from xyrality. I would like to make more comfort to my php web tool for our clan. But for now i have to solve to much for that.

    Perhaps you are able to support me at this point.

    Kind regards

    Wolfgang Vogel

    1. Sorry for taking so long to respond to your comment. I have been away in France with no internet access all weekend.

      With regards to your project, I can definitely offer the assistance you require. I will contact you via email before the weekend, if this suits you?

        1. No, under contract with my publishers I can’t make posts which are covered in my book ๐Ÿ™

          Though there are other topics which I will be posting about in the very near future, which will inadvertently cover what you are after…

  5. Great article series so far! Btw I’ve been following along, and the scripts haven’t output anything yet; hoping it’s all tied together in the next article.

    Very new to php, and learning head first. The way you explained array key vs array value was amazing. The key/value part just never really sunk in until you took the time to explain it.

  6. Hello!

    I am trying to use your scraper to extract all of the type contents out of a website so that I can further modify it to view errors with Google Analytics codes.

    here is what I’m using (removed the URL for anonymity’s sake) — Can you help me troubleshoot why it is not working? I’d like to store all the javascript codes in an array, then loop through them all to echo out only those that are Google Analytics related. (usually contain _gaq in the javascript)

    Thanks:

    $url = "removed";    // Assigning the URL we want to scrape to the variable $url
        $results_page = curl($url); // Downloading the results page using our curl() funtion
         
        $results_page = scrape_between($results_page, "", ""); // Scraping out only the middle section of the results page that contains our results
         
        $separate_results = explode("", $results_page);   // Expploding the results into separate parts into an array
             
        // For each separate result, scrape the URL
        foreach ($separate_results as $separate_result) {
            if ($separate_result != "") {
                $results_urls[] = "" . scrape_between($separate_result, "", ""); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
            }
        }
         
        print_r($results_urls); // Printing out our array of URLs we've just scraped
    
    1. It is likely not working because on line 4, you are not entering any strings to scrape between. If you want to scrape all JavaScript, try something like:

      $results_page = scrape_between($results_page, "<script", "</script>"); // Modify to the pages you are scraping
      
  7. I just realized that your comments strip out code!
    In seperate_results, i have the starting tag for javascript, i.e. script type=”text/javascript” and in scrape_between in the function, I have the starting and ending script tags.

  8. PART1, PART2, PART3 was the nice articles and worked fine for me.

    I am working on screen scraping with paging .

    I am able to scrap the first tow pages but our script didn’t display the page coming after 2nd page….like Next.

    I hope you understand my issue.
    Please do needful.

    Thanks in advance.

  9. Hi,
    I want to scrap data from some of the popular classified sites and store it in a database regularly. The data need to be scraped are ad title, price,images,contact details etc. from the automobile section of each classifieds website. Can you help me with this project requirements?

    Regards,
    Sreejith

  10. Hello Mr.Jacob,
    Excellent work. Amazing. Thanks you for providing such a nice and neatly presented work.

    Your example helped me very much.

    Thanks & Regards,
    Madhu
    [From India/Hyderabad]

  11. I really appreciate the effort put into this and was wondering if you had any ideas on how to cache the curl url request?
    Thanks in advance!

        function curl($url) {
            // Assigning cURL options to an array
            $options = Array(
                CURLOPT_RETURNTRANSFER =&gt; TRUE,  // Setting cURL's option to return the webpage data
                CURLOPT_FOLLOWLOCATION =&gt; TRUE,  // Setting cURL to follow 'location' HTTP headers
                CURLOPT_AUTOREFERER =&gt; TRUE, // Automatically set the referer where following 'location' HTTP headers
                CURLOPT_CONNECTTIMEOUT =&gt; 120,   // Setting the amount of time (in seconds) before the request times out
                CURLOPT_TIMEOUT =&gt; 120,  // Setting the maximum amount of time for cURL to execute queries
                CURLOPT_MAXREDIRS =&gt; 10, // Setting the maximum number of redirections to follow
                CURLOPT_USERAGENT =&gt; "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
                CURLOPT_URL =&gt; $url, // Setting cURL's URL option with the $url variable passed into the function
            );
             
            $ch = curl_init();  // Initialising cURL 
            curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
            $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
            curl_close($ch);    // Closing cURL 
            return $data;   // Returning the data from the function 
        }
    
    1. The best option is to code your own caching library. However, as a quick and dirty fix, you can use file_put_contents($cache, $data); to save it then retrieve it with file_get_contents($cache);.

    1. Yes, I do use DOMDocument(). However, these posts are aimed at people new to PHP so using basic stuff here as a basic introduction to build upon in future posts. But, there are instances where this method is preferable, such as where context is relevant.

  12. Thank you for all of the information,
    When I try to scrape_between “div id’s” I get a blank page as a result. Is there a rule I should be following?

    Also how could I make the array of urls clickable?

    1. I’m assuming you’re talking about the returned array $results_urls. print_r() is just displaying what is in the array, if you’d like to print out a list of clickable urls, rather than just printing the array, you would do something along the lines of:

      // For each of the items in the array where: $key = the array key (0, 1, 2, etc...), and $value = the value of that key (in this case the URL)
      foreach ($results_urls as $key => $value) {
      	echo '<a href="' . $value . '">' . $value . '</a><br />';
      }
      
  13. Thx… nice code…

    btw i want ask something
    1. What different between using file_get_content and curl? and different from using htmldom and curl?
    2. how can i do post something in website that i grab ? what must i do to action and how i get the value?

    thx.. sorry a lot questions

    1. 1. file_get_contents() is just a PHP function for getting the contents of a file, whereas cURL is a complete library for making requests, you can send and receive headers, use cookies, proxies and leading into your next question, make POST requests.

      2. First you must identify the form you wish to post and all of it’s inputs. The name is the name of the input and the value is the value you want to submit for that input. The URL you submit to will be in the form’s action attribute, if this is blank then the URL is the same as the page you are on. I’ll include a simple cURL posting function below. It takes it’s arguments as the url you wish to post to and and array of the names and values of the form, e.g.

      <?php
      
      $url = 'http://www.somewebsite.com/login.php';
      
      $postFields = array(
      'username' => 'yourusername', 
      'password' => 'yourpassword'
      );
      
      $dashboard = curlPost($url, $postFields);
      
      
      function curlPost($url, $postFields = null) {
      			
      			$useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';	// Setting useragent of a popular browser
      	
      			$cookie = 'cookie.txt';	// Setting a cookie file to store cookie
      	
      			$ch = curl_init();	// Initialising cURL session
      
      			// Setting cURL options
      			curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);	// Prevent cURL from verifying SSL certificate
      			curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);	// Script should fail silently on error
      			curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);	// Use cookies
      			curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);	// Follow Location: headers
      			curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Returning transfer as a string
      			curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);	// Setting cookiefile
      			curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);	// Setting cookiejar
      			curl_setopt($ch, CURLOPT_USERAGENT, $useragent);	// Setting useragent
      			curl_setopt($ch, CURLOPT_URL, $url);	// Setting URL to POST to
      			
      			curl_setopt($ch, CURLOPT_POST, TRUE);	// Setting method as POST
      			curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postFields));	// Setting POST fields as array
      			
      			$results = curl_exec($ch);	// Executing cURL session
      			curl_close($ch);	// Closing cURL session
      			
      			return $results;	
      		}
      
      ?>
      
      1. ok thx a lot that really helpfull…. but one more… u already know that there is anything input name username and password… What about if I dont know input name… so i must search input name and where will this input going to post.. how can i search and get input name and url action? sorry a lot questions..

  14. ….well, I think I saw a possible solution; seems like temperature is the first non integer number; all the rest can be easily traced. I wonder if there is any more general concept for doing these pattern retrievals

  15. …seems I did my job with these few lines; using fopen and no CURL at all… you see any problem with this approach?

    1. The main problem I see is that you can’t guarantee the temperature is always going to be a non-positive, non-integer, number.

      I would personally convert the file to a csv file (shouldn’t be too difficult for you to work out) and select the values from the column “Temp”.

  16. Hi Jacob
    I tried running this code but I am getting blank page only.What’s going wrong here.

    I am not able to post

    1. It’s not working because on line 3 you’re trying to execute a function that you haven’t declared scrape_between()

      Add this into your script and you should be good to go:

      <?php
          // Defining the basic scraping function
          function scrape_between($data, $start, $end){
              $data = stristr($data, $start); // Stripping all data from before $start
              $data = substr($data, strlen($start));  // Stripping $start
              $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
              $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
              return $data;   // Returning the scraped data from the function
          }
      ?>
      
  17. Everything is working like a charm. but I have one little question. I would like to scrape the text of the link, Laptop (see code). But how do I use the scrape_between() function when it is not between anything??



    Laptops

    1. Well, according to your picture of the code Laptops seems to be between

      <a href="/laptops" rel="nofollow">

      and

      </a>

      . After you’ve scraped it, you can remove the whitespace with PHP’s trim() function. E.g.

      $laptops = trim(scrapeBetween($a, $b, $c));
  18. Hi Jacob,

    Need your help with this website, it uses some kind of application or it all javascript based. I was trying to scrape this link:

    Website link

    but i don’t know where to start. it’s really different. I can scrape other website, but this one i need your help.

    please gimme a hint to do this.

    THanks in advance

  19. Love your tutorials – it all works well – however some web pages require you to page down or scroll down before the site will download the entire page – often it may require several page downs.

    How do you get the full page in these cases or simulate the multiple page downs?

  20. Thank you for your script but I have problem with your code.
    This is the HTML part I want to scrap ( I want to extract “turtles”) :

    Like :
    Turtles

    This is the part of your code :

    $scraped_data = scrape_between($scraped_page, “Like :
    “, “”);

    It doesn’t work. scraped_data is blank and not detected.

    I try :

    $scraped_data = trim(scrape_between($scraped_page, “Like :
    “, “”));

    It doesn’t work too.

    I try :

    $scraped_data = scrape_between($scraped_page, “Like :”, “”));

    It doesn’t work too, removing the blanks…

    Your script work only with code without any spaces….

    Do you have a solution about this ?

    Thank you

  21. I personally prefer to use XPATH to work with scraped data, it makes finding elements and grabbing data much easier than trying to use string comparing and regex. Something to look into maybe!

  22. Hi Jacob

    In your book, it doesn’t explain how to scrape multiple occurrences of something using XPath, such as scraping all of the links from a page. How is this done?

  23. Thanks so much for writing this script. I have been looking for this for some time, and this is the only script that made sense to me.

  24. Hi Sir,
    i am university student, my task is to scrap data from a website in php. This is my first time to scrap data. I do not know any method to do Web scrapping. Recently i have read your article about Web Scrapping with PHP and CURL (Part 1 and Part 2), but i don’t know, where to use this code in my website.
    And second thing is, i want to scrap data automatically, i mean to say that after 24 hours my website automatically scrap the data from other website and show me on my website.
    Kindly help me for my these two queries.
    Thanks alot

Leave a Reply