Navigating And Scraping Multiple Pages With PHP & CURL [Part 3]

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.

But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?

Well, that’s what we’ll be covering today. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.

So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.

If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.

<?php
	
	$continue = TRUE;	// Assigning a boolean value of TRUE to the $continue variable
	
	$url = "http://www.imdb.com/search/title?genres=action";	// Assigning the URL we want to scrape to the variable $url
	
	// While $continue is TRUE, i.e. there are more search results pages
	while ($continue == TRUE) {
		
		$results_page = curl($url);	// Downloading the results page using our curl() funtion

		$results_page = scrape_between($results_page, "<div id=\"main\">", "<div id=\"sidebar\">");	// Scraping out only the middle section of the results page that contains our results
		
		$separate_results = explode("<td class=\"image\">", $results_page);	// Exploding the results into separate parts into an array
		
		// For each separate result, scrape the URL
		foreach ($separate_results as $separate_result) {
			if ($separate_result != "") {
				$results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title=");	// Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
			}
		}

		// Searching for a 'Next' link. If it exists scrape the url and set it as $url for the next loop of the scraper
		if (strpos($results_page, "Next&nbsp;&raquo;")) {
			$continue = TRUE;
			$url = scrape_between($results_page, "<span class=\"pagination\">", "</span>");
			if (strpos($url, "Prev</a>")) {
				$url = scrape_between($url, "Prev</a>", ">Next");
			}
			$url = "http://www.imdb.com" . scrape_between($url, "href=\"", "\"");
		} else {
			$continue = FALSE;	// Setting $continue to FALSE if there's no 'Next' link
		}
		sleep(rand(3,5));	// Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.
	}
?>

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

foreach($results_urls as $result_url) {
	// Visit $result_url (Reference Part 1)
	// Scrape data from page (Reference Part 1)
	// Add to array or other suitable data structure (Reference Part 2)
}

In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.

Up next time: Downloading Images And Files With PHP & CURL

All Posts From This Series

52 thoughts on “Navigating And Scraping Multiple Pages With PHP & CURL [Part 3]

  1. thank you so much for these posts. I really appreciate your well commented code and your explanations. This is really useful for me as I am trying to get a handle on screen scraping. I look forward to the next lessons. Very informative!

  2. great tutorial, thanks. how can the function “Searching for a ‘Next’ link.” limited? can i set it so that it makes the “Navigate to next” loop just 5 clicks?

    Best Regards
    Peter

    1. Sorry, been away all summer, back now though. I you haven’t yet found a solution to this, let me know and I’ll post something up. Off the top of my head though, rather than testing for the existence of a next link [ if/while (strpos($results_page, “Nextwhatever”)) { ], use a for loop, something like:

      // While $i (the loop counter) is less than or equal to 5 (the number of times you want to navigate)
      for ($i = 1; $i <= 5; $i++) {
      	
      	/ * RUN CODE AND SCRAPE STUFF IN HERE */
      
      }    // End the loop
      
      
  3. hey jacob great job its so help ful but i got a problem at last i have a site to scrap which has links on two pages…i wrote this code and it stuck in an infinite loop giving results from both pages but they are printing again and again here is my piece of code.can u please help me where i am wrong?

    while ($continue == TRUE) {  
    $results_page = curl($url); // Downloading the results page using our curl() 
    $results_page = scrape_between($results_page, "", ""); 
    // Scraping out only the middle section of the results page that contains our results
    	
    $separate_results =explode("Fiche Fonds d'Investissement", $results_page);  
     // Expploding the results into separate parts into an array 
     // For each separate result, scrape the URL
     foreach ($separate_results as $separate_result) {
      if ($separate_result != "") { 	
      $results_urls[] = "http://www.cfnews.net" . scrape_between($separate_result, "href=\"", "&gt;");
             
            }
        }
    if (strpos($results_page, "Suivant &gt;&gt;")) {
              
    $url = scrape_between($results_page, "", "");
    
    if (strpos($url, "Précédent</a>")) 
    { 
    $url = scrape_between($url, "&lt;&lt; Précédent</a>", "&gt;Suivant &gt;&gt;");
      }
    $url = "http://www.cfnews.net" . scrape_between($url, "href=\"", "&gt;");
    } 
    else {
    $continue = FALSE;  // Setting $continue to FALSE if there's no 'Next' link
     }   
    print_r($results_urls);
        }
    
    1. Sorry. I’ve been away from the blog all summer. If you’ve yet to find a solution to this, let me know and I’ll post something up tomorrow 🙂

  4. Great well written tutorials, the best i have been able to find online, not sure if you got to publish part 4 onwards, unfortunately couldn’t find it anywhere.
    Thanks for the first 3 parts anyway!

    1. Jamie Great job. I am working on a copy of your scraper and wondered if you had cleaned it up at all for the results page.

      thanks
      Bill

  5. Hi. Jacob. Your postings are very useful to me.

    I’ve read your postings with the sample website
    http://www.imdb.com/search/title?at=0&genres=action&sort=moviemeter,asc&start=1

    Your code samples are only for extracting URL of each listing. How can I get full information for each listing? That mean, after getting all URLs. I need to go to each URL to extract detail information.
    And in this case, PHP will need to read a set of URLs one by one.
    How can I do this?

    Sorry for this question, I’m a student and a new guy to PHP/cURL.

    Thanks

    1. With the URLs for each listing that we have scraped and are now in an array, we need to iterate over each array item (URL), visit each URL and scrape the required data.

      For example, given the code in this blog post, now we have the URLs of all the listings pages we want to scrape from in array $results_urls.

      Say we want to scrape the title of each listing from it’s individual page. We would do something like:

      foreach($results_urls as $result_url) {
      
          $listings_page = curl($result_url);    // Retrieving listings page
          
          $listing_titles[] = scrape_between($listings_page, "<span class=\"itemprop\" itemprop=\"name\">", "</span>");    // Scraping the listing title and adding to array
      
          print_r($listing_titles);    // Printing out the array of titles on screen
      }
      

      Then it’s just a case of expanding on this to scrape all of the other items from the page.

      Hopefully that’s helped a bit. If there’s anything else you’re unclear of, don’t hesitate to ask!

  6. Hi. jacob
    I’ve made the test with your codes for the part 3.
    And I received the timeout message:

    Fatal error: Maximum execution time of 30 seconds exceeded in C:\Program Files\EasyPHP-12.1\home\test_cURL.php on line 18

    Can you give me an idea? Thanks

    1. The script is taking longer than the max_execution_time you have set in your php.ini

      You can edit this directly by editing this line in your php.ini file:

      ini_set('max_execution_time', 600); // 600 seconds = 10 minutes
      

      Alternatively, you can use PHP’s set_time_limit() function in your script. Personally, I would do this.

      So, for the code above, edit it like so:

      foreach($results_urls as $result_url) {
      
          set_time_limit(60);    // Setting execution time to 1 minute for each iteration of the loop
      
          $listings_page = curl($result_url);    // Retrieving listings page
           
          $listing_titles[] = scrape_between($listings_page, "<span class=\"itemprop\" itemprop=\"name\">", "</span>");    // Scraping the listing title and adding to array
          
          sleep(rand(3,5));   // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.
       
      }
      
      print_r($listing_titles);    // Printing out the array of titles on screen
      
  7. Thanks Jaco

    I’ve just added the time limit into the script. But it ran and didn’t stop.
    I only had a little adjustment for your original script.

    I don’t know what problem was. Can you leave me your email address?
    I’m going to send you the script. please help me take a look it.
    Thanks

  8. I don’t know what does problem with your contact form?
    After clicking “Send”, it ran and didn’t stop.

    Did you receive the script?

  9. Jacob. You’re the man.

    Seeing these comments about a part 4 has made me very anxious, but you’ve laid out a perfect base framework here already. Personally I hope the next post focuses on scraping, but with the intent for meaningful posting to a database. I’m working on a web app right now, and these articles have accelerated the whole process tremendously. Thanks again.

    Btw you need a subscribe to comments plugin, and should be converting all these commenters to your newsletter. I’m probably going to come back eventually to check for it, but much rather get an email when it comes out.

    Take care

    1. Awesome, thanks for your comments. I’m not sure when the next part will be coming as I’m currently busy writing a book. But, I’ll be sure to let you know when I do get round to it!

  10. Jacob

    Your tutorial is by far the best I have come across. I am really excited to see more of this subject.

    I know you are busy but I think I just may be getting the hang of this “scraper” thing.

    Thanks
    Bill

  11. When scraping a site like “ebay”, the returned html is quite complex. I am wanting to extract data from the CentralPanel in code between the “tbody” tags.

    When I scrape using the examples (with tag modifications done), I only receive a single result. When I use the entire “” tags (for the variable $results_page, I get multiple returns.

    Is there specific formatting of the $separate_results array when multiple data points are accessed? If so, what is the recommended modifications to allow the results to be displayed (ie Image, Title, Date Posted, Date Closed, Price”

    Thanks
    Bill

    1. If I understand your question correctly, you will need to recursively scape the data into a multi-dimensional array. E.g.

      array(
      	'1' => array(
      			'image' => 'image'
      			'title' => 'title'
      			etc...
      			)
      	'2' => array(
      			'image' => 'image'
      			'title' => 'title'
      			etc...
      			)
      	'3' => array(
      			'image' => 'image'
      			'title' => 'title'
      			etc...
      			)
      		
      	and so on...
      	
      )
      
  12. I need to send mms messages with curl through the web, my livehttp this, help me, thanks a lot
    http://vietteltelecom.vn/gui-tin-nhan/nhan-tin-mms

    POST /gui-tin-nhan/nhan-tin-mms HTTP/1.1
    Host: vietteltelecom.vn
    User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20100101 Firefox/22.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: vi-vn,vi;q=0.8,en-us;q=0.5,en;q=0.3
    Accept-Encoding: gzip, deflate
    Referer: http://vietteltelecom.vn/gui-tin-nhan/nhan-tin-mms
    Cookie: symfony=nlr0pn84bqops6rt1g19eqlqv2; __utma=242468485.1260269048.1372845291.1372845291.1372847975.2; __utmc=242468485; __utmz=242468485.1372845291.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=242468485.2.10.1372847975
    Connection: keep-alive
    Content-Type: multipart/form-data; boundary=—————————25964292022012
    Content-Length: 1643
    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[_csrf_token]”

    6a835f198000c5ef4c5d92e401a920d1
    —————————–25964292022012
    Content-Disposition: form-data; name=”selectMsisdnSms”

    01663588699
    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[type]”

    0
    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[receiver]”

    0974863863
    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[message]”

    thu nao
    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[attach]”; filename=””
    Content-Type: application/octet-stream

    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[attach1]”; filename=””
    Content-Type: application/octet-stream

    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[attach2]”; filename=””
    Content-Type: application/octet-stream

    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[attach3]”; filename=””
    Content-Type: application/octet-stream

    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[attach4]”; filename=””
    Content-Type: application/octet-stream

    —————————–25964292022012
    Content-Disposition: form-data; name=”vtp_send_mms_form[captcha]”

    0329
    —————————–25964292022012–

    HTTP/1.1 302 Found
    Date: Wed, 03 Jul 2013 16:54:51 GMT
    Server: Apache
    X-Powered-By: PHP/5.3.14
    Location: http://vietteltelecom.vn/gui-tin-nhan/nhan-tin-mms
    Cache-Control: max-age=60, private, must-revalidate
    Expires: Wed, 03 Jul 2013 16:56:51 GMT
    Vary: Accept-Encoding,User-Agent
    Content-Length: 121
    Connection: close
    Content-Type: text/html; charset=utf-8

    1. I took a quick look at this and it would appear that vtp_send_mms_form[_csrf_token] is a unique token that is generated that requires the user to have an account, which I do not have. Sorry I can’t help further.

      If you would like me to take a deeper look at this project for you and can provide me your details, feel free to contact me via my contact form (http://www.jacobward.co.uk/contact/).

  13. I run the code above that you have given in top of the page. But it dispalys error.

    Fatal error: Call to undefined function scrape_between() in C:\xampp\htdocs\SS\test3\index4.php on line 12

    1. As mentioned in the instructions, you need to include the scrape_between() and curl() functions from the previous post.

      // Defining the basic scraping function
          function scrape_between($data, $start, $end){
              $data = stristr($data, $start); // Stripping all data from before $start
              $data = substr($data, strlen($start));  // Stripping $start
              $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
              $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
              return $data;   // Returning the scraped data from the function
          }
      
      // Defining the basic cURL function
          function curl($url) {
              // Assigning cURL options to an array
              $options = Array(
                  CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
                  CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
                  CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
                  CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
                  CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
                  CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
                  CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
                  CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
              );
               
              $ch = curl_init();  // Initialising cURL
              curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
              $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
              curl_close($ch);    // Closing cURL
              return $data;   // Returning the data from the function
          }
      
  14. great tutorial Jacob, been searching for this tutorial for days.

    Anyway, you said that you are working on a book, is that book/eBook about scraping? If yes, then let me know when that book available, I’m interested.

    one more thing , why not create a specific eBook or a course about creating affiliate sites through scraping? like movie affiliate sites, merchandise affiliate sites, or else.

  15. Jacob,

    How would you get all the results without just scraping the screen. I mean if the UI did not have a next button, but you knew that there were more results than just those displayed, how would you get the results still?

    1. I’m not entirely sure what you mean. How do you know there are more results if they are not displayed? How would you access them from a web browser?

      Please provide an example and I will take a look for you.

    1. Well, the data in a frame is just another HTML page. So we can access the source URL of the frame’s content, e.g. src=”http://example.com/the-page-to-scrape.htm” and scrape that directly, as we’ve been doing with other pages.

  16. Jacob,
    thaks for your nice tutorial but wouln’t it be easier with regex?
    can you please advise pros and cons of you way against regex method.

    many thanks

  17. Very helpful tutorial, this is the first really useful thing I’ve used php for.

    One thing your script needed for me was set_time_limit() to a huge number because IMDB has 117k titles in the action category…

  18. If you use the sleep/timers (randomized at 10 to 25 secs) for page navigation or page fetching is it necessary to use multiple proxies? The app I am building would do a search just like I was doing it, one page at a time, and navigate it at the pace I would in real life so that I don’t put too much stress on the bandwidth. Of course I would use the proper useragent and referral. Total page navigation would be from about 50 to 250 max over a period of time.

    Main objective is to automate what I already do, don’t need it to work any faster than me, just want to reduce my personal time spent, mainly cause I am lazy 😀

  19. You are awesome!

    I am a beginner and am spending the time to learn.

    Thank you for sharing.

    How can I tell the difference between PHP and Java code?

    Links removed due to potential spam.

    Please reply with any suggestions.

    Cordially,

    Jon Alex

Leave a Reply