Navigating And Scraping Multiple Pages With PHP & CURL [Part 3]

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.

But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?

Well, that’s what we’ll be covering today. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.

So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.

If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.

	$continue = TRUE;	// Assigning a boolean value of TRUE to the $continue variable
	$url = "";	// Assigning the URL we want to scrape to the variable $url
	// While $continue is TRUE, i.e. there are more search results pages
	while ($continue == TRUE) {
		$results_page = curl($url);	// Downloading the results page using our curl() funtion

		$results_page = scrape_between($results_page, "<div id=\"main\">", "<div id=\"sidebar\">");	// Scraping out only the middle section of the results page that contains our results
		$separate_results = explode("<td class=\"image\">", $results_page);	// Exploding the results into separate parts into an array
		// For each separate result, scrape the URL
		foreach ($separate_results as $separate_result) {
			if ($separate_result != "") {
				$results_urls[] = "" . scrape_between($separate_result, "href=\"", "\" title=");	// Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array

		// Searching for a 'Next' link. If it exists scrape the url and set it as $url for the next loop of the scraper
		if (strpos($results_page, "Next&nbsp;&raquo;")) {
			$continue = TRUE;
			$url = scrape_between($results_page, "<span class=\"pagination\">", "</span>");
			if (strpos($url, "Prev</a>")) {
				$url = scrape_between($url, "Prev</a>", ">Next");
			$url = "" . scrape_between($url, "href=\"", "\"");
		} else {
			$continue = FALSE;	// Setting $continue to FALSE if there's no 'Next' link
		sleep(rand(3,5));	// Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

foreach($results_urls as $result_url) {
	// Visit $result_url (Reference Part 1)
	// Scrape data from page (Reference Part 1)
	// Add to array or other suitable data structure (Reference Part 2)

In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.

Up next time: Downloading Images And Files With PHP & CURL

All Posts From This Series


Leave a Reply