Using Proxies For Scraping With PHP & cURL

We’ve all been there: You finally get your scraper working perfectly, you run your short tests and all your data is being returned as expected, so you let your scraper go wild. Then BAM! Errors are being thrown left, right and center, and you realise the target site is blocking your scraper.

Usually the problem is that your IP has been blocked for requesting too many pages in too short a space of time. The solution? Use proxies.

Using proxies allows your requests to go through many different IP addresses and thus appear to be coming from different visitors.

You could use free proxies from one of the many free proxy lists, but you’re going to run into issues pretty quickly, such as proxy speed, connection time, level of anonymity or the proxies simply going down between you collecting them and actually using them.

The best option is to use private proxies from a provider, such as my personal favorite My Private Proxy.

So, we’ve got our proxies sorted, now how do we go about using proxies in our PHP scraper scripts? Fortunately this is made very easy using PHP’s cURL library.

Code and Walkthrough: Using Proxies For Scraping With PHP & cURL

  1. First, we need to load our list of proxies into an array. Depending on your source of proxies, they will need to be entered in a variety of different ways. The most common ones are shown in the code below.
    $proxies = array();	// Declaring an array to store the proxy list
    // Adding list of proxies to the $proxies array
    $proxies[] = 'user:password@';	// Some proxies require user, password, IP and port number
    $proxies[] = 'user:password@';
    $proxies[] = 'user:password@';
    $proxies[] = '';	// Some proxies only require IP
    $proxies[] = '';
    $proxies[] = '';	// Some proxies require IP and port number
    $proxies[] = '';
  2. Next up, we select a random proxy from our list to use.
    // Choose a random proxy
    if (isset($proxies)) {	// If the $proxies array contains items, then
    	$proxy = $proxies[array_rand($proxies)];	// Select a random proxy from the array and assign to $proxy variable
  3. Now, after initialising our cURL handle, we set the CURLOPT_PROXY option of cURL to our randomly selected proxy, set all our other cURL options, then execute the request and close the handle.
    $ch = curl_init();	// Initialise a cURL handle
    // Setting proxy option for cURL
    if (isset($proxy)) {	// If the $proxy variable is set, then
    	curl_setopt($ch, CURLOPT_PROXY, $proxy);	// Set CURLOPT_PROXY with proxy in $proxy variable
    // Set any other cURL options that are required
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_URL, $url);
    $results = curl_exec($ch);	// Execute a cURL request
    curl_close($ch);	// Closing the cURL handle

And there we have it, our first cURL request using a proxy. There are a number of other techniques which can further mask the source of your scrapers’ requests which I hope to cover in future posts, so stay tuned.


Leave a Reply