Using Proxies For Scraping With PHP & cURL

We’ve all been there: You finally get your scraper working perfectly, you run your short tests and all your data is being returned as expected, so you let your scraper go wild. Then BAM! Errors are being thrown left, right and center, and you realise the target site is blocking your scraper.

Usually the problem is that your IP has been blocked for requesting too many pages in too short a space of time. The solution? Use proxies.

Using proxies allows your requests to go through many different IP addresses and thus appear to be coming from different visitors.

You could use free proxies from one of the many free proxy lists, but you’re going to run into issues pretty quickly, such as proxy speed, connection time, level of anonymity or the proxies simply going down between you collecting them and actually using them.

The best option is to use private proxies from a provider, such as my personal favorite My Private Proxy.

So, we’ve got our proxies sorted, now how do we go about using proxies in our PHP scraper scripts? Fortunately this is made very easy using PHP’s cURL library.

Code and Walkthrough: Using Proxies For Scraping With PHP & cURL

  1. First, we need to load our list of proxies into an array. Depending on your source of proxies, they will need to be entered in a variety of different ways. The most common ones are shown in the code below.
    $proxies = array();	// Declaring an array to store the proxy list
    
    // Adding list of proxies to the $proxies array
    $proxies[] = 'user:password@173.234.11.134:54253';	// Some proxies require user, password, IP and port number
    $proxies[] = 'user:password@173.234.120.69:54253';
    $proxies[] = 'user:password@173.234.46.176:54253';
    $proxies[] = '173.234.92.107';	// Some proxies only require IP
    $proxies[] = '173.234.93.94';
    $proxies[] = '173.234.94.90:54253';	// Some proxies require IP and port number
    $proxies[] = '69.147.240.61:54253';
    
  2. Next up, we select a random proxy from our list to use.
    // Choose a random proxy
    if (isset($proxies)) {	// If the $proxies array contains items, then
    	$proxy = $proxies[array_rand($proxies)];	// Select a random proxy from the array and assign to $proxy variable
    }
    
  3. Now, after initialising our cURL handle, we set the CURLOPT_PROXY option of cURL to our randomly selected proxy, set all our other cURL options, then execute the request and close the handle.
    
    $ch = curl_init();	// Initialise a cURL handle
    
    // Setting proxy option for cURL
    if (isset($proxy)) {	// If the $proxy variable is set, then
    	curl_setopt($ch, CURLOPT_PROXY, $proxy);	// Set CURLOPT_PROXY with proxy in $proxy variable
    }
    
    // Set any other cURL options that are required
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_URL, $url);
    
    $results = curl_exec($ch);	// Execute a cURL request
    curl_close($ch);	// Closing the cURL handle
    

And there we have it, our first cURL request using a proxy. There are a number of other techniques which can further mask the source of your scrapers’ requests which I hope to cover in future posts, so stay tuned.

65 thoughts on “Using Proxies For Scraping With PHP & cURL

  1. Would suggest that you do a post on using php cURL and REGEX if possible ASAP. Alternatively, please send me any well commented code on this.
    Will greatly appreciate!

  2. How do you know if the proxy is success ful in connecting,Does this code work reliable, ain’t you got any testcase examples, tired of these half hearted tutorials leaving you hanging, anyone can copy and paste this from the manual…

    1. Thanks for the comments. I make no promises or guarantees with my posts and if you don’t like them, feel free to go elsewhere or make a suggestion as to how they could be improved.

      With regards to your questions:
      1. Does this code work reliable? Yes it does.
      2. …ain’t you got any testcase examples? This isn’t really possible, because proxies are always going down so it would be impossible to keep an up to date list here. There are plenty of other sites that do, as I mentioned in the post.
      3. How do you know if the proxy is success ful in connecting? The cURL library does not currently have any inbuilt function to check for a successful connection to a proxy. The connection is written to the callback function, along with the rest of the request. E.g.

      HTTP/1.0 200 <-- proxy's response
      HTTP/1.0 200 OK <-- remote server's response
      

      Though, even this is conditional. So, it’s not feasibly possible to reliable check for successful connection to proxy server.

      I hope this answers your questions.

  3. What are the differences between Shared Proxies and Private Proxies on My Private Proxy?

    I get that private ones are only used by you and are randomized monthly, but I need a solution that has constantly changing IP addresses, all the time, not just once a month. Do I get that with Shared Proxies?

    In other words, is shared a megapool, or is it just 50 ip addresses that you happen to be sharing with others?

    1. With My Private Proxy, as you state, private proxies are used only by you and are randomised monthly and shared proxies are ones you happen to be using with other My Private Proxy clients.

      To have constantly randomised private proxies is not feasible, given the cost, though if you have the funds you could buy thousands and randomly cycle through them.

      Alternatively, you could connect to the Tor Network using Privoxy, which would provide you with what you need. This is a slightly more complicated process than is outlined in this post, so I will try to write a tutorial on this in the coming week.

  4. “tired of these half hearted tutorials leaving you hanging, anyone can copy and paste this from the manual…”
    lol

    These scraper tutorials are the best I’ve seen anywhere. Easy to follow and super useful for what I wanted. Thank you for making them.

  5. I’m completely new to the scrapping world… Reading your articles helped me a great deal.. You are doing an awesome job. Keep it up sir.
    And thanks a lot for helping developers like me! πŸ™‚

  6. I have problems also :S

    I tested this and other codes during more than 10 hours, all I was able to do is getting a webpage with a public proxy.

    When I try my paid (and tested) private proxies I get “could not connect to host” and response code 0.

    Here are some debug vars where I get that

    $results[‘INF’] = curl_getinfo($ch);
    $results[‘ERR’] = curl_error($ch);

    I also tried a different way of passing user/pass:

    curl_setopt($ch, CURLOPT_PROXYUSERPWD, “username:password”);

    Everything fails, I dont know what more I can do, any idea? πŸ™‚

  7. thanks for this article;
    I have a special need, I am using appEngine of google as infrastructure, so all my requests are made through their filters, and unfortunately proxy with the classic way as described in this article is forbidden.
    I am wondering if there is a proxy that offer HTTP rewriting, I mean sending the target url in POST message to the proxy, and the proxy connect to this url, get the page, and answer to the initial request as it is provided by the proxy server!
    Any suggestion is welcomed, thanks in advance

    1. I have no experience of appEngine, so I can’t provide any specific details or any kind of answer relating to that.

      But, your description of what you require, is exactly what a proxy does.

      Maybe I’m misunderstanding, if you could explain in more detail I may be able to help further.

  8. Dear Jacob Ward!
    This is a great work but there is a problem with this code that if the proxy of that country whose language is not english then it would not show the characters correctly and mis-spell some words and also characters also…
    Kindly, if possible make a solution of this…

    Thanks

  9. Hii jacod,
    Good to see your superb coding guide for new developer,friend I have issue when i scraping data from website,i m sending continous mupliple request at the single time,these request in between 1-1000 time,some of the time my scraper stop scrapping data through the website or my ip block so can u suggest me any proper solution that resolve the issue of blocking my ip bcz i m given large no time into these project and corrently some of the time these problem occured so if u provide me any better and reliable solution than its good for me…
    Thanks
    From Dinesh Kumawat

    1. Well, yes, using proxies is a good start. But if you’re making up to 1000 requests at a time, you’re going to need 10’s of thousands of proxies.

      You could code your app to navigate the website as a human would, which is timely and convoluted, but if it’s important it works. Wait a short random period before making new requests, like sleep(rand(2, 5)), which would ‘pause’ your app for between 2-5 seconds before making the next request. Change your useragent to a search engine crawler’s, like Google or Bing (depending on whether the site is identifying search engine crawlers by IP or useragent), this could work.

      Why are you making up to 1000 asynchronous requests? If your using cURL Multi, this isn’t really going to be much faster than making a smaller amount of requests, since it waits until the longest request has been made before it moves onto the next batch of URLs, which is going to take ages. Imagine opening 1000 tabs in your web browser all at the same time and think how long it’s going to take for the pages to all load…

    1. I’m not sure of the error you’re getting so I can’t really answer with any certainty. I have no experience of using WAMP. If it’s like any other stack with PHP, you should be able to edit your php.ini file to use cURL.

  10. Jacob, not sure if you’re still supporting this article but wanted to ask.

    I’m using MAMP and testing scraping my website which is on a shared hosting. What I am trying to do is change the $_SERVER[‘REMOTE_ADDR’] which I am logging on the live website. Using your script as is, it’s not giving me any of the random generated ip’s. Is that the purpose of your example here? What could be wrong?

    1. Jacob, I’m an idiot πŸ™‚ The proxies that you supplied might have been deleted so after visiting the free website you gave, I found some real active proxies and testing it on my own website worked.

      Since we are talking about spoofing IP addresses, using cURL, is there a way for the “other side” to see my real server address behind the proxy?

      1. Technically, yes. Practically, no.

        Personally, it’s not something I’d necessarily worry about under usual circumstances. There’s only edge cases where I’d really be concerned, and in those cases it takes more than just a proxy to deal with the issue anyway.

        1. What I found out is, practically speaking, if the proxy server doesn’t send the httpx-forwarded-for in the header, you should be ok. So just need to find good anonymous proxies to use. Just one more question:

          Using cURL, what if one of the proxies in your array gets blacklisted? When I tried using your array directly, those proxies weren’t active anymore and my real server IP address was being sent. Is there a way to test whether the proxy server is online before execution?

          1. You are essentially correct.

            The proxies in my array were never live, they were just there as examples of how different proxies need to be set up.

            Actually testing whether a proxy is live is another matter entirely, since we would need to make a request ‘using’ said proxy and check the headers.

            Too much to explain in a comment response – search Google and you’ll see what I mean, you likely won’t even find a viable solution to what you want to do – but I’ll add it to my list of posts I need to write. So make sure to keep checking back.

            I’ll ping you in this response when I publish the post.

        2. Ok I’ll keep an eye out for your future posts.

          As far as checking whether the proxy is live, as a workaround I was thinking of making a blank page on my server and sending a curl request to that page through a proxy. Then check the header. If my server address came back, then the proxy didn’t work. If the proxy address came back, then that proxy is live. Something like that.

  11. Hi Jacob, thank you for the tutorial and idea of using random to pick random ip from list. I have a quick question about proxies. Do I need to buy some proxy service that will get me proxy IP for my curl requests or I can get them free? and consume them to send out curl requests

    1. woops! I previously just read the code not the intro. I hope you understand we coders do this . Thankyou for elaborating every aspect. I needed to ask one more thing just if you happen to know about it.
      1. When a proxy ip is provided does the request travels through that proxy ip tunnel ?
      2. Also the remote/ target server will also reply on that proxy ip tunnel ?

  12. i’m sorry, i’m newbie here. what is exactly ‘scraping’ mean? and what is the result from your script above? πŸ™‚ sorry for such a noob question,.

  13. Sorry i got little busy with my work and couldn’t reply back…
    here is what i get not sure even if the request is going out. Can you please help me with some thoughts

    Array ( [url] => [content_type] => [http_code] => 0 [header_size] => 0 [request_size] => 0 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 5.008 [namelookup_time] => 0 [connect_time] => 0 [pretransfer_time] => 0 [size_upload] => 0 [size_download] => 0 [speed_download] => 0 [speed_upload] => 0 [download_content_length] => -1 [upload_content_length] => -1 [starttransfer_time] => 0 [redirect_time] => 0 [redirect_url] => [primary_ip] => [certinfo] => Array ( ) [primary_port] => 0 [local_ip] => [local_port] => 0 [errno] => 28 [errmsg] => Connection timed out after 5008 milliseconds [headers] => [content] => ) //

  14. just to add along with my above comment…it works fine unless i add this attribute
    curl_setopt($handle, CURLOPT_PROXY, ‘69.7.113.4’);
    i got this ip from your suggested free ip website. I can ping this ip.
    even if i override my own local ip it gives me same error. so there could be something i have to enable.

Leave a Reply