Web Scraping With PHP & CURL [Part 1]

Things have been a bit slow around here recently, so I figured to keep things alive I may as well start a series of posts.

As most of my freelancing work recently has been building web scraping scripts and/or scraping data from particularly tricky sites for clients, it would appear that scraping data from websites is extremely popular at the moment.

So, why not do a running series on using PHP with CURL for web data scraping?

We’ll start off simple, requesting and downloading a webpage, downloading images, then gradually move onto some more advanced topics, such as submitting forms (registration, login, etc…) and possibly even cracking captchas. In the end, we’ll roll everything we’ve learnt into one PHP class that can be used for quickly and easily building scrapers for almost any site.

So, first off, writing our first scraper in PHP and CURL to download a webpage:

<?php
	// Defining the basic cURL function
	function curl($url) {
		$ch = curl_init();	// Initialising cURL
		curl_setopt($ch, CURLOPT_URL, $url);	// Setting cURL's URL option with the $url variable passed into the function
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Setting cURL's option to return the webpage data
		$data = curl_exec($ch);	// Executing the cURL request and assigning the returned data to the $data variable
		curl_close($ch);	// Closing cURL
		return $data;	// Returning the data from the function
	}
?>

This function is then used as such:

<?php
	$scraped_website = curl("http://www.example.com");	// Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable
?>

Ok, so now we know how to scrape the contents of a webpage. That’s the end of Part 1.

I know, it might not seem like a great deal has been accomplished here, but, this basic function forms the basis of what we will be building on over the coming posts.

Up next time: Working With The Scraped Data

All Posts From This Series

66 thoughts on “Web Scraping With PHP & CURL [Part 1]

  1. Hi im trying to email you on the hire me form but its not sending…..

    My msg: Hi jacob,

    Interested in knowing if you can make us a scraper to harvest freelancers profiles such as skills, feedback etc on freelance websites?

    Thanks,

    Kris

  2. Hi Jacob
    Not being very technical or programming wise, i was wondering if you could help with some custom scripts for me please.

    Best wishes

    Robert

    1. Sorry, due to my contract with my publisher, I can’t explicitly make a post about this. Though forthcoming posts will involve having to do this at some point, so you can extrapolate from there 😉

  3. Thank you for this scraping tutorial. However, the scraping doesn’t work…

    Fatal error: Call to undefined function curl_init()

    Any suggestions?

    1. If you can’t call curl_init() then it’s likely you don’t have cURL on your server or in your development environment.

      This would be odd for a hosting provider to not have cURL. Are you running this from your own machine / development environment?

    1. Yes, error handling is essential. But as an introduction to the subject, going into something like that is less enticing than actually getting down to ‘getting something done’.

      I personally use my own custom error handling class, which I may post up at some point in the future.

      In the mean time, you can use:

      	try {
      		// Code to execute
      	} catch (Exception $e) {
      		echo 'Error: ' . $e;
      		// Action to perform on exception
      	}
      

      …as a basic alternative. Not really so much as error handling, but I’m not willing to disclose some of my stuff. I hope you understand.

  4. Hi im trying to email you on the hire me form but its not sending…..

    My msg: Hi jacob,

    do you know gscraper?

    we have 1000 web properties a week now that need 100,000 good backlinks submitted each within a weeks time

    “gscraper is not efficient for this”

    we need to be able to scrape raw lists – put them in your software – and have the software weed out the good links and submit the correct quantity of good links each week to the web properties we provide

    Each one getting the quantity we determine

    we are growing and will need this to be scaleable to much larger number of sites over time

    “this is mostly trackbacks, blog comments and guest book links”

    Thanks,

    Phil

  5. hi jacob
    the webpage which i want to scrap is displaying data from a API….. how to work on it
    any help on this

    Thanks in advance

    1. Please provide an example and I’ll be able to give you some pointers. Ideally, rather than scraping the actual page, you’d send a request to the API and work with the returned data.

  6. Hi. Can you please tell me how to scrapped the paging data from the website and store it in the database using php.? please tell me about this

    1. Due to contractual obligations with my publisher I can’t explicitly write a post on this, although I’m sure I’ll write a pos in the future that will have this involved in it.

  7. I wonder if you could advise. I do alot of ordering via the amazon website – the products I order are very similar, but its boring and repetitive.

    Would i be able to use curl to go between the various amazon forms completing the fields its needs for example address etc , and then to complete the order.

    I am trying to learn curl, and so far I found a neat php script using curl to login into my amazon account and get the the home screen.

    But thought I would ask your advise before trying to proceed any further.

    1. Yes, it should be possible to accomplish what you want with cURL.

      Although, you might have a better experience using Amazon’s API, so it
      might be worth having a look into that?

    1. I’m not going to personally coach you. Check out some of the posts here to get a start and if you have any specific questions, leave them in the comments so I can answer them and other people get to see the Q&A.

  8. run with a blank display and I had to add
    echo $scraped_website ; to see the scraped page. I need to the URL’s on every page of the website to add to a database. Can you help?

  9. i copied the same code but its giving a error

    Call to undefined function scrape_between() in /opt/lampp/htdocs/curl.php on line 25

  10. Hi Mr.Jacob. First of all let me thank you for the tutorial. I have tried the tutorial above but when I run the code. i only get the text. However the CSS and images are not included. I need to be also retrieve the CSS and images, basically the whole website. Can you tell me what does it wrong? Thank’s.

    1. You would need to retrieve the CSS files and download the images. Retrieving the CSS files should be easy, just look through the source code for their URLs and grab them.

      Downloading the images is fairly simple, however I can’t write a blog post or tutorial on it as there is a chapter on it in my book on PHP web scraping and the contract with the publishers prevents me from doing so.

  11. Thanks for great post .. I’d heard the CURL is good so now I know how to use it (I’ve used python’s BeautifulSoup before so was looking for something similar).

  12. Thanks for the posts; they are very informative. I am not new to PHP, but I am working on a new project, and maybe you can tell me if I’m on the right path. I want to be able to scan all of the pages in a website for keywords, and log them in a database. Is a web crawler/scraper the appropriate tool for this project, or should I be looking into something else? Thanks for your input

  13. Hi Jacob! I would like to get in contact with you about a web scraping question but the form isnt working. Please contact me on my email!

  14. As i am using it for getting the Stock Exchange daily Number. But when i am using it it is showing me this blank Array( )

    Here is my code

    function callIt(){
    $curl = curl_init(“http://www.bseindia.com/”);
    curl_setopt($curl, CURLOPT_USERAGENT, “Mozilla/5.0 (Windows; U; Windows NT 10.10;)”);
    curl_setopt($curl, CURLOPT_FAILONERROR, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($curl);
    curl_close($curl);
    preg_match(‘/\(.*)\/’,$html,$name);
    print_r($name);
    }

    echo callIt();

    1. To be honest with you, I didn’t even get your code to run. I just received a Warning: preg_match(): No ending delimiter ‘/’ found… error.

      Your regex is malformed. Please could you tell me what you’re trying to match? You’re escaping your opening parentheses, rendering the closing parentheses invalid and you’re escaping your closing slash with a dangling backslash. I don’t think it will match much of anything as it currently is.

      1. ohohoho wait there is some issue with the comment it just strip down some tags there is a div thats why you are getting the error

        preg_match(‘/\(.*)\/’,$html,$name);

    1. Since the regex is likely the problem anyway, with nothing being returned. How about you just tell me what it is you want to match and I’ll just write the regex for you?

        1. First off, I want to say that using regex to parse a div like that is not optimal – Parsing the DOM and selecting the div using XPath or a similar solution would be much better.

          Second, the reason why your array is returned empty, because there is no value between those tags in the HTML of the page, they are added to the DOM via a jQuery AJAX request, which obviously isn’t executed using cURL, as it doesn’t run the JavaScript/jQuery. Take a look at this post about scraping AJAX requests, which may be of some help.

  15. Hey Jacob I never, and i mean never reply on blog comments, i normally use and abuse the info and bounce but i purchased your book php web scraping as i liked your posts and was helpful and WOW i used the “traversing multiple pages” with a project of my own on another url src and i managed to scrap all pagination links, then curl’d them links, and got down all the way to each individual item, so i went down a set further then your example to get all item pages and i am so happy. I ve been working with api’s and you have given me some great tools for scraping and i never knew how to do multiple searches using curl. VERY COOL BIG THANKS Dont read the stupid amazon review the guy clearly was dumb as F**** This book is GREAT

    1. Yes it is. Install a web browser plugin called Live HTTP Headers or watch the headers being sent another way when you submit the search form. Youshould then see what parameters and arguments are being passed with the request. You can then use these in your cURL script to emulate a search.

      If you need help with this, I have other posts on this site which go into more detail. Alternatively, reply with a link to the site you need to scrape and I will take a look for you.

Leave a Reply