Web Scraping With PHP & CURL [Part 1]

Things have been a bit slow around here recently, so I figured to keep things alive I may as well start a series of posts.

As most of my freelancing work recently has been building web scraping scripts and/or scraping data from particularly tricky sites for clients, it would appear that scraping data from websites is extremely popular at the moment.

So, why not do a running series on using PHP with CURL for web data scraping?

We’ll start off simple, requesting and downloading a webpage, downloading images, then gradually move onto some more advanced topics, such as submitting forms (registration, login, etc…) and possibly even cracking captchas. In the end, we’ll roll everything we’ve learnt into one PHP class that can be used for quickly and easily building scrapers for almost any site.

So, first off, writing our first scraper in PHP and CURL to download a webpage:

	// Defining the basic cURL function
	function curl($url) {
		$ch = curl_init();	// Initialising cURL
		curl_setopt($ch, CURLOPT_URL, $url);	// Setting cURL's URL option with the $url variable passed into the function
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Setting cURL's option to return the webpage data
		$data = curl_exec($ch);	// Executing the cURL request and assigning the returned data to the $data variable
		curl_close($ch);	// Closing cURL
		return $data;	// Returning the data from the function

This function is then used as such:

	$scraped_website = curl("http://www.example.com");	// Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable

Ok, so now we know how to scrape the contents of a webpage. That’s the end of Part 1.

I know, it might not seem like a great deal has been accomplished here, but, this basic function forms the basis of what we will be building on over the coming posts.

Up next time: Working With The Scraped Data

All Posts From This Series


Leave a Reply