Scraping Putlocker.bz Links to Movies

Maybe I was a bit of a bitch, after promising some free scraping classes and then lunching it out, so here’s one while you wait for my next posts. Essentially, it scrapes all of the alternative movie links from putlocker.bz

Disclaimer: I in no way endorse the illegal watching or downloading of movies. Go buy the fucking DVD or watch it on NetFlix or whatever…

It’s a very much simplified version of a part of something I was working on for a client recently.

<?php

	class Putlocker {
	
		public function searchMovieLinks($title) {
			
			$query = str_replace(' ', '+', htmlspecialchars_decode($title, ENT_QUOTES));
			
			$searchPage = $this->curlGet('http://putlocker.bz/search/search.php?q=' . $query);
			
			$searchPageXPath = $this->returnXPathObject($searchPage);
			
			$searchPageLinks = $searchPageXPath->query('//div[@class="content-box"]/table[last()]/*/*/a/@href');
			$searchPageNames = $searchPageXPath->query('//div[@class="content-box"]/table[last()]/*/*/div/a');
			
			if ($searchPageLinks->length > 0) {
				for ($i = 0; $i <= $searchPageLinks->length; $i++) {
					if (trim(strtolower($searchPageNames->item($i)->nodeValue)) == trim(strtolower(htmlspecialchars_decode($title, ENT_QUOTES)))) {
						$moviePageLink = $searchPageLinks->item($i)->nodeValue;
					}
				}
			}
			
			$moviePage = $this->curlGet($moviePageLink);
			$moviePageXPath = $this->returnXPathObject($moviePage);
			$moviePageLinks = $moviePageXPath->query('//td[@class="entry"]/a/@href');
			
			if ($moviePageLinks->length > 2) {
				for ($i = 2; $i < $moviePageLinks->length; $i++) {
					$movieLinks [] = $moviePageLinks->item($i)->nodeValue;
				}
			}
			
			return $movieLinks;
				
        }
        
		// Method to return XPath object
		public function returnXPathObject($item) {
			$xmlPageDom = new DomDocument();	// Instantiating a new DomDocument object
			@$xmlPageDom->loadHTML($item);	// Loading the HTML from downloaded page
			$xmlPageXPath = new DOMXPath($xmlPageDom);	// Instantiating new XPath DOM object
			return $xmlPageXPath;	// Returning XPath object
		}	
		
		// Method for making a GET request using cURL
		public function curlGet($url) {
			$ch = curl_init();	// Initialising cURL session
			// Setting cURL options
			curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Returning transfer as a string
			curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);	// Follow Location: headers
			curl_setopt($ch, CURLOPT_URL, $url);	// Setting URL
			$results = curl_exec($ch);	// Executing cURL session
			curl_close($ch);	// Closing cURL session
			return $results;	// Return the results
		}
			
	}

Save it as putlocker.php

To use it:

<?php

include('putlocker.php');

$putlocker = new Putlocker();

$title = 'whatever movie you want';

try {
	$links = $putlocker->searchMovieLinks($title);
} catch (Exception $e) {
	// Add your error handling class in here...
}

if ($links) {
	print_r($links);
}

?>

…and you’ll get an array of the ‘alternative movie links’ from putlocker.bz

See, I told you it was pointless posting uncommented code (well, where the only comments are from previously posted methods) without any reference or instructions to the new methods…now you can scrape a few links, but you probably don’t know how you did it.

Ideas For Web Scraping & Automation Projects Or Posts

My blog has remained rather stagnant for the last few months, in part due to uni work, but mostly because I can’t think of any interesting topics to cover or tutorials to write. I’ve so far covered some really basic web scraping topics and gone slightly more in depth in my book, but I don’t know what it is you want to learn more about.

So, all suggestions are currently welcome! Please leave them in the comments section and if they’re appropriate I’ll cover them in future posts. So far I have a few ideas (feel free to leave some feedback):

  • Auto-tweeting images from RSS feeds – So far, none of the automation tools out there support natively tweeting images to Twitter. They all use third-party URL shortening services, which means your image doesn’t appear in your Twitter stream or in your ‘Photos and videos’ page of your Twitter profile. This pisses me off and I’m working on something to accomplish this.
  • OOP PHP programming – With the basics down, I think from now on all my posts and tutorials are going to be using Object-Oriented Programming (OOP), as I do in my personal and clients’ projects. It’s far more easier and cleaner when working on larger projects and easy to scale out our applications as we add more features. This will also entail using classes such as DOMDocument(), PDO(), among others to make our applications more robust and easier to maintain.
  • Automating and scraping AJAX – With more websites than ever now using AJAX, I think automating and scraping these using PHP might be an interesting project to cover.
  • Basic captcha ‘cracking’ – I know this may be a somewhat ‘grey area’ topic. But I’ll approach it from a neutral perspective. Using PHP and Optical Character Recogition (OCR) to crack basic, but commonly implemented, captchas.

If any of these topics take your fancy, or there’s something else you want covered, leave your responses in the comments below

Experimentally Proving Hooke’s Law

Jacob Ward (jjw3g12 ID:25579363) “I am aware of the requirements of good academic practice and the potential penalties for any breaches”

Introduction

Hooke’s Law is a scientific law concerning itself with the elasticity of materials. It states that when a force is applied to a material, the displacement of that material will be directly proportional to the force applied.

Hooke’s Law equation is written as:

F = -kx
where F is the force applied in Newtons (N).
where k is the rate at which the material is displaced, or spring constant, in Newtons per metre (N/m).
where x is the displacement of the material in metres (m).

Hooke’s Law applies, so long as the material is within it’s elastic limit. When an amount of force has been applied, so as to extend the material beyond it’s elastic limit, the material is in it’s plastic range, where applying further force causes permanent displacement of the material.

In this experiment three materials will be used to determine their behaviour according to Hooke’s Law, two within their elastic limit and one in it’s plastic range.


Apparatus and Method

hookes law experiment

Fig 1. Diagram of experiment apparatus

The Apparatus required to carry out the experiment are:

  • Clamps
  • Frame / Ring Stand
  • Material to be tested (Spring)
  • Known Mass
  • Metre Rule

As shown in Fig 1.

In order to carry out the experiment, the apparatus is first set up as shown in Fig 1, followed by these steps for each material being tested:

  1. The top of the material to be tested is attached to the clamp at the top and is hanging parallel to the metre rule.
  2. A known mass is attached to the bottom of the material, causing the material to be displaced.
  3. The material’s displacement is measured using the metre rule and noted down.

Data and Analysis

Shown below in Fig 2 are the results collected after performing the experiment and analysing the data. Where y1 is material 1, y2 is material 2 and z is material 3.

Fig 2. Table showing results and analysis of data
x (Force applied in Newtons N) y1 (Deformation in mm) y2 (Deformation in mm) z (Deformation in mm)
1.00 3.00 2.26 2.38
2.00 4.50 4.32 9.38
3.00 6.00 6.37 28.38
4.00 7.50 8.43 65.38
5.00 9.00 10.49 126.38
6.00 10.50 12.55 217.38
7.00 13.00 14.61 344.38
8.00 14.00 16.67 513.38
9.00 15.00 18.72 730.38

As shown in Fig 3, OpenOffice was used to analyse and calculate the data.

Fig 3. Functions and formulae used to calculate y2 and z
Value Math Function OpenOffice Formula
y2 f(x) = (a + 0.5) x + c =(1.5583333333+0.5)*(A2:A10)+0.2
z f(x) = x3 + b =(A2:A10^3)+1.375

where c = 0.2 (as given on data sheet).
where a = 1.5583333333 (as calculated using OpenOffice).
where b = 1.375 (as calculated using OpenOffice).

Shown in Fig 4 and Fig 5 are graphs of the results plotted with trend lines.

Fig 4. Graph with trend lines plotting materials y1 and y2 against x

Fig 5. Graph with a trend line plotting z against x

From looking at Fig 4 it is possible to estimate the meeting point of y1 and y2 as 2.30, 5.00

By resolving the simultaneous equations, the precise meeting point of y1 and y2 is calculated as:

Video of manually solving simultaneous equations to obtain values

y1 = 1.5583x + 1.375
y2 = 2.0583x + 0.2

2.0583x + 0.2 = 1.5583x + 1.375
2.0583x – 1.5583x = 1.375 – 0.2

0.5x = 1.175

x = 2.35

y = (a + 0.5)x + c
y = (1.5583 + 0.5) * 2.35 + 0.2

y = 5.037

Actual meeting point: 2.35, 5.037


Conclusions

From the results obtained by this experiment it is possible to confirm that Hooke’s Law holds true.

A linear relationship between the force applied, x, and the displacement, y, is shown for materials 1 and 2, shown in the results as y1 and y2, respectively, showing that they are both within their elastic limit.

Material 2, having a steeper trend line than material 1, shows that less force is required to displace the material and that it can be considered to be more elastic.

The exponential trend line for material 3, represented by z, shows that as force is applied the material is permanently displaced and that it is within it’s plastic range.

Despite the results of this experiment showing what Hooke’s Law predicts, there are a number of possible areas of error which could lead to inaccurate results, such as:

  • The accuracy of the metre rule being used to take measurements.
  • Parallax error from the person performing the experiment reading the measurement from an angle.
  • The mass being used to exert force on the material not being accurately measured.
  • Rounding numbers down when resolving the simultaneous equations would lead to slightly inaccurate results.

If conducted again, it would be advantageous to minimise these potential sources of error in the experiment to achieve more accurate results.


References

Bird, J., Ross, C., 2012. Mechanical Engineering Principals, 2nd edition. London and New York: Routledge.

Burton, P., “Forces & Elasticity” (2010), Physics Net, [Online].
Available at: http://physicsnet.co.uk/gcse-physics/forces-elasticity-hookes-law-spring-constant-elastic-potential-energy/ [Accessed 13 November 2012].

OpenStax College. “Hooke’s Law: Stress and Strain Revisited” (2012), Connexions, [Online].
Available at: http://cnx.org/content/m42240/1.5/ [Accessed 13 November 2012].
License: Creative Commons Attribution – 3.0 Unported (CC BY 3.0): http://creativecommons.org/licenses/by/3.0/

Stanbrough, JL., “How Does A Spring Scale Work?” (2002), Mr Stanbrough’s Classes, [Online].
Available at: http://www.batesville.k12.in.us/physics/phynet/mechanics/newton3/Labs/SpringScale.html [Accessed 13 November 2012].

Ward, J., “Hooke’s Law Experiment” (2012), Jacob Ward, [Online].
Available at: http://www.jacobward.co.uk/computer-applications-assignment-1/ [Accessed 9 November 2013].

Instant PHP Web Scraping Book Now Available!

If you’ve been following me on Twitter or contacted me privately, it’s likely you know this day has been approaching and, Instant PHP Web Scraping was published on 26th July and is now available to buy!

For those that don’t already know, the content of the book is essentially where I had originally intended to head with the Web Scraping With PHP & CURL series I started. Aimed at novice PHP programmers who are new to web scraping, it will guide readers through the basics and provide a tool set to complete a number of web scraping tasks and give a firm basis for further learning on the subject.

NOTE: This book is intended to serve as a brief introduction to web scraping with PHP. I was under strict instruction and constraints by the publisher. The target audience of this book is the absolute beginner. If you have experience working with PHP, cURL, MySQL, etc… this book is not for you.

The book is available as an ebook from Packt Publishing or as a paperback from Amazon. In addition to the recipes contained in the book, there are also a number of bonus recipes which will be available online for anybody who has purchased the book, providing even more coverage of the subject matter. I will also be setting up an online forum here, where anybody who has read the book can post questions or ask for help from me personally.

Win A Free Copy!

Packt Publishing have 3 free copies of Instant PHP Web Scraping in ebook format which you can win. I will be putting a competition together in the coming days, so stay tuned to find out how to enter and be in with a chance to win!

Own A Website And Want A Free Copy?

If you own a website or blog and would like to review this book, please send me your details via my contact form and I will respond asap with full details.

Book Overview

Who this book is for

This book is aimed at those new to web scraping, with little or no previous programming experience. Basic knowledge of HTML and the Web is useful, but not necessary.

What you will learn from this book

  • Scrape and parse data from web pages using a number of different techniques
  • Create custom scraping functions
  • Download and save images and documents
  • Retrieve and scrape data from emails
  • Save scraped data into a MySQL database
  • Submit login and file upload forms
  • Use regular expressions for pattern matching
  • Process and validate scraped data
  • Crawl and scrape multiple pages of a website

In Detail

With the proliferation of the web, there has never been a larger body of data freely available for common use. Harvesting and processing this data can be a time consuming task if done manually. However, web scraping can provide the tools and framework to accomplish this with the click of a button. It’s no wonder, then, that web scraping is a desirable weapon in any programmer’s arsenal.

Instant Web Scraping With PHP How-to uses practical examples and step-by-step instructions to guide you through the basic techniques required for web scraping with PHP. This will provide the knowledge and foundation upon which to build web scraping applications for a wide variety of situations such as data monitoring, research, data integration relevant to today’s online data-driven economy.

On setting up a suitable PHP development environment, you will quickly move to building web scraping applications. Beginning with a simple task of retrieving a single web page, you will then gradually build on this by learning various techniques for identifying specific data, crawling through numerous web pages to retrieve large volumes of data, and processing then saving it for future use. You will learn how to submit login forms for accessing password protected areas, along with downloading images, documents, and emails. Learning to schedule the execution of scrapers achieves the goal of complete automation, and the final introduction of basic object-oriented programming (OOP) in the development of a scraping class provides the template for future projects.

Armed with the skills learned in the book, you will be set to embark on a wide variety of web scraping projects.

Approach

Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. Short, concise recipes to learn a variety of useful web scraping techniques using PHP.

Table of contents

  • Preparing your development environment (Simple)
  • Making a simple cURL request (Simple)
  • Scraping elements using XPath (Simple)
  • The custom scraping function (Simple)
  • Scraping and saving images (Simple)
  • Submitting a form using cURL (Intermediate)
  • Traversing multiple pages (Intermediate)
  • Saving scraped data to a database (Intermediate)
  • Scheduling scrapes (Simple)
  • Building a reusable scraping class (Advanced)
  • + online bonus content covering a number of other topics!

Using Proxies For Scraping With PHP & cURL

We’ve all been there: You finally get your scraper working perfectly, you run your short tests and all your data is being returned as expected, so you let your scraper go wild. Then BAM! Errors are being thrown left, right and center, and you realise the target site is blocking your scraper.

Usually the problem is that your IP has been blocked for requesting too many pages in too short a space of time. The solution? Use proxies.

Using proxies allows your requests to go through many different IP addresses and thus appear to be coming from different visitors.

You could use free proxies from one of the many free proxy lists, but you’re going to run into issues pretty quickly, such as proxy speed, connection time, level of anonymity or the proxies simply going down between you collecting them and actually using them.

The best option is to use private proxies from a provider, such as my personal favorite My Private Proxy.

So, we’ve got our proxies sorted, now how do we go about using proxies in our PHP scraper scripts? Fortunately this is made very easy using PHP’s cURL library.

Code and Walkthrough: Using Proxies For Scraping With PHP & cURL

  1. First, we need to load our list of proxies into an array. Depending on your source of proxies, they will need to be entered in a variety of different ways. The most common ones are shown in the code below.
    $proxies = array();	// Declaring an array to store the proxy list
    
    // Adding list of proxies to the $proxies array
    $proxies[] = 'user:password@173.234.11.134:54253';	// Some proxies require user, password, IP and port number
    $proxies[] = 'user:password@173.234.120.69:54253';
    $proxies[] = 'user:password@173.234.46.176:54253';
    $proxies[] = '173.234.92.107';	// Some proxies only require IP
    $proxies[] = '173.234.93.94';
    $proxies[] = '173.234.94.90:54253';	// Some proxies require IP and port number
    $proxies[] = '69.147.240.61:54253';
    
  2. Next up, we select a random proxy from our list to use.
    // Choose a random proxy
    if (isset($proxies)) {	// If the $proxies array contains items, then
    	$proxy = $proxies[array_rand($proxies)];	// Select a random proxy from the array and assign to $proxy variable
    }
    
  3. Now, after initialising our cURL handle, we set the CURLOPT_PROXY option of cURL to our randomly selected proxy, set all our other cURL options, then execute the request and close the handle.
    
    $ch = curl_init();	// Initialise a cURL handle
    
    // Setting proxy option for cURL
    if (isset($proxy)) {	// If the $proxy variable is set, then
    	curl_setopt($ch, CURLOPT_PROXY, $proxy);	// Set CURLOPT_PROXY with proxy in $proxy variable
    }
    
    // Set any other cURL options that are required
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_URL, $url);
    
    $results = curl_exec($ch);	// Execute a cURL request
    curl_close($ch);	// Closing the cURL handle
    

And there we have it, our first cURL request using a proxy. There are a number of other techniques which can further mask the source of your scrapers’ requests which I hope to cover in future posts, so stay tuned.