bbp tweet WordPress bbPress plugin

bbPress plugin to automatically tweet new topics and replies.

Download Plugin | Fork on GitHub | Donate


bbp-tweet-admin-panel-settings

bbp-tweet-admin-panel-oauth

bbp-tweet-admin-panel-dropdown

bbp-tweet-admin-panel-accounts

Download Plugin | Fork on GitHub | Donate


Future Functionality

  • Add support for custom prefix/suffix options for tweets.
  • Add stats tracking for tweet favourites, retweets, click through to topics/replies, etc…
  • Add support for selecting which forums to include/exclude.

If you enjoy using this plugin and wish to support further development, you can send me a beer.

Reverse Engineering JavaScript Encryption Functions To Scrape Email Addresses


Disclaimer: I do not condone spamming (sending unsolicited emails). The information here is provided purely for educational purposes and to highlight problems with implemented solutions to try to combat the scraping of email addresses.

Since obfuscating data in this way goes against both the W3C‘s and Sir Tim Berners-Lee‘s goals of creating a Semantic Web, I have no issues discussing how to un-obfuscate it, so we’ll have no discussions about ethics in the comments, thank you. [Updated] Ok, we’ll have some discussion of ethics if you want.


Oftentimes when developing a website, a webmaster is smart enough to take into account the fact that web bots and scrapers are going to come across their site looking for information. One piece of information that they really don’t want to be scraped is email addresses, because many web scrapers are looking to harvest email addresses to spam.

Despite this, it is often desirable to display email addresses for human visitors, so a workaround is required. There are a number of methods in common usage, but today we’ll be looking at encrypting email addresses using JavaScript encryption functions for frontend display and how it is not a viable solution in combatting web scrapers.

For today’s example we’ll be looking at this wedding planning website which goes to some lengths to try and protect its users email addresses from bots, whilst still displaying them for human visitors.

Reverse Engineering The Encryption Function

In our example the webmaster has gone to great lengths to hide its email addresses. However, as always, we can find a way around this.

To start off with, looking at the code we can see that there is a table with the class name of company-email. It would be logical then for our web bot to assume that any email data is going to be held and displayed within this table. And, if we look, we see the following:

<table class='company-email'><tr><td width='80px' valign='top'><script>shemword();</script></td><td><SCRIPT>sw('moc_yyz_rehpargotohpbytteb.ytteb.com');</SCRIPT></td></tr></table>

…two table cells with <script> </script> tags in them making calls to the functions shemword(); and sw();, with the second function passing a single parameter of moc_yyz_rehpargotohpbytteb.ytteb.com.

Now, although it may seem obvious that the second function is the one displaying the users email address, we need to be thinking like a web bot here and try to reverse engineer both of these fields – both could contain useful data that we may want to scrape and store.

First we’ll start with the shemword(); function, which when we go through the linked script files we find to be:

function shemword()
{
    document.write(String.fromCharCode(69,109,97,105,108));
}

This is writing something out into the document for display.

The function String.fromCharCode(); is one of the first we should have in our reverse engineering class. What the function does is take, as an argument, a series of character codes which represent ascii codes and convert them to standard English UTF characters.

There is no direct equivalent function in PHP which takes a string of ascii codes and converts them, although there is chr() which will take a single ascii code and return a single character. Using this function, we can take a string, convert it to an array and iterate over the different codes in our method.

Our reverse encryption method for this use case should look something like:


class ReverseEncrypt {

    public static function stringFromCharCode( $arr_char_codes ) {
        return implode( array_map( 'chr', $arr_char_codes ) );
    }

}


// Call this method on a comma separated string
$str_shemword_chars = '69,109,97,105,108';

$str_shemword_decrypted = ReverseEncrypt::stringFromCharCode( explode( ',' $str_shemword_chars ) );

echo $str_shemword_decrypted;   // Outputs "Email"

When we run the previous information through it we get the result of “Email“.

Having that returned as the data from the first cell of our table is great, because logic (and we should have this logic coded into our scraper somewhere) would have us assume that the other cell contains an actual email address.

Moving on to the second cell we hit the sw(‘moc_yyz_rehpargotohpbytteb.ytteb.com’); function, which when we find in the code we see to be:

function sw(t)
{
  t=t.substring(0,t.length-4)

  r=String.fromCharCode(95,121,121,122,95);

  while (r.indexOf("#") > -1)
    r=r.replace("#","");

  while (t.indexOf(".") > -1)
    t=t.replace(".","@");

  while (t.indexOf(r) > -1)
    t=t.replace(r,".");

  var s=""
  var l=t.length;

  for (i=0;i<=l;i=i+1)
  {
      s=s + t.charAt(l-i);
  }

  document.write('<a href="http://www.theweddingplannerireland.ie/js/%5C%27mailto:%27" +="" s="" '?subject="Enquiry" from="" theweddingplanner\'="">' + s + '</a>');
}

This also appears to be writing something out to the page and a quick grok of that write will return the obvious mailto: in there. Bingo, we have a winner!

We should now be able to add another method to our ReverseEncrypt class. Going through line-by-line:

Line 3

In this line the encrypted string passed to the function’s length is being evaluated, then taking 4 from that and returning a substring of the original string with the length – 4 removed from the end.

t=t.substring(0,t.length-4)
$t = substr( $t, 0, strlen( $t ) - 4 );

Here we are converting charcodes again.

r=String.fromCharCode(95,121,121,122,95);
$r = implode( array_map( 'chr', array('95', '121', '121', '122', '95') ) );

You could also use our previously defined method to acccomplish this, as such:

$r = self::stringFromCharCode( array('95', '121', '121', '122', '95') );

Here we have a while loop with the condition that as long as the # character is present in the string, it is replaced by an empty string.

while (r.indexOf("#") > -1)
    r=r.replace("#","");
while ( strpos( $r, '#' ) !== false ) {
    $r = str_replace( '#', '', $r );
}

Here, like previously, we have a while loop, this time evaluating whether the . is present and if it is replacing it with an @ symbol.

while (t.indexOf(".") > -1)
    t=t.replace(".","@");
while ( strpos( $t, '.' ) !== false ) {
    $t = str_replace( '.', '@', $t );
}

Here we have, yet another, while loop. This time evaluating whether the previously assigned r variable is present and replacing it with a ..

while (t.indexOf(r) > -1)
    t=t.replace(r,".");
while ( strpos ( $t, $r ) !== false ) {
    $t = str_replace( $r, '.', $t );
}

This line is kind of unnecessary for anything other than freeing up memory perhaps or getting ready for another email address? But it’s just setting the s variable to an empty string.

var s=""
$s = '';

This section of code is actually pretty convoluted, in that all it is doing is reversing a string. But since JavaScript doesn’t have a ‘string reverse’ function, this is actually one of the easiest ways to accomplish it. I’ll go through it, even though in PHP we just have to run a simple strrev() on the string in question.

First the length of the string is being assigned. Then, for the length of the string, take the letter at the length minus the number of iterations and append it to the new string s:

var l=t.length;

for (i=0;i<=l;i=i+1)
  {
      s=s + t.charAt(l-i);
  }
$s = strrev( $s );

Lastly, the JavaScript is writing out a string to send mail to the address. Since we likely don’t want to do this, at least at this point in time, we’ll just return the email address as a string. Then we can put it in a database or something.

document.write('<a href="http://www.theweddingplannerireland.ie/js/%5C%27mailto:%27" +="" s="" '?subject="Enquiry" from="" theweddingplanner\'="">' + s + '</a>');
return ( $t );

This gives us a final class that should look something like this:


class ReverseEncrypt {

    public static function stringFromCharCode( $arr_char_codes ) {
        return implode( array_map( 'chr', $arr_char_codes ) );
    }

    public static function decryptSwtEmail( $t // $str_encrypted_email ) {
        $t = substr( $t, 0, strlen( $t ) - 4 );

        $r = implode( array_map( 'chr', array('95', '121', '121', '122', '95') ) );

        while ( strpos( $r, '#' ) !== false ) {
            $r = str_replace( '#', '', $r );
        }

        while ( strpos( $t, '.' ) !== false ) {
            $t = str_replace( '.', '@', $t );
        }

        while ( strpos ( $t, $r ) !== false ) {
            $t = str_replace( $r, '.', $t );
        }

        $t = strrev( $t );

        return ( $t );
    }

}


// Call this method on an encrypted string

$str_swt_email_encrypted = 'moc_yyz_rehpargotohpbytteb.ytteb.com';

$str_swt_email_decrypted = ReverseEncrypt::decryptSwtEmail( $str_swt_email_encrypted );

echo $str_swt_email_decrypted; // Outputs betty@bettybphotographer.com

When we run a sample encrypted email address through the method, such as moc_yyz_rehpargotohpbytteb.ytteb.com, we get an output of betty@bettybphotographer.com. And if we were to run the scraper through the whole site (yes, I’ve done it) we now have over 4,000 decrypted email addresses that somebody didn’t want us to have.


Taking the time to find more examples like this and extending our reverse engineering encryption class even further can prove to be a real timesaving endeavour and one I think it’s worth looking in to if you plan on automated scraping of various ‘unknown’ sources on a large scale. Next time I’ll be looking at decrypting the popular Enkoder Form by Hivelogic.


In conclusion, webmasters looking to avoid web scrapers from scraping email addresses need to come up with better solutions. As shown here, it’s easy for people like myself, or worse – spammers!, to get this information if you don’t try harder to hide it.

As a side note, and referring back to the opening paragraph of this post, not displaying this information on your web page in plain text and semantically marking it as being an email address (<address>admin@jacobward.co.uk</address>) is a detriment to the internet as a whole. I say “either display it properly or don’t display it at all”. What do you think?

Using PHP To Scrape Websites Generated By JavaScript, jQuery, AJAX & JSON

Scraping websites generated by JavaScript or jQuery using PHP is a topic that I’ve received many requests for and one that I’ve been wanting to cover for a while now. More often than not, it’s just a single page or form that people are having issues with, but I wanted to wait until I found an entire site that is generated using JavaScript where at no point would traditional PHP web scraping techniques work.

Today is that day, and the site is NCR Silver, a Point-of-Sale (POS) system with a web management interface generated entirely by JavaScript.

You’ll need to signup for an account where you’ll get a free 14 day trial. More than enough time to work through this material and learn the techniques involved.

NCR POS Signup Form

Signup for a free 14 day trial at NCR POS.

When you receive your welcome email then we’ll be ready to get started!

Now, let’s navigate to the main login page and take a look. At first glance everything looks normal, wouldn’t you agree?

NCR POS JavaScript Login Page

When we take a first glance at the login page, even in the DOM inspector, everything looks normal.

But when we view the page source we see something else entirely.

We see that there are <noscript> </noscript> tags surrounding some HTML to be displayed to clients without JavaScript enabled informing them that they can’t access the website without JavaScript enabled – this could prove a problem for our web bot written in PHP & cURL, since cURL cannot process JavaScript.

For clients with JavaScript enabled we see a series of document.write() statements to display the HTML code for the login page. Now this could cause an issue for us if the HTML was dynamically generated and we needed JavaScript enabled to actually view it (more on this later). But, as it is, the HTML is hardcoded into the page and we can see the HTML that would be displayed if we had JavaScript enabled.

noscript login

Source of page showing what is displayed to clients without JavaScript.


script login

Source of page showing what is displayed to clients with JavaScript.

From studying the HTML login form using the Tools > Web Developer > Inspector we can assertain what information we need in order to submit the login form to authenticate and build an array from the data:

$credentials = array(
	'username' => $userEmail,       // Your email address
	'password' => $userPass,        // Your password
	'RememberMe' => 'true',         // Staying logged in
	'IsAjaxRequest' => 'false'      // Whether request is AJAX
);

Ar this point I’m going to introduce a new method of determining that data, as it is one we will be using heavily once we get into the admin area. First you need to download and install the Live HTTP Headers plugin: Firefox, Chrome. There are Internet Explorer alternatives, but since I’m not familiar with them and Internet Explorer is a piece of shit, they won’t be covered here.

With the Live HTTP Headers plugin installed we can fire it up from Tools > Live HTTP Headers and make sure the Capture checkbox is selected. Now we manually submit the web login form and you should see the HTTP Headers window begin to fill up with data.

All we have to do now is navigate to the POST request for the login form, POST /app/Account/LogOn HTTP/1.1 and look at the data being submitted.

Live HTTP Headers Login

Live HTTP Headers plugin showing the headers sent when we submit the login form, including our login details.

Now that we have the required info we can just make a simple cURL POST request to get ourselves logged in.

// Function to submit form using cURL POST method
function curlPost($postUrl, $postFields) {
	
	$useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';	// Setting useragent of a popular browser
	
	$cookie = 'cookie.txt';	// Setting a cookie file to store cookie
	
	$ch = curl_init();	// Initialising cURL session

	// Setting cURL options
	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);	// Prevent cURL from verifying SSL certificate
	curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);	// Script should fail silently on error
	curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);	// Use cookies
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);	// Follow Location: headers
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Returning transfer as a string
	curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);	// Setting cookiefile
	curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);	// Setting cookiejar
	curl_setopt($ch, CURLOPT_USERAGENT, $useragent);	// Setting useragent
	curl_setopt($ch, CURLOPT_URL, $postUrl);	// Setting URL to POST to
			
	curl_setopt($ch, CURLOPT_POST, TRUE);	// Setting method as POST
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);	// Setting POST fields as array
			
	$results = curl_exec($ch);	// Executing cURL session
	curl_close($ch);	// Closing cURL session
	
	return $results;
}

$url = 'https://mystore.ncrsilver.com/app/Account/LogOn';	// Login POST URL

// Array built from login credentials
$credentials = array(
	'username' => $userEmail,       // Your email address
	'password' => $userPass,        // Your password
	'RememberMe' => 'true',         // Staying logged in
	'IsAjaxRequest' => 'false'      // Whether request is AJAX
);



// Performing the login!
$request = curlPost($url, $credentials);
 
$login = json_decode($request); // Decoding the JSON response
 
if ($login->success == 1) {
    // Successful login
    $message = 'Successful login.'; // Assigning successful login message
    print_r($request);
    echo $message . "\n";
} elseif ($login->success == 0) {
    $message = $login->error;    // Assigning login error message returned by server
    echo $message . "\n";
    print_r($request);
    exit(); // Ending program
} else {
    $message = 'Unknown login error.';  // Assigning unknown login error message
    echo $message . "\n";
    print_r($request);
    exit(); // Ending program
}



Now, you may be surprised to find out that what is returned from the server is not the usual web page that we would expect from a form submission. Instead, the response is a JSON encoded string intended for the JavaScript application to handle our login request.

I’ve added a couple of print_r() statements in the code so we can actually see what is being returned by the server.

For an unsuccessful login we should receive:

{"success":false,"errorCode":"I","error":"The User Name or Password you entered is not correct. Please try again."}

For a successful login we should receive:

{"success":true,"isAdminUser":false,"isTrialUser":true,"trialDaysLeft":13,"totalTrialTime":14,"resetPassword":false,"posUserId":"JWARD","membershipUserId":"d4b0b737-d191-4f71-93cf-60672dd97d10","merchantId":519228,"merchantStatusCode":"1","eulaId":0,"eulaFileName":"","isPaymentRequired":false,"BillingInfo":"","isCompanyUser":true,"merchantStoreCount":1,"assignedUserStoreCount":1,"userId":1202,"merchantUserRoleId":1}

If you’re not familiar with JSON it’s actually pretty simple, it’s a string of keys and values, much the same as an array. In our instance here it’s the “success” key we are looking for and it’s value of true or false letting us know whether our login was successful or not.

In our PHP script we decode the JSON string using the json_decode() function and store the object in $login. From this we can determine if our login was successful (true / 1) or if it failed (false / 0). With any luck, we should receive a successful login and our PHP scraper script will echo a success message:

Successful login.

…happy fucking days! Now we’re getting to the fun stuff >:)

Now we’re in. What do we want to do? How about get all the customer information?

In your browser navigate to CUSTOMERS > CUSTOMERS or just follow this link.

NCR Silver Customers

The customers admin panel with only one customer in it. This page is generated entirely from JavaScript.


Oh shit, there’s only one customer there, this is going to be boring. I guess we should add a few customers to work with.

Since what we’re really interested in is the scraping of data from a JavaScript page, we’re just going to use the import function of the web site to add bunch of customers. All you have to do is download this csv file and import it on the site.

Import Button

Here is where we import the customers.

Importing Customers Into The POS

Screen showing the importing of our customer base to be scraped.

Now we’ve got some customer data visible in our browser, all displayed by the website using JavaScript and JSON.

Customers To Scrape

Here’s our customer base rendered with JavaScript ready to be scraped.

As you can see by viewing the source code of the page, nowhere can we see the information about the customers, all we see is lots of JavaScript includes which are doing the rendering of the customer information. So where is this information coming from? Well, when the page is loaded in your browser the web page makes a request to get all of the customer data which is returned as a JSON object, which is then rendered in your browser using JavaScript.

Customers Page JavaScript Source Code

All page content is being rendered using a collection of JavaScript applications. Nowhere can we see the actual rendered page content.

You may be thinking, well if we can’t see the customer information on the page then when we request the page using cURL like we usually do, how can we scrape the data? Well, it’s actually quite simple – we pretend to be the JavaScript web application requesting the data and then we have a JSON object of all the data we require returned to us which we can mine and scrape to our hearts content.

In order to do this, we must first figure out the request that is being made by the page which we want to immitate. Back we go to our trusty Live HTTP Headers plugin. I figure the best way to do this is not to ‘load the page’, as this will return lots of erroneous data such as markup and styling. I figure the best way to do this is to mimic the performance of a search as this should only return data about the customers – maybe if we perform a search with no search string we get a list of all of the customers? Let’s give it a shot!

Live HTTP Headers for JavaScript Search Form

From the HTTP headers we can see the POST URL for the search form and the data being sent.

There we have it – our URL to make the POST request and all of the data to pass along with it. Let’s start building this up and hopefully we should see positive results.

<?php
    class NCRSilverScraper {

        // Class constructor method
        function __construct() {

            $this->useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';    // Setting useragent of a popular browser

            $handle = fopen('cookie.txt', 'w') or exit('Unable to create or open cookie.txt file.'."\n");   // Opening or creating cookie file
            fclose($handle);    // Closing cookie file
            $this->cookie = 'cookie.txt';    // Setting a cookie file to store cookie
            $this->timeout = 30; // Setting connection timeout in seconds

            $this->loginUrl = 'https://mystore.ncrsilver.com/app/Account/LogOn';

        }


        // User login method
        public function login() {

            // Login values to POST as array
            $postValues = http_build_query(
                array(
                    'username' => $emailAddress,
                    'password' => $password,
                    'RememberMe' => 'true',
                    'IsAjaxRequest' => 'false'
                )
            );

            $request = $this->curlPostFields($this->loginUrl, $postValues);   // Making cURL POST request

            $login = json_decode($request); // Decoding the JSON response

            if ($login->success == 1) {
                // Successful login
                    $message = 'Successful login.'; // Assigning successful message
                echo $message;
            } elseif ($login->success == 0) {
                $message = $login->error;    // Assigning login error message returned by server
                echo $message;
                exit(); // Ending program
            } else {
                $message = 'Unknown login error.';  // Assigning unknown login error message
                echo $message;
                exit(); // Ending program
            }
        }

        // User logout method
        public function logout() {
            $request = $this->curlPostFields('https://mystore.ncrsilver.com/app/Account/LogOff?CancelLogin=true&isAjaxRequest=true', null);  // Logging out
        }

        // Method to search and scrape existing members details
        public function scrapePersons($searchString = '') {

            $searchUrl = 'https://mystore.ncrsilver.com/app/Customer/GetCustomers';

            $postValues = array(
                'PageRowCount' => 1000,
                'RequestedPageNum' => 1,
                'TotalRowCount' => -1,
                'SearchArg' => $searchString,
                'SortDirection' => 'ASC',
                'SortColumn' => 'Name',
                'page' => 1,
                'start' => 0,
                'limit' => 1000,
                'sort' => '[{"property":"Name","direction":"ASC"}]',
                'isAjaxRequest' => true,
            );

            $search = $this->curlPostFields($searchUrl, $postValues);

            return $search;
        }

        // Method to make a POST request using form fields
        public function curlPostFields($postUrl, $postValues) {
            $_ch = curl_init(); // Initialising cURL session

            // Setting cURL options
            curl_setopt($_ch, CURLOPT_SSL_VERIFYPEER, FALSE);   // Prevent cURL from verifying SSL certificate
            curl_setopt($_ch, CURLOPT_FAILONERROR, TRUE);   // Script should fail silently on error
            curl_setopt($_ch, CURLOPT_COOKIESESSION, TRUE); // Use cookies
            curl_setopt($_ch, CURLOPT_FOLLOWLOCATION, TRUE);    // Follow Location: headers
            curl_setopt($_ch, CURLOPT_RETURNTRANSFER, TRUE);    // Returning transfer as a string
            curl_setopt($_ch, CURLOPT_COOKIEFILE, $this->cookie);    // Setting cookiefile
            curl_setopt($_ch, CURLOPT_COOKIEJAR, $this->cookie); // Setting cookiejar
            curl_setopt($_ch, CURLOPT_USERAGENT, $this->useragent);  // Setting useragent
            curl_setopt($_ch, CURLOPT_URL, $postUrl);   // Setting URL to POST to
            curl_setopt($_ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);   // Connection timeout
            curl_setopt($_ch, CURLOPT_TIMEOUT, $this->timeout); // Request timeout

            curl_setopt($_ch, CURLOPT_POST, TRUE);  // Setting method as POST
            curl_setopt($_ch, CURLOPT_POSTFIELDS, $postValues); // Setting POST fields (array)

            $results = curl_exec($_ch); // Executing cURL session
            curl_close($_ch);   // Closing cURL session

            return $results;
        }


        // Class destructor method
        function __destruct() {
            // Empty
        }
    }


    // Let's run this baby and scrape us some data!
    $testScrape = new NCRSilverScraper();   // Instantiating new object

    $testScrape->login();    // Logging into server

    $data = json_decode($testScrape->scrapePersons());   // Scraping people records
    print_r($data);

    $testScrape->logout();   // Logging out

?>

And with that run we should have us some nice data scraped from a JavaScript and JSON website using nothing more than PHP and a little common sense.

Final Scraped JSON Data

Here’s the output of our scraper, printing out the contents of our PHP object.

Here we have the customer’s code, full name, email address and phone number. It’s one small step for web scraping, one giant leap for something or other. I don’t know where I was going with that.

Of course, we don’t have to store it as an object, we could always parse it into an array if you prefer working with your data like that. Or whatever your preferred data structure is.

Now we have the data it’s up to you what to do with it. Personally, and just for the purposes of this post, I’m going to write a little method to format it in a nice HTML table to display below…you might want to do something more useful with your data, like store it in a database, csv or something else, which I might cover in a future post.

Anyways, I hope this post has been somewhat informative and answered most of your questions regarding scraping JavaScript sites and JSON using PHP. As always, comments and questions are always welcome. You know what to do.

Happy web scraping!

User ID Full Name Email Address Phone Number
192 Andrea Fernandez afernandez2g@cafepress.com 9-(362)056-0581
142 Ann Thomas athomas12@yahoo.co.jp 6-(538)141-2725
145 Ann Walker awalker15@intel.com 7-(670)470-3724
203 Anna Carr acarr2r@boston.com 1-(382)463-0119
183 Ashley Kelly akelly27@mtv.com 1-(112)543-9709
184 Benjamin Dean bdean28@irs.gov 9-(780)063-9572
111 Bonnie Alvarez balvarez7@paypal.com 7-(240)691-0590
141 Brandon Murray bmurray11@tumblr.com 7-(612)179-5480
156 Carolyn Foster cfoster1g@cyberchimps.com 7-(614)558-2275
187 Cheryl Burke cburke2b@merriam-webster.com 8-(119)283-2599
135 Christine Wells cwellsv@wufoo.com 7-(415)042-8205
130 Craig Harper charperq@wired.com 3-(092)318-1942
104 Daniel Gonzales dgonzales0@zimbio.com 9-(313)370-0380
136 Denise Kelly dkellyw@live.com 2-(435)951-9920
178 Denise Vasquez dvasquez22@reddit.com 6-(800)841-4073
166 Diana Gardner dgardner1q@chicagotribune.com 9-(653)558-6654
200 Diana Nguyen dnguyen2o@vinaora.com 7-(016)965-4256
198 Diana Richards drichards2m@huffingtonpost.com 4-(783)241-6445
118 Diane Harvey dharveye@google.es 0-(422)620-9113
128 Diane Porter dportero@qq.com 8-(493)442-8581
201 Donald Roberts droberts2p@acquirethisname.com 2-(883)548-2431
115 Donna Reyes dreyesb@opera.com 2-(529)344-1126
127 Doris Berry dberryn@cargocollective.com 3-(364)519-6194
106 Dorothy Andrews dandrews2@google.ru 5-(727)310-0492
180 Dorothy Kelly dkelly24@multiply.com 3-(400)221-6843

Updated Simple PHP Scraping Function

Going back over some of my old posts on web scraping and looking through the code I’ve noticed a few places where there is room for improvement. One such example is the scrape_between() function used in Working With The Scraped Data.

Functionally, it works. But it’s not great.

  • Repeatedly overwriting the $data variable isn’t a great idea.
  • It’s working on the assumption that both the $start and $end string are found in the search string, i.e. if $start isn’t found then the function should be terminating and returning false, not continuing on performing operations on an empty variable. This isn’t necessarily that bad, but it’s not very elegant and could become a performance issue if the function is used in a large appliction processing thousands of pages.

With that said, here’s my rewrite, which is more readable, makes more sense and is structurally more sound.

<?php
    
/*
 * String Extract PHP Function
 *
 * Simple function for extracting a string from within a string, given a start and end point.
 *
 * Copyright (c) 2014 Jacob Ward (http://www.jacobward.co.uk)
 *
 * Licensed under the MIT (http://opensource.org/licenses/MIT) and GPL (http://www.gnu.org/copyleft/gpl.html) licenses.
 *
 */

    function stringExtract($item, $start, $end) {
    	if (($startPos = stripos($item, $start)) === false) {	// If $start string is not found
    		return false;	// Return false
    	} else if (($endPos = stripos($item, $end)) === false) {	// If $end string is not found
    		return false;	// Return false
    	} else {
    		$substrStart = $startPos + strlen($start);	// Assigning start position
    		return substr($item, $substrStart, $endPos - $substrStart);	// Returning string between start and end positions
    	}
    }
    
?>

I’ve changed the name of the function and added some info before the function as it’s now hosted on GitHub as php-string-extract.

Just thought I’d post it here too for the sake of doing so.

By the way, I haven’t forgotten about the epic post I promised, I’m working on it most days but don’t have a great amount of free time at the moment. But trust me, the wait will be worth it!

Instant PHP Web Scraping Book Now Available!

If you’ve been following me on Twitter or contacted me privately, it’s likely you know this day has been approaching and, Instant PHP Web Scraping was published on 26th July and is now available to buy!

For those that don’t already know, the content of the book is essentially where I had originally intended to head with the Web Scraping With PHP & CURL series I started. Aimed at novice PHP programmers who are new to web scraping, it will guide readers through the basics and provide a tool set to complete a number of web scraping tasks and give a firm basis for further learning on the subject.

NOTE: This book is intended to serve as a brief introduction to web scraping with PHP. I was under strict instruction and constraints by the publisher. The target audience of this book is the absolute beginner. If you have experience working with PHP, cURL, MySQL, etc… this book is not for you.

The book is available as an ebook from Packt Publishing or as a paperback from Amazon. In addition to the recipes contained in the book, there are also a number of bonus recipes which will be available online for anybody who has purchased the book, providing even more coverage of the subject matter. I will also be setting up an online forum here, where anybody who has read the book can post questions or ask for help from me personally.

Win A Free Copy!

Packt Publishing have 3 free copies of Instant PHP Web Scraping in ebook format which you can win. I will be putting a competition together in the coming days, so stay tuned to find out how to enter and be in with a chance to win!

Own A Website And Want A Free Copy?

If you own a website or blog and would like to review this book, please send me your details via my contact form and I will respond asap with full details.

Book Overview

Who this book is for

This book is aimed at those new to web scraping, with little or no previous programming experience. Basic knowledge of HTML and the Web is useful, but not necessary.

What you will learn from this book

  • Scrape and parse data from web pages using a number of different techniques
  • Create custom scraping functions
  • Download and save images and documents
  • Retrieve and scrape data from emails
  • Save scraped data into a MySQL database
  • Submit login and file upload forms
  • Use regular expressions for pattern matching
  • Process and validate scraped data
  • Crawl and scrape multiple pages of a website

In Detail

With the proliferation of the web, there has never been a larger body of data freely available for common use. Harvesting and processing this data can be a time consuming task if done manually. However, web scraping can provide the tools and framework to accomplish this with the click of a button. It’s no wonder, then, that web scraping is a desirable weapon in any programmer’s arsenal.

Instant Web Scraping With PHP How-to uses practical examples and step-by-step instructions to guide you through the basic techniques required for web scraping with PHP. This will provide the knowledge and foundation upon which to build web scraping applications for a wide variety of situations such as data monitoring, research, data integration relevant to today’s online data-driven economy.

On setting up a suitable PHP development environment, you will quickly move to building web scraping applications. Beginning with a simple task of retrieving a single web page, you will then gradually build on this by learning various techniques for identifying specific data, crawling through numerous web pages to retrieve large volumes of data, and processing then saving it for future use. You will learn how to submit login forms for accessing password protected areas, along with downloading images, documents, and emails. Learning to schedule the execution of scrapers achieves the goal of complete automation, and the final introduction of basic object-oriented programming (OOP) in the development of a scraping class provides the template for future projects.

Armed with the skills learned in the book, you will be set to embark on a wide variety of web scraping projects.

Approach

Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. Short, concise recipes to learn a variety of useful web scraping techniques using PHP.

Table of contents

  • Preparing your development environment (Simple)
  • Making a simple cURL request (Simple)
  • Scraping elements using XPath (Simple)
  • The custom scraping function (Simple)
  • Scraping and saving images (Simple)
  • Submitting a form using cURL (Intermediate)
  • Traversing multiple pages (Intermediate)
  • Saving scraped data to a database (Intermediate)
  • Scheduling scrapes (Simple)
  • Building a reusable scraping class (Advanced)
  • + online bonus content covering a number of other topics!