Using PHP To Scrape Websites Generated By JavaScript, jQuery, AJAX & JSON

Scraping websites generated by JavaScript or jQuery using PHP is a topic that I’ve received many requests for and one that I’ve been wanting to cover for a while now. More often than not, it’s just a single page or form that people are having issues with, but I wanted to wait until I found an entire site that is generated using JavaScript where at no point would traditional PHP web scraping techniques work.

Today is that day, and the site is NCR Silver, a Point-of-Sale (POS) system with a web management interface generated entirely by JavaScript.

You’ll need to signup for an account where you’ll get a free 14 day trial. More than enough time to work through this material and learn the techniques involved.

NCR POS Signup Form

Signup for a free 14 day trial at NCR POS.

When you receive your welcome email then we’ll be ready to get started!

Now, let’s navigate to the main login page and take a look. At first glance everything looks normal, wouldn’t you agree?

NCR POS JavaScript Login Page

When we take a first glance at the login page, even in the DOM inspector, everything looks normal.

But when we view the page source we see something else entirely.

We see that there are <noscript> </noscript> tags surrounding some HTML to be displayed to clients without JavaScript enabled informing them that they can’t access the website without JavaScript enabled – this could prove a problem for our web bot written in PHP & cURL, since cURL cannot process JavaScript.

For clients with JavaScript enabled we see a series of document.write() statements to display the HTML code for the login page. Now this could cause an issue for us if the HTML was dynamically generated and we needed JavaScript enabled to actually view it (more on this later). But, as it is, the HTML is hardcoded into the page and we can see the HTML that would be displayed if we had JavaScript enabled.

noscript login

Source of page showing what is displayed to clients without JavaScript.


script login

Source of page showing what is displayed to clients with JavaScript.

From studying the HTML login form using the Tools > Web Developer > Inspector we can assertain what information we need in order to submit the login form to authenticate and build an array from the data:

$credentials = array(
	'username' => $userEmail,       // Your email address
	'password' => $userPass,        // Your password
	'RememberMe' => 'true',         // Staying logged in
	'IsAjaxRequest' => 'false'      // Whether request is AJAX
);

Ar this point I’m going to introduce a new method of determining that data, as it is one we will be using heavily once we get into the admin area. First you need to download and install the Live HTTP Headers plugin: Firefox, Chrome. There are Internet Explorer alternatives, but since I’m not familiar with them and Internet Explorer is a piece of shit, they won’t be covered here.

With the Live HTTP Headers plugin installed we can fire it up from Tools > Live HTTP Headers and make sure the Capture checkbox is selected. Now we manually submit the web login form and you should see the HTTP Headers window begin to fill up with data.

All we have to do now is navigate to the POST request for the login form, POST /app/Account/LogOn HTTP/1.1 and look at the data being submitted.

Live HTTP Headers Login

Live HTTP Headers plugin showing the headers sent when we submit the login form, including our login details.

Now that we have the required info we can just make a simple cURL POST request to get ourselves logged in.

// Function to submit form using cURL POST method
function curlPost($postUrl, $postFields) {
	
	$useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';	// Setting useragent of a popular browser
	
	$cookie = 'cookie.txt';	// Setting a cookie file to store cookie
	
	$ch = curl_init();	// Initialising cURL session

	// Setting cURL options
	curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);	// Prevent cURL from verifying SSL certificate
	curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);	// Script should fail silently on error
	curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);	// Use cookies
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);	// Follow Location: headers
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);	// Returning transfer as a string
	curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);	// Setting cookiefile
	curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);	// Setting cookiejar
	curl_setopt($ch, CURLOPT_USERAGENT, $useragent);	// Setting useragent
	curl_setopt($ch, CURLOPT_URL, $postUrl);	// Setting URL to POST to
			
	curl_setopt($ch, CURLOPT_POST, TRUE);	// Setting method as POST
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);	// Setting POST fields as array
			
	$results = curl_exec($ch);	// Executing cURL session
	curl_close($ch);	// Closing cURL session
	
	return $results;
}

$url = 'https://mystore.ncrsilver.com/app/Account/LogOn';	// Login POST URL

// Array built from login credentials
$credentials = array(
	'username' => $userEmail,       // Your email address
	'password' => $userPass,        // Your password
	'RememberMe' => 'true',         // Staying logged in
	'IsAjaxRequest' => 'false'      // Whether request is AJAX
);



// Performing the login!
$request = curlPost($url, $credentials);
 
$login = json_decode($request); // Decoding the JSON response
 
if ($login->success == 1) {
    // Successful login
    $message = 'Successful login.'; // Assigning successful login message
    print_r($request);
    echo $message . "\n";
} elseif ($login->success == 0) {
    $message = $login->error;    // Assigning login error message returned by server
    echo $message . "\n";
    print_r($request);
    exit(); // Ending program
} else {
    $message = 'Unknown login error.';  // Assigning unknown login error message
    echo $message . "\n";
    print_r($request);
    exit(); // Ending program
}



Now, you may be surprised to find out that what is returned from the server is not the usual web page that we would expect from a form submission. Instead, the response is a JSON encoded string intended for the JavaScript application to handle our login request.

I’ve added a couple of print_r() statements in the code so we can actually see what is being returned by the server.

For an unsuccessful login we should receive:

{"success":false,"errorCode":"I","error":"The User Name or Password you entered is not correct. Please try again."}

For a successful login we should receive:

{"success":true,"isAdminUser":false,"isTrialUser":true,"trialDaysLeft":13,"totalTrialTime":14,"resetPassword":false,"posUserId":"JWARD","membershipUserId":"d4b0b737-d191-4f71-93cf-60672dd97d10","merchantId":519228,"merchantStatusCode":"1","eulaId":0,"eulaFileName":"","isPaymentRequired":false,"BillingInfo":"","isCompanyUser":true,"merchantStoreCount":1,"assignedUserStoreCount":1,"userId":1202,"merchantUserRoleId":1}

If you’re not familiar with JSON it’s actually pretty simple, it’s a string of keys and values, much the same as an array. In our instance here it’s the “success” key we are looking for and it’s value of true or false letting us know whether our login was successful or not.

In our PHP script we decode the JSON string using the json_decode() function and store the object in $login. From this we can determine if our login was successful (true / 1) or if it failed (false / 0). With any luck, we should receive a successful login and our PHP scraper script will echo a success message:

Successful login.

…happy fucking days! Now we’re getting to the fun stuff >:)

Now we’re in. What do we want to do? How about get all the customer information?

In your browser navigate to CUSTOMERS > CUSTOMERS or just follow this link.

NCR Silver Customers

The customers admin panel with only one customer in it. This page is generated entirely from JavaScript.


Oh shit, there’s only one customer there, this is going to be boring. I guess we should add a few customers to work with.

Since what we’re really interested in is the scraping of data from a JavaScript page, we’re just going to use the import function of the web site to add bunch of customers. All you have to do is download this csv file and import it on the site.

Import Button

Here is where we import the customers.

Importing Customers Into The POS

Screen showing the importing of our customer base to be scraped.

Now we’ve got some customer data visible in our browser, all displayed by the website using JavaScript and JSON.

Customers To Scrape

Here’s our customer base rendered with JavaScript ready to be scraped.

As you can see by viewing the source code of the page, nowhere can we see the information about the customers, all we see is lots of JavaScript includes which are doing the rendering of the customer information. So where is this information coming from? Well, when the page is loaded in your browser the web page makes a request to get all of the customer data which is returned as a JSON object, which is then rendered in your browser using JavaScript.

Customers Page JavaScript Source Code

All page content is being rendered using a collection of JavaScript applications. Nowhere can we see the actual rendered page content.

You may be thinking, well if we can’t see the customer information on the page then when we request the page using cURL like we usually do, how can we scrape the data? Well, it’s actually quite simple – we pretend to be the JavaScript web application requesting the data and then we have a JSON object of all the data we require returned to us which we can mine and scrape to our hearts content.

In order to do this, we must first figure out the request that is being made by the page which we want to immitate. Back we go to our trusty Live HTTP Headers plugin. I figure the best way to do this is not to ‘load the page’, as this will return lots of erroneous data such as markup and styling. I figure the best way to do this is to mimic the performance of a search as this should only return data about the customers – maybe if we perform a search with no search string we get a list of all of the customers? Let’s give it a shot!

Live HTTP Headers for JavaScript Search Form

From the HTTP headers we can see the POST URL for the search form and the data being sent.

There we have it – our URL to make the POST request and all of the data to pass along with it. Let’s start building this up and hopefully we should see positive results.

<?php
    class NCRSilverScraper {

        // Class constructor method
        function __construct() {

            $this->useragent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3';    // Setting useragent of a popular browser

            $handle = fopen('cookie.txt', 'w') or exit('Unable to create or open cookie.txt file.'."\n");   // Opening or creating cookie file
            fclose($handle);    // Closing cookie file
            $this->cookie = 'cookie.txt';    // Setting a cookie file to store cookie
            $this->timeout = 30; // Setting connection timeout in seconds

            $this->loginUrl = 'https://mystore.ncrsilver.com/app/Account/LogOn';

        }


        // User login method
        public function login() {

            // Login values to POST as array
            $postValues = http_build_query(
                array(
                    'username' => $emailAddress,
                    'password' => $password,
                    'RememberMe' => 'true',
                    'IsAjaxRequest' => 'false'
                )
            );

            $request = $this->curlPostFields($this->loginUrl, $postValues);   // Making cURL POST request

            $login = json_decode($request); // Decoding the JSON response

            if ($login->success == 1) {
                // Successful login
                    $message = 'Successful login.'; // Assigning successful message
                echo $message;
            } elseif ($login->success == 0) {
                $message = $login->error;    // Assigning login error message returned by server
                echo $message;
                exit(); // Ending program
            } else {
                $message = 'Unknown login error.';  // Assigning unknown login error message
                echo $message;
                exit(); // Ending program
            }
        }

        // User logout method
        public function logout() {
            $request = $this->curlPostFields('https://mystore.ncrsilver.com/app/Account/LogOff?CancelLogin=true&isAjaxRequest=true', null);  // Logging out
        }

        // Method to search and scrape existing members details
        public function scrapePersons($searchString = '') {

            $searchUrl = 'https://mystore.ncrsilver.com/app/Customer/GetCustomers';

            $postValues = array(
                'PageRowCount' => 1000,
                'RequestedPageNum' => 1,
                'TotalRowCount' => -1,
                'SearchArg' => $searchString,
                'SortDirection' => 'ASC',
                'SortColumn' => 'Name',
                'page' => 1,
                'start' => 0,
                'limit' => 1000,
                'sort' => '[{"property":"Name","direction":"ASC"}]',
                'isAjaxRequest' => true,
            );

            $search = $this->curlPostFields($searchUrl, $postValues);

            return $search;
        }

        // Method to make a POST request using form fields
        public function curlPostFields($postUrl, $postValues) {
            $_ch = curl_init(); // Initialising cURL session

            // Setting cURL options
            curl_setopt($_ch, CURLOPT_SSL_VERIFYPEER, FALSE);   // Prevent cURL from verifying SSL certificate
            curl_setopt($_ch, CURLOPT_FAILONERROR, TRUE);   // Script should fail silently on error
            curl_setopt($_ch, CURLOPT_COOKIESESSION, TRUE); // Use cookies
            curl_setopt($_ch, CURLOPT_FOLLOWLOCATION, TRUE);    // Follow Location: headers
            curl_setopt($_ch, CURLOPT_RETURNTRANSFER, TRUE);    // Returning transfer as a string
            curl_setopt($_ch, CURLOPT_COOKIEFILE, $this->cookie);    // Setting cookiefile
            curl_setopt($_ch, CURLOPT_COOKIEJAR, $this->cookie); // Setting cookiejar
            curl_setopt($_ch, CURLOPT_USERAGENT, $this->useragent);  // Setting useragent
            curl_setopt($_ch, CURLOPT_URL, $postUrl);   // Setting URL to POST to
            curl_setopt($_ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);   // Connection timeout
            curl_setopt($_ch, CURLOPT_TIMEOUT, $this->timeout); // Request timeout

            curl_setopt($_ch, CURLOPT_POST, TRUE);  // Setting method as POST
            curl_setopt($_ch, CURLOPT_POSTFIELDS, $postValues); // Setting POST fields (array)

            $results = curl_exec($_ch); // Executing cURL session
            curl_close($_ch);   // Closing cURL session

            return $results;
        }


        // Class destructor method
        function __destruct() {
            // Empty
        }
    }


    // Let's run this baby and scrape us some data!
    $testScrape = new NCRSilverScraper();   // Instantiating new object

    $testScrape->login();    // Logging into server

    $data = json_decode($testScrape->scrapePersons());   // Scraping people records
    print_r($data);

    $testScrape->logout();   // Logging out

?>

And with that run we should have us some nice data scraped from a JavaScript and JSON website using nothing more than PHP and a little common sense.

Final Scraped JSON Data

Here’s the output of our scraper, printing out the contents of our PHP object.

Here we have the customer’s code, full name, email address and phone number. It’s one small step for web scraping, one giant leap for something or other. I don’t know where I was going with that.

Of course, we don’t have to store it as an object, we could always parse it into an array if you prefer working with your data like that. Or whatever your preferred data structure is.

Now we have the data it’s up to you what to do with it. Personally, and just for the purposes of this post, I’m going to write a little method to format it in a nice HTML table to display below…you might want to do something more useful with your data, like store it in a database, csv or something else, which I might cover in a future post.

Anyways, I hope this post has been somewhat informative and answered most of your questions regarding scraping JavaScript sites and JSON using PHP. As always, comments and questions are always welcome. You know what to do.

Happy web scraping!

User ID Full Name Email Address Phone Number
192 Andrea Fernandez afernandez2g@cafepress.com 9-(362)056-0581
142 Ann Thomas athomas12@yahoo.co.jp 6-(538)141-2725
145 Ann Walker awalker15@intel.com 7-(670)470-3724
203 Anna Carr acarr2r@boston.com 1-(382)463-0119
183 Ashley Kelly akelly27@mtv.com 1-(112)543-9709
184 Benjamin Dean bdean28@irs.gov 9-(780)063-9572
111 Bonnie Alvarez balvarez7@paypal.com 7-(240)691-0590
141 Brandon Murray bmurray11@tumblr.com 7-(612)179-5480
156 Carolyn Foster cfoster1g@cyberchimps.com 7-(614)558-2275
187 Cheryl Burke cburke2b@merriam-webster.com 8-(119)283-2599
135 Christine Wells cwellsv@wufoo.com 7-(415)042-8205
130 Craig Harper charperq@wired.com 3-(092)318-1942
104 Daniel Gonzales dgonzales0@zimbio.com 9-(313)370-0380
136 Denise Kelly dkellyw@live.com 2-(435)951-9920
178 Denise Vasquez dvasquez22@reddit.com 6-(800)841-4073
166 Diana Gardner dgardner1q@chicagotribune.com 9-(653)558-6654
200 Diana Nguyen dnguyen2o@vinaora.com 7-(016)965-4256
198 Diana Richards drichards2m@huffingtonpost.com 4-(783)241-6445
118 Diane Harvey dharveye@google.es 0-(422)620-9113
128 Diane Porter dportero@qq.com 8-(493)442-8581
201 Donald Roberts droberts2p@acquirethisname.com 2-(883)548-2431
115 Donna Reyes dreyesb@opera.com 2-(529)344-1126
127 Doris Berry dberryn@cargocollective.com 3-(364)519-6194
106 Dorothy Andrews dandrews2@google.ru 5-(727)310-0492
180 Dorothy Kelly dkelly24@multiply.com 3-(400)221-6843