Ideas For Web Scraping & Automation Projects Or Posts

My blog has remained rather stagnant for the last few months, in part due to uni work, but mostly because I can’t think of any interesting topics to cover or tutorials to write. I’ve so far covered some really basic web scraping topics and gone slightly more in depth in my book, but I don’t know what it is you want to learn more about.

So, all suggestions are currently welcome! Please leave them in the comments section and if they’re appropriate I’ll cover them in future posts. So far I have a few ideas (feel free to leave some feedback):

  • Auto-tweeting images from RSS feeds – So far, none of the automation tools out there support natively tweeting images to Twitter. They all use third-party URL shortening services, which means your image doesn’t appear in your Twitter stream or in your ‘Photos and videos’ page of your Twitter profile. This pisses me off and I’m working on something to accomplish this.
  • OOP PHP programming – With the basics down, I think from now on all my posts and tutorials are going to be using Object-Oriented Programming (OOP), as I do in my personal and clients’ projects. It’s far more easier and cleaner when working on larger projects and easy to scale out our applications as we add more features. This will also entail using classes such as DOMDocument(), PDO(), among others to make our applications more robust and easier to maintain.
  • Automating and scraping AJAX – With more websites than ever now using AJAX, I think automating and scraping these using PHP might be an interesting project to cover.
  • Basic captcha ‘cracking’ – I know this may be a somewhat ‘grey area’ topic. But I’ll approach it from a neutral perspective. Using PHP and Optical Character Recogition (OCR) to crack basic, but commonly implemented, captchas.

If any of these topics take your fancy, or there’s something else you want covered, leave your responses in the comments below

9 thoughts on “Ideas For Web Scraping & Automation Projects Or Posts

  1. Hi,

    Tried to send you a message on the “contact” section but it just hang in there and didnt send the message.

    So, i just bought your ebook a few weeks ago and got stuck on the 9th chapter about the traversing multiple pages. I have a several issue and question regardng the content of the ebook script:

    1. Always got the error message “Fatal error: Maximum execution time of 30 seconds exceeded in E:\xampp\htdocs\example\mpori.php on line 10”

    But i can get normal result when i scrap another website, is there something wrong with: http://www.packtpub.com/books?keys=php ?

    2. How to make the script to traverse all of the website page? With your script, it only can traverse the page that VISIBLE on the web (which is 13 page, i tested on googleXML). While i was expecting the ebook can teach me how to make a script that can traverse all of the website pages?

    Looking forward to hear from you soon.

    ps: can i have your private email address to discuss?

    Best Regards,
    Ligar

    1. 1. The target page may be taking longer than expected to load and your max execution time is set to 30 seconds. You can change this in your script with set_time_limit($seconds); where $seconds is the number of seconds as an integer.

      2. You should be able to extrapolate from there to crawl the entire website. Essentially, you would return all of the internal URLs on every page and scrape them, storing which pages you’ve already scraped to avoid duplicates. All of the info required to do this is in the book, despite not having a specific example (had to be cut due to publishers restriction on length).

      If you have more questions, I’d prefer to discuss them here than in private, then if others are having the same issues they can see too.

  2. I really like all of those ideas. Your book is suppose to arrive today, and hope to read through it this weekend.

    I love the captcha breaking idea. I have looked at companies like Death By Captcha, etc, but if OCR can do it, which I know it can to a point, it would be so much better.

    Posting pictures is something I never thought of with twitter bots, I have run several for years, but that would add so much more. I am also working on a class to mimic a person on twitter. I want to see if I can create a bot that acts “human” on twitter.

    Thank you for this blog and your book! Keep up the good work.

  3. Hi Jacob,

    I am looking to program a stock screener for picking stocks. I have been using Google spreadsheets but those are very limiting. I can use add-ons and scripts to import price quotes, 52 week high, 52 week low and such. Google lacks a function to import the dividend data. Some work arounds are available but they have a lot of limitations. i.e. I can’t import the 5 year dividend history for a stock.

    I have been going to the NASDAQ site to look up historic dividends. i.e. http://www.nasdaq.com/symbol/cce/dividend-history

    A few years ago I did take a PHP & mySQL course so I can set up a basic web interface and enter data into a table using INSET statements. What I would like to do is create a database and web interface which will take a list of stock ticker symbols, scrape the dividend history and then populate the database. If the dividend table is populated, it would check the last entry date in the database and add the more recent dividend or dividends.

    What I would like to do is feed the program a list of stocks, scrape the dividend history off the NASDAQ site (or similar) and populate the database. The objective is to identify stocks with long history of increasing dividends. i.e. Coca-Cola has been increasing dividends for 51 years in a row. It’s a cash cow.

    Sound like a fun project?

    Thanks for your site.

    1. It not only ‘sounds like a fun project’, it is. I’ve done it before for multiple clients, including buy/sell arbitrage in other specific markets (mostly valuable metals).

      Maybe I’ll do a post on it in the future, but it’s not the most simple of projects to complete and there’s a chance of loss initially, which I don’t necessarily want to put people in a position of. For example, developing one app for a client cost many thousands of dollars of losses in testing before getting it working correctly. Most people are not in a position to do this, hence my hesitance.

      But the basic idea of tracking stock prices, sure, I’ll do a post on that in the future.

    1. Sorry I’ve been slacking. I’ve been ill and now I’ve just moved house and have limited internet access. But don’t worry, I’m hoping to start a new little tutorial series today for using Twitter’s API.

Leave a Reply