Working on a recent freelance project I ran into a few errors, and no matter what I did there was no way I was going to get a 100% success rate without writing around 15 extra individual scrapers to accommodate the 15 or so anomalies in the structure of the data.
Ultimately I ended up with around a 97% success rate on the data I scraped and was required by the customer. This was well within, and indeed far exceeded, their 90% acceptance rate. But, it got me to thinking – what is a realistic acceptable level of success when web scraping or data mining? Should we always be aiming for 100% or is it sometimes just not worth it?
I haven’t seen this topic discussed elsewhere, though I’m sure it has, so I thought I’d throw my 2 pennies into the pit and see what sort of response I get.
Initially, I’m thinking – what kind of volume of data are we talking about, where is it and how is it formatted? So for the purposes of this hypothetical I’ll take three examples that I have worked on within the last couple of months:
- 100 rows of 10 columns from an HTML table on a single page. (Table of statistics).
- 10k rows of 10 columns from pages linked off a single page. Badly formatted HTML, almost every page different in some way. (wedding photographers details).
<br><br> </td></tr></tbody></table> </td></tr></tbody></table> <h2>West Midlands</h2> Mahbub Ahmed <br> Mabz Photography<br> 4 Madeley Road<br> Birmingham<br> West Midlands<br> B11 1UX <a href="http://www.multimap.com/maps/?title=Mabz Photography+West Midlands&hloc=GB|B11 1UX " target="_blank"><img src="../art/map.gif" alt="View a Map for the location of Mabz Photography" height="12" border="0" width="22"></a><br> England<br><br> tel:- 07966140541<br> web address:- <a href="http://anonsite.co.uk/members/goto/rd.cgi?redir=http://www.mabzphotography.com/" title="website Ahmed">www.mabzphotography.com/</a> <br>blog:- <a href="http://anonsite.co.uk/members/goto/rd.cgi?redir=http://www.mabzphotography.com/blog">www.mabzphotography.com/blog</a> <br>e-mail:- <a href="<a href="mailto:email@example.com?subject=enquiry%20from%20SWPP%20-%20swpp.co.uk">mailto:firstname.lastname@example.org?subject=enquiry from SWPP - anonsite.co.uk</a>">Mahbub Ahmed</a> <br>facebook:- <a href="http://www.facebook.com/mabzphotography " target="_blank"> Mabz Photography</a> <br><br>Specialist Photographer for :-<br> Wedding photography<br> <p><a href="http://anonsite.co.uk/members/goto/rd.cgi?redir=http://www.mabzphotography.com/"><img src="../members_pictures/Ahmed126227.jpg" alt="an example of the images created by Mahbub Ahmed"></a></p> <br> Cutting edge wedding photographer, specialising in Hindu, Punjbi, Muslim weddings all over the UK and abroad.<br> <br> Freelance photographer<br> Full time professional<br> <br>Other Wedding photographic services<br><br>Wedding Traditional<br>Wedding Contemporary<br>Wedding Photojournalism<br>Storybook wedding<br>Muslim Weddings<br>Hindu Weddings<br>punjabi weddings<br>
- 100k+ rows of 15 columns from an entire site crawl and scrape. Highly semantically marked-up HTML. (ecommerce site).
Given the simplicity of scraping a single table of data, there really is no room for error with this one. Even if it was 100k rows, if it’s in a single table, there is no excuse for missing any data in the final deliverable.
Minimum success rate: 100% Initial success rate: 100% Final success rate: 100%
Given the complete lack of structure and almost no semantically relevant details, we should expect our success rate with this site to be much lower. However, every page that needs to be scraped is linked to nicely from one location, there is no digging or categorisation to be done here. It’s also obviously a home-made website and it’s on cheap hosting which doesn’t throttle the number of connections or speed with which you can access the pages. Given that, we’re very unlikely to experience any issues from our IPs being banned. With a little bit of work we can build a scraper that should perform reasonably well under these conditions.
Minimum success rate: 90% Initial success rate: 97% Final success rate: 97%
This website is a big ecommerce site, hosted on a dedicated server, set up well to restrict the kind of activity we are going to be performing. Despite this, with the markup being so semantically strict and conforming across every page, it should be very easy to find and scrape the data once we’ve accessed the page. The high number of pages and ensuring we find every single one is really the only issue we face. If we use lots of proxies and go at it slowly, we should be aiming to get a very high success rate and scrape almost all of the data available.
Minimum success rate: 95% Initial success rate: 96% Final success rate: ~99%
I’m not too sure where I was going with this post actually …oh yes, acceptable success rates. Ideally we obviously want to have a 100% success rate first time around, every time. And even with some tweaking and a few runs at the problem, 100%. This, however is not very realistic in the field and so sometimes we have to settle for something slightly less than perfect.