Go Articles Scraper [PHP]

This past week I’ve been taking on a few freelancing jobs, mostly data scraping, manipulation, etc…

Anyways, one was to scrape a whole host of article sites for content and then do some funky stuff with it.

I thought I’d share my script for scraping Go Articles here, it’s not particularly wonderful and it isn’t perfect, I literally threw it together in 10 mins for this one off job, but it does the job. Got scripts for a few more sites too that I’ll get up some time soon.

	// Go Articles Scraper Script by Jacob Ward
	// Created on Oct 17 2011
	// For updates or more info visit http://www.jacobward.co.uk/go-articles-scraper-php/

	$keyword = "keyword";	// Keyword to scrape articles for
	$num = "100";	// Number of articles to scrape
	$start_scrape = "1";	// From what point to start the scrape, useful if you need to resume the scrape from a later point.

	// Function to scrape content between string and string
	function scrape_result($item, $start, $end){
		$item = stristr($item, $start);
		$item = substr($item, strlen($start));
		$stop = stripos($item, $end);
		$val = substr($item, 0, $stop);
		return $val;
	$search_results = file_get_contents("http://goarticles.com/search/?start=" . $start_scrape . "&limit=" . $num . "&q=" . $keyword);	// Performing search for keyword
	// Scraping links from page
	$article_urls_scrape = explode("<div class=\"s_article_info\">", $search_results);	// Breaking search results page up into separate results
	foreach ($article_urls_scrape as $url_to_scrape) {	// Foreach separate result scrape the url of the article
		$scraped_url = scrape_result($url_to_scrape, "<a href=\"", "\" target=\"newarticle\"");
		if (stristr($scraped_url, "/article/")) {
			$scraped_array[] = "http://goarticles.com" . $scraped_url;
	// Scraping articles and saving them to text files
	foreach($scraped_array as $article_page) {	// Foreach article url
		$article_to_scrape = file_get_contents($article_page);	// Getting page content
		$title = scrape_result($article_to_scrape, "<h1 class=\"art_head\">", "&nbsp;&nbsp;");	// Scraping title
		$article = scrape_result($article_to_scrape, "<div class=\"KonaBody\">", "</div>");	// Scraping article
		$article = str_replace("<p>", "<br />", $article);	// Replacing <p></p> tags with <br /> - was needed for this project, you may want to change. Personally I'd strip all HTML tags and reformat text accordingly.
		$article = str_replace("</p>", "<br />", $article);
		echo  "Saving article: " . $title . "\n";	// Letting you know the article's being saved
		$handle = fopen("articles/" . $title . ".txt", "w");	// Opening file of article's name in directory articles/
		fwrite($handle, $article);	// Writing content to file
		fclose($handle);	// Closing file


Leave a Reply