Web Scraper
Web Scraper is a library to build APIs by scraping static sites and use data as models.
Installation
gem install web_scraper
require 'web_scraper'
Usage
Example
require 'web_scraper'
class Article < WebScraper
resource 'http://hbswk.hbs.edu/topics/it.html'
base css: '.tile-medium'
property :title, xpath: './/h4/a/text()'
property :date, xpath: './/li[1]/text()'
property :category, xpath: './/li[2]/a/text()'
property :description, xpath: './/p/text()'
key :title
end
puts "#{Article.count} articles were found"
puts
articles = Article.all
articles.each do |article|
header = article.title
puts header
puts '=' * header.length
puts
subheader = "#{article.date} #{article.category}"
puts subheader
puts '-' * subheader.length
puts
puts article.description
puts
end
article = Article.find('Tech Investment the Wise Way')
puts article.description
Output
Optimal Auction Design and Equilibrium Selection in Sponsored Search Auctions
=============================================================================
14 Jan 2010 Working Papers
--------------------------
Reserve prices may have an important impact on search advertising marketplaces. But the effect of reserve prices can be opaque, particularly because it is not always straightforward to compare "before" and "after" conditions. HBS professor Benjamin G. Edelman and Yahoo's Michael Schwarz use a pair of mathematical models to predict responses to reserve prices and understand which advertisers end up paying more.
The IT Leader’s Hero Quest
==========================
11 May 2009 Research & Ideas
----------------------------
Think you could be CIO? Jim Barton is a savvy manager but an IT newbie when he's promoted into the hot seat as chief information officer in , a novel by HBS professors and and coauthor . Can Barton navigate his strange new world quickly enough? Q&A with the authors, and book excerpt.
Reference
WebScraper.all
Loads html page, detects appropriate blocks,
wraps them in objects.
The result will be cached.
articles = Article.all
WebScraper.count
Returns number of objects found.
puts "#{Article.count} articles were found"
WebScraper.reset
Resets cache of the html data.
Article.reset
WebScraper.find(key)
Finds first object with required key.
article = Article.find('Tech Investment the Wise Way')
WebScraper.resource(_resource)
Defines resource -- url of the html page.
class Article < WebScraper
...
resource 'http://hbswk.hbs.edu/topics/it.html'
...
end
WebScraper.base(_base)
Defines base -- selector which determines blocks of content.
You can use css or xpath selectors.
class Article < WebScraper
...
base css: '.tile-medium'
...
end
WebScraper.property(*args)
Defines property -- name (and type optionally) and selector.
You can use css or xpath selectors.
Types determine returning values.
Available types (default is string): string, integer, float, node.
The node option means nokogiri node.
class Article < WebScraper
...
property :title, xpath: './/h4/a/text()'
property views: :integer, xpath: './/h4/span/text()'
...
end
WebScraper.key(_key)
Defines key -- property which will be used in find method.
class Article < WebScraper
...
key :title
...
end
WebScraper#css(*args)
Allows you to use nokogiri css method directly on your object.
It proxies it to nokogiri node.
WebScraper#xpath(*args)
Allows you to use nokogiri xpath method directly on your object.
It proxies it to nokogiri node.
WebScraper#property
WebScraper#method_missing(name, *args, &block)
Returns appropriate value for property if found.
Converts it to the defined type.
puts article.description
Author (Speransky Danil): Personal Page | LinkedIn | GitHub | StackOverflow