Scrape jobs from Hacker News with Mechanize · Web Scraping

Scrape jobs from Hacker News with Mechanize

06 Jan 2016
By Maxim Dzhuliy
web_scraping

Today we will continue parse Hacker News in this article we will parse Jobs. If you don’t know what we will doing read acticle about scraping Hacker News newest posts with Mechanize

Scrape jobs

Before all of doing some actions we change new_links_parser.rb, make base class with helper methods. This class is very simple and small.

require 'rubygems'
require 'mechanize'

module Parsers
  module HackerNews
    class BaseParser
      BASE_URL = 'https://news.ycombinator.com'
      attr_accessor :agent, :data

      def initialize
        @agent = Mechanize.new do |agent|
          agent.user_agent_alias = 'Mac Safari'
        end
        @data = []
      end

      # Method for pagination, just clicking on More link
      def view_more(pages = 1)
        page_num = 0
        loop do
          yield

          next_link = @page.link_with(text: 'More')
          break if page_num > pages || !next_link

          page_num += 1
          @page = next_link.click
        end
      end
    end
  end
end

Jobs parser is more simpler than parser new posts.

require_relative 'base_parser'

module Parsers
  module HackerNews
    class Jobs < BaseParser

      # Runs a parsing process
      # Params:
      # +limit_url+:: sring url value for end url address(may be nil)
      # +pages+:: number of pages for parsing (default: 1)
      def parse_links(limit_url = nil, pages = 1)
        @page = @agent.get "#{BASE_URL}/jobs"
        view_more(pages) do
          links = @page.search("//td[contains(@class,'title') and "\
                              "not(contains(.,'More'))]/a")

          links.each do |link|
            current_data = scrape_link_data(link)
            return @data if limit_url && current_data[:link] == limit_url

            break if @data.include?(current_data)
            @data << current_data
          end
        end
      end

      # Scrape data from link
      def scrape_link_data(link)
        { title: link.text, link: link[:href] }
      end
    end
  end
end

Result data:

[{:title=>"Flexport is hiring software engineers",
  :link=>"https://angel.co/flexport/jobs"},
 {:title=>
   "Streak hiring Senior Engineers to build large scale index of all business email",
  :link=>"https://www.streak.com/careers#SeniorBackend"},
 {:title=>
   "Automatic (YC S11) Is Hiring a Car-Loving Principal Server Engineer",
  :link=>"https://boards.greenhouse.io/automatic/jobs/63773#.VpAV-5MrKRs"},
 {:title=>"Zidisha (YC W14) spring microfinance internships (remote)",
  :link=>"https://www.zidisha.org/volunteer"},
 {:title=>"Airware (YC W13) is crunching drone data in SF",
  :link=>"item?id=10865630"},
 {:title=>
   "Benchling (YC S12) is hiring engineers to build the backbone of biotech",
  :link=>
   "https://jobs.lever.co/benchling/f916b4d9-59ea-4346-a4a0-daa18bf46fa4?lever-source=Hacker%20News"},
 #...........
 ]

If you have some proposition, ask in comments. And send Pull Request to repository: WScraping/hacker_news

Share this article with your friends