Scrape UpWork skills with Nokogiri

12 Dec 2015
By Maxim Dzhuliy
web_scraping

Hello, dear visitors.

Today we will be scraping “programmers” skills from UpWork via Nokogiri and like a small bonus, we will be scraping skills via Selenium. Maybe it will be helpful for someone.

Scraping Tools

What we need:

Ruby, I have awesome article about installation Ruby environment in Ubuntu
Bundler for easy gem installing

Gems:

Nokogiri gem.
Selenium-webdriver gem, just for fun ;)

Scraping logic

Open UpWork’s page with skills
Find all li tags with class skill-item
From each li tag, get a tag and get text from this tag (get text from link)
Repeat again for all available skill’s pages

Notice: We can click on each skill page link (in the top #A B C D etc), but I’m very lazy ;)

What I was found, after you click for next page(for example to B), your URL change to https://www.upwork.com/i/freelancer-skills-b/

if you click to next page(C) https://www.upwork.com/i/freelancer-skills-c/ and for all page, we have the same situation.

We will open just URL with https://www.upwork.com/i/freelancer-skills-{needed_page}/

Scraping with Nokogiri

Before of all, we need to make new “project” like this:

~/web_scraping/up_work_skills$ tree
.
├── Gemfile
├── Gemfile.lock
├── parser.rb
├── README.md
└── selenium_parser.rb

Description of files:

Gemfile - is a file we create which is used for describing gem dependencies for Ruby programs.
Gemfile.lock - the file is where Bundler records the exact versions that were installed.
parser.rb - our parser on Nokogiri
selenium_parser.rb - the same parser, but with Selenium
README.md - readme file for github repo

First step: add dependencies for our scraper

Edit your Gemfile:

# gem's source, it can be changed
source 'https://rubygems.org'

# needed gems
gem 'nokogiri'
gem 'selenium-webdriver'

Save, and run

bundle install

Let’s scrape skills

Open your parser.rb, and let’s do magic :) Include our gems:

require 'rubygems'
require 'net/http'
require 'nokogiri'

Open skill page

 uri = URI("https://www.upwork.com/i/freelancer-skills-#{page}/")

Create Nokogiri page with specific configuration:

noblanks - Remove blank nodes, for performance
nonet - Substitute entities
noerror - Suppress error reports

doc = Nokogiri::HTML(Net::HTTP.get(uri)) do |config|
  config.strict.nonet.noblanks.noerror
end

Scraping logic, section 2:

Nokogiri supports two type of finding:

CSS - this finding by dom selectors
XPath - specific language, we will use it on selenium parser

doc.css('.skill-item')

It return all needed li items in Array. Use it:

doc.css('.skill-item').each do |item|
  skill = item.css('a').text
end

What do we do? For each li, we find a selector and use #text for it. (#text is a Nokogiri method). And, for each interaction in skill variable, we have skill name. Of course, we will make a class for scraping. And skills name we can store in a class instance.

doc.css('.skill-item').each do |item|
  skill = item.css('a').text
  @data << skill if skill
end

Currently, we scrape one page, but we want to scrape all pages from UpWork.

Let’s go to next scraping logic (4)

Make initialize method, with param pages. What we do? In intialize method we pass array or range of letters, like: ['a','b','f','e','m'] or ('b'..'x'). After, store on @pages variable.

def initialize(pages)
  @pages = pages || ('a'..'z')
  @data = []
end

And customize our scrape method:

def scrape_pages
  @pages.each do |page|
    uri = URI("https://www.upwork.com/i/freelancer-skills-#{page}/")

    # and other things
    # ...
  end
end

Scrape skills with Selenium

What is Selenium?

Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.

Create ‘selenium’ browser

For parsing we use firefox, but you can use different browser. List of supported browsers and platforms

@browser = Selenium::WebDriver.for :firefox

Selenium can find element via:

Id
Class
Name
Text
Tagname
Css Selector
XPath

But how we can find li elements with needed links? XPath this is specific request language to XML elements, you can read more information on Wikipedia

links_xpath = "//*[contains(@class, 'skill-item')]/a"
links = @browser.find_elements(:xpath, links_xpath)

Description XPath:

// - this mean descendant-or-self
*[...] - anything with additional conditions of fetch
contains(@class, 'skill-item') - when item contain class: skill-item
/a - for each item we find child-node a

In value links we have Array of links. We need to get a text from each link.

links.map { |item| item.text }
# or
links.map(&:text)

The result is the same, but the second variant more readable.

Summary

All code you can find on GitHub repository: WScraping/up_work_skills

The ask in the comments, mistakes and the website that you want to scrape next.