Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--require spec_helper
7 changes: 7 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# frozen_string_literal: true

source "https://rubygems.org"

gem "rspec", "~> 3.13"
gem "nokogiri", "~> 1.19"
gem "ferrum", "~> 0.17"
65 changes: 65 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
GEM
remote: https://rubygems.org/
specs:
addressable (2.9.0)
public_suffix (>= 2.0.2, < 8.0)
base64 (0.3.0)
concurrent-ruby (1.3.7)
diff-lcs (1.6.2)
ferrum (0.17.2)
addressable (~> 2.5)
base64 (~> 0.2)
concurrent-ruby (~> 1.1)
webrick (~> 1.7)
websocket-driver (~> 0.7)
nokogiri (1.19.4-arm64-darwin)
racc (~> 1.4)
public_suffix (7.0.5)
racc (1.8.1)
rspec (3.13.2)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.6)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.8)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.7)
webrick (1.9.2)
websocket-driver (0.8.2)
base64
websocket-extensions (>= 0.1.0)
websocket-extensions (0.1.5)

PLATFORMS
arm64-darwin-23

DEPENDENCIES
ferrum (~> 0.17)
nokogiri (~> 1.19)
rspec (~> 3.13)

CHECKSUMS
addressable (2.9.0) sha256=7fdf6ac3660f7f4e867a0838be3f6cf722ace541dd97767fa42bc6cfa980c7af
base64 (0.3.0) sha256=27337aeabad6ffae05c265c450490628ef3ebd4b67be58257393227588f5a97b
concurrent-ruby (1.3.7) sha256=4412caec3a5ea2e5fdc52076724c071a81f2c0593d83b2ac8cbb8ca63b3151b0
diff-lcs (1.6.2) sha256=9ae0d2cba7d4df3075fe8cd8602a8604993efc0dfa934cff568969efb1909962
ferrum (0.17.2) sha256=2c2540a850b211a46f4d81de21bfd62048f507e4c327d1807225c3823c17e6ee
nokogiri (1.19.4-arm64-darwin) sha256=a46db9853286e6597b36ebc6953817d15acf3a299583eb3f89fdc6f91dd63527
public_suffix (7.0.5) sha256=1a8bb08f1bbea19228d3bed6e5ed908d1cb4f7c2726d18bd9cadf60bc676f623
racc (1.8.1) sha256=4a7f6929691dbec8b5209a0b373bc2614882b55fc5d2e447a21aaa691303d62f
rspec (3.13.2) sha256=206284a08ad798e61f86d7ca3e376718d52c0bc944626b2349266f239f820587
rspec-core (3.13.6) sha256=a8823c6411667b60a8bca135364351dda34cd55e44ff94c4be4633b37d828b2d
rspec-expectations (3.13.5) sha256=33a4d3a1d95060aea4c94e9f237030a8f9eae5615e9bd85718fe3a09e4b58836
rspec-mocks (3.13.8) sha256=086ad3d3d17533f4237643de0b5c42f04b66348c28bf6b9c2d3f4a3b01af1d47
rspec-support (3.13.7) sha256=0640e5570872aafefd79867901deeeeb40b0c9875a36b983d85f54fb7381c47c
webrick (1.9.2) sha256=beb4a15fc474defed24a3bda4ffd88a490d517c9e4e6118c3edce59e45864131
websocket-driver (0.8.2) sha256=97c556b019bf3410b4961002ac501621e9322d3f8a7bc02161a09301cc4c4146
websocket-extensions (0.1.5) sha256=1c6ba63092cda343eb53fc657110c71c754c56484aad42578495227d717a8241

BUNDLED WITH
4.0.10
54 changes: 54 additions & 0 deletions files/gerhard-richter-paintings.html

Large diffs are not rendered by default.

51 changes: 51 additions & 0 deletions files/rene-magritte-paintings.html

Large diffs are not rendered by default.

53 changes: 53 additions & 0 deletions lib/file_scraper.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# frozen_string_literal: true

require "ferrum"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Ferrum as a web driver because it's headless by default and simple to use (for this challenge at least).

But it's yet to reach a 1.0 release, so the Selenium web-driver could be used as an alternative.

require "json"
require "nokogiri"

class FileScraper
DOMAIN_NAME = "https://www.google.com"

def self.run(file_path)
html = extract_html(file_path)

document = Nokogiri::HTML(html)

artworks = document.css("g-loading-icon + div").children

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selector could be done only in CSS with g-loading-icon + div > *.

I used the children method instead as it is clearer what is being selected than using the > * selector.


result = artworks.map do |artwork|
extensions = artwork.css("img + div").children.map do |extension|

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with artworks above, this selector could be done only in CSS with img + div > *.

I used the children method as it is clearer and consistent with the approach above.

extension.text if extension && !extension.text.empty?
end
name = extensions.shift
relative_path = artwork.at("a")["href"]
data_image = artwork.at("img")["data-src"]
src_image = artwork.at("img")["src"]
{
name:,
extensions: (extensions unless extensions.compact.empty?),
link: DOMAIN_NAME + relative_path,
image: data_image || src_image,
}.compact
end

JSON.generate(artworks: result)
end

private

def self.extract_html(file_path)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To begin with I was just parsing the html files with Nokogiri, but found I needed to use a browser to execute the JavaScript for the thumbnail images. So I switched to using a web driver to render the page.

I thought a separate method for extracting the HTML was useful because:

  • It could be extended to scrape the page in different ways, such as with or without a web driver.
  • Having the web driver code contained within this method makes it simpler to change the web driver (e.g. switch Ferrum for Selenium).

file_extension = file_path.split(".").last

raise "Please use an HTML file" unless file_extension == "html"

begin
# Note: Ferrum uses a Chrome or Chromium driver – you need to have one of these installed.
# Docs: https://docs.rubycdp.com/docs/ferrum/introduction/
browser = Ferrum::Browser.new
browser.go_to("file:///#{File.expand_path(file_path)}")
browser.body
ensure
browser.quit
end
end
end
62 changes: 62 additions & 0 deletions spec/file_scraper_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# frozen_string_literal: true

require "file_scraper"

RSpec.describe FileScraper do
FILE_PATHS = [
"./files/van-gogh-paintings.html",
"./files/rene-magritte-paintings.html",
"./files/gerhard-richter-paintings.html",
]

FILE_PATHS.each do |file_path|
context "using #{file_path}" do
before :all do
json_response = FileScraper.run(file_path)
@response = JSON.parse(json_response)
end

let(:expected_response) { JSON.parse(File.read("./files/expected-array.json")) }

it "contains artworks array" do
expect(@response["artworks"]).to be_a(Array)
end

it "artworks – name" do
expect(@response["artworks"].first["name"]).to be_a(String)
expect(@response["artworks"].first["name"]).not_to be_empty
end

it "artworks – extensions" do
expect(@response["artworks"].first["extensions"]).to be_a(Array)
end

it "artworks – link" do
expect(@response["artworks"].first["link"]).to be_a(String)
expect(@response["artworks"].first["link"]).not_to be_empty
end

context "with thumbnail" do
it "artworks – image" do
expect(@response["artworks"].first["image"]).to be_a(String)
expect(@response["artworks"].first["image"]).not_to be_empty
end
end

context "without thumbnail" do
it "artworks – image" do
expect(@response["artworks"].last["image"]).to be_a(String)
expect(@response["artworks"].last["image"]).not_to be_empty
end
end

if file_path == "./files/van-gogh-paintings.html"
it "produces the expected response" do
@response["artworks"].each.with_index do |artwork, index|

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test could be simplified to:

expect(@response).to eq(expected_response)

I used this approach as it was better for debugging (you get a single wall of text when doing the assertion at the response level).

expect(artwork).to eq(expected_response["artworks"][index])
end
end
end
end
end
end
98 changes: 98 additions & 0 deletions spec/spec_helper.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# This file was generated by the `rspec --init` command. Conventionally, all
# specs live under a `spec` directory, which RSpec adds to the `$LOAD_PATH`.
# The generated `.rspec` file contains `--require spec_helper` which will cause
# this file to always be loaded, without a need to explicitly require it in any
# files.
#
# Given that it is always loaded, you are encouraged to keep this file as
# light-weight as possible. Requiring heavyweight dependencies from this file
# will add to the boot time of your test suite on EVERY test run, even for an
# individual file that may not need all of that loaded. Instead, consider making
# a separate helper file that requires the additional dependencies and performs
# the additional setup, and require it from the spec files that actually need
# it.
#
# See https://rubydoc.info/gems/rspec-core/RSpec/Core/Configuration
RSpec.configure do |config|
# rspec-expectations config goes here. You can use an alternate
# assertion/expectation library such as wrong or the stdlib/minitest
# assertions if you prefer.
config.expect_with :rspec do |expectations|
# This option will default to `true` in RSpec 4. It makes the `description`
# and `failure_message` of custom matchers include text for helper methods
# defined using `chain`, e.g.:
# be_bigger_than(2).and_smaller_than(4).description
# # => "be bigger than 2 and smaller than 4"
# ...rather than:
# # => "be bigger than 2"
expectations.include_chain_clauses_in_custom_matcher_descriptions = true
end

# rspec-mocks config goes here. You can use an alternate test double
# library (such as bogus or mocha) by changing the `mock_with` option here.
config.mock_with :rspec do |mocks|
# Prevents you from mocking or stubbing a method that does not exist on
# a real object. This is generally recommended, and will default to
# `true` in RSpec 4.
mocks.verify_partial_doubles = true
end

# This option will default to `:apply_to_host_groups` in RSpec 4 (and will
# have no way to turn it off -- the option exists only for backwards
# compatibility in RSpec 3). It causes shared context metadata to be
# inherited by the metadata hash of host groups and examples, rather than
# triggering implicit auto-inclusion in groups with matching metadata.
config.shared_context_metadata_behavior = :apply_to_host_groups

# The settings below are suggested to provide a good initial experience
# with RSpec, but feel free to customize to your heart's content.
=begin
# This allows you to limit a spec run to individual examples or groups
# you care about by tagging them with `:focus` metadata. When nothing
# is tagged with `:focus`, all examples get run. RSpec also provides
# aliases for `it`, `describe`, and `context` that include `:focus`
# metadata: `fit`, `fdescribe` and `fcontext`, respectively.
config.filter_run_when_matching :focus

# Allows RSpec to persist some state between runs in order to support
# the `--only-failures` and `--next-failure` CLI options. We recommend
# you configure your source control system to ignore this file.
config.example_status_persistence_file_path = "spec/examples.txt"

# Limits the available syntax to the non-monkey patched syntax that is
# recommended. For more details, see:
# https://rspec.info/features/3-12/rspec-core/configuration/zero-monkey-patching-mode/
config.disable_monkey_patching!

# This setting enables warnings. It's recommended, but in some cases may
# be too noisy due to issues in dependencies.
config.warnings = true

# Many RSpec users commonly either run the entire suite or an individual
# file, and it's useful to allow more verbose output when running an
# individual spec file.
if config.files_to_run.one?
# Use the documentation formatter for detailed output,
# unless a formatter has already been configured
# (e.g. via a command-line flag).
config.default_formatter = "doc"
end

# Print the 10 slowest examples and example groups at the
# end of the spec run, to help surface which specs are running
# particularly slow.
config.profile_examples = 10

# Run specs in random order to surface order dependencies. If you find an
# order dependency and want to debug it, you can fix the order by providing
# the seed, which is printed after each run.
# --seed 1234
config.order = :random

# Seed global randomization in this process using the `--seed` CLI option.
# Setting this allows you to use `--seed` to deterministically reproduce
# test failures related to randomization by passing the same `--seed` value
# as the one that triggered the failure.
Kernel.srand config.seed
=end
end