Skip to content Skip to sidebar Skip to footer

Trouble Scraping A Table Into R

I am trying and failing to scrape the table of average IQs by country from this web page into R. I'm trying to follow the process described in this blog post, but I can't seem to f

Solution 1:

Thanks to @hrbrmstr's pointer about javascript being the issue, I was able to get this done using using phantomjs and following this tutorial. Specifically, I:

  1. Downloaded phantomjs for Windows from this site to my working directory;
  2. In Windows Explorer, manually extracted the file phantomjs.exe to my working directory (I guess I could have done 1 and 2 in R with download.file and unzip, but...);
  3. Copied the 'scrape_techstars.js' file shown in the tutorial identified in step 1, pasted it to Notepad, edited it to fit my case, and saved it to my working directory as "scrape_iq.js";
  4. Back in my R console, ran system("./phantomjs scrape_iq.js");
  5. Back in Windows Explorer, looked in my working directory to find the html file created in step 4 ("iq.html"), right-clicked on that html file, and selected Open with > Google Chrome;
  6. In the Chrome tab that launched, right-clicked on the page and selected Inspect;
  7. Moused over the table I wanted to scrape and looked in the Elements window to confirm that it is a node of type "table"; and, finally,
  8. Ran the R code below to scrape that table from "iq.html".

Here's the .js file I created in step 3 and used in step 4:

// scrape_iq.jsvar webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'iq.html'

page.open('https://iq-research.info/en/page/average-iq-by-country', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

Here's the R code I used in step 8 to scrape the table from the local html file phantomjs had created and get a data frame called IQ in my workspace.

library(dplyr)
library(rvest)

IQ <- read_html("iq.html") %>%
  html_nodes('table') %>%
  html_table() %>%
  .[[1]]

And here's the head the data frame that produced:

> head(IQ)
  Rank     Country  IQ
11   Hong Kong 10821   Singapore 10832 South Korea 10643       Japan 10553       China 10564      Taiwan 104

Post a Comment for "Trouble Scraping A Table Into R"