Scrape Values From Html Select/option Tags In R
I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extr
Solution 1:
The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.
UPDATED Incorporates the second request (see comments below)
library(rvest)
library(dplyr)
# gets data from the second popup# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {
# make the AJAX URL and grab the data
url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
town_id)
subunits <- html(url)
# reformat into a data frame with the town data
data.frame(town_id=town_id,
town_name=town_name,
area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
area_name=subunits %>% html_nodes("option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
}
# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
town_name=majidata %>% html_nodes("#town option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
# pass in the name and id to our addArea function and make the result into# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
mapply(addArea, maji$town_id, maji$town_name,
SIMPLIFY=FALSE, USE.NAMES=FALSE))
# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL
str(combined)
## 'data.frame': 1964 obs. of 4 variables:## $ town_id : chr "611" "635" "625" "628" ...## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...
head(combined)
## town_id town_name area_id area_name## 1 611 AHERO 60603030101 AHERO## 2 635 AKALA 60107050201 AKALA## 3 625 AWASI 60603020101 AWASI## 4 628 AWENDO 61103040101 ANINDO## 5 628 AWENDO 61103050401 SARE## 6 749 BAHATI 73101010101 BAHATI
Solution 2:
Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with
options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option")
ids <- sapply(options, xmlGetAttr, "value")
names <- sapply(options, xmlValue)
data.frame(ID=ids, Name=names)
which returns
ID Name
1 0 [SELECT TOWN]
2 611 AHERO
3 635 AKALA
4 625 AWASI
5 628 AWENDO
6 749 BAHATI
...
Post a Comment for "Scrape Values From Html Select/option Tags In R"