R may just have become more preferable for simple webscraping jobs with the release of
rvest. Before, this was something I'd prefer to do in python but the new R syntax seems to prevail. In this document I will give a small example of it's syntax.
library(rvest) library(stringr) library(dplyr) library(ggplot2) library(GGally)
Heroes of the storm
We retrieve the website simply via the
heroes <- html("http://www.heroesnexus.com/heroes")
This page contains many html nodes which have classes. These are very useful for scraping and can be accessed via a css-selector string in
html_nodes. The text of these nodes can then be accessed via
html_text. If you need a reminder of useful selectors, this source might help.
df <- data.frame( name = heroes %>% html_nodes("a.hero-champion") %>% html_text, hp_txt = heroes %>% html_nodes(".visual-quickinfo-cell .hero-hp") %>% html_text, attack_txt = heroes %>% html_nodes(".visual-quickinfo-cell .hero-atk") %>% html_text, role = heroes %>% html_nodes(".role") %>% html_text, attack_type = heroes %>% html_nodes(".hero-type :not(.role)") %>% html_text )
The nodes that we've retreived now only need to be extracted for numerical value. Through
mutate we perform some etl together with some functions from
\\d* confuses you: don't worry. It's a thing called a regex: go here if you want to know more.
df <- df %>% mutate(hp = hp_txt %>% str_extract("(HP: \\d*)") %>% str_replace("HP: ", "") %>% as.numeric, attack = attack_txt %>% str_extract("(Damage: \\d.\\d*)") %>% str_replace("Damage: ", "") %>% as.numeric, attack_spd = attack_txt %>% str_extract("(Speed: \\d\\.?\\d*)") %>% str_replace("Speed: ", "") %>% as.numeric)
Now that the dataframe is done, let's go for some massive visual exploring with
df %>% select(hp, attack, attack_spd, attack_type) %>% ggpairs(data=., color = "attack_type", title="stats by melee/ranged")
df %>% select(hp, attack, attack_spd, role) %>% ggpairs(data=., color = "role", title="stats by role")
Specialist and support heroes don't deal as much damage as assassins or warriors. Melee characters also seem to pack more of a punch. All these numbers make sense from a game balance perspective.
So in about 20 lines of (very straightforward) code we have;
- retreived the html of a website
- parsed through the html to find nodes of interest
- parsed the nodes of interest
- visualised the result with two plots
In python, I would need to use beautiful soup as well as requests and matplotlib to get to a similar result, both of which feel like different styles of api. This is something R is becomming very good at, the entire language feels as the same api.
html_nodes function feels lovely if you just want to quickly select a few things based of css-selectors. It plays very nicely with the
%>% operator too. It even has support for rare css-selectors like
:not. Very cool.
R and python: always contending.
You can download the .Rmd file for this post here.
You should also be able to find this blog on: r-bloggers