library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(furrr)
Scraping lyrics from Vagalume
Today I woke up with a desire to stretch my web scraping skills and willing to do so while listening to some music, so why not scrap some music data? In this post I will scrap som data on artists and their lyrics so in a future post I plan on having some visualizations on the data as well as train a LSTM on the lyrics and maybe make it compose some new ones!
This whole post is about scraping the data from a Brazilian lyrics website called Vagalume (literally Firefly in PT-BR), which stores music lyrics for a lot of artists, not only Brazilians. The site’s evaluations of popularity might result on some cool visualizations. And of course, it’s accessed almost exclusively by brazilians, so it will be informative about their tastes, on some level.
NOTE: The datasets generated by this posts are stored in Kaggle here, in case you want them to do some analysis.
Some packages we will need for this task:
The packages httr
, xml2
and rvest
are the main ones for html-based web scraping these days. Loading tidyverse
for all reasons and furrr
for combination of future
and purrr
packages.
The website has a A-to-Z list of artists, with a link for each letter, then a link for each artist with such starting letter, which contains a list of music names with lyrics “inside”. It also displays the artists in music genre sections, which I will use, they work very similarly to the A-to-Z sections. I’ll need to go through lots of links in order to collect all the data I want.
1 Scraping artists data
First let’s grab data on the artists, it’s a first step to know how many artists and musics the site has. At the end of this section we should have a data frame containing data on the artists on the website.
To obtain our data frame we will need a list of links to artists’ pages on the site as well as a scrapper function to return the data we want from them. The scrapper is the function that will receive a link a an artist’s page and return the data we want. Using the rvest
package we can build a scrapper.
Let’s first build a scrapper that returns the data we need from a single artist. For that purpose I’m going to use the page for Green Day.
We can see that it contains the bands name, the list of all the lyrics, and a subpage “Popularidade” with the popularity history of the band according to the site’s access history. Those pages are constant across all the artists while the others aren’t as some artists might not have them. Now, there are some information that we can get for individual artists:
- Artist’s name;
- Genre labels;
- Number of musics in the site;
- Popularity;
Let’s get scraping!
Building the scrapper
I’m going to start with scraping the band’s name to illustrate how rvest
works with web scraping. The code below does that job for us. The read_html()
parses a html page into R. Then, html_nodes()
extracts a specific part (node) from the page that contains the data we want. You can define the node’s name with Selector Gadget or by inspecting the page’s code in the browser. Also, a fun and very instructive tutorial on selecting css labels is CSS Dinner. Finally, we turn the lyric into text, removing line breaks and this sort of stuff with html_text
.
# Getting an artists name
read_html("https://www.vagalume.com.br/green-day/") %>%
html_nodes(".darkBG") %>%
html_text()
There, now we will try to get the labels for music genre for the band:
# Scraping data on genre
read_html("https://www.vagalume.com.br/green-day/") %>%
html_nodes(".subHeaderTags") %>%
as_list() %>%
unlist() %>%
paste(collapse = "; ")
In this case we need to parse it to a list using xml2::as_list()
and then use apply::unlist()
in order to transform the html to a vector. If we used rvest::html_text()
it would have concatenated all the labels with sep = ""
. Green day has the labels: “Rock”; “Punk Rock” and “Rock Alternativo” (Alternative Rock).
Now let’s try and get number of musics for the band. The main page for them has the list of all their songs, we can get them and them store their count.
read_html("https://www.vagalume.com.br/green-day/") %>%
html_nodes(".nameMusic") %>%
html_text() %>%
unique() %>%
length()
The first line of code above, if ran alone, returns a vector with all the music names in the page. However before the list of all songs there is a top 25 of the artist’s songs, so we remove duplicates by using unique()
and then count how many songs are there using length()
applied to the resulting vector of unique song names.
Now, last but not least important, the artist’s popularity. This one is stored in a sub-page (\popularidade\) of each artist. In the end of the page it shows the current popularity in a pop which always begins with the word “Atualmente”.
read_html("https://www.vagalume.com.br/green-day/popularidade/") %>%
html_nodes(".text") %>%
html_text()
So we need to filter the one that begins with that word (which means ‘Currently’ by the way). And then we need to extract the value of popularity (9.0 in this case) which is always in between the words “em” and “pontos”.
read_html("https://www.vagalume.com.br/green-day/popularidade/") %>%
html_nodes(".text") %>%
html_text() %>%
# Extracting last phrase
tail(1) %>%
# Using Regular Expressions to remove the number
str_extract(., "(?<=está em )(.*)(?= pontos)") %>%
# Replacing brazilian decimal "," by "." and arsing to numeric
str_replace(",", ".") %>%
as.numeric()
It took some work but we got the popularity for the band, shame it’s rounded up to an integer, but it won’t be completely useless.
Now we know how to get every information we need from a single artist, so let’s build our scraper function:
<- function(artist_link) {
scrap_artist # Reading the entire pages
<- read_html(paste0("https://www.vagalume.com.br", artist_link))
page <- read_html(paste0("https://www.vagalume.com.br", artist_link, "popularidade/"))
pop_page
# Getting desired variables
<- page %>%
A html_nodes(".darkBG") %>%
html_text()
<- page %>%
G html_nodes(".subHeaderTags") %>%
as_list() %>%
unlist() %>%
paste(collapse = "; ")
<- page %>%
S html_nodes(".nameMusic") %>%
html_text() %>%
unique() %>%
length()
<- pop_page %>%
P html_nodes(".text") %>%
html_text() %>%
tail(1) %>%
str_extract(., "(?<=está em )(.*)(?= pontos)") %>%
str_replace(",", ".") %>%
as.numeric()
# Creating tibble
<- tibble(
res Artist = A,
Genres = G,
Songs = S,
Popularity = P,
Link = artist_link
)return(res)
}
# Testing the scrapper function
scrap_artist("/green-day/")
Nice, a single function that receives a link to an artist and returns all the data we defined for that artist. Notice that I defined a function such as it receives only the important part of the link, which is what we will scrap from the site in order to form our data frame.
Obtaining links of artists
Now, since this post is just to stretch my web scraping skills and have some fun I won’t scrap data from all the artists, which could take ages as the process depends not only on the processing power but also on the internet connection and time to get to each link. However, we will get some music genres to play:
- Rock
- Hip Hop
- Pop music
- Sertanejo (Basically the Brazilian version of Country Music)
- Funk Carioca (Originated 60s US Funk, a completely different genre in Brazil nowadays)
- Samba (Typical Brazilian music)
In order to map our scrapper to artists we need to have the link to their pages in the website. For that we need to get the links for music genre sections we want to scrap:
<- read_html("https://www.vagalume.com.br/browse/style/")
page
<- page %>%
styles html_nodes(".itemTitle") %>%
html_text() %>%
str_replace(., "/", "-") %>%
str_replace(., "á", "a") %>%
str_replace(., "é", "e") %>%
str_replace(., "â", "a") %>%
str_replace(., "ó", "o") %>%
str_replace_all(., "ú", "u") %>%
str_replace(., "&", "-n-") %>%
str_replace(., " ", "-") %>%
str_replace(., "J-Pop-J-Rock", "j-pop") %>%
str_replace(., "Gospel-Religioso", "gospel") %>%
str_replace(., "Romantico", "romantica") %>%
tolower()
<- paste0("/browse/style/", styles, ".html") section_links
Now, in one specific page (“Rock”), we need to figure out how to get the list of artists. We need to do it since we will need a list of all the artists to scrap the data with the scrap_artist
function. Let’s start by getting the objects with class=moreNamesContainer
(“.moreNamesContainer”) as identified in the page’s code with .
The table containing all the names for each section page is converted to a list (which unfortunately has 4 levels), so the code below is able to extract the links. It’s already inside a function artist_links
that will be useful for us.
<- function(section_link) {
get_artist_links # Reading the page
<- read_html(paste0("https://www.vagalume.com.br", section_link))
page
# Removing desired node as a list
<- page %>%
nameList html_nodes(".moreNamesContainer") %>%
as_list()
# Removing undesired list levels and extracting 'href' attribute
# 'NOT ELEGANT AT ALL' ALLERT
<- nameList %>%
artist_links unlist(recursive = F) %>% # Removing first undesired level
unlist(recursive = F) %>% # Removing second undesired level
unlist(recursive = F) %>% # Removig third undesired level, ugh
map(., ~ attr(., "href")) %>% # Mapping through list to extract attrs
unlist() %>%
as.character() # Removing names and parsing to a vector
return(unique(artist_links))
}
get_artist_links(section_link = "/browse/style/rock.html") %>% head(10)
Ouch, what inelegant code, probably due to my lack of knowledge on purrr
, I definitely need to get deeper than map_*()
on this package soon. But ok, let’s map it through all values of section_links
in order to obtain the links to all artists in the website.
In order to increase performance we will use furrr::future_map()
which is a function that conveniently combines purrr::map()
with the parallel processing provided by the future
package.
# Planning parallel processing
plan(multisession)
# Getting all artists' links from the website
#all_artists_links <- get_artist_links("/browse/style/sertanejo.html")
<- future_map(section_links, ~get_artist_links(.)) %>%
all_artists_links unlist()
I got \(3.255\) links when I scrapped the data in February 10th in 2019. That leads us to the first big step in this endeavour, our Artists dataset.
Building the dataset
Now that we have the scrapper function scrap_artist
and a list to map it through, we can build our dataset. Since our scrapper returns a tibble i will use the map_dfr
variation which already binds the resulting data frames by rows. Of course there is a furrr::future_map_dfr()
to save is since this operation is going to be much more demanding than the last one.
# De
<- possibly(scrap_artist,
p_scrap_artist otherwise = tibble(Artist = NA,
Genres = NA,
Songs = NA,
Popularity = NA,
Link = NA))
# Planning parallel processing
plan(multisession)
# Getting all artists' links from the website
<- future_map_dfr(all_artists_links, ~p_scrap_artist(.)) all_artists
It took around 16 minutes to finish this data scraping on my gaming laptop, it may depend on internet connection and processing power. Notice that I used possibly()
to create a new function. It wraps a function and in case it returns an error it won’t stop our code, instead it will return what I passed to the otherwise
argument, which is a tibble of NA
s. It’s possible to use the argument .progress = TRUE
in the future_map_*
functions in order to check the map procedure progress in case of big operations like this one.
Originally i wanted to maintain only some genres as shown bellow, but on the most recent revision of the data on Kaggle i scraped all the genres, so the code below is commented but you can use it if you want specific genres to be scraped:
# Selecting genre labels to keep
#genres_keep <- c("Rock", "Hip Hop", "Pop", "Sertanejo", "Funk Carioca", "Samba")
# Removing other genre labels
#all_artists_fixed <- all_artists %>%
# separate(Genres, c("G1", "G2", "G3"), sep = "; ") %>% # Separating Genres variable
# gather(G1:G3, key = "G", value = "Genre") %>% select(-G) %>%
# filter(Genre %in% genres_keep)
There, now we have some data on all artists on the Genres I specified. The number of rows grew up to \(3622\) because of artists that had more than one label, I will rather maintain it this way since removing the duplicates might bias the results to one genre, at least this way the artists weight equally both genres they are present in. The dataset on artists is built! We can do a last step on the data scraping before we stop for today: Scraping their lyrics!
2 Scraping lyrics data
I decided to already scrap some data on the lyrics to use in the future analysis post, some text analysis, maybe some sentiment analysis and then I have plans for a lyrics-composing AI, I will create a little monster that will ruin the music industry, ok I’ll stop dreaming. Let’s do it!
Building the scrapper
We are going to do it similarly to how we did it with the artists: do a scrapper to get all the individual links then map throuhg it. I already built a lyrics scraper get_lyric()
, that returns the lyric based on the songs’ link. the code below defines it:
# Extracts a single lyric from a song link
<- function(song_link){
get_lyric
# Reading the html page
<- read_html(paste0("https://www.vagalume.com.br", song_link)) %>%
lyric html_nodes("#lyrics") %>%
html_text2()
return(lyric)
}
# Testing it on Holiday from Green Day
get_lyric(song_link = "/green-day/holiday.html")
From this specific lyric we can see that there are still some special characters. Instead of just treating this “\” issue, I’ll remove all special characters when analyzing the lyrics. Better, right? Let’s get to the next step on our scraping adventure.
Links to scrap the data from
Now we need the list of songs to map the scraper through. The code below defines get_lyrics_links()
, which receives the link to the artist then returns a tibble with name and link to the music. It also returns the link to the artist that was passed, in order to use it as an ID for the artist dataset in the future.
# Extracts all lyrics from an artist link - uses get_lyric()
<- function(artist_link){
get_lyrics_links
# Reading the html page on the artist
<- read_html(paste0("https://www.vagalume.com.br", artist_link))
page
# Obtaining all the musics' links -
<- page %>% html_nodes(".nameMusic")
music_name_node <- music_name_node %>% html_text()
music_names <- music_name_node %>% html_attr("href")
music_links
# Building final tibble
<- tibble(ALink = rep(artist_link, length(music_names)),
res SName = music_names,
SLink = music_links)
return(res)
}
# Testing final function
get_lyrics_links("/green-day/") %>% head(5)
Notice that I use html_attr("href")
to extract the attribute “href” from the html object, which is the link to the song page. Let’s map this function through all the artists we have in order to obtain the links to the musics:
# Generating safe version of get_lyrics_links()
<- possibly(get_lyrics_links,
p_get_lyrics_links otherwise = tibble(ALink = NA,
SName = NA))
# Returning all the links to the musics
plan(multisession)
<- future_map_dfr(all_artists_links, ~p_get_lyrics_links(.)) all_songs_links
After a 82.98 seconds wait on my case we found out that there are \(209.522\) songs for our \(3.355\) artists.
Building the lyrics dataset
Now all we gotta do is map hour scraper get_lyric()
through all this songs to get our final lyrics dataset.
# Gerenating a safe version with possibly()
<- possibly(get_lyric, otherwise = NA)
p_get_lyric
# Mapping it through the all_artists_links vector
plan(multisession)
<- future_map_chr(all_songs_links$SLink, ~p_get_lyric(.)[1], .progress = TRUE)
all_lyrics
# Adding it to the dataframe
$Lyric <- all_lyrics %>% str_replace_all("\\. \\. ", ". ") all_songs_links
This one took around 9,064.08 seconds (151 minutes) to finish, quite some time!
You might be asking why not build a single function to scrap the lyrics. Well, I tried it at first, a function that took the artist link, mapped through all song link inside it and returned a tibble with all the artist lyrics. But it was taking hours to run iterally. I think it happens because furrr
can’t optimize well functions that move great pieces of data around. Also, it was a function with a map call inside, seems like it’s not possible to call a future within a future call so I had to separate them in order to optimize running time.
Finally, we can use the cld2
package to infer each lyric’s language to make future analysis easier:
library(cld2)
<- all_songs_links %>%
all_songs_links mutate(language = detect_language(Lyric))
3 That’s it!
Both datasets can be foun on my Kaggle datasets page, here. Feel free to analyze it the way you want!
This post is the first of two parts of the music analysis. In the next one i plan on briefly analyzing this data and training a LSTM in the lyrics for each genre. I hope you liked it so far! If you have any feedback on your mind, share it in my twitter @a_neisse or on the contact section of this website.