Scraping lyrics from Vagalume
Today I woke up with a desire to stretch my web scraping skills and willing to do so while listening to some music, so why not scrap some music data? In this post I will scrap som data on artists and their lyrics so in a future post I plan on having some visualizations on the data as well as train a LSTM on the lyrics and maybe make it compose some new ones!
This whole post is about scraping the data from a Brazilian lyrics website called Vagalume (literally Firefly in PT-BR), which stores music lyrics for a lot of artists, not only Brazilians. The site’s evaluations of popularity might result on some cool visualizations. And of course, it’s accessed almost exclusively by brazilians, so it will be informative about their tastes, on some level.
NOTE: The datasets generated by this posts are stored in Kaggle here, in case you want them to do some analysis.
Some packages we will need for this task:
library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(furrr)
The packages httr
, xml2
and rvest
are the main ones for html-based web scraping these days. Loading tidyverse
for all reasons and furrr
for combination of future
and purrr
packages.
The website has a A-to-Z list of artists, with a link for each letter, then a link for each artist with such starting letter, which contains a list of music names with lyrics “inside”. It also displays the artists in music genre sections, which I will use, they work very similarly to the A-to-Z sections. I’ll need to go through lots of links in order to collect all the data I want.
1 Scraping artists data
First let’s grab data on the artists, it’s a first step to know how many artists and musics the site has. At the end of this section we should have a data frame containing data on the artists on the website.
To obtain our data frame we will need a list of links to artists’ pages on the site as well as a scrapper function to return the data we want from them. The scrapper is the function that will receive a link a an artist’s page and return the data we want. Using the rvest
package we can build a scrapper.
Let’s first build a scrapper that returns the data we need from a single artist. For that purpose I’m going to use the page for Green Day.
We can see that it contains the bands name, the list of all the lyrics, and a subpage “Popularidade” with the popularity history of the band according to the site’s access history. Those pages are constant across all the artists while the others aren’t as some artists might not have them. Now, there are some information that we can get for individual artists:
- Artist’s name;
- Genre labels;
- Number of musics in the site;
- Popularity;
Let’s get scraping!
Building the scrapper
I’m going to start with scraping the band’s name to illustrate how rvest
works with web scraping. The code below does that job for us. The read_html()
parses a html page into R. Then, html_nodes()
extracts a specific part (node) from the page that contains the data we want. You can define the node’s name with Selector Gadget or by inspecting the page’s code in the browser. Also, a fun and very instructive tutorial on selecting css labels is CSS Dinner. Finally, we turn the lyric into text, removing line breaks and this sort of stuff with html_text
.
# Getting an artists name
read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".darkBG") %>% html_text()
## [1] "Green Day"
There, now we will try to get the labels for music genre for the band:
# Scraping data on genre
read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".subHeaderTags") %>%
as_list() %>% unlist() %>% paste(collapse = "; ")
## [1] "Rock; Punk Rock; Rock Alternativo"
In this case we need to parse it to a list using xml2::as_list()
and then use apply::unlist()
in order to transform the html to a vector. If we used rvest::html_text()
it would have concatenated all the labels with sep = ""
. Green day has the labels: “Rock”; “Punk Rock” and “Rock Alternativo” (Alternative Rock).
Now let’s try and get number of musics for the band. The main page for them has the list of all their songs, we can get them and them store their count.
read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".nameMusic") %>% html_text() %>%
unique() %>% length()
## [1] 239
The first line of code above, if ran alone, returns a vector with all the music names in the page. However before the list of all songs there is a top 25 of the artist’s songs, so we remove duplicates by using unique()
and then count how many songs are there using length()
applied to the resulting vector of unique song names.
Now, last but not least important, the artist’s popularity. This one is stored in a sub-page (\popularidade\) of each artist. In the end of the page it shows the current popularity in a pop which always begins with the word “Atualmente”.
read_html("https://www.vagalume.com.br/green-day/popularidade/") %>% html_nodes(".text") %>% html_text()
## [1] "A primeira vez que Green Day apareceu no Vagalume foi em Junho de 2003.As 2 músicas mais acessadas na época eram Basket Case e When I Come Around."
## [2] "Nos meses de Março e Maio de 2005, Green Day ganhou popularidade com a música Boulevard Of Broken Dreams.\nNo período, a popularidade máxima atingiu 150,0 pontos em Março."
## [3] "Por 5 meses, entre Setembro de 2005 e Fevereiro de 2006, Green Day ganhou popularidade com a música Wake Me Up When September Ends.\nNo período, a popularidade máxima atingiu 178,0 pontos em Novembro de 2005."
## [4] "Atualmente, em Setembro de 2018, a popularidade está em 9,0 pontos e a música mais acessada é Wake Me Up When September Ends."
So we need to filter the one that begins with that word (which means ‘Currently’ by the way). And then we need to extract the value of popularity (9.0 in this case) which is always in between the words “em” and “pontos”.
read_html("https://www.vagalume.com.br/green-day/popularidade/") %>% html_nodes(".text") %>% html_text() %>%
tail(1) %>% # Extracting last phrase
str_extract(., "(?<=está em )(.*)(?= pontos)") %>% # Using Regular Expressions to remove the number
str_replace(",", ".") %>% as.numeric() # Replacing brazilian decimal "," by "." and arsing to numeric
## [1] 9
It took some work but we got the popularity for the band, shame it’s rounded up to an integer, but it won’t be completely useless.
Now we know how to get every information we need from a single artist, so let’s build our scraper function:
scrap_artist <- function(artist_link){
# Reading the entire pages
page <- read_html(paste0("https://www.vagalume.com.br", artist_link))
pop_page <- read_html(paste0("https://www.vagalume.com.br", artist_link, "popularidade/"))
# Getting desired variables
A <- page %>% html_nodes(".darkBG") %>% html_text()
G <- page %>% html_nodes(".subHeaderTags") %>%
as_list() %>% unlist() %>% paste(collapse = "; ")
S <- page %>% html_nodes(".nameMusic") %>% html_text() %>%
unique() %>% length()
P <- pop_page %>% html_nodes(".text") %>% html_text() %>%
tail(1) %>%
str_extract(., "(?<=está em )(.*)(?= pontos)") %>%
str_replace(",", ".") %>% as.numeric()
# Creating tibble
res <- tibble(Artist = A, Genres = G, Songs = S, Popularity = P, Link = artist_link)
return(res)
}
# Testing the scrapper function
scrap_artist("/green-day/")
## # A tibble: 1 x 5
## Artist Genres Songs Popularity Link
## <chr> <chr> <int> <dbl> <chr>
## 1 Green Day Rock; Punk Rock; Rock Alternativo 239 9 /green-day/
Nice, a single function that receives a link to an artist and returns all the data we defined for that artist. Notice that I defined a function such as it receives only the important part of the link, which is what we will scrap from the site in order to form our data frame.
Obtaining links of artists
Now, since this post is just to stretch my web scraping skills and have some fun I won’t scrap data from all the artists, which could take ages as the process depends not only on the processing power but also on the internet connection and time to get to each link. However, we will get some music genres to play:
- Rock
- Hip Hop
- Pop music
- Sertanejo (Basically the Brazilian version of Country Music)
- Funk Carioca (Originated 60s US Funk, a completely different genre in Brazil nowadays)
- Samba (Typical Brazilian music)
In order to map our scrapper to artists we need to have the link to their pages in the website. For that we need to get the links for music genre sections we want to scrap:
section_links <- c("/browse/style/rock.html",
"/browse/style/hip-hop.html",
"/browse/style/pop.html",
"/browse/style/sertanejo.html",
"/browse/style/funk-carioca.html",
"/browse/style/samba.html")
Now, in one specific page (“Rock”), we need to figure out how to get the list of artists. We need to do it since we will need a list of all the artists to scrap the data with the scrap_artist
function. Let’s start by getting the objects with class=moreNamesContainer
(“.moreNamesContainer”) as identified in the page’s code with .
The table containing all the names for each section page is converted to a list (which unfortunately has 4 levels), so the code below is able to extract the links. It’s already inside a function artist_links
that will be useful for us.
get_artist_links <- function(section_link){
# Reading the page
page <- read_html(paste0("https://www.vagalume.com.br", section_link))
# Removing desired node as a list
nameList <- page %>% html_nodes(".moreNamesContainer") %>% as_list()
# Removing undesired list levels and extracting 'href' attribute
# 'NOT ELEGANT AT ALL' ALLERT
artist_links <- nameList %>%
unlist(recursive = F) %>% # Removing first undesired level
unlist(recursive = F) %>% # Removing second undesired level
unlist(recursive = F) %>% # Removig third undesired level, ugh
map(., ~attr(., "href")) %>% # Mapping through list to extract attrs
unlist() %>% as.character() # Removing names and parsing to a vector
return(unique(artist_links))
}
get_artist_links(section_link = "/browse/style/rock.html") %>% head(10)
## [1] "/10000-maniacs/" "/12-stones/" "/3-doors-down/"
## [4] "/311/" "/4-non-blondes/" "/4seres/"
## [7] "/a-corte-animal/" "/a-cruz-esta-vazia/" "/a-day-to-remember/"
## [10] "/a-volta/"
Ouch, what inelegant code, probably due to my lack of knowledge on purrr
, I definitely need to get deeper than map_*()
on this package soon. But ok, let’s map it through all values of section_links
in order to obtain the links to all artists in the website.
In order to increase performance we will use furrr::future_map()
which is a function that conveniently combines purrr::map()
with the parallel processing provided by the future
package.
# Planning parallel processing
plan(multisession)
# Getting all artists' links from the website
#all_artists_links <- get_artist_links("/browse/style/sertanejo.html")
all_artists_links <- future_map(section_links, ~get_artist_links(.)) %>% unlist()
I got \(3.255\) links when I scrapped the data in February 10th in 2019. That leads us to the first big step in this endeavour, our Artists dataset.
Building the dataset
Now that we have the scrapper function scrap_artist
and a list to map it through, we can build our dataset. Since our scrapper returns a tibble i will use the map_dfr
variation which already binds the resulting data frames by rows. Of course there is a furrr::future_map_dfr()
to save is since this operation is going to be much more demanding than the last one.
# De
p_scrap_artist <- possibly(scrap_artist,
otherwise = tibble(Artist = NA,
Genres = NA,
Songs = NA,
Popularity = NA,
Link = NA))
# Planning parallel processing
plan(multisession)
# Getting all artists' links from the website
all_artists <- future_map_dfr(all_artists_links, ~p_scrap_artist(.))
It took around 16 minutes to finish this data scraping on my gaming laptop, it may depend on internet connection and processing power. Notice that I used possibly()
to create a new function. It wraps a function and in case it returns an error it won’t stop our code, instead it will return what I passed to the otherwise
argument, which is a tibble of NA
s. It’s possible to use the argument .progress = TRUE
in the future_map_*
functions in order to check the map procedure progress in case of big operations like this one.
One last thing for us to unify the genres, I want to maintain only the ones we need:
# Selecting genre labels to keep
genres_keep <- c("Rock", "Hip Hop", "Pop", "Sertanejo", "Funk Carioca", "Samba")
# Removing other genre labels
all_artists_fixed <- all_artists %>%
separate(Genres, c("G1", "G2", "G3"), sep = "; ") %>% # Separating Genres variable
gather(G1:G3, key = "G", value = "Genre") %>% select(-G) %>%
filter(Genre %in% genres_keep)
There, now we have some data on all artists on the Genres I specified. The number of rows grew up to \(3622\) because of artists that had more than one label, I will rather maintain it this way since removing the duplicates might bias the results to one genre, at least this way the artists weight equally both genres they are present in. The dataset on artists is built! We can do a last step on the data scraping before we stop for today: Scraping their lyrics!
2 Scraping lyrics data
I decided to already scrap some data on the lyrics to use in the future analysis post, some text analysis, maybe some sentiment analysis and then I have plans for a lyrics-composing AI, I will create a little monster that will ruin the music industry, ok I’ll stop dreaming. Let’s do it!
Building the scrapper
We are going to do it similarly to how we did it with the artists: do a scrapper to get all the individual links then map throuhg it. I already built a lyrics scraper get_lyric()
, that returns the lyric based on the songs’ link. the code below defines it:
# Extracts a single lyric from a song link
get_lyric <- function(song_link){
# Reading the html page
lyric <- read_html(paste0("https://www.vagalume.com.br", song_link)) %>% html_nodes("#lyrics")
# Creating sep to replace linebreaks with ". "
dummy <- xml_node(read_xml("<doc><span>. </span></doc>"), "span")
# Replacing line-breaks
xml_add_sibling(xml_nodes(lyric, "br"), dummy)
res <- lyric %>% html_text()
return(res)
}
# Testing it on Holiday from Green Day
get_lyric("/green-day/holiday.html")
## [1] "Hear the sound of the falling rain. Coming down like an Armageddon flame (HEY!). The shame, the ones who died without a name. . Hear the dogs howling out of key. To a hymn called \"Faith and Misery\" (Hey!). And bleed, the company lost the war today. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. On holiday. . Hear the drum pounding out of time. Another protestor has crossed the line (HEY!). To find, the money's on the other side. . Can I get another Amen (AMEN!). There's a flag wrapped around a score of men (HEY!). A gag, A plastic bag on a monument. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. On holiday. . The representative from California has the floor. . Zieg Heil to the president gasman. Bombs away is your punishment. Pulverize the Eiffel towers. Who criticize your government. Bang bang goes the broken glass. Kill all the fags that don't agree. Trials by fire, setting fire. Is not a way that's meant for me. Just cause...(hey, hey). Just cause, because we're outlaws YEAH!. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. This is our lives on holiday"
From this specific lyric we can see that there are still some special characters. Instead of just treating this “\” issue, I’ll remove all special characters when analyzing the lyrics. Better, right? Let’s get to the next step on our scraping adventure.
Links to scrap the data from
Now we need the list of songs to map the scraper through. The code below defines get_lyrics_links()
, which receives the link to the artist then returns a tibble with name and link to the music. It also returns the link to the artist that was passed, in order to use it as an ID for the artist dataset in the future.
# Extracts all lyrics from an artist link - uses get_lyric()
get_lyrics_links <- function(artist_link){
# Reading the html page on the artist
page <- read_html(paste0("https://www.vagalume.com.br", artist_link))
# Obtaining all the musics' links -
music_name_node <- page %>% html_nodes(".nameMusic")
music_names <- music_name_node %>% html_text()
music_links <- music_name_node %>% html_attr("href")
# Building final tibble
res <- tibble(ALink = rep(artist_link, length(music_names)),
SName = music_names,
SLink = music_links)
return(res)
}
# Testing final function
get_lyrics_links("/green-day/") %>% head(5)
## # A tibble: 5 x 3
## ALink SName SLink
## <chr> <chr> <chr>
## 1 /green-da~ Boulevard Of Broken Drea~ /green-day/boulevard-of-broken-drea~
## 2 /green-da~ Good Riddance (Time Of Y~ /green-day/good-riddance-time-of-yo~
## 3 /green-da~ Holiday /green-day/holiday.html
## 4 /green-da~ Wake Me Up When Septembe~ /green-day/wake-me-up-when-septembe~
## 5 /green-da~ Basket Case /green-day/basket-case.html
Notice that I use html_attr("href")
to extract the attribute “href” from the html object, which is the link to the song page. Let’s map this function through all the artists we have in order to obtain the links to the musics:
# Generating safe version of get_lyrics_links()
p_get_lyrics_links <- possibly(get_lyrics_links,
otherwise = tibble(ALink = NA,
SName = NA))
# Returning all the links to the musics
plan(multisession)
all_songs_links <- future_map_dfr(all_artists_links, ~p_get_lyrics_links(.))
After a 82.98 seconds wait on my case we found out that there are \(209.522\) songs for our \(3.355\) artists.
Building the lyrics dataset
Now all we gotta do is map hour scraper get_lyric()
through all this songs to get our final lyrics dataset.
# Gerenating a safe version with possibly()
p_get_lyric <- possibly(get_lyric, otherwise = NA)
# Mapping it through the all_artists_links vector
plan(multisession)
all_lyrics <- future_map_chr(all_songs_links$SLink, ~p_get_lyric(.)[1], .progress = TRUE)
# Adding it to the dataframe
all_songs_links$Lyric <- all_lyrics %>% str_replace_all("\\. \\. ", ". ")
This one took around 9,064.08 seconds (151 minutes) to finish, quite some time!
You might be asking why not build a single function to scrap the lyrics. Well, I tried it at first, a function that took the artist link, mapped through all song link inside it and returned a tibble with all the artist lyrics. But it was taking hours to run iterally. I think it happens because furrr
can’t optimize well functions that move great pieces of data around. Also, it was a function with a map call inside, seems like it’s not possible to call a future within a future call so I had to separate them in order to optimize running time.
3 That’s it!
Both datasets can be foun on my Kaggle datasets page, here. Feel free to analyze it the way you want!
This post is the first of two parts of the music analysis. In the next one i plan on briefly analyzing this data and training a LSTM in the lyrics for each genre. I hope you liked it so far! If you have any feedback on your mind, share it in my twitter @a_neisse or on the contact section of this website.