Scraping lyrics from Vagalume

Today I woke up with a desire to stretch my web scraping skills and willing to do so while listening to some music, so why not scrap some music data? In this post I will scrap som data on artists and their lyrics so in a future post I plan on having some visualizations on the data as well as train a LSTM on the lyrics and maybe make it compose some new ones!

This whole post is about scraping the data from a Brazilian lyrics website called Vagalume (literally Firefly in PT-BR), which stores music lyrics for a lot of artists, not only Brazilians. The site’s evaluations of popularity might result on some cool visualizations. And of course, it’s accessed almost exclusively by brazilians, so it will be informative about their tastes, on some level.

NOTE: The datasets generated by this posts are stored in Kaggle here, in case you want them to do some analysis.

Some packages we will need for this task:

library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(furrr)

The packages httr, xml2 and rvest are the main ones for html-based web scraping these days. Loading tidyverse for all reasons and furrr for combination of future and purrr packages.

The website has a A-to-Z list of artists, with a link for each letter, then a link for each artist with such starting letter, which contains a list of music names with lyrics “inside”. It also displays the artists in music genre sections, which I will use, they work very similarly to the A-to-Z sections. I’ll need to go through lots of links in order to collect all the data I want.

1 Scraping artists data

First let’s grab data on the artists, it’s a first step to know how many artists and musics the site has. At the end of this section we should have a data frame containing data on the artists on the website.

To obtain our data frame we will need a list of links to artists’ pages on the site as well as a scrapper function to return the data we want from them. The scrapper is the function that will receive a link a an artist’s page and return the data we want. Using the rvest package we can build a scrapper.

Let’s first build a scrapper that returns the data we need from a single artist. For that purpose I’m going to use the page for Green Day.

We can see that it contains the bands name, the list of all the lyrics, and a subpage “Popularidade” with the popularity history of the band according to the site’s access history. Those pages are constant across all the artists while the others aren’t as some artists might not have them. Now, there are some information that we can get for individual artists:

  • Artist’s name;
  • Genre labels;
  • Number of musics in the site;
  • Popularity;

Let’s get scraping!

Building the scrapper

I’m going to start with scraping the band’s name to illustrate how rvest works with web scraping. The code below does that job for us. The read_html() parses a html page into R. Then, html_nodes() extracts a specific part (node) from the page that contains the data we want. You can define the node’s name with Selector Gadget or by inspecting the page’s code in the browser. Also, a fun and very instructive tutorial on selecting css labels is CSS Dinner. Finally, we turn the lyric into text, removing line breaks and this sort of stuff with html_text.

# Getting an artists name
read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".darkBG") %>% html_text()
## [1] "Green Day"

There, now we will try to get the labels for music genre for the band:

# Scraping data on genre
read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".subHeaderTags") %>% 
  as_list() %>% unlist() %>% paste(collapse = "; ")
## [1] "Rock; Punk Rock; Rock Alternativo"

In this case we need to parse it to a list using xml2::as_list() and then use apply::unlist() in order to transform the html to a vector. If we used rvest::html_text() it would have concatenated all the labels with sep = "". Green day has the labels: “Rock”; “Punk Rock” and “Rock Alternativo” (Alternative Rock).

Now let’s try and get number of musics for the band. The main page for them has the list of all their songs, we can get them and them store their count.

read_html("https://www.vagalume.com.br/green-day/") %>% html_nodes(".nameMusic") %>% html_text() %>% 
  unique() %>% length()
## [1] 239

The first line of code above, if ran alone, returns a vector with all the music names in the page. However before the list of all songs there is a top 25 of the artist’s songs, so we remove duplicates by using unique() and then count how many songs are there using length() applied to the resulting vector of unique song names.

Now, last but not least important, the artist’s popularity. This one is stored in a sub-page (\popularidade\) of each artist. In the end of the page it shows the current popularity in a pop which always begins with the word “Atualmente”.

read_html("https://www.vagalume.com.br/green-day/popularidade/") %>% html_nodes(".text") %>% html_text()
## [1] "A primeira vez que Green Day apareceu no Vagalume foi em Junho de 2003.As 2 músicas mais acessadas na época eram Basket Case e When I Come Around."                                                              
## [2] "Nos meses de Março e Maio de 2005, Green Day ganhou popularidade com a música Boulevard Of Broken Dreams.\nNo período, a popularidade máxima atingiu 150,0 pontos em Março."                                     
## [3] "Por 5 meses, entre Setembro de 2005 e Fevereiro de 2006, Green Day ganhou popularidade com a música Wake Me Up When September Ends.\nNo período, a popularidade máxima atingiu 178,0 pontos em Novembro de 2005."
## [4] "Atualmente, em Setembro de 2018, a popularidade está em 9,0 pontos e a música mais acessada é Wake Me Up When September Ends."

So we need to filter the one that begins with that word (which means ‘Currently’ by the way). And then we need to extract the value of popularity (9.0 in this case) which is always in between the words “em” and “pontos”.

read_html("https://www.vagalume.com.br/green-day/popularidade/") %>% html_nodes(".text") %>% html_text() %>% 
  tail(1) %>% # Extracting last phrase
  str_extract(., "(?<=está em )(.*)(?= pontos)") %>% # Using Regular Expressions to remove the number
  str_replace(",", ".") %>% as.numeric() # Replacing brazilian decimal "," by "." and arsing to numeric
## [1] 9

It took some work but we got the popularity for the band, shame it’s rounded up to an integer, but it won’t be completely useless.

Now we know how to get every information we need from a single artist, so let’s build our scraper function:

scrap_artist <- function(artist_link){
  # Reading the entire pages
  page <- read_html(paste0("https://www.vagalume.com.br", artist_link))
  pop_page <- read_html(paste0("https://www.vagalume.com.br", artist_link, "popularidade/"))
  
  # Getting desired variables
  A <- page %>% html_nodes(".darkBG") %>% html_text()
    
  G <- page %>% html_nodes(".subHeaderTags") %>% 
    as_list() %>% unlist() %>% paste(collapse = "; ")
  
  S <- page %>% html_nodes(".nameMusic") %>% html_text() %>% 
  unique() %>% length()
  
  P <- pop_page %>% html_nodes(".text") %>% html_text() %>% 
    tail(1) %>% 
    str_extract(., "(?<=está em )(.*)(?= pontos)") %>% 
    str_replace(",", ".") %>% as.numeric() 
  
  # Creating tibble
  res <- tibble(Artist = A, Genres = G, Songs = S, Popularity = P, Link = artist_link)
  return(res)
}

# Testing the scrapper function
scrap_artist("/green-day/")
## # A tibble: 1 x 5
##   Artist    Genres                            Songs Popularity Link       
##   <chr>     <chr>                             <int>      <dbl> <chr>      
## 1 Green Day Rock; Punk Rock; Rock Alternativo   239          9 /green-day/

Nice, a single function that receives a link to an artist and returns all the data we defined for that artist. Notice that I defined a function such as it receives only the important part of the link, which is what we will scrap from the site in order to form our data frame.

Building the dataset

Now that we have the scrapper function scrap_artist and a list to map it through, we can build our dataset. Since our scrapper returns a tibble i will use the map_dfr variation which already binds the resulting data frames by rows. Of course there is a furrr::future_map_dfr() to save is since this operation is going to be much more demanding than the last one.

# De
p_scrap_artist <- possibly(scrap_artist, 
                           otherwise = tibble(Artist = NA, 
                                              Genres = NA, 
                                              Songs = NA, 
                                              Popularity = NA, 
                                              Link = NA))

# Planning parallel processing
plan(multisession)

# Getting all artists' links from the website
all_artists <- future_map_dfr(all_artists_links, ~p_scrap_artist(.))

It took around 16 minutes to finish this data scraping on my gaming laptop, it may depend on internet connection and processing power. Notice that I used possibly() to create a new function. It wraps a function and in case it returns an error it won’t stop our code, instead it will return what I passed to the otherwise argument, which is a tibble of NAs. It’s possible to use the argument .progress = TRUE in the future_map_* functions in order to check the map procedure progress in case of big operations like this one.

One last thing for us to unify the genres, I want to maintain only the ones we need:

# Selecting genre labels to keep
genres_keep <- c("Rock", "Hip Hop", "Pop", "Sertanejo", "Funk Carioca", "Samba")


# Removing other genre labels
all_artists_fixed <- all_artists %>% 
  separate(Genres, c("G1", "G2", "G3"), sep = "; ") %>%  # Separating Genres variable
  gather(G1:G3, key = "G", value = "Genre") %>% select(-G) %>% 
  filter(Genre %in% genres_keep)

There, now we have some data on all artists on the Genres I specified. The number of rows grew up to \(3622\) because of artists that had more than one label, I will rather maintain it this way since removing the duplicates might bias the results to one genre, at least this way the artists weight equally both genres they are present in. The dataset on artists is built! We can do a last step on the data scraping before we stop for today: Scraping their lyrics!

2 Scraping lyrics data

I decided to already scrap some data on the lyrics to use in the future analysis post, some text analysis, maybe some sentiment analysis and then I have plans for a lyrics-composing AI, I will create a little monster that will ruin the music industry, ok I’ll stop dreaming. Let’s do it!

Building the scrapper

We are going to do it similarly to how we did it with the artists: do a scrapper to get all the individual links then map throuhg it. I already built a lyrics scraper get_lyric(), that returns the lyric based on the songs’ link. the code below defines it:

# Extracts a single lyric from a song link
get_lyric <- function(song_link){
  
  # Reading the html page
  lyric <- read_html(paste0("https://www.vagalume.com.br", song_link)) %>% html_nodes("#lyrics")
  
  # Creating sep to replace linebreaks with ". "
  dummy <- xml_node(read_xml("<doc><span>. </span></doc>"), "span")

  # Replacing line-breaks
  xml_add_sibling(xml_nodes(lyric, "br"), dummy)
  
  res <- lyric %>% html_text()
  
  return(res)
}

# Testing it on Holiday from Green Day
get_lyric("/green-day/holiday.html")
## [1] "Hear the sound of the falling rain. Coming down like an Armageddon flame (HEY!). The shame, the ones who died without a name. . Hear the dogs howling out of key. To a hymn called \"Faith and Misery\" (Hey!). And bleed, the company lost the war today. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. On holiday. . Hear the drum pounding out of time. Another protestor has crossed the line (HEY!). To find, the money's on the other side. . Can I get another Amen (AMEN!). There's a flag wrapped around a score of men (HEY!). A gag, A plastic bag on a monument. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. On holiday. . The representative from California has the floor. . Zieg Heil to the president gasman. Bombs away is your punishment. Pulverize the Eiffel towers. Who criticize your government. Bang bang goes the broken glass. Kill all the fags that don't agree. Trials by fire, setting fire. Is not a way that's meant for me. Just cause...(hey, hey). Just cause, because we're outlaws YEAH!. . I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. I beg to dream and differ from the hollow lies. This is the dawning of the rest of our lives. This is our lives on holiday"

From this specific lyric we can see that there are still some special characters. Instead of just treating this “\” issue, I’ll remove all special characters when analyzing the lyrics. Better, right? Let’s get to the next step on our scraping adventure.

Building the lyrics dataset

Now all we gotta do is map hour scraper get_lyric() through all this songs to get our final lyrics dataset.

# Gerenating a safe version with possibly()
p_get_lyric <- possibly(get_lyric, otherwise = NA)

# Mapping it through the all_artists_links vector
plan(multisession)
all_lyrics <- future_map_chr(all_songs_links$SLink, ~p_get_lyric(.)[1], .progress = TRUE)

# Adding it to the dataframe
all_songs_links$Lyric <- all_lyrics %>% str_replace_all("\\. \\. ", ". ")

This one took around 9,064.08 seconds (151 minutes) to finish, quite some time!

You might be asking why not build a single function to scrap the lyrics. Well, I tried it at first, a function that took the artist link, mapped through all song link inside it and returned a tibble with all the artist lyrics. But it was taking hours to run iterally. I think it happens because furrr can’t optimize well functions that move great pieces of data around. Also, it was a function with a map call inside, seems like it’s not possible to call a future within a future call so I had to separate them in order to optimize running time.

3 That’s it!

Both datasets can be foun on my Kaggle datasets page, here. Feel free to analyze it the way you want!

This post is the first of two parts of the music analysis. In the next one i plan on briefly analyzing this data and training a LSTM in the lyrics for each genre. I hope you liked it so far! If you have any feedback on your mind, share it in my twitter @a_neisse or on the contact section of this website.

Anderson Neisse
Anderson Neisse
Statistician, Geek of Data and Science

Statistician, Geek of Data and Science

Related