Scraping lyrics from Vagalume

Webscraping

Music

Code

Using R to webscrape vagalume.com

Author

Anderson Neisse

Published

October 2, 2019

Today I woke up with a desire to stretch my web scraping skills and willing to do so while listening to some music, so why not scrap some music data? In this post I will scrap som data on artists and their lyrics so in a future post I plan on having some visualizations on the data as well as train a LSTM on the lyrics and maybe make it compose some new ones!

This whole post is about scraping the data from a Brazilian lyrics website called Vagalume (literally Firefly in PT-BR), which stores music lyrics for a lot of artists, not only Brazilians. The site’s evaluations of popularity might result on some cool visualizations. And of course, it’s accessed almost exclusively by brazilians, so it will be informative about their tastes, on some level.

NOTE: The datasets generated by this posts are stored in Kaggle here, in case you want them to do some analysis.

Some packages we will need for this task:

library(tidyverse)
library(httr)
library(xml2)
library(rvest)
library(furrr)

The packages httr, xml2 and rvest are the main ones for html-based web scraping these days. Loading tidyverse for all reasons and furrr for combination of future and purrr packages.

The website has a A-to-Z list of artists, with a link for each letter, then a link for each artist with such starting letter, which contains a list of music names with lyrics “inside”. It also displays the artists in music genre sections, which I will use, they work very similarly to the A-to-Z sections. I’ll need to go through lots of links in order to collect all the data I want.

1 Scraping artists data

First let’s grab data on the artists, it’s a first step to know how many artists and musics the site has. At the end of this section we should have a data frame containing data on the artists on the website.

To obtain our data frame we will need a list of links to artists’ pages on the site as well as a scrapper function to return the data we want from them. The scrapper is the function that will receive a link a an artist’s page and return the data we want. Using the rvest package we can build a scrapper.

Let’s first build a scrapper that returns the data we need from a single artist. For that purpose I’m going to use the page for Green Day.

We can see that it contains the bands name, the list of all the lyrics, and a subpage “Popularidade” with the popularity history of the band according to the site’s access history. Those pages are constant across all the artists while the others aren’t as some artists might not have them. Now, there are some information that we can get for individual artists:

Artist’s name;
Genre labels;
Number of musics in the site;
Popularity;

Let’s get scraping!

Building the scrapper

I’m going to start with scraping the band’s name to illustrate how rvest works with web scraping. The code below does that job for us. The read_html() parses a html page into R. Then, html_nodes() extracts a specific part (node) from the page that contains the data we want. You can define the node’s name with Selector Gadget or by inspecting the page’s code in the browser. Also, a fun and very instructive tutorial on selecting css labels is CSS Dinner. Finally, we turn the lyric into text, removing line breaks and this sort of stuff with html_text.

# Getting an artists name
read_html("https://www.vagalume.com.br/green-day/") %>%
    html_nodes(".darkBG") %>%
    html_text()

There, now we will try to get the labels for music genre for the band:

# Scraping data on genre
read_html("https://www.vagalume.com.br/green-day/") %>%
    html_nodes(".subHeaderTags") %>%
    as_list() %>%
    unlist() %>%
    paste(collapse = "; ")

In this case we need to parse it to a list using xml2::as_list() and then use apply::unlist() in order to transform the html to a vector. If we used rvest::html_text() it would have concatenated all the labels with sep = "". Green day has the labels: “Rock”; “Punk Rock” and “Rock Alternativo” (Alternative Rock).

Now let’s try and get number of musics for the band. The main page for them has the list of all their songs, we can get them and them store their count.

read_html("https://www.vagalume.com.br/green-day/") %>%
    html_nodes(".nameMusic") %>%
    html_text() %>%
    unique() %>%
    length()

The first line of code above, if ran alone, returns a vector with all the music names in the page. However before the list of all songs there is a top 25 of the artist’s songs, so we remove duplicates by using unique() and then count how many songs are there using length() applied to the resulting vector of unique song names.

Now, last but not least important, the artist’s popularity. This one is stored in a sub-page (\popularidade\) of each artist. In the end of the page it shows the current popularity in a pop which always begins with the word “Atualmente”.

read_html("https://www.vagalume.com.br/green-day/popularidade/") %>%
    html_nodes(".text") %>%
    html_text()

So we need to filter the one that begins with that word (which means ‘Currently’ by the way). And then we need to extract the value of popularity (9.0 in this case) which is always in between the words “em” and “pontos”.

read_html("https://www.vagalume.com.br/green-day/popularidade/") %>%
    html_nodes(".text") %>%
    html_text() %>%
    # Extracting last phrase
    tail(1) %>%
    # Using Regular Expressions to remove the number
    str_extract(., "(?<=está em )(.*)(?= pontos)") %>%
    # Replacing brazilian decimal "," by "." and arsing to numeric
    str_replace(",", ".") %>%
    as.numeric()

It took some work but we got the popularity for the band, shame it’s rounded up to an integer, but it won’t be completely useless.

Now we know how to get every information we need from a single artist, so let’s build our scraper function:

scrap_artist <- function(artist_link) {
    # Reading the entire pages
    page <- read_html(paste0("https://www.vagalume.com.br", artist_link))
    pop_page <- read_html(paste0("https://www.vagalume.com.br", artist_link, "popularidade/"))

    # Getting desired variables
    A <- page %>%
        html_nodes(".darkBG") %>%
        html_text()

    G <- page %>%
        html_nodes(".subHeaderTags") %>%
        as_list() %>%
        unlist() %>%
        paste(collapse = "; ")

    S <- page %>%
        html_nodes(".nameMusic") %>%
        html_text() %>%
        unique() %>%
        length()

    P <- pop_page %>%
        html_nodes(".text") %>%
        html_text() %>%
        tail(1) %>%
        str_extract(., "(?<=está em )(.*)(?= pontos)") %>%
        str_replace(",", ".") %>%
        as.numeric()

    # Creating tibble
    res <- tibble(
        Artist = A,
        Genres = G,
        Songs = S,
        Popularity = P,
        Link = artist_link
    )
    return(res)
}

# Testing the scrapper function
scrap_artist("/green-day/")

Nice, a single function that receives a link to an artist and returns all the data we defined for that artist. Notice that I defined a function such as it receives only the important part of the link, which is what we will scrap from the site in order to form our data frame.

Obtaining links of artists

Now, since this post is just to stretch my web scraping skills and have some fun I won’t scrap data from all the artists, which could take ages as the process depends not only on the processing power but also on the internet connection and time to get to each link. However, we will get some music genres to play:

Rock
Hip Hop
Pop music
Sertanejo (Basically the Brazilian version of Country Music)
Funk Carioca (Originated 60s US Funk, a completely different genre in Brazil nowadays)
Samba (Typical Brazilian music)

In order to map our scrapper to artists we need to have the link to their pages in the website. For that we need to get the links for music genre sections we want to scrap:

page <- read_html("https://www.vagalume.com.br/browse/style/")

styles <- page %>%
    html_nodes(".itemTitle") %>%
    html_text() %>%
    str_replace(., "/", "-") %>%
    str_replace(., "á", "a") %>%
    str_replace(., "é", "e") %>%
    str_replace(., "â", "a") %>%
    str_replace(., "ó", "o") %>%
    str_replace_all(., "ú", "u") %>%
    str_replace(., "&", "-n-") %>%
    str_replace(., " ", "-") %>%
    str_replace(., "J-Pop-J-Rock", "j-pop") %>%
    str_replace(., "Gospel-Religioso", "gospel") %>%
    str_replace(., "Romantico", "romantica") %>%
    tolower()

section_links <- paste0("/browse/style/", styles, ".html")

Now, in one specific page (“Rock”), we need to figure out how to get the list of artists. We need to do it since we will need a list of all the artists to scrap the data with the scrap_artist function. Let’s start by getting the objects with class=moreNamesContainer (“.moreNamesContainer”) as identified in the page’s code with .

The table containing all the names for each section page is converted to a list (which unfortunately has 4 levels), so the code below is able to extract the links. It’s already inside a function artist_links that will be useful for us.

get_artist_links <- function(section_link) {
    # Reading the page
    page <- read_html(paste0("https://www.vagalume.com.br", section_link))

    # Removing desired node as a list
    nameList <- page %>%
        html_nodes(".moreNamesContainer") %>%
        as_list()

    # Removing undesired list levels and extracting 'href' attribute
    # 'NOT ELEGANT AT ALL' ALLERT
    artist_links <- nameList %>%
        unlist(recursive = F) %>% # Removing first undesired level
        unlist(recursive = F) %>% # Removing second undesired level
        unlist(recursive = F) %>% # Removig third undesired level, ugh
        map(., ~ attr(., "href")) %>% # Mapping through list to extract attrs
        unlist() %>%
        as.character() # Removing names and parsing to a vector

    return(unique(artist_links))
}

get_artist_links(section_link = "/browse/style/rock.html") %>% head(10)

Ouch, what inelegant code, probably due to my lack of knowledge on purrr, I definitely need to get deeper than map_*() on this package soon. But ok, let’s map it through all values of section_links in order to obtain the links to all artists in the website.

In order to increase performance we will use furrr::future_map() which is a function that conveniently combines purrr::map() with the parallel processing provided by the future package.

# Planning parallel processing
plan(multisession)

# Getting all artists' links from the website
#all_artists_links <- get_artist_links("/browse/style/sertanejo.html")
all_artists_links <- future_map(section_links, ~get_artist_links(.)) %>% 
unlist()

I got \(3.255\) links when I scrapped the data in February 10th in 2019. That leads us to the first big step in this endeavour, our Artists dataset.

Building the dataset

Now that we have the scrapper function scrap_artist and a list to map it through, we can build our dataset. Since our scrapper returns a tibble i will use the map_dfr variation which already binds the resulting data frames by rows. Of course there is a furrr::future_map_dfr() to save is since this operation is going to be much more demanding than the last one.

# De
p_scrap_artist <- possibly(scrap_artist, 
                           otherwise = tibble(Artist = NA, 
                                              Genres = NA, 
                                              Songs = NA, 
                                              Popularity = NA, 
                                              Link = NA))

# Planning parallel processing
plan(multisession)

# Getting all artists' links from the website
all_artists <- future_map_dfr(all_artists_links, ~p_scrap_artist(.))

It took around 16 minutes to finish this data scraping on my gaming laptop, it may depend on internet connection and processing power. Notice that I used possibly() to create a new function. It wraps a function and in case it returns an error it won’t stop our code, instead it will return what I passed to the otherwise argument, which is a tibble of NAs. It’s possible to use the argument .progress = TRUE in the future_map_* functions in order to check the map procedure progress in case of big operations like this one.

Originally i wanted to maintain only some genres as shown bellow, but on the most recent revision of the data on Kaggle i scraped all the genres, so the code below is commented but you can use it if you want specific genres to be scraped:

# Selecting genre labels to keep
#genres_keep <- c("Rock", "Hip Hop", "Pop", "Sertanejo", "Funk Carioca", "Samba")


# Removing other genre labels
#all_artists_fixed <- all_artists %>% 
#  separate(Genres, c("G1", "G2", "G3"), sep = "; ") %>%  # Separating Genres variable
#  gather(G1:G3, key = "G", value = "Genre") %>% select(-G) %>% 
#  filter(Genre %in% genres_keep)

There, now we have some data on all artists on the Genres I specified. The number of rows grew up to \(3622\) because of artists that had more than one label, I will rather maintain it this way since removing the duplicates might bias the results to one genre, at least this way the artists weight equally both genres they are present in. The dataset on artists is built! We can do a last step on the data scraping before we stop for today: Scraping their lyrics!

2 Scraping lyrics data

I decided to already scrap some data on the lyrics to use in the future analysis post, some text analysis, maybe some sentiment analysis and then I have plans for a lyrics-composing AI, I will create a little monster that will ruin the music industry, ok I’ll stop dreaming. Let’s do it!

Building the scrapper

We are going to do it similarly to how we did it with the artists: do a scrapper to get all the individual links then map throuhg it. I already built a lyrics scraper get_lyric(), that returns the lyric based on the songs’ link. the code below defines it:

# Extracts a single lyric from a song link
get_lyric <- function(song_link){
  
  # Reading the html page
  lyric <- read_html(paste0("https://www.vagalume.com.br", song_link)) %>% 
    html_nodes("#lyrics") %>% 
    html_text2()
  
  return(lyric)
}

# Testing it on Holiday from Green Day
get_lyric(song_link = "/green-day/holiday.html")

From this specific lyric we can see that there are still some special characters. Instead of just treating this “\” issue, I’ll remove all special characters when analyzing the lyrics. Better, right? Let’s get to the next step on our scraping adventure.

Links to scrap the data from

Now we need the list of songs to map the scraper through. The code below defines get_lyrics_links(), which receives the link to the artist then returns a tibble with name and link to the music. It also returns the link to the artist that was passed, in order to use it as an ID for the artist dataset in the future.

# Extracts all lyrics from an artist link - uses get_lyric()
get_lyrics_links <- function(artist_link){
  
  # Reading the html page on the artist
  page <- read_html(paste0("https://www.vagalume.com.br", artist_link))
  
  # Obtaining all the musics' links -
  music_name_node <- page %>% html_nodes(".nameMusic")
  music_names <- music_name_node %>% html_text()
  music_links <- music_name_node %>% html_attr("href")
  
  # Building final tibble
  res <- tibble(ALink = rep(artist_link, length(music_names)),
                SName = music_names,
                SLink = music_links)
  
  return(res)
}

# Testing final function
get_lyrics_links("/green-day/") %>% head(5)

Notice that I use html_attr("href") to extract the attribute “href” from the html object, which is the link to the song page. Let’s map this function through all the artists we have in order to obtain the links to the musics:

# Generating safe version of get_lyrics_links()
p_get_lyrics_links <- possibly(get_lyrics_links, 
                               otherwise = tibble(ALink = NA,
                                                  SName = NA))

# Returning all the links to the musics
plan(multisession)
all_songs_links <- future_map_dfr(all_artists_links, ~p_get_lyrics_links(.))

After a 82.98 seconds wait on my case we found out that there are \(209.522\) songs for our \(3.355\) artists.

Building the lyrics dataset

Now all we gotta do is map hour scraper get_lyric() through all this songs to get our final lyrics dataset.

# Gerenating a safe version with possibly()
p_get_lyric <- possibly(get_lyric, otherwise = NA)

# Mapping it through the all_artists_links vector
plan(multisession)
all_lyrics <- future_map_chr(all_songs_links$SLink, ~p_get_lyric(.)[1], .progress = TRUE)

# Adding it to the dataframe
all_songs_links$Lyric <- all_lyrics %>% str_replace_all("\\. \\. ", ". ")

This one took around 9,064.08 seconds (151 minutes) to finish, quite some time!

You might be asking why not build a single function to scrap the lyrics. Well, I tried it at first, a function that took the artist link, mapped through all song link inside it and returned a tibble with all the artist lyrics. But it was taking hours to run iterally. I think it happens because furrr can’t optimize well functions that move great pieces of data around. Also, it was a function with a map call inside, seems like it’s not possible to call a future within a future call so I had to separate them in order to optimize running time.

Finally, we can use the cld2 package to infer each lyric’s language to make future analysis easier:

library(cld2)

all_songs_links <- all_songs_links %>% 
  mutate(language = detect_language(Lyric))

3 That’s it!

Both datasets can be foun on my Kaggle datasets page, here. Feel free to analyze it the way you want!

This post is the first of two parts of the music analysis. In the next one i plan on briefly analyzing this data and training a LSTM in the lyrics for each genre. I hope you liked it so far! If you have any feedback on your mind, share it in my twitter @a_neisse or on the contact section of this website.