Last updated: 2020-06-01
Checks: 7 0
Knit directory: requestival/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200529)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 8f40292. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: code/_spotify_secrets.R
Ignored: data/.DS_Store
Ignored: data/raw/.DS_Store
Ignored: data/raw/requestival_24_files/
Ignored: data/raw/requestival_25_files/
Ignored: data/raw/requestival_26_files/
Ignored: data/raw/requestival_27_files/
Ignored: data/raw/requestival_28_files/
Ignored: data/raw/requestival_29_files/
Ignored: data/raw/requestival_30_files/
Ignored: data/raw/requestival_31_files/
Ignored: output/01-scraping.Rmd/
Ignored: output/02-tidying.Rmd/
Ignored: output/03-augmentation.Rmd/
Ignored: output/04-exploration.Rmd/
Ignored: renv/library/
Ignored: renv/staging/
Untracked files:
Untracked: data/raw/requestival_29.html
Untracked: data/raw/requestival_30.html
Untracked: data/raw/requestival_31.html
Unstaged changes:
Modified: data/01-requestival-scraped.tsv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01-scraping.Rmd
) and HTML (docs/01-scraping.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 107e6dc | Luke Zappia | 2020-05-29 | Add scraping file |
html | 107e6dc | Luke Zappia | 2020-05-29 | Add scraping file |
source(here::here("code", "setup.R"))
We are going to start by scraping the data into a usable form. We are starting with HTML files downloaded from the Triple J Recently Played page.
html_files <- fs::dir_ls(PATHS$html_dir, glob = "*.html")
Chunk time: 0.01 secs
There are 8 files, one for each day of the Requestival.
The HTML files are well structured and the entries for individual songs have look like this:
</div></li><li class="view-playlistItem listeItem clearfix"><div class="time">12:07am</div>
<div class="comp-image">
<div class="thumbnail">
<img src="./requestival_24_files/http___www.abc.net.au_dig_covers_original_gorillaz_plastic.jpg" alt="">
</div>
</div>
<div class="info">
<div class="title">
<h5>Stylo</h5>
</div>
<div class="artist">Gorillaz</div>
<div class="release">Plastic Beach </div>
<ul class="search clearfix">
<li><a href="https://www.youtube.com/results?search_query=Gorillaz%20Stylo" target="_blank">YouTube</a></li>
<li>| <a href="https://play.spotify.com/search/results/artist:Gorillaz%20track:Stylo" target="_blank">Spotify</a></li>
</ul>
It’s pretty easy to pick out the information we are looking for such as the time played, the artist, song name and album. Some of this also have special classes which should make things easier.
Let’s try and scrape song information from these file using the {rvest} package. Much of this is based on this handy tutorial which does a similar thing for the Billboard Hot 100.
The play time, artist and release information are stored in divs with their own special class so let’s pull those out first. For now we will just work with the first HTML file.
html <- read_html(html_files[1])
times <- html %>%
html_nodes(".time") %>%
html_text()
artists <- html %>%
html_nodes(".artist") %>%
html_text()
releases <- html %>%
html_nodes(".release") %>%
html_text()
Chunk time: 0.29 secs
We have found 311 times, 361 artists and 361 releases. These aren’t quite the same length 😿. A quick look at the web page shows us that there is a “Most Played” section at the bottom of the page. This includes the 50 songs that Triple J are currently playing most often. Conveniently this explains the extra information we have found 🎉! Only the recently played songs have times so we can use this as the number of songs we expect to find or just remove the last 50.
Here is the information we have found so far:
n_played <- length(times)
played_idx <- seq_len(n_played)
played <- tibble(
Time = times,
Artist = artists[played_idx],
Release = releases[played_idx]
)
played
Chunk time: 0.03 secs
The song names are stored in <h5>
tags. Let’s extract those as well and see what we get.
h5s <- html %>%
html_nodes("h5") %>%
html_text()
Chunk time: 0.01 secs
This has given us a vector with 361 items. This is the same length as the artists and releases so it looks like this tag isn’t used for anything else on the site.
Let’s add the song names to the information we have so far:
played$Song <- h5s[played_idx]
played
Chunk time: 0.08 secs
That’s the most important information but there are a few more things it might be useful to extract. Next to each song there is a set of links to YouTube, Spotify and Triple J Unearthed (a platform for new artists to share their work). Let’s see if we can scrape those as well. The links are in a list with class="search clearfix"
.
searches <- html %>%
html_nodes(".search")
Chunk time: 0.03 secs
Selecting the search
class gives us 362 items. This number can be explained by a search box, the recently played songs and the most played songs. It looks like this will give us what we want, as long as we ignore the first item.
Inside the list we have extracted there are items for each link. Not every song has all the links so we have to be a bit careful to make sure we are extracting them properly. Let’s just look at the first list to start with.
types_example <- searches[2] %>%
html_nodes("li") %>%
html_nodes("a") %>%
html_text()
urls_example <- searches[2] %>%
html_nodes("li") %>%
html_nodes("a") %>%
html_attr("href")
tibble(
Type = types_example,
URL = urls_example
)
Chunk time: 0.02 secs
By selecting the <li>
tag and then the <a>
tag we can get the information we want. In this case we want to extract both the text to get the type of the link and the href
attribute to get the URL (we could probably get the type from the URL but it’s already there so this is easier).
Let’s make this into a function that returns a tibble
that we can apply to our list of search
divs.
get_links <- function(search_div) {
a_tags <- search_div %>%
html_nodes("li") %>%
html_nodes("a")
tibble(
Type = html_text(a_tags),
URL = html_attr(a_tags, "href")
)
}
Chunk time: 0 secs
Now we can run this for all the songs and see what we get. We will add a little bit of code to add the song name and artist to the results. The artist is necessary because there are can be several songs with the same name. We also select distinct links as some songs have been played multiple times.
links <- purrr::map_dfr(played_idx, function(.idx) {
get_links(searches[.idx + 1]) %>%
mutate(
Song = played$Song[.idx],
Artist = played$Artist[.idx]
)
}) %>%
distinct()
links
Chunk time: 1.31 secs
This is currently in long format where each row is a link but as there is only one of each link type for each song it will be more convenient to have each type as a separate column.
links <- links %>%
pivot_wider(names_from = Type, values_from = URL)
links
Chunk time: 0.08 secs
Now the links are in a form that is easy to join to the other song information (using the song name and artist as keys).
played <- played %>%
left_join(links, by = c("Song", "Artist"))
played
Chunk time: 0.09 secs
We now have code for extracting all the information we want so let’s put it together into a function that we can apply to each HTML file. We want the function to take the path to one of the HTML data files and do the following things:
tibble
The function looks like this.
scrape_songs <- function(html_path) {
file <- fs::path_ext_remove(fs::path_file(html_path))
html <- read_html(html_path)
times <- html %>%
html_nodes(".time") %>%
html_text()
n_played <- length(times)
played_idx <- seq_len(n_played)
artists <- html %>%
html_nodes(".artist") %>%
html_text()
releases <- html %>%
html_nodes(".release") %>%
html_text()
songs <- html %>%
html_nodes("h5") %>%
html_text()
search_divs <- html %>%
html_nodes(".search")
links <- purrr::map_dfr(played_idx, function(.idx) {
get_links(search_divs[.idx + 1]) %>%
mutate(
Song = songs[.idx],
Artist = artists[.idx]
)
}) %>%
distinct() %>%
pivot_wider(names_from = Type, values_from = URL)
tibble(
File = file,
Time = times,
Song = songs[played_idx],
Artist = artists[played_idx],
Release = releases[played_idx]
) %>%
left_join(links, by = c("Song", "Artist"))
}
Chunk time: 0 secs
Let’s apply the scraping function to all the files and combine the results!
requestival <- purrr::map_dfr(html_files, scrape_songs)
requestival
Chunk time: 9.8 secs
We save this scraped dataset as a TSV for further analysis.
write_tsv(requestival, PATHS$scraped)
Chunk time: 0.03 secs
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.0 (2020-04-24)
os macOS Catalina 10.15.4
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Berlin
date 2020-06-01
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date lib source
P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.0.0)
P backports 1.1.7 2020-05-13 [?] CRAN (R 4.0.0)
P base64enc 0.1-3 2015-07-28 [?] CRAN (R 4.0.0)
P blob 1.2.1 2020-01-20 [?] CRAN (R 4.0.0)
P broom 0.5.6 2020-04-20 [?] CRAN (R 4.0.0)
P cellranger 1.1.0 2016-07-27 [?] standard (@1.1.0)
P cli 2.0.2 2020-02-28 [?] CRAN (R 4.0.0)
P colorspace 1.4-1 2019-03-18 [?] standard (@1.4-1)
P conflicted * 1.0.4 2019-06-21 [?] standard (@1.0.4)
P crayon 1.3.4 2017-09-16 [?] CRAN (R 4.0.0)
P DBI 1.1.0 2019-12-15 [?] CRAN (R 4.0.0)
P dbplyr 1.4.4 2020-05-27 [?] CRAN (R 4.0.0)
P digest 0.6.25 2020-02-23 [?] CRAN (R 4.0.0)
P dplyr * 0.8.5 2020-03-07 [?] CRAN (R 4.0.0)
P ellipsis 0.3.1 2020-05-15 [?] CRAN (R 4.0.0)
P evaluate 0.14 2019-05-28 [?] standard (@0.14)
P fansi 0.4.1 2020-01-08 [?] CRAN (R 4.0.0)
P forcats * 0.5.0 2020-03-01 [?] CRAN (R 4.0.0)
P fs * 1.4.1 2020-04-04 [?] CRAN (R 4.0.0)
P generics 0.0.2 2018-11-29 [?] standard (@0.0.2)
P genius 2.2.2 2020-05-28 [?] CRAN (R 4.0.0)
P ggplot2 * 3.3.1 2020-05-28 [?] CRAN (R 4.0.0)
P git2r 0.27.1 2020-05-03 [?] CRAN (R 4.0.0)
P glue * 1.4.1 2020-05-13 [?] CRAN (R 4.0.0)
P gtable 0.3.0 2019-03-25 [?] standard (@0.3.0)
P haven 2.3.0 2020-05-24 [?] CRAN (R 4.0.0)
P here * 0.1 2017-05-28 [?] standard (@0.1)
P hms 0.5.3 2020-01-08 [?] CRAN (R 4.0.0)
P htmltools 0.4.0 2019-10-04 [?] standard (@0.4.0)
P httpuv 1.5.3.1 2020-05-26 [?] CRAN (R 4.0.0)
P httr 1.4.1 2019-08-05 [?] standard (@1.4.1)
P janeaustenr 0.1.5 2017-06-10 [?] CRAN (R 4.0.0)
P jsonlite 1.6.1 2020-02-02 [?] CRAN (R 4.0.0)
P knitr 1.28 2020-02-06 [?] CRAN (R 4.0.0)
P later 1.0.0 2019-10-04 [?] standard (@1.0.0)
P lattice 0.20-41 2020-04-02 [3] CRAN (R 4.0.0)
P lifecycle 0.2.0 2020-03-06 [?] CRAN (R 4.0.0)
P lubridate * 1.7.8 2020-04-06 [?] CRAN (R 4.0.0)
P magrittr 1.5 2014-11-22 [?] CRAN (R 4.0.0)
P Matrix 1.2-18 2019-11-27 [3] CRAN (R 4.0.0)
P memoise 1.1.0 2017-04-21 [?] standard (@1.1.0)
P modelr 0.1.8 2020-05-19 [?] CRAN (R 4.0.0)
P munsell 0.5.0 2018-06-12 [?] standard (@0.5.0)
P nlme 3.1-147 2020-04-13 [3] CRAN (R 4.0.0)
P pillar 1.4.4 2020-05-05 [?] CRAN (R 4.0.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.0.0)
P plyr 1.8.6 2020-03-03 [?] CRAN (R 4.0.0)
P promises 1.1.0 2019-10-04 [?] standard (@1.1.0)
P purrr * 0.3.4 2020-04-17 [?] CRAN (R 4.0.0)
P R6 2.4.1 2019-11-12 [?] CRAN (R 4.0.0)
P Rcpp 1.0.4.6 2020-04-09 [?] CRAN (R 4.0.0)
P readr * 1.3.1 2018-12-21 [?] standard (@1.3.1)
P readxl 1.3.1 2019-03-13 [?] standard (@1.3.1)
P reprex 0.3.0 2019-05-16 [?] standard (@0.3.0)
P reshape2 1.4.4 2020-04-09 [?] CRAN (R 4.0.0)
P rlang 0.4.6 2020-05-02 [?] CRAN (R 4.0.0)
P rmarkdown 2.1 2020-01-20 [?] CRAN (R 4.0.0)
P rprojroot 1.3-2 2018-01-03 [?] CRAN (R 4.0.0)
P rstudioapi 0.11 2020-02-07 [?] CRAN (R 4.0.0)
P rvest * 0.3.5 2019-11-08 [?] standard (@0.3.5)
P scales 1.1.1 2020-05-11 [?] CRAN (R 4.0.0)
P selectr 0.4-2 2019-11-20 [?] CRAN (R 4.0.0)
sessioninfo 1.1.1 2018-11-05 [3] CRAN (R 4.0.0)
P SnowballC 0.7.0 2020-04-01 [?] CRAN (R 4.0.0)
P spotifyr * 2.1.1 2019-07-13 [?] CRAN (R 4.0.0)
P stringi 1.4.6 2020-02-17 [?] CRAN (R 4.0.0)
P stringr * 1.4.0 2019-02-10 [?] CRAN (R 4.0.0)
P tibble * 3.0.1 2020-04-20 [?] CRAN (R 4.0.0)
P tidyr * 1.1.0 2020-05-20 [?] CRAN (R 4.0.0)
P tidyselect 1.1.0 2020-05-11 [?] CRAN (R 4.0.0)
P tidytext 0.2.4 2020-04-17 [?] CRAN (R 4.0.0)
P tidyverse * 1.3.0 2019-11-21 [?] standard (@1.3.0)
P tokenizers 0.2.1 2018-03-29 [?] CRAN (R 4.0.0)
P vctrs 0.3.0 2020-05-11 [?] CRAN (R 4.0.0)
P whisker 0.4 2019-08-28 [?] standard (@0.4)
P withr 2.2.0 2020-04-20 [?] CRAN (R 4.0.0)
P workflowr 1.6.2 2020-04-30 [?] CRAN (R 4.0.0)
P xfun 0.14 2020-05-20 [?] CRAN (R 4.0.0)
P xml2 * 1.3.2 2020-04-23 [?] CRAN (R 4.0.0)
P yaml 2.2.1 2020-02-01 [?] CRAN (R 4.0.0)
[1] /Users/luke.zappia/Documents/Projects/requestival/renv/library/R-4.0/x86_64-apple-darwin17.0
[2] /private/var/folders/rj/60lhr791617422kqvh0r4vy40000gn/T/RtmpQWp4dI/renv-system-library
[3] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
P ── Loaded and on-disk path mismatch.
Chunk time: 0.2 secs