Last updated: 2020-06-01

Checks: 7 0

Knit directory: requestival/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200529)

The command set.seed(20200529) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8f40292

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8f40292. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    code/_spotify_secrets.R
    Ignored:    data/.DS_Store
    Ignored:    data/raw/.DS_Store
    Ignored:    data/raw/requestival_24_files/
    Ignored:    data/raw/requestival_25_files/
    Ignored:    data/raw/requestival_26_files/
    Ignored:    data/raw/requestival_27_files/
    Ignored:    data/raw/requestival_28_files/
    Ignored:    data/raw/requestival_29_files/
    Ignored:    data/raw/requestival_30_files/
    Ignored:    data/raw/requestival_31_files/
    Ignored:    output/01-scraping.Rmd/
    Ignored:    output/02-tidying.Rmd/
    Ignored:    output/03-augmentation.Rmd/
    Ignored:    output/04-exploration.Rmd/
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  data/raw/requestival_29.html
    Untracked:  data/raw/requestival_30.html
    Untracked:  data/raw/requestival_31.html

Unstaged changes:
    Modified:   data/01-requestival-scraped.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01-scraping.Rmd) and HTML (docs/01-scraping.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	107e6dc	Luke Zappia	2020-05-29	Add scraping file
html	107e6dc	Luke Zappia	2020-05-29	Add scraping file

source(here::here("code", "setup.R"))

Introduction

We are going to start by scraping the data into a usable form. We are starting with HTML files downloaded from the Triple J Recently Played page.

html_files <- fs::dir_ls(PATHS$html_dir, glob = "*.html")

Chunk time: 0.01 secs

There are 8 files, one for each day of the Requestival.

1 Structure

The HTML files are well structured and the entries for individual songs have look like this:

</div></li><li class="view-playlistItem listeItem clearfix"><div class="time">12:07am</div>
<div class="comp-image">
<div class="thumbnail">
<img src="./requestival_24_files/http___www.abc.net.au_dig_covers_original_gorillaz_plastic.jpg" alt="">
</div>
</div>
<div class="info">
<div class="title">

<h5>Stylo</h5>
</div>
<div class="artist">Gorillaz</div>
<div class="release">Plastic Beach </div>

<ul class="search clearfix">
<li><a href="https://www.youtube.com/results?search_query=Gorillaz%20Stylo" target="_blank">YouTube</a></li>
<li>| <a href="https://play.spotify.com/search/results/artist:Gorillaz%20track:Stylo" target="_blank">Spotify</a></li>

</ul>

It’s pretty easy to pick out the information we are looking for such as the time played, the artist, song name and album. Some of this also have special classes which should make things easier.

2 Locating information

Let’s try and scrape song information from these file using the {rvest} package. Much of this is based on this handy tutorial which does a similar thing for the Billboard Hot 100.

2.1 Time, artist and release

The play time, artist and release information are stored in divs with their own special class so let’s pull those out first. For now we will just work with the first HTML file.

html <- read_html(html_files[1])

times <- html %>%
    html_nodes(".time") %>%
    html_text()

artists <- html %>%
    html_nodes(".artist") %>%
    html_text()

releases <- html %>%
    html_nodes(".release") %>%
    html_text()

Chunk time: 0.29 secs

We have found 311 times, 361 artists and 361 releases. These aren’t quite the same length 😿. A quick look at the web page shows us that there is a “Most Played” section at the bottom of the page. This includes the 50 songs that Triple J are currently playing most often. Conveniently this explains the extra information we have found 🎉! Only the recently played songs have times so we can use this as the number of songs we expect to find or just remove the last 50.

Here is the information we have found so far:

n_played   <- length(times)
played_idx <- seq_len(n_played)

played <- tibble(
    Time    = times,
    Artist  = artists[played_idx],
    Release = releases[played_idx]
)

played

Chunk time: 0.03 secs

2.2 Song name

The song names are stored in <h5> tags. Let’s extract those as well and see what we get.

h5s <- html %>%
    html_nodes("h5") %>%
    html_text()

Chunk time: 0.01 secs

This has given us a vector with 361 items. This is the same length as the artists and releases so it looks like this tag isn’t used for anything else on the site.

Let’s add the song names to the information we have so far:

played$Song <- h5s[played_idx]
played

Chunk time: 0.08 secs

2.3 Links

That’s the most important information but there are a few more things it might be useful to extract. Next to each song there is a set of links to YouTube, Spotify and Triple J Unearthed (a platform for new artists to share their work). Let’s see if we can scrape those as well. The links are in a list with class="search clearfix".

searches <- html %>%
    html_nodes(".search")

Chunk time: 0.03 secs

Selecting the search class gives us 362 items. This number can be explained by a search box, the recently played songs and the most played songs. It looks like this will give us what we want, as long as we ignore the first item.

Inside the list we have extracted there are items for each link. Not every song has all the links so we have to be a bit careful to make sure we are extracting them properly. Let’s just look at the first list to start with.

types_example <- searches[2] %>%
    html_nodes("li") %>%
    html_nodes("a") %>%
    html_text()

urls_example <- searches[2] %>%
    html_nodes("li") %>%
    html_nodes("a") %>%
    html_attr("href")

tibble(
    Type = types_example,
    URL  = urls_example
)

Chunk time: 0.02 secs

By selecting the <li> tag and then the <a> tag we can get the information we want. In this case we want to extract both the text to get the type of the link and the href attribute to get the URL (we could probably get the type from the URL but it’s already there so this is easier).

Let’s make this into a function that returns a tibble that we can apply to our list of search divs.

get_links <- function(search_div) {
    a_tags <- search_div %>%
        html_nodes("li") %>%
        html_nodes("a")
    
    tibble(
        Type = html_text(a_tags),
        URL  = html_attr(a_tags, "href")
    )
}

Chunk time: 0 secs

Now we can run this for all the songs and see what we get. We will add a little bit of code to add the song name and artist to the results. The artist is necessary because there are can be several songs with the same name. We also select distinct links as some songs have been played multiple times.

links <- purrr::map_dfr(played_idx, function(.idx) {
    get_links(searches[.idx + 1]) %>%
        mutate(
            Song   = played$Song[.idx],
            Artist = played$Artist[.idx]
        )
}) %>%
    distinct()

links

Chunk time: 1.31 secs

This is currently in long format where each row is a link but as there is only one of each link type for each song it will be more convenient to have each type as a separate column.

links <- links %>%
    pivot_wider(names_from = Type, values_from = URL)

links

Chunk time: 0.08 secs

Now the links are in a form that is easy to join to the other song information (using the song name and artist as keys).

played <- played %>%
    left_join(links, by = c("Song", "Artist"))

played

Chunk time: 0.09 secs

3 Put it together

We now have code for extracting all the information we want so let’s put it together into a function that we can apply to each HTML file. We want the function to take the path to one of the HTML data files and do the following things:

Read the HTML from the file
Extract the time, artist and release information
Extract the song names
Extract the song links
Put this into a tidy tibble
Add which file the songs come from

The function looks like this.

scrape_songs <- function(html_path) {
    
    file <- fs::path_ext_remove(fs::path_file(html_path))
    
    html <- read_html(html_path)

    times <- html %>%
        html_nodes(".time") %>%
        html_text()

    n_played   <- length(times)
    played_idx <- seq_len(n_played)

    artists <- html %>%
        html_nodes(".artist") %>%
        html_text()

    releases <- html %>%
        html_nodes(".release") %>%
        html_text()
    
    songs <- html %>%
        html_nodes("h5") %>%
        html_text()
    
    search_divs <- html %>%
        html_nodes(".search")
    
    links <- purrr::map_dfr(played_idx, function(.idx) {
        get_links(search_divs[.idx + 1]) %>%
            mutate(
                Song   = songs[.idx],
                Artist = artists[.idx]
            )
    }) %>%
        distinct() %>%
        pivot_wider(names_from = Type, values_from = URL)
    
    tibble(
        File    = file,
        Time    = times,
        Song    = songs[played_idx],
        Artist  = artists[played_idx],
        Release = releases[played_idx]
    ) %>%
        left_join(links, by = c("Song", "Artist"))
}

Chunk time: 0 secs

4 Scrape those files!

Let’s apply the scraping function to all the files and combine the results!

requestival <- purrr::map_dfr(html_files, scrape_songs)

requestival

Chunk time: 9.8 secs

We save this scraped dataset as a TSV for further analysis.

write_tsv(requestival, PATHS$scraped)

Chunk time: 0.03 secs

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.0 (2020-04-24)
 os       macOS Catalina 10.15.4      
 system   x86_64, darwin17.0          
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Berlin               
 date     2020-06-01                  

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date       lib source           
 P assertthat    0.2.1   2019-03-21 [?] CRAN (R 4.0.0)   
 P backports     1.1.7   2020-05-13 [?] CRAN (R 4.0.0)   
 P base64enc     0.1-3   2015-07-28 [?] CRAN (R 4.0.0)   
 P blob          1.2.1   2020-01-20 [?] CRAN (R 4.0.0)   
 P broom         0.5.6   2020-04-20 [?] CRAN (R 4.0.0)   
 P cellranger    1.1.0   2016-07-27 [?] standard (@1.1.0)
 P cli           2.0.2   2020-02-28 [?] CRAN (R 4.0.0)   
 P colorspace    1.4-1   2019-03-18 [?] standard (@1.4-1)
 P conflicted  * 1.0.4   2019-06-21 [?] standard (@1.0.4)
 P crayon        1.3.4   2017-09-16 [?] CRAN (R 4.0.0)   
 P DBI           1.1.0   2019-12-15 [?] CRAN (R 4.0.0)   
 P dbplyr        1.4.4   2020-05-27 [?] CRAN (R 4.0.0)   
 P digest        0.6.25  2020-02-23 [?] CRAN (R 4.0.0)   
 P dplyr       * 0.8.5   2020-03-07 [?] CRAN (R 4.0.0)   
 P ellipsis      0.3.1   2020-05-15 [?] CRAN (R 4.0.0)   
 P evaluate      0.14    2019-05-28 [?] standard (@0.14) 
 P fansi         0.4.1   2020-01-08 [?] CRAN (R 4.0.0)   
 P forcats     * 0.5.0   2020-03-01 [?] CRAN (R 4.0.0)   
 P fs          * 1.4.1   2020-04-04 [?] CRAN (R 4.0.0)   
 P generics      0.0.2   2018-11-29 [?] standard (@0.0.2)
 P genius        2.2.2   2020-05-28 [?] CRAN (R 4.0.0)   
 P ggplot2     * 3.3.1   2020-05-28 [?] CRAN (R 4.0.0)   
 P git2r         0.27.1  2020-05-03 [?] CRAN (R 4.0.0)   
 P glue        * 1.4.1   2020-05-13 [?] CRAN (R 4.0.0)   
 P gtable        0.3.0   2019-03-25 [?] standard (@0.3.0)
 P haven         2.3.0   2020-05-24 [?] CRAN (R 4.0.0)   
 P here        * 0.1     2017-05-28 [?] standard (@0.1)  
 P hms           0.5.3   2020-01-08 [?] CRAN (R 4.0.0)   
 P htmltools     0.4.0   2019-10-04 [?] standard (@0.4.0)
 P httpuv        1.5.3.1 2020-05-26 [?] CRAN (R 4.0.0)   
 P httr          1.4.1   2019-08-05 [?] standard (@1.4.1)
 P janeaustenr   0.1.5   2017-06-10 [?] CRAN (R 4.0.0)   
 P jsonlite      1.6.1   2020-02-02 [?] CRAN (R 4.0.0)   
 P knitr         1.28    2020-02-06 [?] CRAN (R 4.0.0)   
 P later         1.0.0   2019-10-04 [?] standard (@1.0.0)
 P lattice       0.20-41 2020-04-02 [3] CRAN (R 4.0.0)   
 P lifecycle     0.2.0   2020-03-06 [?] CRAN (R 4.0.0)   
 P lubridate   * 1.7.8   2020-04-06 [?] CRAN (R 4.0.0)   
 P magrittr      1.5     2014-11-22 [?] CRAN (R 4.0.0)   
 P Matrix        1.2-18  2019-11-27 [3] CRAN (R 4.0.0)   
 P memoise       1.1.0   2017-04-21 [?] standard (@1.1.0)
 P modelr        0.1.8   2020-05-19 [?] CRAN (R 4.0.0)   
 P munsell       0.5.0   2018-06-12 [?] standard (@0.5.0)
 P nlme          3.1-147 2020-04-13 [3] CRAN (R 4.0.0)   
 P pillar        1.4.4   2020-05-05 [?] CRAN (R 4.0.0)   
 P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 4.0.0)   
 P plyr          1.8.6   2020-03-03 [?] CRAN (R 4.0.0)   
 P promises      1.1.0   2019-10-04 [?] standard (@1.1.0)
 P purrr       * 0.3.4   2020-04-17 [?] CRAN (R 4.0.0)   
 P R6            2.4.1   2019-11-12 [?] CRAN (R 4.0.0)   
 P Rcpp          1.0.4.6 2020-04-09 [?] CRAN (R 4.0.0)   
 P readr       * 1.3.1   2018-12-21 [?] standard (@1.3.1)
 P readxl        1.3.1   2019-03-13 [?] standard (@1.3.1)
 P reprex        0.3.0   2019-05-16 [?] standard (@0.3.0)
 P reshape2      1.4.4   2020-04-09 [?] CRAN (R 4.0.0)   
 P rlang         0.4.6   2020-05-02 [?] CRAN (R 4.0.0)   
 P rmarkdown     2.1     2020-01-20 [?] CRAN (R 4.0.0)   
 P rprojroot     1.3-2   2018-01-03 [?] CRAN (R 4.0.0)   
 P rstudioapi    0.11    2020-02-07 [?] CRAN (R 4.0.0)   
 P rvest       * 0.3.5   2019-11-08 [?] standard (@0.3.5)
 P scales        1.1.1   2020-05-11 [?] CRAN (R 4.0.0)   
 P selectr       0.4-2   2019-11-20 [?] CRAN (R 4.0.0)   
   sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.0)   
 P SnowballC     0.7.0   2020-04-01 [?] CRAN (R 4.0.0)   
 P spotifyr    * 2.1.1   2019-07-13 [?] CRAN (R 4.0.0)   
 P stringi       1.4.6   2020-02-17 [?] CRAN (R 4.0.0)   
 P stringr     * 1.4.0   2019-02-10 [?] CRAN (R 4.0.0)   
 P tibble      * 3.0.1   2020-04-20 [?] CRAN (R 4.0.0)   
 P tidyr       * 1.1.0   2020-05-20 [?] CRAN (R 4.0.0)   
 P tidyselect    1.1.0   2020-05-11 [?] CRAN (R 4.0.0)   
 P tidytext      0.2.4   2020-04-17 [?] CRAN (R 4.0.0)   
 P tidyverse   * 1.3.0   2019-11-21 [?] standard (@1.3.0)
 P tokenizers    0.2.1   2018-03-29 [?] CRAN (R 4.0.0)   
 P vctrs         0.3.0   2020-05-11 [?] CRAN (R 4.0.0)   
 P whisker       0.4     2019-08-28 [?] standard (@0.4)  
 P withr         2.2.0   2020-04-20 [?] CRAN (R 4.0.0)   
 P workflowr     1.6.2   2020-04-30 [?] CRAN (R 4.0.0)   
 P xfun          0.14    2020-05-20 [?] CRAN (R 4.0.0)   
 P xml2        * 1.3.2   2020-04-23 [?] CRAN (R 4.0.0)   
 P yaml          2.2.1   2020-02-01 [?] CRAN (R 4.0.0)   

[1] /Users/luke.zappia/Documents/Projects/requestival/renv/library/R-4.0/x86_64-apple-darwin17.0
[2] /private/var/folders/rj/60lhr791617422kqvh0r4vy40000gn/T/RtmpQWp4dI/renv-system-library
[3] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

 P ── Loaded and on-disk path mismatch.

Chunk time: 0.2 secs

LS0tCnRpdGxlOiAiU2NyYXBpbmciCm91dHB1dDogd29ya2Zsb3dyOjp3Zmxvd19odG1sCmVkaXRvcl9vcHRpb25zOgogIGNodW5rX291dHB1dF90eXBlOiBjb25zb2xlCi0tLQoKYGBge3Igc2V0dXAsIGNhY2hlID0gRkFMU0V9CnNvdXJjZShoZXJlOjpoZXJlKCJjb2RlIiwgInNldHVwLlIiKSkKYGBgCgojIEludHJvZHVjdGlvbiB7LnVubnVtYmVyZWR9CgpXZSBhcmUgZ29pbmcgdG8gc3RhcnQgYnkgc2NyYXBpbmcgdGhlIGRhdGEgaW50byBhIHVzYWJsZSBmb3JtLiBXZSBhcmUgc3RhcnRpbmcKd2l0aCBIVE1MIGZpbGVzIGRvd25sb2FkZWQgZnJvbSB0aGUgVHJpcGxlIEogW1JlY2VudGx5IFBsYXllZF1bcmVjZW50bHktcGxheWVkXQpwYWdlLgoKYGBge3IgZmlsZXN9Cmh0bWxfZmlsZXMgPC0gZnM6OmRpcl9scyhQQVRIUyRodG1sX2RpciwgZ2xvYiA9ICIqLmh0bWwiKQpgYGAKClRoZXJlIGFyZSBgciBsZW5ndGgoaHRtbF9maWxlcylgIGZpbGVzLCBvbmUgZm9yIGVhY2ggZGF5IG9mIHRoZSBSZXF1ZXN0aXZhbC4KCiMgU3RydWN0dXJlCgpUaGUgSFRNTCBmaWxlcyBhcmUgd2VsbCBzdHJ1Y3R1cmVkIGFuZCB0aGUgZW50cmllcyBmb3IgaW5kaXZpZHVhbCBzb25ncyBoYXZlCmxvb2sgbGlrZSB0aGlzOgoKYGBgaHRtbAo8L2Rpdj48L2xpPjxsaSBjbGFzcz0idmlldy1wbGF5bGlzdEl0ZW0gbGlzdGVJdGVtIGNsZWFyZml4Ij48ZGl2IGNsYXNzPSJ0aW1lIj4xMjowN2FtPC9kaXY+CjxkaXYgY2xhc3M9ImNvbXAtaW1hZ2UiPgo8ZGl2IGNsYXNzPSJ0aHVtYm5haWwiPgo8aW1nIHNyYz0iLi9yZXF1ZXN0aXZhbF8yNF9maWxlcy9odHRwX19fd3d3LmFiYy5uZXQuYXVfZGlnX2NvdmVyc19vcmlnaW5hbF9nb3JpbGxhel9wbGFzdGljLmpwZyIgYWx0PSIiPgo8L2Rpdj4KPC9kaXY+CjxkaXYgY2xhc3M9ImluZm8iPgo8ZGl2IGNsYXNzPSJ0aXRsZSI+Cgo8aDU+U3R5bG88L2g1Pgo8L2Rpdj4KPGRpdiBjbGFzcz0iYXJ0aXN0Ij5Hb3JpbGxhejwvZGl2Pgo8ZGl2IGNsYXNzPSJyZWxlYXNlIj5QbGFzdGljIEJlYWNoIDwvZGl2PgoKPHVsIGNsYXNzPSJzZWFyY2ggY2xlYXJmaXgiPgo8bGk+PGEgaHJlZj0iaHR0cHM6Ly93d3cueW91dHViZS5jb20vcmVzdWx0cz9zZWFyY2hfcXVlcnk9R29yaWxsYXolMjBTdHlsbyIgdGFyZ2V0PSJfYmxhbmsiPllvdVR1YmU8L2E+PC9saT4KPGxpPnwgPGEgaHJlZj0iaHR0cHM6Ly9wbGF5LnNwb3RpZnkuY29tL3NlYXJjaC9yZXN1bHRzL2FydGlzdDpHb3JpbGxheiUyMHRyYWNrOlN0eWxvIiB0YXJnZXQ9Il9ibGFuayI+U3BvdGlmeTwvYT48L2xpPgoKPC91bD4KYGBgCgpJdCdzIHByZXR0eSBlYXN5IHRvIHBpY2sgb3V0IHRoZSBpbmZvcm1hdGlvbiB3ZSBhcmUgbG9va2luZyBmb3Igc3VjaCBhcyB0aGUKdGltZSBwbGF5ZWQsIHRoZSBhcnRpc3QsIHNvbmcgbmFtZSBhbmQgYWxidW0uIFNvbWUgb2YgdGhpcyBhbHNvIGhhdmUgc3BlY2lhbApjbGFzc2VzIHdoaWNoIHNob3VsZCBtYWtlIHRoaW5ncyBlYXNpZXIuCgojIExvY2F0aW5nIGluZm9ybWF0aW9uCgpMZXQncyB0cnkgYW5kIHNjcmFwZSBzb25nIGluZm9ybWF0aW9uIGZyb20gdGhlc2UgZmlsZSB1c2luZyB0aGUgKip7cnZlc3R9KioKcGFja2FnZS4gTXVjaCBvZiB0aGlzIGlzIGJhc2VkIG9uIFt0aGlzIGhhbmR5IHR1dG9yaWFsXVt0aWR5LXdlYi1zY3JhcGluZ10gd2hpY2gKZG9lcyBhIHNpbWlsYXIgdGhpbmcgZm9yIHRoZSBCaWxsYm9hcmQgSG90IDEwMC4KCiMjIFRpbWUsIGFydGlzdCBhbmQgcmVsZWFzZQoKVGhlIHBsYXkgdGltZSwgYXJ0aXN0IGFuZCByZWxlYXNlIGluZm9ybWF0aW9uIGFyZSBzdG9yZWQgaW4gZGl2cyB3aXRoIHRoZWlyIG93bgpzcGVjaWFsIGNsYXNzIHNvIGxldCdzIHB1bGwgdGhvc2Ugb3V0IGZpcnN0LiBGb3Igbm93IHdlIHdpbGwganVzdCB3b3JrIHdpdGggdGhlCmZpcnN0IEhUTUwgZmlsZS4KCmBgYHtyIGNsYXNzZXN9Cmh0bWwgPC0gcmVhZF9odG1sKGh0bWxfZmlsZXNbMV0pCgp0aW1lcyA8LSBodG1sICU+JQogICAgaHRtbF9ub2RlcygiLnRpbWUiKSAlPiUKICAgIGh0bWxfdGV4dCgpCgphcnRpc3RzIDwtIGh0bWwgJT4lCiAgICBodG1sX25vZGVzKCIuYXJ0aXN0IikgJT4lCiAgICBodG1sX3RleHQoKQoKcmVsZWFzZXMgPC0gaHRtbCAlPiUKICAgIGh0bWxfbm9kZXMoIi5yZWxlYXNlIikgJT4lCiAgICBodG1sX3RleHQoKQpgYGAKCldlIGhhdmUgZm91bmQgKipgciBsZW5ndGgodGltZXMpYCoqIHRpbWVzLCAqKmByIGxlbmd0aChhcnRpc3RzKWAqKiBhcnRpc3RzIGFuZAoqKmByIGxlbmd0aChyZWxlYXNlcylgKiogcmVsZWFzZXMuIFRoZXNlIGFyZW4ndCBxdWl0ZSB0aGUgc2FtZSBsZW5ndGgg8J+Yvy4gQQpxdWljayBsb29rIGF0IHRoZSB3ZWIgcGFnZSBzaG93cyB1cyB0aGF0IHRoZXJlIGlzIGEgIk1vc3QgUGxheWVkIiBzZWN0aW9uIGF0CnRoZSBib3R0b20gb2YgdGhlIHBhZ2UuIFRoaXMgaW5jbHVkZXMgdGhlIDUwIHNvbmdzIHRoYXQgVHJpcGxlIEogYXJlIGN1cnJlbnRseQpwbGF5aW5nIG1vc3Qgb2Z0ZW4uIENvbnZlbmllbnRseSB0aGlzIGV4cGxhaW5zIHRoZSBleHRyYSBpbmZvcm1hdGlvbiB3ZSBoYXZlCmZvdW5kIPCfjokhIE9ubHkgdGhlIHJlY2VudGx5IHBsYXllZCBzb25ncyBoYXZlIHRpbWVzIHNvIHdlIGNhbiB1c2UgdGhpcyBhcyB0aGUKbnVtYmVyIG9mIHNvbmdzIHdlIGV4cGVjdCB0byBmaW5kIG9yIGp1c3QgcmVtb3ZlIHRoZSBsYXN0IDUwLgoKSGVyZSBpcyB0aGUgaW5mb3JtYXRpb24gd2UgaGF2ZSBmb3VuZCBzbyBmYXI6CgpgYGB7ciBjbGFzc2VzLXRhYmxlfQpuX3BsYXllZCAgIDwtIGxlbmd0aCh0aW1lcykKcGxheWVkX2lkeCA8LSBzZXFfbGVuKG5fcGxheWVkKQoKcGxheWVkIDwtIHRpYmJsZSgKICAgIFRpbWUgICAgPSB0aW1lcywKICAgIEFydGlzdCAgPSBhcnRpc3RzW3BsYXllZF9pZHhdLAogICAgUmVsZWFzZSA9IHJlbGVhc2VzW3BsYXllZF9pZHhdCikKCnBsYXllZApgYGAKCiMjIFNvbmcgbmFtZQoKVGhlIHNvbmcgbmFtZXMgYXJlIHN0b3JlZCBpbiBgPGg1PmAgdGFncy4gTGV0J3MgZXh0cmFjdCB0aG9zZSBhcyB3ZWxsIGFuZCBzZWUKd2hhdCB3ZSBnZXQuCgpgYGB7ciBoNX0KaDVzIDwtIGh0bWwgJT4lCiAgICBodG1sX25vZGVzKCJoNSIpICU+JQogICAgaHRtbF90ZXh0KCkKYGBgCgpUaGlzIGhhcyBnaXZlbiB1cyBhIHZlY3RvciB3aXRoICoqYHIgbGVuZ3RoKGg1cylgKiogaXRlbXMuIFRoaXMgaXMgdGhlIHNhbWUKbGVuZ3RoIGFzIHRoZSBhcnRpc3RzIGFuZCByZWxlYXNlcyBzbyBpdCBsb29rcyBsaWtlIHRoaXMgdGFnIGlzbid0IHVzZWQgZm9yCmFueXRoaW5nIGVsc2Ugb24gdGhlIHNpdGUuCgpMZXQncyBhZGQgdGhlIHNvbmcgbmFtZXMgdG8gdGhlIGluZm9ybWF0aW9uIHdlIGhhdmUgc28gZmFyOgoKYGBge3Igc29uZ3MtdGFibGV9CnBsYXllZCRTb25nIDwtIGg1c1twbGF5ZWRfaWR4XQpwbGF5ZWQKYGBgCgojIyBMaW5rcwoKVGhhdCdzIHRoZSBtb3N0IGltcG9ydGFudCBpbmZvcm1hdGlvbiBidXQgdGhlcmUgYXJlIGEgZmV3IG1vcmUgdGhpbmdzIGl0IG1pZ2h0CmJlIHVzZWZ1bCB0byBleHRyYWN0LiBOZXh0IHRvIGVhY2ggc29uZyB0aGVyZSBpcyBhIHNldCBvZiBsaW5rcyB0byBZb3VUdWJlLApTcG90aWZ5IGFuZCBbVHJpcGxlIEogVW5lYXJ0aGVkXVt1bmVhcnRoZWRdIChhIHBsYXRmb3JtIGZvciBuZXcgYXJ0aXN0cyB0byBzaGFyZSB0aGVpciB3b3JrKS4gTGV0J3Mgc2VlIGlmIHdlIGNhbiBzY3JhcGUgdGhvc2UgYXMgd2VsbC4gVGhlIGxpbmtzIGFyZSBpbiBhIGxpc3QKd2l0aCBgY2xhc3M9InNlYXJjaCBjbGVhcmZpeCJgLgoKYGBge3Igc2VhcmNofQpzZWFyY2hlcyA8LSBodG1sICU+JQogICAgaHRtbF9ub2RlcygiLnNlYXJjaCIpCmBgYAoKU2VsZWN0aW5nIHRoZSBgc2VhcmNoYCBjbGFzcyBnaXZlcyB1cyAqKmByIGxlbmd0aChzZWFyY2hlcylgKiogaXRlbXMuIFRoaXMKbnVtYmVyIGNhbiBiZSBleHBsYWluZWQgYnkgYSBzZWFyY2ggYm94LCB0aGUgcmVjZW50bHkgcGxheWVkIHNvbmdzIGFuZCB0aGUgbW9zdApwbGF5ZWQgc29uZ3MuIEl0IGxvb2tzIGxpa2UgdGhpcyB3aWxsIGdpdmUgdXMgd2hhdCB3ZSB3YW50LCBhcyBsb25nIGFzIHdlIGlnbm9yZQp0aGUgZmlyc3QgaXRlbS4KCkluc2lkZSB0aGUgbGlzdCB3ZSBoYXZlIGV4dHJhY3RlZCB0aGVyZSBhcmUgaXRlbXMgZm9yIGVhY2ggbGluay4gTm90IGV2ZXJ5IHNvbmcKaGFzIGFsbCB0aGUgbGlua3Mgc28gd2UgaGF2ZSB0byBiZSBhIGJpdCBjYXJlZnVsIHRvIG1ha2Ugc3VyZSB3ZSBhcmUgZXh0cmFjdGluZwp0aGVtIHByb3Blcmx5LiBMZXQncyBqdXN0IGxvb2sgYXQgdGhlIGZpcnN0IGxpc3QgdG8gc3RhcnQgd2l0aC4KCmBgYHtyIGV4YW1wbGUtbGlua3N9CnR5cGVzX2V4YW1wbGUgPC0gc2VhcmNoZXNbMl0gJT4lCiAgICBodG1sX25vZGVzKCJsaSIpICU+JQogICAgaHRtbF9ub2RlcygiYSIpICU+JQogICAgaHRtbF90ZXh0KCkKCnVybHNfZXhhbXBsZSA8LSBzZWFyY2hlc1syXSAlPiUKICAgIGh0bWxfbm9kZXMoImxpIikgJT4lCiAgICBodG1sX25vZGVzKCJhIikgJT4lCiAgICBodG1sX2F0dHIoImhyZWYiKQoKdGliYmxlKAogICAgVHlwZSA9IHR5cGVzX2V4YW1wbGUsCiAgICBVUkwgID0gdXJsc19leGFtcGxlCikKYGBgCgpCeSBzZWxlY3RpbmcgdGhlIGA8bGk+YCB0YWcgYW5kIHRoZW4gdGhlIGA8YT5gIHRhZyB3ZSBjYW4gZ2V0IHRoZSBpbmZvcm1hdGlvbiB3ZQp3YW50LiBJbiB0aGlzIGNhc2Ugd2Ugd2FudCB0byBleHRyYWN0IGJvdGggdGhlIHRleHQgdG8gZ2V0IHRoZSB0eXBlIG9mIHRoZSBsaW5rCmFuZCB0aGUgYGhyZWZgIGF0dHJpYnV0ZSB0byBnZXQgdGhlIFVSTCAod2UgY291bGQgcHJvYmFibHkgZ2V0IHRoZSB0eXBlIGZyb20KdGhlIFVSTCBidXQgaXQncyBhbHJlYWR5IHRoZXJlIHNvIHRoaXMgaXMgZWFzaWVyKS4KCkxldCdzIG1ha2UgdGhpcyBpbnRvIGEgZnVuY3Rpb24gdGhhdCByZXR1cm5zIGEgYHRpYmJsZWAgdGhhdCB3ZSBjYW4gYXBwbHkgdG8gb3VyCmxpc3Qgb2YgYHNlYXJjaGAgZGl2cy4KCmBgYHtyIGdldC1saW5rcywgY2xhc3Muc291cmNlPSJmb2xkLXNob3cifQpnZXRfbGlua3MgPC0gZnVuY3Rpb24oc2VhcmNoX2RpdikgewogICAgYV90YWdzIDwtIHNlYXJjaF9kaXYgJT4lCiAgICAgICAgaHRtbF9ub2RlcygibGkiKSAlPiUKICAgICAgICBodG1sX25vZGVzKCJhIikKICAgIAogICAgdGliYmxlKAogICAgICAgIFR5cGUgPSBodG1sX3RleHQoYV90YWdzKSwKICAgICAgICBVUkwgID0gaHRtbF9hdHRyKGFfdGFncywgImhyZWYiKQogICAgKQp9CmBgYAoKTm93IHdlIGNhbiBydW4gdGhpcyBmb3IgYWxsIHRoZSBzb25ncyBhbmQgc2VlIHdoYXQgd2UgZ2V0LiBXZSB3aWxsIGFkZCBhIGxpdHRsZQpiaXQgb2YgY29kZSB0byBhZGQgdGhlIHNvbmcgbmFtZSBhbmQgYXJ0aXN0IHRvIHRoZSByZXN1bHRzLiBUaGUgYXJ0aXN0IGlzCm5lY2Vzc2FyeSBiZWNhdXNlIHRoZXJlIGFyZSBjYW4gYmUgc2V2ZXJhbCBzb25ncyB3aXRoIHRoZSBzYW1lIG5hbWUuIFdlIGFsc28Kc2VsZWN0IGRpc3RpbmN0IGxpbmtzIGFzIHNvbWUgc29uZ3MgaGF2ZSBiZWVuIHBsYXllZCBtdWx0aXBsZSB0aW1lcy4KCmBgYHtyIGFwcGx5LWdldC1saW5rc30KbGlua3MgPC0gcHVycnI6Om1hcF9kZnIocGxheWVkX2lkeCwgZnVuY3Rpb24oLmlkeCkgewogICAgZ2V0X2xpbmtzKHNlYXJjaGVzWy5pZHggKyAxXSkgJT4lCiAgICAgICAgbXV0YXRlKAogICAgICAgICAgICBTb25nICAgPSBwbGF5ZWQkU29uZ1suaWR4XSwKICAgICAgICAgICAgQXJ0aXN0ID0gcGxheWVkJEFydGlzdFsuaWR4XQogICAgICAgICkKfSkgJT4lCiAgICBkaXN0aW5jdCgpCgpsaW5rcwpgYGAKClRoaXMgaXMgY3VycmVudGx5IGluIGxvbmcgZm9ybWF0IHdoZXJlIGVhY2ggcm93IGlzIGEgbGluayBidXQgYXMgdGhlcmUgaXMgb25seQpvbmUgb2YgZWFjaCBsaW5rIHR5cGUgZm9yIGVhY2ggc29uZyBpdCB3aWxsIGJlIG1vcmUgY29udmVuaWVudCB0byBoYXZlIGVhY2ggdHlwZQphcyBhIHNlcGFyYXRlIGNvbHVtbi4KCmBgYHtyIHdpZGVuLWxpbmtzfQpsaW5rcyA8LSBsaW5rcyAlPiUKICAgIHBpdm90X3dpZGVyKG5hbWVzX2Zyb20gPSBUeXBlLCB2YWx1ZXNfZnJvbSA9IFVSTCkKCmxpbmtzCmBgYAoKTm93IHRoZSBsaW5rcyBhcmUgaW4gYSBmb3JtIHRoYXQgaXMgZWFzeSB0byBqb2luIHRvIHRoZSBvdGhlciBzb25nIGluZm9ybWF0aW9uCih1c2luZyB0aGUgc29uZyBuYW1lIGFuZCBhcnRpc3QgYXMga2V5cykuCgpgYGB7ciBqb2luLWxpbmtzfQpwbGF5ZWQgPC0gcGxheWVkICU+JQogICAgbGVmdF9qb2luKGxpbmtzLCBieSA9IGMoIlNvbmciLCAiQXJ0aXN0IikpCgpwbGF5ZWQKYGBgCgojIFB1dCBpdCB0b2dldGhlcgoKV2Ugbm93IGhhdmUgY29kZSBmb3IgZXh0cmFjdGluZyBhbGwgdGhlIGluZm9ybWF0aW9uIHdlIHdhbnQgc28gbGV0J3MgcHV0IGl0CnRvZ2V0aGVyIGludG8gYSBmdW5jdGlvbiB0aGF0IHdlIGNhbiBhcHBseSB0byBlYWNoIEhUTUwgZmlsZS4gV2Ugd2FudCB0aGUKZnVuY3Rpb24gdG8gdGFrZSB0aGUgcGF0aCB0byBvbmUgb2YgdGhlIEhUTUwgZGF0YSBmaWxlcyBhbmQgZG8gdGhlIGZvbGxvd2luZwp0aGluZ3M6CgoxLiBSZWFkIHRoZSBIVE1MIGZyb20gdGhlIGZpbGUKMi4gRXh0cmFjdCB0aGUgdGltZSwgYXJ0aXN0IGFuZCByZWxlYXNlIGluZm9ybWF0aW9uCjMuIEV4dHJhY3QgdGhlIHNvbmcgbmFtZXMKNC4gRXh0cmFjdCB0aGUgc29uZyBsaW5rcwo1LiBQdXQgdGhpcyBpbnRvIGEgdGlkeSBgdGliYmxlYAo2LiBBZGQgd2hpY2ggZmlsZSB0aGUgc29uZ3MgY29tZSBmcm9tCgpUaGUgZnVuY3Rpb24gbG9va3MgbGlrZSB0aGlzLgoKYGBge3Igc2NyYXBlLWZ1biwgY2xhc3Muc291cmNlPSJmb2xkLXNob3cifQpzY3JhcGVfc29uZ3MgPC0gZnVuY3Rpb24oaHRtbF9wYXRoKSB7CiAgICAKICAgIGZpbGUgPC0gZnM6OnBhdGhfZXh0X3JlbW92ZShmczo6cGF0aF9maWxlKGh0bWxfcGF0aCkpCiAgICAKICAgIGh0bWwgPC0gcmVhZF9odG1sKGh0bWxfcGF0aCkKCiAgICB0aW1lcyA8LSBodG1sICU+JQogICAgICAgIGh0bWxfbm9kZXMoIi50aW1lIikgJT4lCiAgICAgICAgaHRtbF90ZXh0KCkKCiAgICBuX3BsYXllZCAgIDwtIGxlbmd0aCh0aW1lcykKICAgIHBsYXllZF9pZHggPC0gc2VxX2xlbihuX3BsYXllZCkKCiAgICBhcnRpc3RzIDwtIGh0bWwgJT4lCiAgICAgICAgaHRtbF9ub2RlcygiLmFydGlzdCIpICU+JQogICAgICAgIGh0bWxfdGV4dCgpCgogICAgcmVsZWFzZXMgPC0gaHRtbCAlPiUKICAgICAgICBodG1sX25vZGVzKCIucmVsZWFzZSIpICU+JQogICAgICAgIGh0bWxfdGV4dCgpCiAgICAKICAgIHNvbmdzIDwtIGh0bWwgJT4lCiAgICAgICAgaHRtbF9ub2RlcygiaDUiKSAlPiUKICAgICAgICBodG1sX3RleHQoKQogICAgCiAgICBzZWFyY2hfZGl2cyA8LSBodG1sICU+JQogICAgICAgIGh0bWxfbm9kZXMoIi5zZWFyY2giKQogICAgCiAgICBsaW5rcyA8LSBwdXJycjo6bWFwX2RmcihwbGF5ZWRfaWR4LCBmdW5jdGlvbiguaWR4KSB7CiAgICAgICAgZ2V0X2xpbmtzKHNlYXJjaF9kaXZzWy5pZHggKyAxXSkgJT4lCiAgICAgICAgICAgIG11dGF0ZSgKICAgICAgICAgICAgICAgIFNvbmcgICA9IHNvbmdzWy5pZHhdLAogICAgICAgICAgICAgICAgQXJ0aXN0ID0gYXJ0aXN0c1suaWR4XQogICAgICAgICAgICApCiAgICB9KSAlPiUKICAgICAgICBkaXN0aW5jdCgpICU+JQogICAgICAgIHBpdm90X3dpZGVyKG5hbWVzX2Zyb20gPSBUeXBlLCB2YWx1ZXNfZnJvbSA9IFVSTCkKICAgIAogICAgdGliYmxlKAogICAgICAgIEZpbGUgICAgPSBmaWxlLAogICAgICAgIFRpbWUgICAgPSB0aW1lcywKICAgICAgICBTb25nICAgID0gc29uZ3NbcGxheWVkX2lkeF0sCiAgICAgICAgQXJ0aXN0ICA9IGFydGlzdHNbcGxheWVkX2lkeF0sCiAgICAgICAgUmVsZWFzZSA9IHJlbGVhc2VzW3BsYXllZF9pZHhdCiAgICApICU+JQogICAgICAgIGxlZnRfam9pbihsaW5rcywgYnkgPSBjKCJTb25nIiwgIkFydGlzdCIpKQp9CmBgYAoKIyBTY3JhcGUgdGhvc2UgZmlsZXMhCgpMZXQncyBhcHBseSB0aGUgc2NyYXBpbmcgZnVuY3Rpb24gdG8gYWxsIHRoZSBmaWxlcyBhbmQgY29tYmluZSB0aGUgcmVzdWx0cyEKCmBgYHtyIHNjcmFwZX0KcmVxdWVzdGl2YWwgPC0gcHVycnI6Om1hcF9kZnIoaHRtbF9maWxlcywgc2NyYXBlX3NvbmdzKQoKcmVxdWVzdGl2YWwKYGBgCgpXZSBzYXZlIHRoaXMgc2NyYXBlZCBkYXRhc2V0IGFzIGEgVFNWIGZvciBmdXJ0aGVyIGFuYWx5c2lzLgoKYGBge3Igc2F2ZX0Kd3JpdGVfdHN2KHJlcXVlc3RpdmFsLCBQQVRIUyRzY3JhcGVkKQpgYGAKCgpbcmVjZW50bHktcGxheWVkXTogaHR0cHM6Ly93d3cuYWJjLm5ldC5hdS90cmlwbGVqL2ZlYXR1cmVkLW11c2ljL3JlY2VudGx5LXBsYXllZC8gIlRyaXBsZSBKIHJlY2VudGx5IHBsYXllZCIKW3RpZHktd2ViLXNjcmFwaW5nXTogaHR0cHM6Ly90b3dhcmRzZGF0YXNjaWVuY2UuY29tL3RpZHktd2ViLXNjcmFwaW5nLWluLXItdHV0b3JpYWwtYW5kLXJlc291cmNlcy1hYzlmNzJiNGZlNDcgIlRpZHkgd2ViIHNjcmFwaW5nIGluIFIg4oCUIFR1dG9yaWFsIGFuZCByZXNvdXJjZXMiClt1bmVhcnRoZWRdOiBodHRwczovL3d3dy50cmlwbGVqdW5lYXJ0aGVkLmNvbS8gIlRyaXBsZSBKIFVuZWFydGhlZCIK

Scraping

Introduction

1 Structure

2 Locating information

2.1 Time, artist and release

2.2 Song name

2.3 Links

3 Put it together

4 Scrape those files!