I saw no robots.txt
nor a T&C but I did read through the (quite daunting) "APPLICATION TO USE RESTRICTED MICRODATA" (I forgot I had an account that can access IPUMS though I don't recall ever using it). I'm impressed at their desire to register the importance of the potentially sensitive nature of their data up front before download.
Since this metadata has no "microdata" in it (it appears the metadata is provided to help folks decide what data elements they can select) and since acquisition & use of it doesn't violate any of the stated restrictions, the following should be OK. If a rep of IPUMS sees this and disagrees, I'll gladly remove the answer and ask the SO admins to really delete it, too (for those who aren't aware, folks w/high enough rep can see deleted answers).
Now, you don't need Selenium or Splash for this but you'll need to do some post-processing of the data retrieved by the below code.
The data that builds the metadata tables is in a javascript blob in a <script>
tag (Use "View Source" to see it, you're going to need it later). We can use some string munging & the V8 package to get it:
library(V8)
library(rvest)
library(jsonlite)
library(stringi)
pg <- read_html("https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section")
html_nodes(pg, xpath=".//script[contains(., 'Less than')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] -> js_lines
idx <- which(stri_detect_fixed(js_lines, '$(document).ready(function() {')) - 1
That finds the target <script>
element, gets the contents, converts it to lines and finds the first line that isn't the data. We can only pull out the javascript code with the data since the V8 engine in R isn't a full browser and can't execute the jQuery code after it.
We now create a "V8 context", extract the code and execute it in said V8 context and retrieve it back:
ctx <- v8()
ctx$eval(paste0(js_lines[1:idx], collapse="\n"))
code_data <- ctx$get("codeData")
str(code_data)
## List of 14
## $ jsonPath : chr "/international-action/frequencies/MIGYRSBR"
## $ samples :'data.frame': 6 obs. of 2 variables:
## ..$ name: chr [1:6] "br1960a" "br1970a" "br1980a" "br1991a" ...
## ..$ id : int [1:6] 2416 2417 2418 2419 2420 2651
## $ categories :'data.frame': 100 obs. of 5 variables:
## ..$ id : int [1:100] 4725113 4725114 4725115 4725116 4725117 4725118 4725119 4725120 4725121 4725122 ...
## ..$ label : chr [1:100] "Less than 1 year" "1" "2" "3" ...
## ..$ indent : int [1:100] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ code : chr [1:100] "00" "01" "02" "03" ...
## ..$ general: logi [1:100] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ longSamplesHeader : chr "<tr class=\"fullHeader grayHeader\">\n\n <th class=\"codesColumn\">Code</th>\n <th class=\"la"| __truncated__
## $ samplesHeader : chr "\n<tr class=\"fullHeader grayHeader\">\n <th class=\"codesColumn\">Code</th>\n <th class=\"labelColum"| __truncated__
## $ showCounts : logi FALSE
## $ generalWidth : int 2
## $ width : int 2
## $ interval : int 25
## $ isGeneral : logi FALSE
## $ frequencyType : NULL
## $ project_uses_survey_groups: logi FALSE
## $ variables_show_tab_1 : chr ""
## $ header_type : chr "short"
The jsonPath
component suggests it uses more data in the building of the codes & frequencies tables, so we can get it, too:
code_json <- fromJSON(sprintf("https://international.ipums.org%s", code_data$jsonPath))
str(code_json, 1)
## List of 6
## $ 2416:List of 100
## $ 2417:List of 100
## $ 2418:List of 100
## $ 2419:List of 100
## $ 2420:List of 100
## $ 2651:List of 100
Those "Lists of 100" are 100 numbers each.
You'll need to look at the code in the "View Source" (as suggested above) to see how you might be able to use those two bits of data to re-create the metadata table.
I do think you'd be better off following the path @alistaire started you on but follow it fully. I saw no questions about obtaining "codes and frequencies" or "metadata" (such as this) in the forum (http://answers.popdata.org/) and read in at least 5 places the the IPUMS staff reads and answers questions in the forums and also at their info-email address: [email protected]
.
They obviously have this metadata somewhere electronically and could likely give you a complete dump of it across all data products to avoid further scraping (which my guess is your goal since I can't imagine a scenario where one wld want to go through this trouble for one extract).