3

I am trying to scrape the table from the codes tab on this website (the big table containing the x and .)

I thought one of the following would do the trick...

library(rvest)
library(tidyverse)
"https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section" %>%
  read_html() %>%
  html_table()

"https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section" %>%
  read_html() %>%
  html_nodes(".variablesList , #ui-id-1")

... but nothing of use comes back. I had a look at the source of the html file. I think the the website is using some JavaScript to generate the table? Does this mean it is not possible to get the table?

Note: I cannot install RSelenium on my office PC

4
  • 1
    Possible duplicate of stumped on how to scrape the data from this site (using R) Commented Oct 16, 2017 at 5:20
  • @Hardikgupta I cannot install RSelenium on my office PC Commented Oct 16, 2017 at 6:08
  • That site has clear permitted methods for obtaining data which do not include scraping. Go through their channels and you'll likely get a nice flat file without breaking copyright.
    – alistaire
    Commented Oct 16, 2017 at 6:39
  • @alistaire have done. there is no option to get the meta data (in the table I want to get) Commented Oct 16, 2017 at 6:42

2 Answers 2

4

I saw no robots.txt nor a T&C but I did read through the (quite daunting) "APPLICATION TO USE RESTRICTED MICRODATA" (I forgot I had an account that can access IPUMS though I don't recall ever using it). I'm impressed at their desire to register the importance of the potentially sensitive nature of their data up front before download.

Since this metadata has no "microdata" in it (it appears the metadata is provided to help folks decide what data elements they can select) and since acquisition & use of it doesn't violate any of the stated restrictions, the following should be OK. If a rep of IPUMS sees this and disagrees, I'll gladly remove the answer and ask the SO admins to really delete it, too (for those who aren't aware, folks w/high enough rep can see deleted answers).

Now, you don't need Selenium or Splash for this but you'll need to do some post-processing of the data retrieved by the below code.

The data that builds the metadata tables is in a javascript blob in a <script> tag (Use "View Source" to see it, you're going to need it later). We can use some string munging & the V8 package to get it:

library(V8)
library(rvest)
library(jsonlite)
library(stringi)

pg <- read_html("https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section")

html_nodes(pg, xpath=".//script[contains(., 'Less than')]") %>% 
  html_text() %>% 
  stri_split_lines() %>% 
  .[[1]] -> js_lines

idx <- which(stri_detect_fixed(js_lines, '$(document).ready(function() {')) - 1

That finds the target <script> element, gets the contents, converts it to lines and finds the first line that isn't the data. We can only pull out the javascript code with the data since the V8 engine in R isn't a full browser and can't execute the jQuery code after it.

We now create a "V8 context", extract the code and execute it in said V8 context and retrieve it back:

ctx <- v8()

ctx$eval(paste0(js_lines[1:idx], collapse="\n"))

code_data <- ctx$get("codeData")

str(code_data)
## List of 14
##  $ jsonPath                  : chr "/international-action/frequencies/MIGYRSBR"
##  $ samples                   :'data.frame': 6 obs. of  2 variables:
##   ..$ name: chr [1:6] "br1960a" "br1970a" "br1980a" "br1991a" ...
##   ..$ id  : int [1:6] 2416 2417 2418 2419 2420 2651
##  $ categories                :'data.frame': 100 obs. of  5 variables:
##   ..$ id     : int [1:100] 4725113 4725114 4725115 4725116 4725117 4725118 4725119 4725120 4725121 4725122 ...
##   ..$ label  : chr [1:100] "Less than 1 year" "1" "2" "3" ...
##   ..$ indent : int [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ code   : chr [1:100] "00" "01" "02" "03" ...
##   ..$ general: logi [1:100] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longSamplesHeader         : chr "<tr class=\"fullHeader grayHeader\">\n\n          <th class=\"codesColumn\">Code</th>\n          <th class=\"la"| __truncated__
##  $ samplesHeader             : chr "\n<tr class=\"fullHeader grayHeader\">\n      <th class=\"codesColumn\">Code</th>\n      <th class=\"labelColum"| __truncated__
##  $ showCounts                : logi FALSE
##  $ generalWidth              : int 2
##  $ width                     : int 2
##  $ interval                  : int 25
##  $ isGeneral                 : logi FALSE
##  $ frequencyType             : NULL
##  $ project_uses_survey_groups: logi FALSE
##  $ variables_show_tab_1      : chr ""
##  $ header_type               : chr "short"

The jsonPath component suggests it uses more data in the building of the codes & frequencies tables, so we can get it, too:

code_json <- fromJSON(sprintf("https://international.ipums.org%s", code_data$jsonPath))

str(code_json, 1)
## List of 6
##  $ 2416:List of 100
##  $ 2417:List of 100
##  $ 2418:List of 100
##  $ 2419:List of 100
##  $ 2420:List of 100
##  $ 2651:List of 100

Those "Lists of 100" are 100 numbers each.

You'll need to look at the code in the "View Source" (as suggested above) to see how you might be able to use those two bits of data to re-create the metadata table.

I do think you'd be better off following the path @alistaire started you on but follow it fully. I saw no questions about obtaining "codes and frequencies" or "metadata" (such as this) in the forum (http://answers.popdata.org/) and read in at least 5 places the the IPUMS staff reads and answers questions in the forums and also at their info-email address: [email protected].

They obviously have this metadata somewhere electronically and could likely give you a complete dump of it across all data products to avoid further scraping (which my guess is your goal since I can't imagine a scenario where one wld want to go through this trouble for one extract).

3
  • many thanks. was not intending to scrape many of the metadata pages... just the three at the moment... never it expected it to get so complicated. Commented Oct 17, 2017 at 10:32
  • I totally get how frustrating it must be having things like selenium restricted on the work computer. It or Splash wld have made quick work of this. Are you able to even install or use standalone (I'm assuming Windows) binaries? If so, I have a much smaller solution. Try installing webshot from CRAN and then running install_phantomjs() if that works ping me in a comment here.
    – hrbrmstr
    Commented Oct 17, 2017 at 11:00
  • 3
    For now please consider the metadata to have the same terms as the microdata (most importantly please cite us, it's how we're able to keep the lights on!). Otherwise I think everything @hrbmstr says looks good, please don't hit our servers too hard. We hope to develop an API so this is even easier, but can't discuss a timeline just yet.
    – GregF
    Commented Dec 4, 2017 at 21:44
1

See comment above about scraping, but in case it's helpful, we've just released the ipumsr package, which makes using IPUMS metadata in R a bit easier.

If you make an extract with MIGYRSBR in it, and then download the DDI (which is available even before the full microdata is), you can get the codes table using the command:

# install.packages("ipumsr")
library(ipumsr)
ddi <- read_ipums_ddi("ipumsi_00020.xml")

ipums_val_labels(ddi, "MIGYRSBR")
#> # A tibble: 7 x 2
#>     val                              lbl
#>   <dbl>                            <chr>
#> 1     0                 Less than 1 year
#> 2     6 6 (6 to 10 1960-70, 6 to 9 1980)
#> 3    10                    10 (10+ 1980)
#> 4    11                 11 (11+ 1960-70)
#> 5    97                              97+
#> 6    98                          Unknown
#> 7    99            NIU (not in universe)

Or, you can load the full dataset and the value labels will be attached as labelled class vectors (from haven). See the value-labels vignette for more details.

data <- read_ipums_micro(ddi, verbose = FALSE)
data$MIGYRSBR <- as_factor(data$MIGYRSBR)

table(data$MIGYRSBR)
#> 
#>                 Less than 1 year                                1 
#>                           123862                            65529 
#>                                2                                3 
#>                            77190                            59908 
#>                                4                                5 
#>                            44748                            49590 
#> 6 (6 to 10 1960-70, 6 to 9 1980)                    10 (10+ 1980) 
#>                           185220                                0 
#>                 11 (11+ 1960-70)                              97+ 
#>                           318097                                0 
#>                          Unknown            NIU (not in universe) 
#>                             6459                          2070836

Note the DDI alone won't have the availability / frequencies that are on the web, you'll need to calculate those from the data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.