I rly wish 'experts' would stop with the "you need Selenium/Headless Chrome" since it's almost never true and introduces a needless, heavyweight third-party dependency into data science workflows.
The site is an ASP.NET site so it makes heavy use of sessions and the programmers behind this particular one force that session to start at the home ("Hello, 2000 called and would like their session state preserving model back.")
Anyway, we need to start there and progress to your page. Here's what that looks like to your browser:
We can also see from 👆🏽 is that the site returns lovely JSON so we'll eventually grab that. Let's start modeling an R httr
workflow like the above session:
library(xml2)
library(httr)
library(rvest)
Start at the, um, start!
httr::GET(
url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
httr::verbose()
) -> res
Now, we need to get the HTML from that page as there are a number of hidden values we need to supply to the POST
that is made since that's part of how the brain-dead ASP.NET workflow works (again, follow the requests in the image above):
pg <- httr::content(res)
hinput <- html_nodes(pg, "input")
hinput <- as.list(setNames(html_attr(hinput, "value"), html_attr(hinput, "name")))
hinput$`header$txtBoxBuscador` <- ""
hinput$`__EVENTARGUMENT` <- ""
hinput$`__EVENTTARGET` <- "lnkBut01"
httr::POST(
url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
httr::add_headers(
`Referer` = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx"
),
encode = "form",
body = hinput
) -> res
Now we've done what we need to con the website into thinking we've got a proper session so let's make the request for the JSON content:
httr::GET(
url = "https://si3.bcentral.cl/siete/secure/cuadros/actions.aspx",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
query = list(
Opcion = "1",
idMenu = "IPC_VAR_MEN1_HIST",
codCuadro = "IPC_VAR_MEN1_HIST",
DrDwnAnioDesde = "",
DrDwnAnioHasta = "",
DrDwnAnioDiario = "",
DropDownListFrequency = "",
DrDwnCalculo = "NONE"
)
) -> res
And, boom:
str(
httr::content(res), 1
)
## List of 32
## $ CodigoCuadro : chr "IPC_VAR_MEN1_HIST"
## $ Language : chr "es-CL"
## $ DescripcionCuadro : chr "IPC, IPCX, IPCX1 e IPC SAE, variación mensual, información histórica"
## $ AnioDesde : int 1928
## $ AnioHasta : int 2018
## $ FechaInicio : chr "01-01-2010"
## $ FechaFin : chr "01-11-2018"
## $ ListaFrecuencia :List of 1
## $ FrecuenciaDefecto : NULL
## $ DrDwnAnioDesde :List of 3
## $ DrDwnAnioHasta :List of 3
## $ DrDwnAnioDiario :List of 3
## $ hsDecimales :List of 1
## $ ListaCalculo :List of 1
## $ Metadatos : chr " <img runat=\"server\" ID=\"imgButMetaDatos\" alt=\"Ver metadatos\" src=\"../../Images/lens.gif\" OnClick=\"jav"| __truncated__
## $ NotasPrincipales : chr ""
## $ StatusTextBox : chr ""
## $ Grid :List of 4
## $ GridColumnNames :List of 113
## $ Paginador : int 15
## $ allowEmptyColumns : logi FALSE
## $ FechaInicioSelected: chr "2010"
## $ FechaFinSelected : chr "2018"
## $ FrecuenciaSelected : chr "MONTHLY"
## $ CalculoSelected : chr "NONE"
## $ AnioDiarioSelected : chr "2010"
## $ UrlFechaBase : chr "Indizar_fechaBase.aspx?codCuadro=IPC_VAR_MEN1_HIST"
## $ FechaBaseCuadro : chr "Ene 2010"
## $ IsBoletin : logi FALSE
## $ CheckSelected :List of 4
## $ lnkButFechaBase : logi FALSE
## $ ShowFechaBase : logi FALSE
Dig around in the JSON for the data you need. I think it's in the Grid…
elements.
La funcionalidad Excel dinámico será descontinuada a partir del 31 de Octubre de 2018
. Translation: "The dynamic Excel function will be discontinued October 31, 2018."