Skip to main content
Acknowledge that "need" is too strong a word
Source Link
Old Pro
  • 25.5k
  • 8
  • 63
  • 112

Your target is a complex, dynamic website, which is why you cannot easily scrape it. To get to the page I think you are asking about, I have to first go to the home page, then click on "Cuentas Nacionales" on the left menu. That click causes a POST request sending form data apparently indicating the next view to present, which is apparently stored on the server side in a session. This is why you cannot directly access the target URL; it is the same URL for several different displays.

In order to scrape the page, you are going to need to script a browser to go through the steps to get to the page and then save the rendered page to an HTML file, at which point you should be able to use rvest to extract the data from the file. (@hrbrmstr points out that you do not absolutely need to script a browser to get the data, since you do not need to get the data by scraping a rendered page. More on that later.)

At this point in time (December 2018), PhantomJS has been deprecated and the best recommendation is to use headless chrome. To script it sufficiently to navigate through a multi-page site, you use Selenium WebDriver with ChromeDriver to control headless chrome. See this answer for a fully worked out explanation of how to get this working with a Python script. The Selenium documentation includes information for how to use other programming languages, including Java, C#, Ruby, Perl, PHP, and JavaScript, so use whichever language you are comfortable with.

The general outline of the script (with Python snippets) would be

  • Start chrome in headless mode
  • Fetch the home page
  • Wait for the page to fully load. I'm not sure the best way to do that in this case, but probably you can poll the page looking for the table data to be filled in and wait until you find it. See Selenium explicit and implicit waits.
  • Find the link by Link Text link = driver.find_element_by_link_text("Cuentas Nacionales")
  • Click the link link.click()
  • Again, wait for the page to load
  • Get the HTML using driver.getPageSource() and save it to a file.
  • Feed that file into rvest

It looks like it may be possible to do all this from within R using seleniumPipes. See its documentation for how to accomplish the above steps. Use findElement("link text", "Cuentas Nacionales") %>% elementClick to find and click the link. Then use getPageSource() to get the page source and feed that into rvest or XML or something to find and parse the table.

Side note: @hrbrmstr points out that instead of scripting a browser to scrape the page, you could manually go through all the steps in the browser, extract the relevant requests and response data using the browser's development tools, so that you can eventually script a set of HTTPS requests and response parsers that will eventually generate a request that will return the data you want. Since hrbrmstr has done that for you already, it will be easier for you in this exact instance to cut and paste their answer, but in general I do not recommend that approach as it is difficult to set up, very likely to break in the future, and difficult to fix when it does break. And for people who don't care about long-term maintainability, since this table only changes monthly, you could even more easily just manually navigate to the page and use the browser to save it to an HTML file and the load that file into the R script.

Your target is a complex, dynamic website, which is why you cannot easily scrape it. To get to the page I think you are asking about, I have to first go to the home page, then click on "Cuentas Nacionales" on the left menu. That click causes a POST request sending form data apparently indicating the next view to present, which is apparently stored on the server side in a session. This is why you cannot directly access the target URL; it is the same URL for several different displays.

In order to scrape the page, you are going to need to script a browser to go through the steps to get to the page and then save the rendered page to an HTML file, at which point you should be able to use rvest to extract the data from the file.

At this point in time (December 2018), PhantomJS has been deprecated and the best recommendation is to use headless chrome. To script it sufficiently to navigate through a multi-page site, you use Selenium WebDriver with ChromeDriver to control headless chrome. See this answer for a fully worked out explanation of how to get this working with a Python script. The Selenium documentation includes information for how to use other programming languages, including Java, C#, Ruby, Perl, PHP, and JavaScript, so use whichever language you are comfortable with.

The general outline of the script (with Python snippets) would be

  • Start chrome in headless mode
  • Fetch the home page
  • Wait for the page to fully load. I'm not sure the best way to do that in this case, but probably you can poll the page looking for the table data to be filled in and wait until you find it. See Selenium explicit and implicit waits.
  • Find the link by Link Text link = driver.find_element_by_link_text("Cuentas Nacionales")
  • Click the link link.click()
  • Again, wait for the page to load
  • Get the HTML using driver.getPageSource() and save it to a file.
  • Feed that file into rvest

It looks like it may be possible to do all this from within R using seleniumPipes. See its documentation for how to accomplish the above steps. Use findElement("link text", "Cuentas Nacionales") %>% elementClick to find and click the link. Then use getPageSource() to get the page source and feed that into rvest or XML or something to find and parse the table.

Your target is a complex, dynamic website, which is why you cannot easily scrape it. To get to the page I think you are asking about, I have to first go to the home page, then click on "Cuentas Nacionales" on the left menu. That click causes a POST request sending form data apparently indicating the next view to present, which is apparently stored on the server side in a session. This is why you cannot directly access the target URL; it is the same URL for several different displays.

In order to scrape the page, you are going to need to script a browser to go through the steps to get to the page and then save the rendered page to an HTML file, at which point you should be able to use rvest to extract the data from the file. (@hrbrmstr points out that you do not absolutely need to script a browser to get the data, since you do not need to get the data by scraping a rendered page. More on that later.)

At this point in time (December 2018), PhantomJS has been deprecated and the best recommendation is to use headless chrome. To script it sufficiently to navigate through a multi-page site, you use Selenium WebDriver with ChromeDriver to control headless chrome. See this answer for a fully worked out explanation of how to get this working with a Python script. The Selenium documentation includes information for how to use other programming languages, including Java, C#, Ruby, Perl, PHP, and JavaScript, so use whichever language you are comfortable with.

The general outline of the script (with Python snippets) would be

  • Start chrome in headless mode
  • Fetch the home page
  • Wait for the page to fully load. I'm not sure the best way to do that in this case, but probably you can poll the page looking for the table data to be filled in and wait until you find it. See Selenium explicit and implicit waits.
  • Find the link by Link Text link = driver.find_element_by_link_text("Cuentas Nacionales")
  • Click the link link.click()
  • Again, wait for the page to load
  • Get the HTML using driver.getPageSource() and save it to a file.
  • Feed that file into rvest

It looks like it may be possible to do all this from within R using seleniumPipes. See its documentation for how to accomplish the above steps. Use findElement("link text", "Cuentas Nacionales") %>% elementClick to find and click the link. Then use getPageSource() to get the page source and feed that into rvest or XML or something to find and parse the table.

Side note: @hrbrmstr points out that instead of scripting a browser to scrape the page, you could manually go through all the steps in the browser, extract the relevant requests and response data using the browser's development tools, so that you can eventually script a set of HTTPS requests and response parsers that will eventually generate a request that will return the data you want. Since hrbrmstr has done that for you already, it will be easier for you in this exact instance to cut and paste their answer, but in general I do not recommend that approach as it is difficult to set up, very likely to break in the future, and difficult to fix when it does break. And for people who don't care about long-term maintainability, since this table only changes monthly, you could even more easily just manually navigate to the page and use the browser to save it to an HTML file and the load that file into the R script.

Source Link
Old Pro
  • 25.5k
  • 8
  • 63
  • 112

Your target is a complex, dynamic website, which is why you cannot easily scrape it. To get to the page I think you are asking about, I have to first go to the home page, then click on "Cuentas Nacionales" on the left menu. That click causes a POST request sending form data apparently indicating the next view to present, which is apparently stored on the server side in a session. This is why you cannot directly access the target URL; it is the same URL for several different displays.

In order to scrape the page, you are going to need to script a browser to go through the steps to get to the page and then save the rendered page to an HTML file, at which point you should be able to use rvest to extract the data from the file.

At this point in time (December 2018), PhantomJS has been deprecated and the best recommendation is to use headless chrome. To script it sufficiently to navigate through a multi-page site, you use Selenium WebDriver with ChromeDriver to control headless chrome. See this answer for a fully worked out explanation of how to get this working with a Python script. The Selenium documentation includes information for how to use other programming languages, including Java, C#, Ruby, Perl, PHP, and JavaScript, so use whichever language you are comfortable with.

The general outline of the script (with Python snippets) would be

  • Start chrome in headless mode
  • Fetch the home page
  • Wait for the page to fully load. I'm not sure the best way to do that in this case, but probably you can poll the page looking for the table data to be filled in and wait until you find it. See Selenium explicit and implicit waits.
  • Find the link by Link Text link = driver.find_element_by_link_text("Cuentas Nacionales")
  • Click the link link.click()
  • Again, wait for the page to load
  • Get the HTML using driver.getPageSource() and save it to a file.
  • Feed that file into rvest

It looks like it may be possible to do all this from within R using seleniumPipes. See its documentation for how to accomplish the above steps. Use findElement("link text", "Cuentas Nacionales") %>% elementClick to find and click the link. Then use getPageSource() to get the page source and feed that into rvest or XML or something to find and parse the table.