Saturday, October 30, 2021

[FIXED] How to get the only PDF url from web page?

October 30, 2021 dom, java, javascript, selenium, web-scraping

Issue

I am trying to get some DOM elements using Selenium and I am doing all of this using Java but I am getting this Error when trying it out:

Exception in thread "main" org.openqa.selenium.StaleElementReferenceException: stale element reference: element is not attached to the page document

I am still a newbie in all this but the code I am using to retrieve the DOM element is:

 driver.get("https://www.qp.alberta.ca/570.cfm?frm_isbn=9780779808571&search_by=link");
String pagePdfUrl = driver.findElement(By.xpath("//img[@alt='View PDF']//..//parent::a")).getAttribute("href");

I believe the error is that it cannot find the XPath given although this xpath exists. Any help would be appreciated.

Thank you.

Solution

There is a href attribute is having pdf URL but that URL opens the pdf within webpage.
So I extracted the pdf URL from href attribute and fetched the pdf name from that then concatenated with https://www.qp.alberta.ca/documents/Acts/ URL.

You can write the code like below to get the pdf URL.

Code to get `PDF` URL:

    driver = new ChromeDriver();
    /*I hard coded below URL. You need parameterize based on your requirement.*/
    driver.get("https://www.qp.alberta.ca/570.cfm?frm_isbn=9780779808571&search_by=link");
    String pagePdfUrl = driver.findElement(By.xpath("//img[@alt='View PDF']//..//parent::a")).getAttribute("href");
    System.out.println("Page PDF URL: " + pagePdfUrl);
    String pdfName = StringUtils.substringBetween(pagePdfUrl, "page=", ".cfm&");
    driver.get("https://www.qp.alberta.ca/documents/Acts/" + pdfName + ".pdf");

Code to download `PDF`:

Required ChromOptions:

   ChromeOptions options = new ChromeOptions();
   HashMap<String, Object> chromeOptionsMap = new HashMap<String, Object>();
       chromeOptionsMap.put("plugins.plugins_disabled", new String[] { "Chrome PDF Viewer" });
       chromeOptionsMap.put("plugins.always_open_pdf_externally", true);
       chromeOptionsMap.put("download.default_directory", "C:\\Users\\Downloads\\test\\");
       options.setExperimentalOption("prefs", chromeOptionsMap);
       options.addArguments("--headless");

Accessing PDF:

    driver = new ChromeDriver(options);
    driver.get("https://www.qp.alberta.ca/570.cfm?frm_isbn=9780779808571&search_by=link");
    String pagePdfUrl = driver.findElement(By.xpath("//img[@alt='View PDF']//..//parent::a")).getAttribute("href");
    System.out.println("Page PDF URL: " + pagePdfUrl);
    String pdfName = StringUtils.substringBetween(pagePdfUrl, "page=", ".cfm&");
    System.out.println("Only PDF URL: "+"https://www.qp.alberta.ca/documents/Acts/" + pdfName + ".pdf");
    driver.get("https://www.qp.alberta.ca/documents/Acts/" + pdfName + ".pdf");

OutPut:

Page PDF URL: https://www.qp.alberta.ca/1266.cfm?page=2017ch18_unpr.cfm&leg_type=Acts&isbncln=9780779808571
Only PDF URL: https://www.qp.alberta.ca/documents/Acts/2017ch18_unpr.pdf

Import for StringUtils:

import org.apache.commons.lang3.StringUtils;

Answered By - Nandan A

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 30, 2021

[FIXED] How to get the only PDF url from web page?

Issue

Solution

Code to get `PDF` URL:

Code to download `PDF`:

Popular Posts

Labels

Saturday, October 30, 2021

Issue

Solution

Code to get PDF URL:

Code to download PDF:

Popular Posts

Labels

Code to get `PDF` URL:

Code to download `PDF`: