Save complete web page (incl css, images) using python/selenium

Max Power picture Max Power · Dec 11, 2018 · Viewed 9.2k times · Source

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)

At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source
with open('webpage.html', 'w') as f:
    f.write(content)

I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)

Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right). enter image description here

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST. 
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>

Any idea why this is happening?

Answer

FThompson picture FThompson · Dec 29, 2018

As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.

pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.

Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.


Full code, in which I also added conditional waits to improve speed:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')