Save complete web page (incl css, images) using python/selenium












15















I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:



from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)


At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.



content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)


I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)



Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?



UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
enter image description here



This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:



<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>


Any idea why this is happening?










share|improve this question

























  • check this question codereview.stackexchange.com/q/78775/179828

    – Moshe Slavin
    Dec 12 '18 at 13:58











  • thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

    – Max Power
    Dec 14 '18 at 16:31











  • Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

    – Tomas Farias
    Dec 24 '18 at 20:52











  • thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

    – Max Power
    Dec 27 '18 at 17:07













  • i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

    – iamklaus
    Dec 28 '18 at 16:29
















15















I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:



from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)


At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.



content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)


I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)



Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?



UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
enter image description here



This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:



<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>


Any idea why this is happening?










share|improve this question

























  • check this question codereview.stackexchange.com/q/78775/179828

    – Moshe Slavin
    Dec 12 '18 at 13:58











  • thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

    – Max Power
    Dec 14 '18 at 16:31











  • Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

    – Tomas Farias
    Dec 24 '18 at 20:52











  • thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

    – Max Power
    Dec 27 '18 at 17:07













  • i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

    – iamklaus
    Dec 28 '18 at 16:29














15












15








15


1






I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:



from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)


At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.



content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)


I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)



Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?



UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
enter image description here



This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:



<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>


Any idea why this is happening?










share|improve this question
















I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:



from selenium import webdriver

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'
CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION)
driver.get(URL)
time.sleep(5)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()
time.sleep(60)


At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.



content = driver.page_source
with open('webpage.html', 'w') as f:
f.write(content)


I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)



Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?



UPDATE - Follow up question for James' answer
So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right).
enter image description here



This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:



<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true">
These options control formatting of alignments in results pages. The
default is HTML, but other formats (including plain text) are available.
PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.
The Advanced view option allows the database descriptions to be sorted by various indices in a table.
</p>


Any idea why this is happening?







python selenium web-crawler bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 29 '18 at 4:05







Max Power

















asked Dec 11 '18 at 17:18









Max PowerMax Power

2,70621840




2,70621840













  • check this question codereview.stackexchange.com/q/78775/179828

    – Moshe Slavin
    Dec 12 '18 at 13:58











  • thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

    – Max Power
    Dec 14 '18 at 16:31











  • Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

    – Tomas Farias
    Dec 24 '18 at 20:52











  • thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

    – Max Power
    Dec 27 '18 at 17:07













  • i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

    – iamklaus
    Dec 28 '18 at 16:29



















  • check this question codereview.stackexchange.com/q/78775/179828

    – Moshe Slavin
    Dec 12 '18 at 13:58











  • thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

    – Max Power
    Dec 14 '18 at 16:31











  • Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

    – Tomas Farias
    Dec 24 '18 at 20:52











  • thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

    – Max Power
    Dec 27 '18 at 17:07













  • i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

    – iamklaus
    Dec 28 '18 at 16:29

















check this question codereview.stackexchange.com/q/78775/179828

– Moshe Slavin
Dec 12 '18 at 13:58





check this question codereview.stackexchange.com/q/78775/179828

– Moshe Slavin
Dec 12 '18 at 13:58













thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

– Max Power
Dec 14 '18 at 16:31





thanks for this Moshe, although from the description that it saves "html content of page(without CSS) and also it looks for all images on page and save them too" it's not quite what I'm looking for. Also, more importantly to me, I want to leverage a crawling tool like Selenium or maybe Scrapy because I need to do a bit of crawling to get to my results page, I can't just give a URL as an input. I also prefer to use a tool like Selenium/Scrapy/bs4 than to use regex to parse html as this answer does.

– Max Power
Dec 14 '18 at 16:31













Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

– Tomas Farias
Dec 24 '18 at 20:52





Would you consider a valid solution one which uses a Firefox addon? Translating this answer to Python looks like a plausible task.

– Tomas Farias
Dec 24 '18 at 20:52













thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

– Max Power
Dec 27 '18 at 17:07







thanks for the suggestion, Tomas. Although that answer refers to the following dead link for the firefox plugin - addons.mozilla.org/de/firefox/addon/scrapbook. Googling, I find addons.mozilla.org/en-US/firefox/addon/web-scrapbook, but that has the following warning text in bold "This addon is under development. Every feature could change in the future. Use in production carefully and be sure to make a backup frequently." So I will probably not build around this for now, but I appreciate your pointing me to it

– Max Power
Dec 27 '18 at 17:07















i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

– iamklaus
Dec 28 '18 at 16:29





i faced similar issue regarding clicking the save as button..a solution would be to click on the save as button using pywin32.. stackoverflow.com/questions/1181464/…

– iamklaus
Dec 28 '18 at 16:29












3 Answers
3






active

oldest

votes


















3





+50









As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.



pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')


This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.



Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.





Full code, in which I also added conditional waits to improve speed:



import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)

# enter sequence into the query field and hit 'blast' button to search
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)

blast_button = driver.find_element_by_id("b1")
blast_button.click()

# wait until results are loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.typewrite(SEQUENCE + '.html')
pyautogui.hotkey('enter')





share|improve this answer


























  • hey thanks, this is perfect!

    – Max Power
    Dec 29 '18 at 17:45











  • I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

    – Max Power
    Dec 29 '18 at 17:50



















4














This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.



Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.



This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.



from selenium import webdriver
import chromedriver_binary
from lxml import html
import requests
import os

driver = webdriver.Chrome()
URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
base = 'https://blast.ncbi.nlm.nih.gov/'

driver.get(URL)
seq_query_field = driver.find_element_by_id("seq")
seq_query_field.send_keys(SEQUENCE)
blast_button = driver.find_element_by_id("b1")
blast_button.click()

content = driver.page_source
# write the page content
os.mkdir('page')
with open('page/page.html', 'w') as fp:
fp.write(content)

# download the referenced files to the same path as in the html
sess = requests.Session()
sess.get(base) # sets cookies

# parse html
h = html.fromstring(content)
# get css/js files loaded in the head
for hr in h.xpath('head//@href'):
if not hr.startswith('http'):
local_path = 'page/' + hr
hr = base + hr
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)

# get image/js files from the body. skip anything loaded from outside sources
for src in h.xpath('//@src'):
if not src or src.startswith('http'):
continue
local_path = 'page/' + src
print(local_path)
src = base + src
res = sess.get(hr)
if not os.path.exists(os.path.dirname(local_path)):
os.makedirs(os.path.dirname(local_path))
with open(local_path, 'wb') as fp:
fp.write(res.content)


You should have a folder called page with a file called page.html in it with the content you are after.






share|improve this answer
























  • hey thanks, this looks great, I'll try this out soon

    – Max Power
    Dec 27 '18 at 17:10











  • hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

    – Max Power
    Dec 29 '18 at 4:06






  • 1





    Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

    – James
    Dec 29 '18 at 13:09



















-2














I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53729201%2fsave-complete-web-page-incl-css-images-using-python-selenium%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3





    +50









    As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.



    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')


    This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.



    Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.





    Full code, in which I also added conditional waits to improve speed:



    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.expected_conditions import visibility_of_element_located
    from selenium.webdriver.support.ui import WebDriverWait
    import pyautogui

    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    driver = webdriver.Chrome()
    driver.get(URL)

    # enter sequence into the query field and hit 'blast' button to search
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)

    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    # wait until results are loaded
    WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')





    share|improve this answer


























    • hey thanks, this is perfect!

      – Max Power
      Dec 29 '18 at 17:45











    • I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

      – Max Power
      Dec 29 '18 at 17:50
















    3





    +50









    As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.



    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')


    This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.



    Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.





    Full code, in which I also added conditional waits to improve speed:



    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.expected_conditions import visibility_of_element_located
    from selenium.webdriver.support.ui import WebDriverWait
    import pyautogui

    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    driver = webdriver.Chrome()
    driver.get(URL)

    # enter sequence into the query field and hit 'blast' button to search
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)

    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    # wait until results are loaded
    WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')





    share|improve this answer


























    • hey thanks, this is perfect!

      – Max Power
      Dec 29 '18 at 17:45











    • I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

      – Max Power
      Dec 29 '18 at 17:50














    3





    +50







    3





    +50



    3




    +50





    As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.



    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')


    This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.



    Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.





    Full code, in which I also added conditional waits to improve speed:



    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.expected_conditions import visibility_of_element_located
    from selenium.webdriver.support.ui import WebDriverWait
    import pyautogui

    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    driver = webdriver.Chrome()
    driver.get(URL)

    # enter sequence into the query field and hit 'blast' button to search
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)

    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    # wait until results are loaded
    WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')





    share|improve this answer















    As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.



    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')


    This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.



    Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.





    Full code, in which I also added conditional waits to improve speed:



    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.expected_conditions import visibility_of_element_located
    from selenium.webdriver.support.ui import WebDriverWait
    import pyautogui

    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'

    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    driver = webdriver.Chrome()
    driver.get(URL)

    # enter sequence into the query field and hit 'blast' button to search
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)

    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    # wait until results are loaded
    WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))

    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    pyautogui.typewrite(SEQUENCE + '.html')
    pyautogui.hotkey('enter')






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Dec 29 '18 at 5:29

























    answered Dec 29 '18 at 4:54









    VulcanVulcan

    22.7k94076




    22.7k94076













    • hey thanks, this is perfect!

      – Max Power
      Dec 29 '18 at 17:45











    • I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

      – Max Power
      Dec 29 '18 at 17:50



















    • hey thanks, this is perfect!

      – Max Power
      Dec 29 '18 at 17:45











    • I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

      – Max Power
      Dec 29 '18 at 17:50

















    hey thanks, this is perfect!

    – Max Power
    Dec 29 '18 at 17:45





    hey thanks, this is perfect!

    – Max Power
    Dec 29 '18 at 17:45













    I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

    – Max Power
    Dec 29 '18 at 17:50





    I think for customizing save location, I will just save to default location, then use subprocess.call('mv [current-location] [new-location]') instead of relying on preset tab and arrow strokes via the gui.

    – Max Power
    Dec 29 '18 at 17:50













    4














    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.



    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.



    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.



    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os

    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
    base = 'https://blast.ncbi.nlm.nih.gov/'

    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
    fp.write(content)

    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base) # sets cookies

    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
    local_path = 'page/' + hr
    hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)

    # get image/js files from the body. skip anything loaded from outside sources
    for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
    continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)


    You should have a folder called page with a file called page.html in it with the content you are after.






    share|improve this answer
























    • hey thanks, this looks great, I'll try this out soon

      – Max Power
      Dec 27 '18 at 17:10











    • hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

      – Max Power
      Dec 29 '18 at 4:06






    • 1





      Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

      – James
      Dec 29 '18 at 13:09
















    4














    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.



    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.



    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.



    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os

    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
    base = 'https://blast.ncbi.nlm.nih.gov/'

    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
    fp.write(content)

    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base) # sets cookies

    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
    local_path = 'page/' + hr
    hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)

    # get image/js files from the body. skip anything loaded from outside sources
    for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
    continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)


    You should have a folder called page with a file called page.html in it with the content you are after.






    share|improve this answer
























    • hey thanks, this looks great, I'll try this out soon

      – Max Power
      Dec 27 '18 at 17:10











    • hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

      – Max Power
      Dec 29 '18 at 4:06






    • 1





      Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

      – James
      Dec 29 '18 at 13:09














    4












    4








    4







    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.



    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.



    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.



    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os

    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
    base = 'https://blast.ncbi.nlm.nih.gov/'

    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
    fp.write(content)

    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base) # sets cookies

    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
    local_path = 'page/' + hr
    hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)

    # get image/js files from the body. skip anything loaded from outside sources
    for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
    continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)


    You should have a folder called page with a file called page.html in it with the content you are after.






    share|improve this answer













    This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.



    Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.



    This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.



    from selenium import webdriver
    import chromedriver_binary
    from lxml import html
    import requests
    import os

    driver = webdriver.Chrome()
    URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome'
    SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'
    base = 'https://blast.ncbi.nlm.nih.gov/'

    driver.get(URL)
    seq_query_field = driver.find_element_by_id("seq")
    seq_query_field.send_keys(SEQUENCE)
    blast_button = driver.find_element_by_id("b1")
    blast_button.click()

    content = driver.page_source
    # write the page content
    os.mkdir('page')
    with open('page/page.html', 'w') as fp:
    fp.write(content)

    # download the referenced files to the same path as in the html
    sess = requests.Session()
    sess.get(base) # sets cookies

    # parse html
    h = html.fromstring(content)
    # get css/js files loaded in the head
    for hr in h.xpath('head//@href'):
    if not hr.startswith('http'):
    local_path = 'page/' + hr
    hr = base + hr
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)

    # get image/js files from the body. skip anything loaded from outside sources
    for src in h.xpath('//@src'):
    if not src or src.startswith('http'):
    continue
    local_path = 'page/' + src
    print(local_path)
    src = base + src
    res = sess.get(hr)
    if not os.path.exists(os.path.dirname(local_path)):
    os.makedirs(os.path.dirname(local_path))
    with open(local_path, 'wb') as fp:
    fp.write(res.content)


    You should have a folder called page with a file called page.html in it with the content you are after.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 25 '18 at 4:19









    JamesJames

    13.3k11532




    13.3k11532













    • hey thanks, this looks great, I'll try this out soon

      – Max Power
      Dec 27 '18 at 17:10











    • hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

      – Max Power
      Dec 29 '18 at 4:06






    • 1





      Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

      – James
      Dec 29 '18 at 13:09



















    • hey thanks, this looks great, I'll try this out soon

      – Max Power
      Dec 27 '18 at 17:10











    • hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

      – Max Power
      Dec 29 '18 at 4:06






    • 1





      Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

      – James
      Dec 29 '18 at 13:09

















    hey thanks, this looks great, I'll try this out soon

    – Max Power
    Dec 27 '18 at 17:10





    hey thanks, this looks great, I'll try this out soon

    – Max Power
    Dec 27 '18 at 17:10













    hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

    – Max Power
    Dec 29 '18 at 4:06





    hey James, thanks this is really great, although I'm confused why it also shows a bunch of extra text which is hidden when the page is saved manually (see more details in the update to my question at the bottom). Do you understand why this is happening?

    – Max Power
    Dec 29 '18 at 4:06




    1




    1





    Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

    – James
    Dec 29 '18 at 13:09





    Those are elements that are normally hidden by javascript. One of the downloaded JS libraries may be calling another library that was not downloaded.

    – James
    Dec 29 '18 at 13:09











    -2














    I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
    All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).






    share|improve this answer






























      -2














      I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
      All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).






      share|improve this answer




























        -2












        -2








        -2







        I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
        All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).






        share|improve this answer















        I'll advise u to have a try on sikulix which is an image based automation tool for operate any widgets within PC OS, it supports python grammar and run with command line and maybe the simplest way to solve ur problem.
        All u need to do is just give it a screenshot, call sikulix script in ur python automation script(with OS.system("xxxx") or subprocess...).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 29 '18 at 2:45

























        answered Dec 27 '18 at 1:48









        Lau RealLau Real

        928




        928






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53729201%2fsave-complete-web-page-incl-css-images-using-python-selenium%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Angular Downloading a file using contenturl with Basic Authentication

            Olmecas

            Can't read property showImagePicker of undefined in react native iOS