for loop with lxml code used for scraping shows 'list index out of range' error but works for 2 instances





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:



Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range


we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly



import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']


for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)


#loads everything trump said

Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)

print(Text, 'n')


we want only trumps utterances from every link










share|improve this question























  • Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

    – barny
    Jan 4 at 17:15








  • 1





    See this lovely debug blog for help. @barny already gave you the straightforward technique.

    – Prune
    Jan 4 at 17:24











  • Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

    – maomii
    Jan 6 at 14:13


















0















we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:



Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range


we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly



import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']


for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)


#loads everything trump said

Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)

print(Text, 'n')


we want only trumps utterances from every link










share|improve this question























  • Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

    – barny
    Jan 4 at 17:15








  • 1





    See this lovely debug blog for help. @barny already gave you the straightforward technique.

    – Prune
    Jan 4 at 17:24











  • Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

    – maomii
    Jan 6 at 14:13














0












0








0








we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:



Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range


we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly



import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']


for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)


#loads everything trump said

Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)

print(Text, 'n')


we want only trumps utterances from every link










share|improve this question














we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:



Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range


we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly



import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']


for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)


#loads everything trump said

Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)

print(Text, 'n')


we want only trumps utterances from every link







python loops lxml






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 4 at 17:04









maomiimaomii

92




92













  • Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

    – barny
    Jan 4 at 17:15








  • 1





    See this lovely debug blog for help. @barny already gave you the straightforward technique.

    – Prune
    Jan 4 at 17:24











  • Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

    – maomii
    Jan 6 at 14:13



















  • Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

    – barny
    Jan 4 at 17:15








  • 1





    See this lovely debug blog for help. @barny already gave you the straightforward technique.

    – Prune
    Jan 4 at 17:24











  • Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

    – maomii
    Jan 6 at 14:13

















Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15







Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15






1




1





See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24





See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24













Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13





Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13












1 Answer
1






active

oldest

votes


















0














Here's a modified version of your script.





code.py:



from lxml import html
import requests
import re
from pprint import pprint


url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}

media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"

xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]


for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))


Notes:





  • The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:





    • movieless_xpath_marker - which will work for the "faulty" page


    • normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)


    When one pattern triggers some results, simply ignore the rest (if any)



  • I also refactored the code:


    • Got rid of a loop (and operations uselessly executed multiple times)

    • Variable renaming

    • Constant extraction

    • Code style

    • Other minor changes




Output (displaying the article count for each URL):




(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006






share|improve this answer


























  • Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

    – barny
    Jan 4 at 19:42













  • @barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

    – CristiFati
    Jan 4 at 19:47











  • First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

    – maomii
    Jan 6 at 14:16











  • You're welcome!

    – CristiFati
    Jan 6 at 16:50












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043232%2ffor-loop-with-lxml-code-used-for-scraping-shows-list-index-out-of-range-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Here's a modified version of your script.





code.py:



from lxml import html
import requests
import re
from pprint import pprint


url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}

media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"

xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]


for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))


Notes:





  • The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:





    • movieless_xpath_marker - which will work for the "faulty" page


    • normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)


    When one pattern triggers some results, simply ignore the rest (if any)



  • I also refactored the code:


    • Got rid of a loop (and operations uselessly executed multiple times)

    • Variable renaming

    • Constant extraction

    • Code style

    • Other minor changes




Output (displaying the article count for each URL):




(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006






share|improve this answer


























  • Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

    – barny
    Jan 4 at 19:42













  • @barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

    – CristiFati
    Jan 4 at 19:47











  • First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

    – maomii
    Jan 6 at 14:16











  • You're welcome!

    – CristiFati
    Jan 6 at 16:50
















0














Here's a modified version of your script.





code.py:



from lxml import html
import requests
import re
from pprint import pprint


url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}

media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"

xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]


for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))


Notes:





  • The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:





    • movieless_xpath_marker - which will work for the "faulty" page


    • normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)


    When one pattern triggers some results, simply ignore the rest (if any)



  • I also refactored the code:


    • Got rid of a loop (and operations uselessly executed multiple times)

    • Variable renaming

    • Constant extraction

    • Code style

    • Other minor changes




Output (displaying the article count for each URL):




(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006






share|improve this answer


























  • Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

    – barny
    Jan 4 at 19:42













  • @barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

    – CristiFati
    Jan 4 at 19:47











  • First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

    – maomii
    Jan 6 at 14:16











  • You're welcome!

    – CristiFati
    Jan 6 at 16:50














0












0








0







Here's a modified version of your script.





code.py:



from lxml import html
import requests
import re
from pprint import pprint


url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}

media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"

xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]


for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))


Notes:





  • The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:





    • movieless_xpath_marker - which will work for the "faulty" page


    • normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)


    When one pattern triggers some results, simply ignore the rest (if any)



  • I also refactored the code:


    • Got rid of a loop (and operations uselessly executed multiple times)

    • Variable renaming

    • Constant extraction

    • Code style

    • Other minor changes




Output (displaying the article count for each URL):




(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006






share|improve this answer















Here's a modified version of your script.





code.py:



from lxml import html
import requests
import re
from pprint import pprint


url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]

headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}

media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"

xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]


for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))


Notes:





  • The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:





    • movieless_xpath_marker - which will work for the "faulty" page


    • normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)


    When one pattern triggers some results, simply ignore the rest (if any)



  • I also refactored the code:


    • Got rid of a loop (and operations uselessly executed multiple times)

    • Variable renaming

    • Constant extraction

    • Code style

    • Other minor changes




Output (displaying the article count for each URL):




(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 4 at 19:26

























answered Jan 4 at 19:02









CristiFatiCristiFati

15.2k72638




15.2k72638













  • Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

    – barny
    Jan 4 at 19:42













  • @barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

    – CristiFati
    Jan 4 at 19:47











  • First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

    – maomii
    Jan 6 at 14:16











  • You're welcome!

    – CristiFati
    Jan 6 at 16:50



















  • Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

    – barny
    Jan 4 at 19:42













  • @barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

    – CristiFati
    Jan 4 at 19:47











  • First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

    – maomii
    Jan 6 at 14:16











  • You're welcome!

    – CristiFati
    Jan 6 at 16:50

















Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42







Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42















@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47





@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47













First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16





First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16













You're welcome!

– CristiFati
Jan 6 at 16:50





You're welcome!

– CristiFati
Jan 6 at 16:50




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043232%2ffor-loop-with-lxml-code-used-for-scraping-shows-list-index-out-of-range-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas