for loop with lxml code used for scraping shows 'list index out of range' error but works for 2 instances
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:
Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, 'n')
we want only trumps utterances from every link
python loops lxml
add a comment |
we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:
Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, 'n')
we want only trumps utterances from every link
python loops lxml
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
1
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13
add a comment |
we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:
Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, 'n')
we want only trumps utterances from every link
python loops lxml
we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:
Traceback (most recent call last):
File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, 'n')
we want only trumps utterances from every link
python loops lxml
python loops lxml
asked Jan 4 at 17:04
maomiimaomii
92
92
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
1
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13
add a comment |
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
1
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
1
1
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13
add a comment |
1 Answer
1
active
oldest
votes
Here's a modified version of your script.
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
Notes:
The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
movieless_xpath_marker - which will work for the "faulty" page
normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)
- I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes
Output (displaying the article count for each URL):
(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043232%2ffor-loop-with-lxml-code-used-for-scraping-shows-list-index-out-of-range-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's a modified version of your script.
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
Notes:
The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
movieless_xpath_marker - which will work for the "faulty" page
normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)
- I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes
Output (displaying the article count for each URL):
(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
add a comment |
Here's a modified version of your script.
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
Notes:
The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
movieless_xpath_marker - which will work for the "faulty" page
normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)
- I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes
Output (displaying the article count for each URL):
(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
add a comment |
Here's a modified version of your script.
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
Notes:
The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
movieless_xpath_marker - which will work for the "faulty" page
normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)
- I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes
Output (displaying the article count for each URL):
(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
Here's a modified version of your script.
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines =
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
Notes:
The problem is that the 3rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
movieless_xpath_marker - which will work for the "faulty" page
normal_xpath_marker - which will work on the rest of them (this is the 1st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)
- I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes
Output (displaying the article count for each URL):
(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
edited Jan 4 at 19:26
answered Jan 4 at 19:02
CristiFatiCristiFati
15.2k72638
15.2k72638
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
add a comment |
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.
– barny
Jan 4 at 19:42
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.
– CristiFati
Jan 4 at 19:47
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)
– maomii
Jan 6 at 14:16
You're welcome!
– CristiFati
Jan 6 at 16:50
You're welcome!
– CristiFati
Jan 6 at 16:50
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043232%2ffor-loop-with-lxml-code-used-for-scraping-shows-list-index-out-of-range-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.
– barny
Jan 4 at 17:15
1
See this lovely debug blog for help. @barny already gave you the straightforward technique.
– Prune
Jan 4 at 17:24
Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D
– maomii
Jan 6 at 14:13