for loop with lxml code used for scraping shows 'list index out of range' error but works for 2 instances

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

we are python beginners.
We have a list of links/websites with donald trumps utterances. Every link represents an whole interview/speech,etc. We now want to access those sites, scrape them and create a text file for every link. At the moment our code does that for 2 or 3 of the links but then just shows this error:

Traceback (most recent call last):

 File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>

Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

IndexError: list index out of range

we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly

import lxml

from lxml import html

from lxml.html import fromstring

import requests

import re

Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']





for item in Linklist:

    headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}

    page = requests.get(item, headers=headers)

    tree = html.fromstring(page.content)





#loads everything trump said



    Text=

    for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):

        Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

        Text.append(Trump)



    print(Text, 'n')

we want only trumps utterances from every link

asked Jan 4 at 17:04

maomii

Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15

1

See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24

Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13

add a comment |

Traceback (most recent call last):

 File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>

Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

IndexError: list index out of range

we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly

import lxml

from lxml import html

from lxml.html import fromstring

import requests

import re

Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']





for item in Linklist:

    headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}

    page = requests.get(item, headers=headers)

    tree = html.fromstring(page.content)





#loads everything trump said



    Text=

    for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):

        Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

        Text.append(Trump)



    print(Text, 'n')

we want only trumps utterances from every link

asked Jan 4 at 17:04

maomii

Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15

1

See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24

Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13

add a comment |

Traceback (most recent call last):

 File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>

Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

IndexError: list index out of range

we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly

import lxml

from lxml import html

from lxml.html import fromstring

import requests

import re

Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']





for item in Linklist:

    headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}

    page = requests.get(item, headers=headers)

    tree = html.fromstring(page.content)





#loads everything trump said



    Text=

    for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):

        Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

        Text.append(Trump)



    print(Text, 'n')

we want only trumps utterances from every link

asked Jan 4 at 17:04

maomii

Traceback (most recent call last):

 File "C:UsersLotteAppDataLocalProgramsPythonPython37CodeCorpus_createScrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>

Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

IndexError: list index out of range

we experimented with the index element, tried [0] or even leaving it out. nothing worked. we then tried to run the code with only one link and without the first loop, which works perfectly

import lxml

from lxml import html

from lxml.html import fromstring

import requests

import re

Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']





for item in Linklist:

    headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}

    page = requests.get(item, headers=headers)

    tree = html.fromstring(page.content)





#loads everything trump said



    Text=

    for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):

        Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())

        Text.append(Trump)



    print(Text, 'n')

we want only trumps utterances from every link

python loops lxml

asked Jan 4 at 17:04

maomii

asked Jan 4 at 17:04

maomii

asked Jan 4 at 17:04

maomii

asked Jan 4 at 17:04

maomii

asked Jan 4 at 17:04

maomii

Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15

1

See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24

Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13

add a comment |

Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15

1

See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24

Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13

Presumably on the third link factba.se/transcript/… the div[3]/div/div[2] indexes don't work. Print out the data you are blindly assuming has that number of divs nested in that way and check that it is as you expect, or not. Full marks for a good MCVE, but nul points for debugging.

– barny
Jan 4 at 17:15

See this lovely debug blog for help. @barny already gave you the straightforward technique.

– Prune
Jan 4 at 17:24

Thank you very much for the tips :) we actually made a mistake in the div structure and after changing it to /div[3]/div/div/a it worked out perfectly :D

– maomii
Jan 6 at 14:13

add a comment |

1 Answer
1

active

oldest

votes

Here's a modified version of your script.

code.py:

from lxml import html

import requests

import re

from pprint import pprint





url_list = [

    "https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",

    "https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",

    "https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",

    "https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",

    "https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",

    "https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",

    "https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",

    "https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",

    "https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"

]



headers = {

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"

}



media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'

normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"

movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"



xpath_markers = [

    normal_xpath_marker,

    movieless_xpath_marker,

]





for url_index, url in enumerate(url_list):

    page = requests.get(url, headers=headers)

    tree = html.fromstring(page.content)

    lines = 

    media_row_list = tree.xpath(media_row_xpath_marker)

    if media_row_list:

        for xpath_marker in xpath_markers:

            post_list = tree.xpath(xpath_marker)

            if post_list:

                lines = [item.text_content() for item in post_list]

                break

    #pprint(lines)

    print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))

Notes:

The problem is that the 3^rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
- movieless_xpath_marker - which will work for the "faulty" page
- normal_xpath_marker - which will work on the rest of them (this is the 1^st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)

I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes

Output (displaying the article count for each URL):

(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py

URL index: 00 - Article count: 018

URL index: 01 - Article count: 207

URL index: 02 - Article count: 063

URL index: 03 - Article count: 068

URL index: 04 - Article count: 080

URL index: 05 - Article count: 051

URL index: 06 - Article count: 045

URL index: 07 - Article count: 014

URL index: 08 - Article count: 036

URL index: 09 - Article count: 022

URL index: 10 - Article count: 105

URL index: 11 - Article count: 020

URL index: 12 - Article count: 025

URL index: 13 - Article count: 028

URL index: 14 - Article count: 010

URL index: 15 - Article count: 012

URL index: 16 - Article count: 015

URL index: 17 - Article count: 005

URL index: 18 - Article count: 005

URL index: 19 - Article count: 006

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043232%2ffor-loop-with-lxml-code-used-for-scraping-shows-list-index-out-of-range-error%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Here's a modified version of your script.

code.py:

from lxml import html

import requests

import re

from pprint import pprint





url_list = [

    "https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",

    "https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",

    "https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",

    "https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",

    "https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",

    "https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",

    "https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",

    "https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",

    "https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"

]



headers = {

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"

}



media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'

normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"

movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"



xpath_markers = [

    normal_xpath_marker,

    movieless_xpath_marker,

]





for url_index, url in enumerate(url_list):

    page = requests.get(url, headers=headers)

    tree = html.fromstring(page.content)

    lines = 

    media_row_list = tree.xpath(media_row_xpath_marker)

    if media_row_list:

        for xpath_marker in xpath_markers:

            post_list = tree.xpath(xpath_marker)

            if post_list:

                lines = [item.text_content() for item in post_list]

                break

    #pprint(lines)

    print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))

Notes:

The problem is that the 3^rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
- movieless_xpath_marker - which will work for the "faulty" page
- normal_xpath_marker - which will work on the rest of them (this is the 1^st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)

I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes

Output (displaying the article count for each URL):

(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py

URL index: 00 - Article count: 018

URL index: 01 - Article count: 207

URL index: 02 - Article count: 063

URL index: 03 - Article count: 068

URL index: 04 - Article count: 080

URL index: 05 - Article count: 051

URL index: 06 - Article count: 045

URL index: 07 - Article count: 014

URL index: 08 - Article count: 036

URL index: 09 - Article count: 022

URL index: 10 - Article count: 105

URL index: 11 - Article count: 020

URL index: 12 - Article count: 025

URL index: 13 - Article count: 028

URL index: 14 - Article count: 010

URL index: 15 - Article count: 012

URL index: 16 - Article count: 015

URL index: 17 - Article count: 005

URL index: 18 - Article count: 005

URL index: 19 - Article count: 006

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

add a comment |

Here's a modified version of your script.

code.py:

from lxml import html

import requests

import re

from pprint import pprint





url_list = [

    "https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",

    "https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",

    "https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",

    "https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",

    "https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",

    "https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",

    "https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",

    "https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",

    "https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"

]



headers = {

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"

}



media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'

normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"

movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"



xpath_markers = [

    normal_xpath_marker,

    movieless_xpath_marker,

]





for url_index, url in enumerate(url_list):

    page = requests.get(url, headers=headers)

    tree = html.fromstring(page.content)

    lines = 

    media_row_list = tree.xpath(media_row_xpath_marker)

    if media_row_list:

        for xpath_marker in xpath_markers:

            post_list = tree.xpath(xpath_marker)

            if post_list:

                lines = [item.text_content() for item in post_list]

                break

    #pprint(lines)

    print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))

Notes:

The problem is that the 3^rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
- movieless_xpath_marker - which will work for the "faulty" page
- normal_xpath_marker - which will work on the rest of them (this is the 1^st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)

I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes

Output (displaying the article count for each URL):

(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py

URL index: 00 - Article count: 018

URL index: 01 - Article count: 207

URL index: 02 - Article count: 063

URL index: 03 - Article count: 068

URL index: 04 - Article count: 080

URL index: 05 - Article count: 051

URL index: 06 - Article count: 045

URL index: 07 - Article count: 014

URL index: 08 - Article count: 036

URL index: 09 - Article count: 022

URL index: 10 - Article count: 105

URL index: 11 - Article count: 020

URL index: 12 - Article count: 025

URL index: 13 - Article count: 028

URL index: 14 - Article count: 010

URL index: 15 - Article count: 012

URL index: 16 - Article count: 015

URL index: 17 - Article count: 005

URL index: 18 - Article count: 005

URL index: 19 - Article count: 006

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

add a comment |

Here's a modified version of your script.

code.py:

from lxml import html

import requests

import re

from pprint import pprint





url_list = [

    "https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",

    "https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",

    "https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",

    "https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",

    "https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",

    "https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",

    "https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",

    "https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",

    "https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"

]



headers = {

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"

}



media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'

normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"

movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"



xpath_markers = [

    normal_xpath_marker,

    movieless_xpath_marker,

]





for url_index, url in enumerate(url_list):

    page = requests.get(url, headers=headers)

    tree = html.fromstring(page.content)

    lines = 

    media_row_list = tree.xpath(media_row_xpath_marker)

    if media_row_list:

        for xpath_marker in xpath_markers:

            post_list = tree.xpath(xpath_marker)

            if post_list:

                lines = [item.text_content() for item in post_list]

                break

    #pprint(lines)

    print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))

Notes:

The problem is that the 3^rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
- movieless_xpath_marker - which will work for the "faulty" page
- normal_xpath_marker - which will work on the rest of them (this is the 1^st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)

I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes

Output (displaying the article count for each URL):

(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py

URL index: 00 - Article count: 018

URL index: 01 - Article count: 207

URL index: 02 - Article count: 063

URL index: 03 - Article count: 068

URL index: 04 - Article count: 080

URL index: 05 - Article count: 051

URL index: 06 - Article count: 045

URL index: 07 - Article count: 014

URL index: 08 - Article count: 036

URL index: 09 - Article count: 022

URL index: 10 - Article count: 105

URL index: 11 - Article count: 020

URL index: 12 - Article count: 025

URL index: 13 - Article count: 028

URL index: 14 - Article count: 010

URL index: 15 - Article count: 012

URL index: 16 - Article count: 015

URL index: 17 - Article count: 005

URL index: 18 - Article count: 005

URL index: 19 - Article count: 006

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

Here's a modified version of your script.

code.py:

from lxml import html

import requests

import re

from pprint import pprint





url_list = [

    "https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",

    "https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",

    "https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",

    "https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",

    "https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",

    "https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",

    "https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",

    "https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",

    "https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",

    "https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",

    "https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",

    "https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"

]



headers = {

    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"

}



media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'

normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"

movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"



xpath_markers = [

    normal_xpath_marker,

    movieless_xpath_marker,

]





for url_index, url in enumerate(url_list):

    page = requests.get(url, headers=headers)

    tree = html.fromstring(page.content)

    lines = 

    media_row_list = tree.xpath(media_row_xpath_marker)

    if media_row_list:

        for xpath_marker in xpath_markers:

            post_list = tree.xpath(xpath_marker)

            if post_list:

                lines = [item.text_content() for item in post_list]

                break

    #pprint(lines)

    print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))

Notes:

The problem is that the 3^rd URL is a little bit different than the others, if you look at it it doesn't have the YouTube, so the xpath didn't match. That combined with the lack of empty list test yielded the above exception. Now, 2 patterns are being attempted:
- movieless_xpath_marker - which will work for the "faulty" page
- normal_xpath_marker - which will work on the rest of them (this is the 1^st one attempted)
When one pattern triggers some results, simply ignore the rest (if any)

I also refactored the code:
- Got rid of a loop (and operations uselessly executed multiple times)
- Variable renaming
- Constant extraction
- Code style
- Other minor changes

Output (displaying the article count for each URL):

(py_064_03.06.08_test0) e:WorkDevStackOverflowq054043232>"e:WorkDevVEnvspy_064_03.06.08_test0Scriptspython.exe" code.py

URL index: 00 - Article count: 018

URL index: 01 - Article count: 207

URL index: 02 - Article count: 063

URL index: 03 - Article count: 068

URL index: 04 - Article count: 080

URL index: 05 - Article count: 051

URL index: 06 - Article count: 045

URL index: 07 - Article count: 014

URL index: 08 - Article count: 036

URL index: 09 - Article count: 022

URL index: 10 - Article count: 105

URL index: 11 - Article count: 020

URL index: 12 - Article count: 025

URL index: 13 - Article count: 028

URL index: 14 - Article count: 010

URL index: 15 - Article count: 012

URL index: 16 - Article count: 015

URL index: 17 - Article count: 005

URL index: 18 - Article count: 005

URL index: 19 - Article count: 006

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

edited Jan 4 at 19:26

answered Jan 4 at 19:02

CristiFati

15.2k72638

answered Jan 4 at 19:02

CristiFati

15.2k72638

answered Jan 4 at 19:02

CristiFati

15.2k72638

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

add a comment |

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

Perfectly good answer, but personally I think it would have been good practice for the OP to have to solve the problem himself.

– barny
Jan 4 at 19:42

@barny: you're right, but I spent quite some time figuring out the pattern for the faulty file (as I'm new to that area), and when it was all done, I felt that it would be a shame not to post it.

– CristiFati
Jan 4 at 19:47

First of all thank you so much for the time invested in our problem . We took some time to understand everything you did but now everything works fine. Especially realizing that the 3rd link has a different structure than the others was very important, so thank you :)

– maomii
Jan 6 at 14:16

You're welcome!

– CristiFati
Jan 6 at 16:50

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk