Spider not parsing data once it enters the page

I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.

My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.

The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.

With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.

def start_requests(self):



    txtfile = open('productosABuscar.txt', 'r')



    keywords = txtfile.readlines()



    txtfile.close()



    for keyword in keywords:



        yield Request(self.search_url.format(keyword))



def parse_item(self,response):

    #Here i get the keyword for comparisson later

    category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()) 

    #Here i get the product url for the next parser

    productURL = response.request.url



    if category == 'Laptop':



        yield response.follow(productUrl, callback = self.parse_laptop)



def parse_laptop(self, response):



    laptop_item = LaptopItem()



    #Parsing things



    yield laptop_item

This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.

In the console, I can see every link the spider is accessing, with the statement, for example

2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)

Is there something wrong with the arrangement of the parser or is it a deeper issue?

asked Dec 27 at 13:21

Manuel

486

add a comment |

I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.

My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.

With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.

def start_requests(self):



    txtfile = open('productosABuscar.txt', 'r')



    keywords = txtfile.readlines()



    txtfile.close()



    for keyword in keywords:



        yield Request(self.search_url.format(keyword))



def parse_item(self,response):

    #Here i get the keyword for comparisson later

    category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()) 

    #Here i get the product url for the next parser

    productURL = response.request.url



    if category == 'Laptop':



        yield response.follow(productUrl, callback = self.parse_laptop)



def parse_laptop(self, response):



    laptop_item = LaptopItem()



    #Parsing things



    yield laptop_item

In the console, I can see every link the spider is accessing, with the statement, for example

Is there something wrong with the arrangement of the parser or is it a deeper issue?

asked Dec 27 at 13:21

Manuel

486

add a comment |

I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.

My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.

With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.

def start_requests(self):



    txtfile = open('productosABuscar.txt', 'r')



    keywords = txtfile.readlines()



    txtfile.close()



    for keyword in keywords:



        yield Request(self.search_url.format(keyword))



def parse_item(self,response):

    #Here i get the keyword for comparisson later

    category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()) 

    #Here i get the product url for the next parser

    productURL = response.request.url



    if category == 'Laptop':



        yield response.follow(productUrl, callback = self.parse_laptop)



def parse_laptop(self, response):



    laptop_item = LaptopItem()



    #Parsing things



    yield laptop_item

In the console, I can see every link the spider is accessing, with the statement, for example

Is there something wrong with the arrangement of the parser or is it a deeper issue?

asked Dec 27 at 13:21

Manuel

486

I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.

My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.

With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.

def start_requests(self):



    txtfile = open('productosABuscar.txt', 'r')



    keywords = txtfile.readlines()



    txtfile.close()



    for keyword in keywords:



        yield Request(self.search_url.format(keyword))



def parse_item(self,response):

    #Here i get the keyword for comparisson later

    category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()) 

    #Here i get the product url for the next parser

    productURL = response.request.url



    if category == 'Laptop':



        yield response.follow(productUrl, callback = self.parse_laptop)



def parse_laptop(self, response):



    laptop_item = LaptopItem()



    #Parsing things



    yield laptop_item

In the console, I can see every link the spider is accessing, with the statement, for example

Is there something wrong with the arrangement of the parser or is it a deeper issue?

python-3.x web-scraping scrapy

asked Dec 27 at 13:21

Manuel

486

asked Dec 27 at 13:21

Manuel

486

asked Dec 27 at 13:21

Manuel

486

asked Dec 27 at 13:21

Manuel

486

asked Dec 27 at 13:21

Manuel

486

add a comment |

1 Answer
1

active

oldest

votes

does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53945797%2fspider-not-parsing-data-once-it-enters-the-page%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

add a comment |

does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

add a comment |

does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

answered Dec 27 at 14:08

ThunderMind

1017

answered Dec 27 at 14:08

ThunderMind

1017

New contributor

ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

add a comment |

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk