Spider not parsing data once it enters the page
I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.
My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.
The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.
With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
#Here i get the keyword for comparisson later
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
#Here i get the product url for the next parser
productURL = response.request.url
if category == 'Laptop':
yield response.follow(productUrl, callback = self.parse_laptop)
def parse_laptop(self, response):
laptop_item = LaptopItem()
#Parsing things
yield laptop_item
This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.
In the console, I can see every link the spider is accessing, with the statement, for example
2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)
Is there something wrong with the arrangement of the parser or is it a deeper issue?
python-3.x web-scraping scrapy
add a comment |
I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.
My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.
The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.
With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
#Here i get the keyword for comparisson later
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
#Here i get the product url for the next parser
productURL = response.request.url
if category == 'Laptop':
yield response.follow(productUrl, callback = self.parse_laptop)
def parse_laptop(self, response):
laptop_item = LaptopItem()
#Parsing things
yield laptop_item
This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.
In the console, I can see every link the spider is accessing, with the statement, for example
2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)
Is there something wrong with the arrangement of the parser or is it a deeper issue?
python-3.x web-scraping scrapy
add a comment |
I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.
My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.
The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.
With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
#Here i get the keyword for comparisson later
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
#Here i get the product url for the next parser
productURL = response.request.url
if category == 'Laptop':
yield response.follow(productUrl, callback = self.parse_laptop)
def parse_laptop(self, response):
laptop_item = LaptopItem()
#Parsing things
yield laptop_item
This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.
In the console, I can see every link the spider is accessing, with the statement, for example
2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)
Is there something wrong with the arrangement of the parser or is it a deeper issue?
python-3.x web-scraping scrapy
I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.
My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.
The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.
With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.
def start_requests(self):
txtfile = open('productosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
#Here i get the keyword for comparisson later
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
#Here i get the product url for the next parser
productURL = response.request.url
if category == 'Laptop':
yield response.follow(productUrl, callback = self.parse_laptop)
def parse_laptop(self, response):
laptop_item = LaptopItem()
#Parsing things
yield laptop_item
This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.
In the console, I can see every link the spider is accessing, with the statement, for example
2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)
Is there something wrong with the arrangement of the parser or is it a deeper issue?
python-3.x web-scraping scrapy
python-3.x web-scraping scrapy
asked Dec 27 at 13:21
Manuel
486
486
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?
New contributor
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53945797%2fspider-not-parsing-data-once-it-enters-the-page%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?
New contributor
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
add a comment |
does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?
New contributor
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
add a comment |
does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?
New contributor
does it goes to parse_laptop function ?
and if it goes, what do you get ? empty {} or nothing ? or any error ?
New contributor
New contributor
answered Dec 27 at 14:08
ThunderMind
1017
1017
New contributor
New contributor
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
add a comment |
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
– Manuel
Dec 27 at 14:52
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53945797%2fspider-not-parsing-data-once-it-enters-the-page%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown