Spider not parsing data once it enters the page












0














I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.



My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.



The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.



With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.



def start_requests(self):

txtfile = open('productosABuscar.txt', 'r')

keywords = txtfile.readlines()

txtfile.close()

for keyword in keywords:

yield Request(self.search_url.format(keyword))

def parse_item(self,response):
#Here i get the keyword for comparisson later
category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
#Here i get the product url for the next parser
productURL = response.request.url

if category == 'Laptop':

yield response.follow(productUrl, callback = self.parse_laptop)

def parse_laptop(self, response):

laptop_item = LaptopItem()

#Parsing things

yield laptop_item


This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.



In the console, I can see every link the spider is accessing, with the statement, for example



2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)



Is there something wrong with the arrangement of the parser or is it a deeper issue?










share|improve this question



























    0














    I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.



    My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.



    The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.



    With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.



    def start_requests(self):

    txtfile = open('productosABuscar.txt', 'r')

    keywords = txtfile.readlines()

    txtfile.close()

    for keyword in keywords:

    yield Request(self.search_url.format(keyword))

    def parse_item(self,response):
    #Here i get the keyword for comparisson later
    category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
    #Here i get the product url for the next parser
    productURL = response.request.url

    if category == 'Laptop':

    yield response.follow(productUrl, callback = self.parse_laptop)

    def parse_laptop(self, response):

    laptop_item = LaptopItem()

    #Parsing things

    yield laptop_item


    This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.



    In the console, I can see every link the spider is accessing, with the statement, for example



    2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)



    Is there something wrong with the arrangement of the parser or is it a deeper issue?










    share|improve this question

























      0












      0








      0







      I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.



      My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.



      The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.



      With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.



      def start_requests(self):

      txtfile = open('productosABuscar.txt', 'r')

      keywords = txtfile.readlines()

      txtfile.close()

      for keyword in keywords:

      yield Request(self.search_url.format(keyword))

      def parse_item(self,response):
      #Here i get the keyword for comparisson later
      category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
      #Here i get the product url for the next parser
      productURL = response.request.url

      if category == 'Laptop':

      yield response.follow(productUrl, callback = self.parse_laptop)

      def parse_laptop(self, response):

      laptop_item = LaptopItem()

      #Parsing things

      yield laptop_item


      This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.



      In the console, I can see every link the spider is accessing, with the statement, for example



      2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)



      Is there something wrong with the arrangement of the parser or is it a deeper issue?










      share|improve this question













      I'm trying scrape amazon's website for products, after achieving a normal scraping process, I tried to add some "complexity" to the program.



      My idea was to, from a .txt, recieve certain keywords. With those keywords I used the search bar for getting the products that matched them and scrape the data. That worked just fine.



      The problem is that, depending on the keyword, for example, Laptop and Shoes, the parser needs to work differently because shoes have different sizes, colors and such so the data I need to scrape from a "shoe" product is different than the data I need from a "Laptop" product. And that's where I'm at.



      With some help of the people this site, I was able to make a different parser be called depending on the word that the spider got from the .txt. The code looks something like this.



      def start_requests(self):

      txtfile = open('productosABuscar.txt', 'r')

      keywords = txtfile.readlines()

      txtfile.close()

      for keyword in keywords:

      yield Request(self.search_url.format(keyword))

      def parse_item(self,response):
      #Here i get the keyword for comparisson later
      category = re.sub('Back to search results for |"','', response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first())
      #Here i get the product url for the next parser
      productURL = response.request.url

      if category == 'Laptop':

      yield response.follow(productUrl, callback = self.parse_laptop)

      def parse_laptop(self, response):

      laptop_item = LaptopItem()

      #Parsing things

      yield laptop_item


      This should work fine but, when I run the spider from the Anaconda console, no data is scraped. The weird thing is that the spider is actually accessing every "Laptop" item in the amazon page but not scraping the data from it.



      In the console, I can see every link the spider is accessing, with the statement, for example



      2018-12-27 10:02:36 [scrapy.core.engine] DEBUG: Crawled (200) https://www.amazon.com/Acer-Aspire-i3-8130U-Memory-E5-576-392H/dp/B079TGL2BZ/ref=sr_1_3/ref=sr_1_acs_bss_3_4?ie=UTF8&qid=1545915651&sr=8-3-acs&keywords=Laptop> (referer: https://www.amazon.com/s?field-keywords=Laptop)



      Is there something wrong with the arrangement of the parser or is it a deeper issue?







      python-3.x web-scraping scrapy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 27 at 13:21









      Manuel

      486




      486
























          1 Answer
          1






          active

          oldest

          votes


















          1














          does it goes to parse_laptop function ?
          and if it goes, what do you get ? empty {} or nothing ? or any error ?






          share|improve this answer








          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.


















          • By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
            – Manuel
            Dec 27 at 14:52













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53945797%2fspider-not-parsing-data-once-it-enters-the-page%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          does it goes to parse_laptop function ?
          and if it goes, what do you get ? empty {} or nothing ? or any error ?






          share|improve this answer








          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.


















          • By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
            – Manuel
            Dec 27 at 14:52


















          1














          does it goes to parse_laptop function ?
          and if it goes, what do you get ? empty {} or nothing ? or any error ?






          share|improve this answer








          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.


















          • By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
            – Manuel
            Dec 27 at 14:52
















          1












          1








          1






          does it goes to parse_laptop function ?
          and if it goes, what do you get ? empty {} or nothing ? or any error ?






          share|improve this answer








          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          does it goes to parse_laptop function ?
          and if it goes, what do you get ? empty {} or nothing ? or any error ?







          share|improve this answer








          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|improve this answer



          share|improve this answer






          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered Dec 27 at 14:08









          ThunderMind

          1017




          1017




          New contributor




          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          ThunderMind is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.












          • By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
            – Manuel
            Dec 27 at 14:52




















          • By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
            – Manuel
            Dec 27 at 14:52


















          By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
          – Manuel
          Dec 27 at 14:52






          By the way it's coded it should go to the parse_laptop function (have seen some examples here that put the function that way and it works). Maybe the issue is that, not going to the correct function. Also, no error appears on the console when I run the program. I'll upload a photo of the console when I get back home, thanks for the help!
          – Manuel
          Dec 27 at 14:52




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53945797%2fspider-not-parsing-data-once-it-enters-the-page%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Angular Downloading a file using contenturl with Basic Authentication

          Olmecas

          Can't read property showImagePicker of undefined in react native iOS