Web scraping certain web page cannot finish












0















So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer



The code is simple



const rp = require('request-promise');
const url = 'https://www.examples.com'; //good

rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Now if url is examples.com, i can see the plain html output, great.



Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?



Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.



Why is this please ?










share|improve this question

























  • Can you share some of the "binary data" that is output for yahoo.com?

    – nareddyt
    Jan 2 at 3:05











  • I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

    – Felix Fong
    Jan 2 at 3:10











  • @FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

    – user3552178
    Jan 2 at 3:17
















0















So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer



The code is simple



const rp = require('request-promise');
const url = 'https://www.examples.com'; //good

rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Now if url is examples.com, i can see the plain html output, great.



Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?



Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.



Why is this please ?










share|improve this question

























  • Can you share some of the "binary data" that is output for yahoo.com?

    – nareddyt
    Jan 2 at 3:05











  • I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

    – Felix Fong
    Jan 2 at 3:10











  • @FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

    – user3552178
    Jan 2 at 3:17














0












0








0








So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer



The code is simple



const rp = require('request-promise');
const url = 'https://www.examples.com'; //good

rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Now if url is examples.com, i can see the plain html output, great.



Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?



Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.



Why is this please ?










share|improve this question
















So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer



The code is simple



const rp = require('request-promise');
const url = 'https://www.examples.com'; //good

rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Now if url is examples.com, i can see the plain html output, great.



Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?



Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.



Why is this please ?







node.js puppeteer






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 2 at 3:07







user3552178

















asked Jan 2 at 2:56









user3552178user3552178

4511717




4511717













  • Can you share some of the "binary data" that is output for yahoo.com?

    – nareddyt
    Jan 2 at 3:05











  • I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

    – Felix Fong
    Jan 2 at 3:10











  • @FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

    – user3552178
    Jan 2 at 3:17



















  • Can you share some of the "binary data" that is output for yahoo.com?

    – nareddyt
    Jan 2 at 3:05











  • I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

    – Felix Fong
    Jan 2 at 3:10











  • @FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

    – user3552178
    Jan 2 at 3:17

















Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05





Can you share some of the "binary data" that is output for yahoo.com?

– nareddyt
Jan 2 at 3:05













I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10





I have tried using another HTTP client package called "Axios", and the result is the same, maybe it's just how Yahoo return there data?

– Felix Fong
Jan 2 at 3:10













@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17





@FelixFong maybe, i don't know much about these stuff, but if you run in browser, everything is fine. The 2nd question even confuses more, just returns nothing and hangs there.

– user3552178
Jan 2 at 3:17












1 Answer
1






active

oldest

votes


















2














I'm not sure about Q2, but I can answer Q1.



It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.



You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.



const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good

const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};

rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.






share|improve this answer



















  • 1





    Good one, let me read a little bit more, many thanks !

    – user3552178
    Jan 2 at 3:44











  • No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

    – nareddyt
    Jan 2 at 3:48













  • thx again, you rock !

    – user3552178
    Jan 2 at 4:16











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000738%2fweb-scraping-certain-web-page-cannot-finish%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














I'm not sure about Q2, but I can answer Q1.



It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.



You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.



const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good

const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};

rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.






share|improve this answer



















  • 1





    Good one, let me read a little bit more, many thanks !

    – user3552178
    Jan 2 at 3:44











  • No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

    – nareddyt
    Jan 2 at 3:48













  • thx again, you rock !

    – user3552178
    Jan 2 at 4:16
















2














I'm not sure about Q2, but I can answer Q1.



It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.



You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.



const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good

const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};

rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.






share|improve this answer



















  • 1





    Good one, let me read a little bit more, many thanks !

    – user3552178
    Jan 2 at 3:44











  • No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

    – nareddyt
    Jan 2 at 3:48













  • thx again, you rock !

    – user3552178
    Jan 2 at 4:16














2












2








2







I'm not sure about Q2, but I can answer Q1.



It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.



You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.



const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good

const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};

rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.






share|improve this answer













I'm not sure about Q2, but I can answer Q1.



It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.



You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.



const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good

const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};

rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});


Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 2 at 3:39









nareddytnareddyt

486410




486410








  • 1





    Good one, let me read a little bit more, many thanks !

    – user3552178
    Jan 2 at 3:44











  • No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

    – nareddyt
    Jan 2 at 3:48













  • thx again, you rock !

    – user3552178
    Jan 2 at 4:16














  • 1





    Good one, let me read a little bit more, many thanks !

    – user3552178
    Jan 2 at 3:44











  • No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

    – nareddyt
    Jan 2 at 3:48













  • thx again, you rock !

    – user3552178
    Jan 2 at 4:16








1




1





Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44





Good one, let me read a little bit more, many thanks !

– user3552178
Jan 2 at 3:44













No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48







No problem! In general, websites want you to use their APIs instead of web-scraping because it is easier to monetize APIs. Nasdaq has a real-time quote API that costs money, which is probably why they block bots from web-scraping. I would suggest looking for other APIs to solve your problem instead of web-scraping. This might be a good place to start.

– nareddyt
Jan 2 at 3:48















thx again, you rock !

– user3552178
Jan 2 at 4:16





thx again, you rock !

– user3552178
Jan 2 at 4:16




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000738%2fweb-scraping-certain-web-page-cannot-finish%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas