scrapy getting stuck after some time
I have a master-worker network on aws ec2 using dask distributed library. For now i have one master machine and one worker machine. Master has REST api (flask) for scheduling scrapy jobs on worker machine. I am using docker for both master and worker that means both master container and worker container communicating with each other using dask distributed.
When i scheduler scrapy job, crawling starts successfully and scrapy uploads data to s3 as well. But after some time scrapy gets stuck at one point and nothing happens after that.
Please check attached log file for more info
log.txt
2019-01-02 08:05:30 [botocore.hooks] DEBUG: Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f1fe54adf28>>
scrapy get stuck at above point.
command to run docker:
sudo docker run --network host -d crawler-worker # for worker
sudo docker run -p 80:80 -p 8786:8786 -p 8787:8787 --net=host -d crawler-master # for master
I am facing this issue on fresh ec2 machine as well
python amazon-web-services docker scrapy dask-distributed
add a comment |
I have a master-worker network on aws ec2 using dask distributed library. For now i have one master machine and one worker machine. Master has REST api (flask) for scheduling scrapy jobs on worker machine. I am using docker for both master and worker that means both master container and worker container communicating with each other using dask distributed.
When i scheduler scrapy job, crawling starts successfully and scrapy uploads data to s3 as well. But after some time scrapy gets stuck at one point and nothing happens after that.
Please check attached log file for more info
log.txt
2019-01-02 08:05:30 [botocore.hooks] DEBUG: Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f1fe54adf28>>
scrapy get stuck at above point.
command to run docker:
sudo docker run --network host -d crawler-worker # for worker
sudo docker run -p 80:80 -p 8786:8786 -p 8787:8787 --net=host -d crawler-master # for master
I am facing this issue on fresh ec2 machine as well
python amazon-web-services docker scrapy dask-distributed
add a comment |
I have a master-worker network on aws ec2 using dask distributed library. For now i have one master machine and one worker machine. Master has REST api (flask) for scheduling scrapy jobs on worker machine. I am using docker for both master and worker that means both master container and worker container communicating with each other using dask distributed.
When i scheduler scrapy job, crawling starts successfully and scrapy uploads data to s3 as well. But after some time scrapy gets stuck at one point and nothing happens after that.
Please check attached log file for more info
log.txt
2019-01-02 08:05:30 [botocore.hooks] DEBUG: Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f1fe54adf28>>
scrapy get stuck at above point.
command to run docker:
sudo docker run --network host -d crawler-worker # for worker
sudo docker run -p 80:80 -p 8786:8786 -p 8787:8787 --net=host -d crawler-master # for master
I am facing this issue on fresh ec2 machine as well
python amazon-web-services docker scrapy dask-distributed
I have a master-worker network on aws ec2 using dask distributed library. For now i have one master machine and one worker machine. Master has REST api (flask) for scheduling scrapy jobs on worker machine. I am using docker for both master and worker that means both master container and worker container communicating with each other using dask distributed.
When i scheduler scrapy job, crawling starts successfully and scrapy uploads data to s3 as well. But after some time scrapy gets stuck at one point and nothing happens after that.
Please check attached log file for more info
log.txt
2019-01-02 08:05:30 [botocore.hooks] DEBUG: Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f1fe54adf28>>
scrapy get stuck at above point.
command to run docker:
sudo docker run --network host -d crawler-worker # for worker
sudo docker run -p 80:80 -p 8786:8786 -p 8787:8787 --net=host -d crawler-master # for master
I am facing this issue on fresh ec2 machine as well
python amazon-web-services docker scrapy dask-distributed
python amazon-web-services docker scrapy dask-distributed
edited Jan 2 at 15:00
suraj deshmukh
asked Jan 2 at 10:25
suraj deshmukhsuraj deshmukh
666
666
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
I solved the problem. The problem was in subprocess which i was using it to execute the scrapy with argument stdout=subprocess.PIPE and as per subprocess's documentation wait() function can cause a deadlock when using stdout=subprocess.PIPE or stderr=subprocess.PIPE.
add a comment |
(This would be a comment but I don't yet have the points to do so.)
You're probably encountering some sort of anti-DDOS protection. Have you tried scraping a control site?
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004644%2fscrapy-getting-stuck-after-some-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I solved the problem. The problem was in subprocess which i was using it to execute the scrapy with argument stdout=subprocess.PIPE and as per subprocess's documentation wait() function can cause a deadlock when using stdout=subprocess.PIPE or stderr=subprocess.PIPE.
add a comment |
I solved the problem. The problem was in subprocess which i was using it to execute the scrapy with argument stdout=subprocess.PIPE and as per subprocess's documentation wait() function can cause a deadlock when using stdout=subprocess.PIPE or stderr=subprocess.PIPE.
add a comment |
I solved the problem. The problem was in subprocess which i was using it to execute the scrapy with argument stdout=subprocess.PIPE and as per subprocess's documentation wait() function can cause a deadlock when using stdout=subprocess.PIPE or stderr=subprocess.PIPE.
I solved the problem. The problem was in subprocess which i was using it to execute the scrapy with argument stdout=subprocess.PIPE and as per subprocess's documentation wait() function can cause a deadlock when using stdout=subprocess.PIPE or stderr=subprocess.PIPE.
answered Jan 3 at 12:01
suraj deshmukhsuraj deshmukh
666
666
add a comment |
add a comment |
(This would be a comment but I don't yet have the points to do so.)
You're probably encountering some sort of anti-DDOS protection. Have you tried scraping a control site?
add a comment |
(This would be a comment but I don't yet have the points to do so.)
You're probably encountering some sort of anti-DDOS protection. Have you tried scraping a control site?
add a comment |
(This would be a comment but I don't yet have the points to do so.)
You're probably encountering some sort of anti-DDOS protection. Have you tried scraping a control site?
(This would be a comment but I don't yet have the points to do so.)
You're probably encountering some sort of anti-DDOS protection. Have you tried scraping a control site?
answered Jan 2 at 16:19
benasbenas
743
743
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54004644%2fscrapy-getting-stuck-after-some-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown