Ignore a link in parenthesis while trying to extract other links

Multi tool use
I'm trying to extract links from a p
block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2
link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?)
.
Anyone able to help?
python beautifulsoup
add a comment |
I'm trying to extract links from a p
block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2
link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?)
.
Anyone able to help?
python beautifulsoup
add a comment |
I'm trying to extract links from a p
block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2
link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?)
.
Anyone able to help?
python beautifulsoup
I'm trying to extract links from a p
block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2
link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?)
.
Anyone able to help?
python beautifulsoup
python beautifulsoup
edited Dec 29 '18 at 0:16
pad11
asked Dec 28 '18 at 23:50
pad11pad11
867
867
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
You can analyze the elements in BeautifulSoup.contents
to find all a
objects. The latter can then be filtered to ensure that surrounding content does not create a (
and )
pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[<a href="link_text2">link_text2</a>]
So ran into an issue -hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include<b>test</b>
right before(
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?
– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that<b>test</b>
also yields true when its object is used inhasattr(a, 'a')
. However, I replacedhasattr
withgetattr
to inspect the value of thename
attribute.
– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53965510%2fignore-a-link-in-parenthesis-while-trying-to-extract-other-links%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can analyze the elements in BeautifulSoup.contents
to find all a
objects. The latter can then be filtered to ensure that surrounding content does not create a (
and )
pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[<a href="link_text2">link_text2</a>]
So ran into an issue -hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include<b>test</b>
right before(
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?
– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that<b>test</b>
also yields true when its object is used inhasattr(a, 'a')
. However, I replacedhasattr
withgetattr
to inspect the value of thename
attribute.
– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
|
show 1 more comment
You can analyze the elements in BeautifulSoup.contents
to find all a
objects. The latter can then be filtered to ensure that surrounding content does not create a (
and )
pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[<a href="link_text2">link_text2</a>]
So ran into an issue -hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include<b>test</b>
right before(
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?
– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that<b>test</b>
also yields true when its object is used inhasattr(a, 'a')
. However, I replacedhasattr
withgetattr
to inspect the value of thename
attribute.
– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
|
show 1 more comment
You can analyze the elements in BeautifulSoup.contents
to find all a
objects. The latter can then be filtered to ensure that surrounding content does not create a (
and )
pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[<a href="link_text2">link_text2</a>]
You can analyze the elements in BeautifulSoup.contents
to find all a
objects. The latter can then be filtered to ensure that surrounding content does not create a (
and )
pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[<a href="link_text2">link_text2</a>]
edited Dec 29 '18 at 3:16
answered Dec 28 '18 at 23:58


Ajax1234Ajax1234
40.8k42653
40.8k42653
So ran into an issue -hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include<b>test</b>
right before(
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?
– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that<b>test</b>
also yields true when its object is used inhasattr(a, 'a')
. However, I replacedhasattr
withgetattr
to inspect the value of thename
attribute.
– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
|
show 1 more comment
So ran into an issue -hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include<b>test</b>
right before(
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?
– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that<b>test</b>
also yields true when its object is used inhasattr(a, 'a')
. However, I replacedhasattr
withgetattr
to inspect the value of thename
attribute.
– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
So ran into an issue -
hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b>
right before (
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?– pad11
Dec 29 '18 at 3:12
So ran into an issue -
hasattr(a, 'a')
- this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b>
right before (
, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?– pad11
Dec 29 '18 at 3:12
@pad11 Please see my recent edit. I, too, find it strange that
<b>test</b>
also yields true when its object is used in hasattr(a, 'a')
. However, I replaced hasattr
with getattr
to inspect the value of the name
attribute.– Ajax1234
Dec 29 '18 at 3:17
@pad11 Please see my recent edit. I, too, find it strange that
<b>test</b>
also yields true when its object is used in hasattr(a, 'a')
. However, I replaced hasattr
with getattr
to inspect the value of the name
attribute.– Ajax1234
Dec 29 '18 at 3:17
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.
– pad11
Dec 29 '18 at 4:08
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
@pad11 No problem. Do you have an example along with your desired output?
– Ajax1234
Dec 29 '18 at 4:11
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```
– pad11
Dec 29 '18 at 5:26
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53965510%2fignore-a-link-in-parenthesis-while-trying-to-extract-other-links%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
FtP1jK8PXJ9V6HJ