Ignore a link in parenthesis while trying to extract other links

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,

<p> 

  Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

</p>

I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...

 ps = content.find_all('p', recursive=False)

 for p in ps:

    as = p.find_all('a', recursive=False)

I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).

Anyone able to help?

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

add a comment |

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,

<p> 

  Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

</p>

I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...

 ps = content.find_all('p', recursive=False)

 for p in ps:

    as = p.find_all('a', recursive=False)

I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).

Anyone able to help?

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

add a comment |

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,

<p> 

  Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

</p>

I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...

 ps = content.find_all('p', recursive=False)

 for p in ps:

    as = p.find_all('a', recursive=False)

I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).

Anyone able to help?

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,

<p> 

  Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

</p>

I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...

 ps = content.find_all('p', recursive=False)

 for p in ps:

    as = p.find_all('a', recursive=False)

I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).

Anyone able to help?

python beautifulsoup

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

edited Dec 29 '18 at 0:16

asked Dec 28 '18 at 23:50

pad11

867

asked Dec 28 '18 at 23:50

pad11

867

asked Dec 28 '18 at 23:50

pad11

867

add a comment |

1 Answer
1

active

oldest

votes

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:

from bs4 import BeautifulSoup as soup

def is_valid(ind:int, content:list, flag=False) -> bool:

   return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])



s = """

 <p> 

   Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

 </p>

"""

d = soup(s, 'html.parser').p.contents

l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']

new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]

Output:

[<a href="link_text2">link_text2</a>]

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53965510%2fignore-a-link-in-parenthesis-while-trying-to-extract-other-links%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:

from bs4 import BeautifulSoup as soup

def is_valid(ind:int, content:list, flag=False) -> bool:

   return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])



s = """

 <p> 

   Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

 </p>

"""

d = soup(s, 'html.parser').p.contents

l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']

new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]

Output:

[<a href="link_text2">link_text2</a>]

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

|
show 1 more comment

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:

from bs4 import BeautifulSoup as soup

def is_valid(ind:int, content:list, flag=False) -> bool:

   return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])



s = """

 <p> 

   Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

 </p>

"""

d = soup(s, 'html.parser').p.contents

l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']

new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]

Output:

[<a href="link_text2">link_text2</a>]

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

|
show 1 more comment

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:

from bs4 import BeautifulSoup as soup

def is_valid(ind:int, content:list, flag=False) -> bool:

   return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])



s = """

 <p> 

   Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

 </p>

"""

d = soup(s, 'html.parser').p.contents

l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']

new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]

Output:

[<a href="link_text2">link_text2</a>]

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:

from bs4 import BeautifulSoup as soup

def is_valid(ind:int, content:list, flag=False) -> bool:

   return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])



s = """

 <p> 

   Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>

 </p>

"""

d = soup(s, 'html.parser').p.contents

l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']

new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]

Output:

[<a href="link_text2">link_text2</a>]

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

edited Dec 29 '18 at 3:16

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

answered Dec 28 '18 at 23:58

Ajax1234

40.8k42653

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

|
show 1 more comment

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include test right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

– pad11
Dec 29 '18 at 3:12

@pad11 Please see my recent edit. I, too, find it strange that test also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

– Ajax1234
Dec 29 '18 at 3:17

That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

– pad11
Dec 29 '18 at 4:08

@pad11 No problem. Do you have an example along with your desired output?

– Ajax1234
Dec 29 '18 at 4:11

I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` Some text test(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><a href='cite_text'>cite_text</a> ) another link <a href='link_text2'>link_text2</a> ```

– pad11
Dec 29 '18 at 5:26

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

FtP1jK8PXJ9V6HJ

搜尋此網誌

Bdtjtk