Ignore a link in parenthesis while trying to extract other links

Multi tool use
Multi tool use












1















I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,



<p> 
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>


I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...



 ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)


I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).



Anyone able to help?










share|improve this question





























    1















    I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,



    <p> 
    Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
    </p>


    I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...



     ps = content.find_all('p', recursive=False)
    for p in ps:
    as = p.find_all('a', recursive=False)


    I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).



    Anyone able to help?










    share|improve this question



























      1












      1








      1








      I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,



      <p> 
      Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
      </p>


      I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...



       ps = content.find_all('p', recursive=False)
      for p in ps:
      as = p.find_all('a', recursive=False)


      I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).



      Anyone able to help?










      share|improve this question
















      I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,



      <p> 
      Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
      </p>


      I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...



       ps = content.find_all('p', recursive=False)
      for p in ps:
      as = p.find_all('a', recursive=False)


      I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - (.*?).



      Anyone able to help?







      python beautifulsoup






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 29 '18 at 0:16







      pad11

















      asked Dec 28 '18 at 23:50









      pad11pad11

      867




      867
























          1 Answer
          1






          active

          oldest

          votes


















          1














          You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:



          from bs4 import BeautifulSoup as soup
          def is_valid(ind:int, content:list, flag=False) -> bool:
          return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])

          s = """
          <p>
          Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
          </p>
          """
          d = soup(s, 'html.parser').p.contents
          l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
          new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]


          Output:



          [<a href="link_text2">link_text2</a>]





          share|improve this answer


























          • So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

            – pad11
            Dec 29 '18 at 3:12













          • @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

            – Ajax1234
            Dec 29 '18 at 3:17











          • That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

            – pad11
            Dec 29 '18 at 4:08











          • @pad11 No problem. Do you have an example along with your desired output?

            – Ajax1234
            Dec 29 '18 at 4:11











          • I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

            – pad11
            Dec 29 '18 at 5:26











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53965510%2fignore-a-link-in-parenthesis-while-trying-to-extract-other-links%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:



          from bs4 import BeautifulSoup as soup
          def is_valid(ind:int, content:list, flag=False) -> bool:
          return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])

          s = """
          <p>
          Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
          </p>
          """
          d = soup(s, 'html.parser').p.contents
          l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
          new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]


          Output:



          [<a href="link_text2">link_text2</a>]





          share|improve this answer


























          • So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

            – pad11
            Dec 29 '18 at 3:12













          • @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

            – Ajax1234
            Dec 29 '18 at 3:17











          • That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

            – pad11
            Dec 29 '18 at 4:08











          • @pad11 No problem. Do you have an example along with your desired output?

            – Ajax1234
            Dec 29 '18 at 4:11











          • I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

            – pad11
            Dec 29 '18 at 5:26
















          1














          You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:



          from bs4 import BeautifulSoup as soup
          def is_valid(ind:int, content:list, flag=False) -> bool:
          return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])

          s = """
          <p>
          Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
          </p>
          """
          d = soup(s, 'html.parser').p.contents
          l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
          new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]


          Output:



          [<a href="link_text2">link_text2</a>]





          share|improve this answer


























          • So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

            – pad11
            Dec 29 '18 at 3:12













          • @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

            – Ajax1234
            Dec 29 '18 at 3:17











          • That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

            – pad11
            Dec 29 '18 at 4:08











          • @pad11 No problem. Do you have an example along with your desired output?

            – Ajax1234
            Dec 29 '18 at 4:11











          • I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

            – pad11
            Dec 29 '18 at 5:26














          1












          1








          1







          You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:



          from bs4 import BeautifulSoup as soup
          def is_valid(ind:int, content:list, flag=False) -> bool:
          return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])

          s = """
          <p>
          Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
          </p>
          """
          d = soup(s, 'html.parser').p.contents
          l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
          new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]


          Output:



          [<a href="link_text2">link_text2</a>]





          share|improve this answer















          You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:



          from bs4 import BeautifulSoup as soup
          def is_valid(ind:int, content:list, flag=False) -> bool:
          return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])

          s = """
          <p>
          Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
          </p>
          """
          d = soup(s, 'html.parser').p.contents
          l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
          new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]


          Output:



          [<a href="link_text2">link_text2</a>]






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 29 '18 at 3:16

























          answered Dec 28 '18 at 23:58









          Ajax1234Ajax1234

          40.8k42653




          40.8k42653













          • So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

            – pad11
            Dec 29 '18 at 3:12













          • @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

            – Ajax1234
            Dec 29 '18 at 3:17











          • That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

            – pad11
            Dec 29 '18 at 4:08











          • @pad11 No problem. Do you have an example along with your desired output?

            – Ajax1234
            Dec 29 '18 at 4:11











          • I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

            – pad11
            Dec 29 '18 at 5:26



















          • So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

            – pad11
            Dec 29 '18 at 3:12













          • @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

            – Ajax1234
            Dec 29 '18 at 3:17











          • That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

            – pad11
            Dec 29 '18 at 4:08











          • @pad11 No problem. Do you have an example along with your desired output?

            – Ajax1234
            Dec 29 '18 at 4:11











          • I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

            – pad11
            Dec 29 '18 at 5:26

















          So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

          – pad11
          Dec 29 '18 at 3:12







          So ran into an issue - hasattr(a, 'a') - this seems like it should work but it doesn't just isolate a tags. If I change the sample to include <b>test</b> right before (, that gets picked up as a link as well. shouldn't hasattr filter to 'a' tags?

          – pad11
          Dec 29 '18 at 3:12















          @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

          – Ajax1234
          Dec 29 '18 at 3:17





          @pad11 Please see my recent edit. I, too, find it strange that <b>test</b> also yields true when its object is used in hasattr(a, 'a'). However, I replaced hasattr with getattr to inspect the value of the name attribute.

          – Ajax1234
          Dec 29 '18 at 3:17













          That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

          – pad11
          Dec 29 '18 at 4:08





          That does help! Don't hate me though :( So what I gave is more of a basic sample simply because I don't know how complex it can get. If the text within the parenthesis has more text and links, the checks on -1/+1 don't work as expected.

          – pad11
          Dec 29 '18 at 4:08













          @pad11 No problem. Do you have an example along with your desired output?

          – Ajax1234
          Dec 29 '18 at 4:11





          @pad11 No problem. Do you have an example along with your desired output?

          – Ajax1234
          Dec 29 '18 at 4:11













          I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

          – pad11
          Dec 29 '18 at 5:26





          I just want to completely ignore whatever is in the parenthesis as the links will never be relevant but I cannot determine ahead of time what will be in the parenthesis. It can be multiple links with text surrounding the links or cite references like this... ``` <p> Some text <b>test</b>(even more text <a href='link_text'>link_text</a><a href='link_text'>link_text</a><sup class="reference"><a href='cite_text'>cite_text</a></sup> ) another link <a href='link_text2'>link_text2</a> </p>```

          – pad11
          Dec 29 '18 at 5:26


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53965510%2fignore-a-link-in-parenthesis-while-trying-to-extract-other-links%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          FtP1jK8PXJ9V6HJ
          KfP C8VHthvKOF1o1JjcsHy6i9lFH,VqFo9,jCFjtrsQ

          Popular posts from this blog

          Monofisismo

          Angular Downloading a file using contenturl with Basic Authentication

          Olmecas