html / php - image clean up with HTML parsing or regular expression?












0














I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.



The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:




  1. HTML parsing


I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")



$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();


When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:



<section id="mysection">
<div class="mydiv">test</div>
<section>


and I end up with this:



<section id="mysection">
test
<section>


As this first method was actually worsening my problem, I tried to work with REGEX.




  1. REGEX


I went the easy way by using the following pattern: /<img(.*)>/ and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending > of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a > but I don't know how to do this.



I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.



Do you have any idea of which method I should use to solve this issue?










share|improve this question




















  • 1




    Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
    – user3783243
    Dec 27 '18 at 16:20








  • 1




    Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
    – user3783243
    Dec 27 '18 at 16:28


















0














I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.



The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:




  1. HTML parsing


I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")



$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();


When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:



<section id="mysection">
<div class="mydiv">test</div>
<section>


and I end up with this:



<section id="mysection">
test
<section>


As this first method was actually worsening my problem, I tried to work with REGEX.




  1. REGEX


I went the easy way by using the following pattern: /<img(.*)>/ and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending > of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a > but I don't know how to do this.



I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.



Do you have any idea of which method I should use to solve this issue?










share|improve this question




















  • 1




    Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
    – user3783243
    Dec 27 '18 at 16:20








  • 1




    Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
    – user3783243
    Dec 27 '18 at 16:28
















0












0








0







I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.



The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:




  1. HTML parsing


I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")



$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();


When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:



<section id="mysection">
<div class="mydiv">test</div>
<section>


and I end up with this:



<section id="mysection">
test
<section>


As this first method was actually worsening my problem, I tried to work with REGEX.




  1. REGEX


I went the easy way by using the following pattern: /<img(.*)>/ and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending > of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a > but I don't know how to do this.



I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.



Do you have any idea of which method I should use to solve this issue?










share|improve this question















I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.



The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:




  1. HTML parsing


I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")



$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();


When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:



<section id="mysection">
<div class="mydiv">test</div>
<section>


and I end up with this:



<section id="mysection">
test
<section>


As this first method was actually worsening my problem, I tried to work with REGEX.




  1. REGEX


I went the easy way by using the following pattern: /<img(.*)>/ and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending > of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a > but I don't know how to do this.



I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.



Do you have any idea of which method I should use to solve this issue?







php preg-replace html-parsing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 27 '18 at 16:19









Toto

64.7k175698




64.7k175698










asked Dec 27 '18 at 16:04









Laurent

421518




421518








  • 1




    Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
    – user3783243
    Dec 27 '18 at 16:20








  • 1




    Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
    – user3783243
    Dec 27 '18 at 16:28
















  • 1




    Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
    – user3783243
    Dec 27 '18 at 16:20








  • 1




    Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
    – user3783243
    Dec 27 '18 at 16:28










1




1




Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20






Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20






1




1




Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28






Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28



















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947753%2fhtml-php-image-clean-up-with-html-parsing-or-regular-expression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947753%2fhtml-php-image-clean-up-with-html-parsing-or-regular-expression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas