html / php - image clean up with HTML parsing or regular expression?
I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.
The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:
- HTML parsing
I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();
When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:
<section id="mysection">
<div class="mydiv">test</div>
<section>
and I end up with this:
<section id="mysection">
test
<section>
As this first method was actually worsening my problem, I tried to work with REGEX.
- REGEX
I went the easy way by using the following pattern: /<img(.*)>/
and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending >
of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a >
but I don't know how to do this.
I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.
Do you have any idea of which method I should use to solve this issue?
php preg-replace html-parsing
add a comment |
I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.
The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:
- HTML parsing
I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();
When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:
<section id="mysection">
<div class="mydiv">test</div>
<section>
and I end up with this:
<section id="mysection">
test
<section>
As this first method was actually worsening my problem, I tried to work with REGEX.
- REGEX
I went the easy way by using the following pattern: /<img(.*)>/
and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending >
of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a >
but I don't know how to do this.
I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.
Do you have any idea of which method I should use to solve this issue?
php preg-replace html-parsing
1
Use a?
so the*
is not greedy... or you can use theU
modifier so it functions opposite. You should use the parser though. (Also note you aren't usingPHP simple html dom
you are usingdomdocument
.)
– user3783243
Dec 27 '18 at 16:20
1
Can you please post a full example? That example has noimg
. It appears to work for me, 3v4l.org/OeHZL (your closingsection
is missing a/
in the example)
– user3783243
Dec 27 '18 at 16:28
add a comment |
I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.
The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:
- HTML parsing
I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();
When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:
<section id="mysection">
<div class="mydiv">test</div>
<section>
and I end up with this:
<section id="mysection">
test
<section>
As this first method was actually worsening my problem, I tried to work with REGEX.
- REGEX
I went the easy way by using the following pattern: /<img(.*)>/
and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending >
of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a >
but I don't know how to do this.
I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.
Do you have any idea of which method I should use to solve this issue?
php preg-replace html-parsing
I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.
The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:
- HTML parsing
I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image)
{
$image->setAttribute('class', 'blabla');
}
$html = $dom->saveHTML();
When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:
<section id="mysection">
<div class="mydiv">test</div>
<section>
and I end up with this:
<section id="mysection">
test
<section>
As this first method was actually worsening my problem, I tried to work with REGEX.
- REGEX
I went the easy way by using the following pattern: /<img(.*)>/
and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending >
of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a >
but I don't know how to do this.
I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.
Do you have any idea of which method I should use to solve this issue?
php preg-replace html-parsing
php preg-replace html-parsing
edited Dec 27 '18 at 16:19
Toto
64.7k175698
64.7k175698
asked Dec 27 '18 at 16:04
Laurent
421518
421518
1
Use a?
so the*
is not greedy... or you can use theU
modifier so it functions opposite. You should use the parser though. (Also note you aren't usingPHP simple html dom
you are usingdomdocument
.)
– user3783243
Dec 27 '18 at 16:20
1
Can you please post a full example? That example has noimg
. It appears to work for me, 3v4l.org/OeHZL (your closingsection
is missing a/
in the example)
– user3783243
Dec 27 '18 at 16:28
add a comment |
1
Use a?
so the*
is not greedy... or you can use theU
modifier so it functions opposite. You should use the parser though. (Also note you aren't usingPHP simple html dom
you are usingdomdocument
.)
– user3783243
Dec 27 '18 at 16:20
1
Can you please post a full example? That example has noimg
. It appears to work for me, 3v4l.org/OeHZL (your closingsection
is missing a/
in the example)
– user3783243
Dec 27 '18 at 16:28
1
1
Use a
?
so the *
is not greedy... or you can use the U
modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom
you are using domdocument
.)– user3783243
Dec 27 '18 at 16:20
Use a
?
so the *
is not greedy... or you can use the U
modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom
you are using domdocument
.)– user3783243
Dec 27 '18 at 16:20
1
1
Can you please post a full example? That example has no
img
. It appears to work for me, 3v4l.org/OeHZL (your closing section
is missing a /
in the example)– user3783243
Dec 27 '18 at 16:28
Can you please post a full example? That example has no
img
. It appears to work for me, 3v4l.org/OeHZL (your closing section
is missing a /
in the example)– user3783243
Dec 27 '18 at 16:28
add a comment |
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947753%2fhtml-php-image-clean-up-with-html-parsing-or-regular-expression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947753%2fhtml-php-image-clean-up-with-html-parsing-or-regular-expression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Use a
?
so the*
is not greedy... or you can use theU
modifier so it functions opposite. You should use the parser though. (Also note you aren't usingPHP simple html dom
you are usingdomdocument
.)– user3783243
Dec 27 '18 at 16:20
1
Can you please post a full example? That example has no
img
. It appears to work for me, 3v4l.org/OeHZL (your closingsection
is missing a/
in the example)– user3783243
Dec 27 '18 at 16:28