html / php - image clean up with HTML parsing or regular expression?

I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.

The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:

HTML parsing

I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")

$dom = new DOMDocument;

$dom->loadHTML($html);

$images = $dom->getElementsByTagName('img');

foreach ($images as $image) 

  {

    $image->setAttribute('class', 'blabla');

  }

$html = $dom->saveHTML();

When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:

<section id="mysection">

<div class="mydiv">test</div>

<section>

and I end up with this:

<section id="mysection">

test

<section>

As this first method was actually worsening my problem, I tried to work with REGEX.

REGEX

I went the easy way by using the following pattern: /<img(.*)>/ and then I would explode the result to identify all the tags inside img. The issue I got here is that the REGEX ignored the ending > of the expression and end up taking a lot of unwanted html code. It should stop at the first occurence of a > but I don't know how to do this.

I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.

Do you have any idea of which method I should use to solve this issue?

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

1

Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20

1

Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28

add a comment |

I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.

The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:

HTML parsing

I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")

$dom = new DOMDocument;

$dom->loadHTML($html);

$images = $dom->getElementsByTagName('img');

foreach ($images as $image) 

  {

    $image->setAttribute('class', 'blabla');

  }

$html = $dom->saveHTML();

When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:

<section id="mysection">

<div class="mydiv">test</div>

<section>

and I end up with this:

<section id="mysection">

test

<section>

As this first method was actually worsening my problem, I tried to work with REGEX.

REGEX

I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.

Do you have any idea of which method I should use to solve this issue?

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

1

Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20

1

Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28

add a comment |

I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.

The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:

HTML parsing

I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")

$dom = new DOMDocument;

$dom->loadHTML($html);

$images = $dom->getElementsByTagName('img');

foreach ($images as $image) 

  {

    $image->setAttribute('class', 'blabla');

  }

$html = $dom->saveHTML();

When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:

<section id="mysection">

<div class="mydiv">test</div>

<section>

and I end up with this:

<section id="mysection">

test

<section>

As this first method was actually worsening my problem, I tried to work with REGEX.

REGEX

I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.

Do you have any idea of which method I should use to solve this issue?

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

I'm trying to clean some legacy content in my CMS to unify the way tags are being used, I wanted to start with images but I bumped into some problems.

The content in image tags like alt, srcset, sizes,... is not always used and when it is used, it is not always used in the same order. I tried two different angles to clean up my code:

HTML parsing

I have tried PHP simple html dom and the alternative through PHP with the following code: (it's just an example, I don't want class to be set to "blabla")

$dom = new DOMDocument;

$dom->loadHTML($html);

$images = $dom->getElementsByTagName('img');

foreach ($images as $image) 

  {

    $image->setAttribute('class', 'blabla');

  }

$html = $dom->saveHTML();

When I do this, image classes are correctly adapted but many DIVs in $html have vanished while I haven't modified them. I originally had something like this:

<section id="mysection">

<div class="mydiv">test</div>

<section>

and I end up with this:

<section id="mysection">

test

<section>

As this first method was actually worsening my problem, I tried to work with REGEX.

REGEX

I suppose html parsing should be the preferred method for this kind of operation but the parsing destroys my code.

Do you have any idea of which method I should use to solve this issue?

php preg-replace html-parsing

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

edited Dec 27 '18 at 16:19

Toto

64.7k175698

edited Dec 27 '18 at 16:19

Toto

64.7k175698

edited Dec 27 '18 at 16:19

Toto

64.7k175698

asked Dec 27 '18 at 16:04

Laurent

421518

asked Dec 27 '18 at 16:04

Laurent

421518

asked Dec 27 '18 at 16:04

Laurent

421518

1

Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20

1

Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28

add a comment |

1

Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20

1

Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28

Use a ? so the * is not greedy... or you can use the U modifier so it functions opposite. You should use the parser though. (Also note you aren't using PHP simple html dom you are using domdocument.)
– user3783243
Dec 27 '18 at 16:20

Can you please post a full example? That example has no img. It appears to work for me, 3v4l.org/OeHZL (your closing section is missing a / in the example)
– user3783243
Dec 27 '18 at 16:28

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947753%2fhtml-php-image-clean-up-with-html-parsing-or-regular-expression%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk