How to read text in XSLFGraphicFrame with Apache POI for PowerPoint

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.

I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:

File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");

PrintStream out = System.out;



FileInputStream is = new FileInputStream(f);

XMLSlideShow ppt = new XMLSlideShow(is);

for (XSLFSlide slide : ppt.getSlides()) {

    for (XSLFShape shape : slide) {

       if (shape instanceof XSLFTextShape) {

       XSLFTextShape txShape = (XSLFTextShape) shape;

       out.println(txShape.getText());

       } else if (shape instanceof XSLFPictureShape) {

        //do nothing

       } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {

       //print all text in it or in its children

       }

    }

}

asked Dec 30 '18 at 17:16

Antonio Masala

add a comment |

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.

File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");

PrintStream out = System.out;



FileInputStream is = new FileInputStream(f);

XMLSlideShow ppt = new XMLSlideShow(is);

for (XSLFSlide slide : ppt.getSlides()) {

    for (XSLFShape shape : slide) {

       if (shape instanceof XSLFTextShape) {

       XSLFTextShape txShape = (XSLFTextShape) shape;

       out.println(txShape.getText());

       } else if (shape instanceof XSLFPictureShape) {

        //do nothing

       } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {

       //print all text in it or in its children

       }

    }

}

asked Dec 30 '18 at 17:16

Antonio Masala

add a comment |

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.

File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");

PrintStream out = System.out;



FileInputStream is = new FileInputStream(f);

XMLSlideShow ppt = new XMLSlideShow(is);

for (XSLFSlide slide : ppt.getSlides()) {

    for (XSLFShape shape : slide) {

       if (shape instanceof XSLFTextShape) {

       XSLFTextShape txShape = (XSLFTextShape) shape;

       out.println(txShape.getText());

       } else if (shape instanceof XSLFPictureShape) {

        //do nothing

       } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {

       //print all text in it or in its children

       }

    }

}

asked Dec 30 '18 at 17:16

Antonio Masala

I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.

File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");

PrintStream out = System.out;



FileInputStream is = new FileInputStream(f);

XMLSlideShow ppt = new XMLSlideShow(is);

for (XSLFSlide slide : ppt.getSlides()) {

    for (XSLFShape shape : slide) {

       if (shape instanceof XSLFTextShape) {

       XSLFTextShape txShape = (XSLFTextShape) shape;

       out.println(txShape.getText());

       } else if (shape instanceof XSLFPictureShape) {

        //do nothing

       } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {

       //print all text in it or in its children

       }

    }

}

java apache-poi powerpoint

asked Dec 30 '18 at 17:16

Antonio Masala

asked Dec 30 '18 at 17:16

Antonio Masala

asked Dec 30 '18 at 17:16

Antonio Masala

asked Dec 30 '18 at 17:16

Antonio Masala

asked Dec 30 '18 at 17:16

Antonio Masala

add a comment |

1 Answer
1

active

oldest

votes

If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.

Example:

import java.io.FileInputStream;



import org.apache.poi.xslf.usermodel.*;

import org.apache.poi.sl.usermodel.SlideShow;

import org.apache.poi.sl.extractor.SlideShowExtractor;



import org.apache.poi.extractor.POITextExtractor;



public class SlideShowExtractorExample {



 public static void main(String args) throws Exception {



  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 

   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));



  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 

   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);

  slideShowExtractor.setCommentsByDefault(true);

  slideShowExtractor.setMasterByDefault(true);

  slideShowExtractor.setNotesByDefault(true);



  String allTextContentInSlideShow = slideShowExtractor.getText();



System.out.println(allTextContentInSlideShow);



System.out.println("===========================================================================");



  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();

  String metaData = textExtractor.getText();



System.out.println(metaData);



 }

}

Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.

For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:

...

System.out.println("===========================================================================");



//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:

  StringBuilder sb = new StringBuilder();

  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {

   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {

    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {

     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());

     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();

     while(cursor.hasNextToken()) {

      if (cursor.isText()) {

       sb.append(cursor.getTextValue() + "rn");

      }

      cursor.toNextToken();

     }

     sb.append(slide.getSlideNumber() + "rnrn");

    }

   }

  }

  String allTextContentInDiagrams = sb.toString();



System.out.println(allTextContentInDiagrams);

...

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979761%2fhow-to-read-text-in-xslfgraphicframe-with-apache-poi-for-powerpoint%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Example:

import java.io.FileInputStream;



import org.apache.poi.xslf.usermodel.*;

import org.apache.poi.sl.usermodel.SlideShow;

import org.apache.poi.sl.extractor.SlideShowExtractor;



import org.apache.poi.extractor.POITextExtractor;



public class SlideShowExtractorExample {



 public static void main(String args) throws Exception {



  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 

   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));



  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 

   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);

  slideShowExtractor.setCommentsByDefault(true);

  slideShowExtractor.setMasterByDefault(true);

  slideShowExtractor.setNotesByDefault(true);



  String allTextContentInSlideShow = slideShowExtractor.getText();



System.out.println(allTextContentInSlideShow);



System.out.println("===========================================================================");



  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();

  String metaData = textExtractor.getText();



System.out.println(metaData);



 }

}

For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:

...

System.out.println("===========================================================================");



//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:

  StringBuilder sb = new StringBuilder();

  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {

   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {

    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {

     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());

     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();

     while(cursor.hasNextToken()) {

      if (cursor.isText()) {

       sb.append(cursor.getTextValue() + "rn");

      }

      cursor.toNextToken();

     }

     sb.append(slide.getSlideNumber() + "rnrn");

    }

   }

  }

  String allTextContentInDiagrams = sb.toString();



System.out.println(allTextContentInDiagrams);

...

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

add a comment |

Example:

import java.io.FileInputStream;



import org.apache.poi.xslf.usermodel.*;

import org.apache.poi.sl.usermodel.SlideShow;

import org.apache.poi.sl.extractor.SlideShowExtractor;



import org.apache.poi.extractor.POITextExtractor;



public class SlideShowExtractorExample {



 public static void main(String args) throws Exception {



  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 

   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));



  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 

   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);

  slideShowExtractor.setCommentsByDefault(true);

  slideShowExtractor.setMasterByDefault(true);

  slideShowExtractor.setNotesByDefault(true);



  String allTextContentInSlideShow = slideShowExtractor.getText();



System.out.println(allTextContentInSlideShow);



System.out.println("===========================================================================");



  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();

  String metaData = textExtractor.getText();



System.out.println(metaData);



 }

}

For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:

...

System.out.println("===========================================================================");



//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:

  StringBuilder sb = new StringBuilder();

  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {

   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {

    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {

     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());

     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();

     while(cursor.hasNextToken()) {

      if (cursor.isText()) {

       sb.append(cursor.getTextValue() + "rn");

      }

      cursor.toNextToken();

     }

     sb.append(slide.getSlideNumber() + "rnrn");

    }

   }

  }

  String allTextContentInDiagrams = sb.toString();



System.out.println(allTextContentInDiagrams);

...

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

add a comment |

Example:

import java.io.FileInputStream;



import org.apache.poi.xslf.usermodel.*;

import org.apache.poi.sl.usermodel.SlideShow;

import org.apache.poi.sl.extractor.SlideShowExtractor;



import org.apache.poi.extractor.POITextExtractor;



public class SlideShowExtractorExample {



 public static void main(String args) throws Exception {



  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 

   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));



  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 

   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);

  slideShowExtractor.setCommentsByDefault(true);

  slideShowExtractor.setMasterByDefault(true);

  slideShowExtractor.setNotesByDefault(true);



  String allTextContentInSlideShow = slideShowExtractor.getText();



System.out.println(allTextContentInSlideShow);



System.out.println("===========================================================================");



  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();

  String metaData = textExtractor.getText();



System.out.println(metaData);



 }

}

For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:

...

System.out.println("===========================================================================");



//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:

  StringBuilder sb = new StringBuilder();

  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {

   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {

    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {

     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());

     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();

     while(cursor.hasNextToken()) {

      if (cursor.isText()) {

       sb.append(cursor.getTextValue() + "rn");

      }

      cursor.toNextToken();

     }

     sb.append(slide.getSlideNumber() + "rnrn");

    }

   }

  }

  String allTextContentInDiagrams = sb.toString();



System.out.println(allTextContentInDiagrams);

...

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

Example:

import java.io.FileInputStream;



import org.apache.poi.xslf.usermodel.*;

import org.apache.poi.sl.usermodel.SlideShow;

import org.apache.poi.sl.extractor.SlideShowExtractor;



import org.apache.poi.extractor.POITextExtractor;



public class SlideShowExtractorExample {



 public static void main(String args) throws Exception {



  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 

   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));



  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 

   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);

  slideShowExtractor.setCommentsByDefault(true);

  slideShowExtractor.setMasterByDefault(true);

  slideShowExtractor.setNotesByDefault(true);



  String allTextContentInSlideShow = slideShowExtractor.getText();



System.out.println(allTextContentInSlideShow);



System.out.println("===========================================================================");



  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();

  String metaData = textExtractor.getText();



System.out.println(metaData);



 }

}

For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:

...

System.out.println("===========================================================================");



//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:

  StringBuilder sb = new StringBuilder();

  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {

   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {

    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {

     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());

     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();

     while(cursor.hasNextToken()) {

      if (cursor.isText()) {

       sb.append(cursor.getTextValue() + "rn");

      }

      cursor.toNextToken();

     }

     sb.append(slide.getSlideNumber() + "rnrn");

    }

   }

  }

  String allTextContentInDiagrams = sb.toString();



System.out.println(allTextContentInDiagrams);

...

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

edited Dec 31 '18 at 8:50

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

answered Dec 31 '18 at 5:39

Axel Richter

25.4k21935

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk