How to read text in XSLFGraphicFrame with Apache POI for PowerPoint












1















I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.



I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:



File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
PrintStream out = System.out;

FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}









share|improve this question



























    1















    I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.



    I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
    I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
    This is the piece of codepseudocode:



    File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
    PrintStream out = System.out;

    FileInputStream is = new FileInputStream(f);
    XMLSlideShow ppt = new XMLSlideShow(is);
    for (XSLFSlide slide : ppt.getSlides()) {
    for (XSLFShape shape : slide) {
    if (shape instanceof XSLFTextShape) {
    XSLFTextShape txShape = (XSLFTextShape) shape;
    out.println(txShape.getText());
    } else if (shape instanceof XSLFPictureShape) {
    //do nothing
    } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
    //print all text in it or in its children
    }
    }
    }









    share|improve this question

























      1












      1








      1








      I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.



      I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
      I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
      This is the piece of codepseudocode:



      File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
      PrintStream out = System.out;

      FileInputStream is = new FileInputStream(f);
      XMLSlideShow ppt = new XMLSlideShow(is);
      for (XSLFSlide slide : ppt.getSlides()) {
      for (XSLFShape shape : slide) {
      if (shape instanceof XSLFTextShape) {
      XSLFTextShape txShape = (XSLFTextShape) shape;
      out.println(txShape.getText());
      } else if (shape instanceof XSLFPictureShape) {
      //do nothing
      } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
      //print all text in it or in its children
      }
      }
      }









      share|improve this question














      I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.



      I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
      I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
      This is the piece of codepseudocode:



      File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
      PrintStream out = System.out;

      FileInputStream is = new FileInputStream(f);
      XMLSlideShow ppt = new XMLSlideShow(is);
      for (XSLFSlide slide : ppt.getSlides()) {
      for (XSLFShape shape : slide) {
      if (shape instanceof XSLFTextShape) {
      XSLFTextShape txShape = (XSLFTextShape) shape;
      out.println(txShape.getText());
      } else if (shape instanceof XSLFPictureShape) {
      //do nothing
      } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
      //print all text in it or in its children
      }
      }
      }






      java apache-poi powerpoint






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 30 '18 at 17:16









      Antonio MasalaAntonio Masala

      61




      61
























          1 Answer
          1






          active

          oldest

          votes


















          1














          If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.



          Example:



          import java.io.FileInputStream;

          import org.apache.poi.xslf.usermodel.*;
          import org.apache.poi.sl.usermodel.SlideShow;
          import org.apache.poi.sl.extractor.SlideShowExtractor;

          import org.apache.poi.extractor.POITextExtractor;

          public class SlideShowExtractorExample {

          public static void main(String args) throws Exception {

          SlideShow<XSLFShape,XSLFTextParagraph> slideshow
          = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

          SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
          = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
          slideShowExtractor.setCommentsByDefault(true);
          slideShowExtractor.setMasterByDefault(true);
          slideShowExtractor.setNotesByDefault(true);

          String allTextContentInSlideShow = slideShowExtractor.getText();

          System.out.println(allTextContentInSlideShow);

          System.out.println("===========================================================================");

          POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
          String metaData = textExtractor.getText();

          System.out.println(metaData);

          }
          }


          Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.



          For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:



          ...
          System.out.println("===========================================================================");

          //additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
          StringBuilder sb = new StringBuilder();
          for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
          for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
          if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
          org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
          org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
          while(cursor.hasNextToken()) {
          if (cursor.isText()) {
          sb.append(cursor.getTextValue() + "rn");
          }
          cursor.toNextToken();
          }
          sb.append(slide.getSlideNumber() + "rnrn");
          }
          }
          }
          String allTextContentInDiagrams = sb.toString();

          System.out.println(allTextContentInDiagrams);
          ...





          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979761%2fhow-to-read-text-in-xslfgraphicframe-with-apache-poi-for-powerpoint%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.



            Example:



            import java.io.FileInputStream;

            import org.apache.poi.xslf.usermodel.*;
            import org.apache.poi.sl.usermodel.SlideShow;
            import org.apache.poi.sl.extractor.SlideShowExtractor;

            import org.apache.poi.extractor.POITextExtractor;

            public class SlideShowExtractorExample {

            public static void main(String args) throws Exception {

            SlideShow<XSLFShape,XSLFTextParagraph> slideshow
            = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

            SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
            = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
            slideShowExtractor.setCommentsByDefault(true);
            slideShowExtractor.setMasterByDefault(true);
            slideShowExtractor.setNotesByDefault(true);

            String allTextContentInSlideShow = slideShowExtractor.getText();

            System.out.println(allTextContentInSlideShow);

            System.out.println("===========================================================================");

            POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
            String metaData = textExtractor.getText();

            System.out.println(metaData);

            }
            }


            Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.



            For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:



            ...
            System.out.println("===========================================================================");

            //additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
            StringBuilder sb = new StringBuilder();
            for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
            for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
            if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
            org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
            org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
            while(cursor.hasNextToken()) {
            if (cursor.isText()) {
            sb.append(cursor.getTextValue() + "rn");
            }
            cursor.toNextToken();
            }
            sb.append(slide.getSlideNumber() + "rnrn");
            }
            }
            }
            String allTextContentInDiagrams = sb.toString();

            System.out.println(allTextContentInDiagrams);
            ...





            share|improve this answer






























              1














              If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.



              Example:



              import java.io.FileInputStream;

              import org.apache.poi.xslf.usermodel.*;
              import org.apache.poi.sl.usermodel.SlideShow;
              import org.apache.poi.sl.extractor.SlideShowExtractor;

              import org.apache.poi.extractor.POITextExtractor;

              public class SlideShowExtractorExample {

              public static void main(String args) throws Exception {

              SlideShow<XSLFShape,XSLFTextParagraph> slideshow
              = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

              SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
              = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
              slideShowExtractor.setCommentsByDefault(true);
              slideShowExtractor.setMasterByDefault(true);
              slideShowExtractor.setNotesByDefault(true);

              String allTextContentInSlideShow = slideShowExtractor.getText();

              System.out.println(allTextContentInSlideShow);

              System.out.println("===========================================================================");

              POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
              String metaData = textExtractor.getText();

              System.out.println(metaData);

              }
              }


              Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.



              For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:



              ...
              System.out.println("===========================================================================");

              //additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
              StringBuilder sb = new StringBuilder();
              for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
              for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
              if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
              org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
              org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
              while(cursor.hasNextToken()) {
              if (cursor.isText()) {
              sb.append(cursor.getTextValue() + "rn");
              }
              cursor.toNextToken();
              }
              sb.append(slide.getSlideNumber() + "rnrn");
              }
              }
              }
              String allTextContentInDiagrams = sb.toString();

              System.out.println(allTextContentInDiagrams);
              ...





              share|improve this answer




























                1












                1








                1







                If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.



                Example:



                import java.io.FileInputStream;

                import org.apache.poi.xslf.usermodel.*;
                import org.apache.poi.sl.usermodel.SlideShow;
                import org.apache.poi.sl.extractor.SlideShowExtractor;

                import org.apache.poi.extractor.POITextExtractor;

                public class SlideShowExtractorExample {

                public static void main(String args) throws Exception {

                SlideShow<XSLFShape,XSLFTextParagraph> slideshow
                = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

                SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
                = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
                slideShowExtractor.setCommentsByDefault(true);
                slideShowExtractor.setMasterByDefault(true);
                slideShowExtractor.setNotesByDefault(true);

                String allTextContentInSlideShow = slideShowExtractor.getText();

                System.out.println(allTextContentInSlideShow);

                System.out.println("===========================================================================");

                POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
                String metaData = textExtractor.getText();

                System.out.println(metaData);

                }
                }


                Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.



                For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:



                ...
                System.out.println("===========================================================================");

                //additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
                StringBuilder sb = new StringBuilder();
                for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
                for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
                if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
                org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
                org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
                while(cursor.hasNextToken()) {
                if (cursor.isText()) {
                sb.append(cursor.getTextValue() + "rn");
                }
                cursor.toNextToken();
                }
                sb.append(slide.getSlideNumber() + "rnrn");
                }
                }
                }
                String allTextContentInDiagrams = sb.toString();

                System.out.println(allTextContentInDiagrams);
                ...





                share|improve this answer















                If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.



                Example:



                import java.io.FileInputStream;

                import org.apache.poi.xslf.usermodel.*;
                import org.apache.poi.sl.usermodel.SlideShow;
                import org.apache.poi.sl.extractor.SlideShowExtractor;

                import org.apache.poi.extractor.POITextExtractor;

                public class SlideShowExtractorExample {

                public static void main(String args) throws Exception {

                SlideShow<XSLFShape,XSLFTextParagraph> slideshow
                = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

                SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
                = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
                slideShowExtractor.setCommentsByDefault(true);
                slideShowExtractor.setMasterByDefault(true);
                slideShowExtractor.setNotesByDefault(true);

                String allTextContentInSlideShow = slideShowExtractor.getText();

                System.out.println(allTextContentInSlideShow);

                System.out.println("===========================================================================");

                POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
                String metaData = textExtractor.getText();

                System.out.println(metaData);

                }
                }


                Of course there are kinds of XSLFGraphicFrame which are not read by SlideShowExtractor because they are not supported by apache poi until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml document parts which are referenced from the slides. Since apache poi does not supporting this until now, it only can be read using low level underlying methods.



                For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics we could do:



                ...
                System.out.println("===========================================================================");

                //additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
                StringBuilder sb = new StringBuilder();
                for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
                for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
                if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
                org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
                org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
                while(cursor.hasNextToken()) {
                if (cursor.isText()) {
                sb.append(cursor.getTextValue() + "rn");
                }
                cursor.toNextToken();
                }
                sb.append(slide.getSlideNumber() + "rnrn");
                }
                }
                }
                String allTextContentInDiagrams = sb.toString();

                System.out.println(allTextContentInDiagrams);
                ...






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 31 '18 at 8:50

























                answered Dec 31 '18 at 5:39









                Axel RichterAxel Richter

                25.4k21935




                25.4k21935






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979761%2fhow-to-read-text-in-xslfgraphicframe-with-apache-poi-for-powerpoint%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas