How to read text in XSLFGraphicFrame with Apache POI for PowerPoint
I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:
File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}
java apache-poi powerpoint
add a comment |
I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:
File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}
java apache-poi powerpoint
add a comment |
I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:
File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}
java apache-poi powerpoint
I'm making a Java program to find occurrrences of a particular keyword in documents. I want to read many types of file format, including all Microsoft Office documents.
I already made it with all of them except for PowerPoint ones, I'm using Apache POI code snippets found on StackOverflow or on other sources.
I discovered all slides are made of shapes (XSLFTextShape) but many of them are objects of class XSLFGraphicFrame or XSLFTable for which I can't use simply the toString() methods. How can I extract all of the text contained in them using Java.
This is the piece of codepseudocode:
File f = new File("C:\Users\Windows\Desktop\Modulo 9.pptx");
PrintStream out = System.out;
FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
for (XSLFShape shape : slide) {
if (shape instanceof XSLFTextShape) {
XSLFTextShape txShape = (XSLFTextShape) shape;
out.println(txShape.getText());
} else if (shape instanceof XSLFPictureShape) {
//do nothing
} else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
//print all text in it or in its children
}
}
}
java apache-poi powerpoint
java apache-poi powerpoint
asked Dec 30 '18 at 17:16
Antonio MasalaAntonio Masala
61
61
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows
, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame
which are not read by SlideShowExtractor
because they are not supported by apache poi
until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml
document parts which are referenced from the slides. Since apache poi
does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt
graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "rn");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "rnrn");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979761%2fhow-to-read-text-in-xslfgraphicframe-with-apache-poi-for-powerpoint%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows
, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame
which are not read by SlideShowExtractor
because they are not supported by apache poi
until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml
document parts which are referenced from the slides. Since apache poi
does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt
graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "rn");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "rnrn");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
add a comment |
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows
, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame
which are not read by SlideShowExtractor
because they are not supported by apache poi
until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml
document parts which are referenced from the slides. Since apache poi
does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt
graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "rn");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "rnrn");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
add a comment |
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows
, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame
which are not read by SlideShowExtractor
because they are not supported by apache poi
until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml
document parts which are referenced from the slides. Since apache poi
does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt
graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "rn");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "rnrn");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
If your requirement "to find occurrences of a particular keyword in documents" needs simply searching in all text content of SlideShows
, then simply using SlideShowExtractor could be an approach. This also can act as entry point to an POITextExtractor for getting textual content of the document metadata / properties, such as author and title.
Example:
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;
import org.apache.poi.extractor.POITextExtractor;
public class SlideShowExtractorExample {
public static void main(String args) throws Exception {
SlideShow<XSLFShape,XSLFTextParagraph> slideshow
= new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));
SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor
= new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
slideShowExtractor.setCommentsByDefault(true);
slideShowExtractor.setMasterByDefault(true);
slideShowExtractor.setNotesByDefault(true);
String allTextContentInSlideShow = slideShowExtractor.getText();
System.out.println(allTextContentInSlideShow);
System.out.println("===========================================================================");
POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
String metaData = textExtractor.getText();
System.out.println(metaData);
}
}
Of course there are kinds of XSLFGraphicFrame
which are not read by SlideShowExtractor
because they are not supported by apache poi
until now. For example all kinds of SmartArt graphic. The text content of those is stored in /ppt/diagrams/data*.xml
document parts which are referenced from the slides. Since apache poi
does not supporting this until now, it only can be read using low level underlying methods.
For example to additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt
graphics we could do:
...
System.out.println("===========================================================================");
//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
StringBuilder sb = new StringBuilder();
for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
while(cursor.hasNextToken()) {
if (cursor.isText()) {
sb.append(cursor.getTextValue() + "rn");
}
cursor.toNextToken();
}
sb.append(slide.getSlideNumber() + "rnrn");
}
}
}
String allTextContentInDiagrams = sb.toString();
System.out.println(allTextContentInDiagrams);
...
edited Dec 31 '18 at 8:50
answered Dec 31 '18 at 5:39
Axel RichterAxel Richter
25.4k21935
25.4k21935
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53979761%2fhow-to-read-text-in-xslfgraphicframe-with-apache-poi-for-powerpoint%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown