Performance problem with read and parse large XML files

Multi tool use
I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?
func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
defer wg.Done()
//var conf configuration
file, err := os.Open(path)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make(byte, 1024*1024)
scanner.Buffer(buf, 50)
for scanner.Scan() {
_, err := file.Read(buf)
if err != nil {
log.Fatal(err)
}
}
err = xml.Unmarshal(buf, &mdc)
if err != nil {
log.Fatal(err)
}
fmt.Println(mdc)
}
xml go
add a comment |
I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?
func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
defer wg.Done()
//var conf configuration
file, err := os.Open(path)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make(byte, 1024*1024)
scanner.Buffer(buf, 50)
for scanner.Scan() {
_, err := file.Read(buf)
if err != nil {
log.Fatal(err)
}
}
err = xml.Unmarshal(buf, &mdc)
if err != nil {
log.Fatal(err)
}
fmt.Println(mdc)
}
xml go
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08
add a comment |
I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?
func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
defer wg.Done()
//var conf configuration
file, err := os.Open(path)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make(byte, 1024*1024)
scanner.Buffer(buf, 50)
for scanner.Scan() {
_, err := file.Read(buf)
if err != nil {
log.Fatal(err)
}
}
err = xml.Unmarshal(buf, &mdc)
if err != nil {
log.Fatal(err)
}
fmt.Println(mdc)
}
xml go
I have a directory which contains several large XML files (total size is about 10 GB). Is there any way to iterate through the directory containing the XML files and read 50 byte by 50 byte and parse the XML files with high performance?
func (mdc *Mdc) Loadxml(path string, wg sync.WaitGroup) {
defer wg.Done()
//var conf configuration
file, err := os.Open(path)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
buf := make(byte, 1024*1024)
scanner.Buffer(buf, 50)
for scanner.Scan() {
_, err := file.Read(buf)
if err != nil {
log.Fatal(err)
}
}
err = xml.Unmarshal(buf, &mdc)
if err != nil {
log.Fatal(err)
}
fmt.Println(mdc)
}
xml go
xml go
edited Dec 31 '18 at 12:12
Flimzy
38.5k96597
38.5k96597
asked Dec 31 '18 at 10:21
amiramir
124
124
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08
add a comment |
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08
add a comment |
2 Answers
2
active
oldest
votes
You can do something even better: You can tokenize your xml files.
Say you have an xml like this
<inventory>
<item name="ACME Unobtainium">
<tag>Foo</tag>
<count>1</count>
</item>
<item name="Dirt">
<tag>Bar</tag>
<count>0</count>
</item>
</inventory>
you can actually have the following data model
type Inventory struct {
Items Item `xml:"item"`
}
type Item struct {
Name string `xml:"name,attr"`
Tags string `xml:"tag"`
Count int `xml:"count"`
}
Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:
decoder := xml.NewDecoder(file)
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we are at the end of the file, we are done
if err == io.EOF {
log.Println("The end")
break
} else if err != nil {
log.Fatalf("Error decoding token: %s", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element.
// However, we have the complete token in t
case xml.StartElement:
switch se.Name.Local {
// Found an item, so we process it
case "item":
var item Item
// We decode the element into our data model...
if err = decoder.DecodeElement(&item, &se); err != nil {
log.Fatalf("Error decoding item: %s", err)
}
// And use it for whatever we want to
log.Printf("'%s' in stock: %d", item.Name, item.Count)
if len(item.Tags) > 0 {
log.Println("Tags")
for _, tag := range item.Tags {
log.Printf("t%s", tag)
}
}
}
}
}
Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt
1
The handling ofio.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)
– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
add a comment |
The encoding/xml
package provides a medium-level xml.Decoder
type. That lets you read through an XML input stream one Token
at a time, not unlike the streaming Java SAX model of old. When you find the thing you're looking for, you can jump back into decoder.Decode
to run the normal unmarshaling sequence to get individual objects out. Just remember that the token stream might contain several things that are "irrelevant" (whitespace-only text nodes, processing instructions, comments) and you need to skip over them, while still looking for things that are "important" (non-whitespace text nodes, unexpected start/end elements).
As a high-level example, if you're expecting a very large SOAP message with a list of records, you might do a "streaming" parse until you see the <soap:Body>
start-element, check that its immediate child (e.g., the next start-element) is the element you expect, and then call decoder.Decode
on each of its child elements. If you see the end of the operation element, you can unwind the element tree (you now expect to see </soap:Body></soap:Envelope>
). Anything else is an error, that you need to catch and process.
The skeleton of an application here might look like
type Foo struct {
Name string `xml:"name"`
}
decoder := xml.NewDecoder(r)
for {
t, err := decoder.Token()
if err != nil {
panic(err)
}
switch x := t.(type) {
case xml.StartElement:
switch x.Name {
case xml.Name{Space: "", Local: "foo"}:
var foo Foo
err = decoder.DecodeElement(&foo, &x)
if err != nil {
panic(err)
}
fmt.Printf("%+vn", foo)
default:
fmt.Printf("Unexpected SE {%s}%sn", x.Name.Space, x.Name.Local)
}
case xml.EndElement:
switch x.Name {
default:
fmt.Printf("Unexpected EE {%s}%sn", x.Name.Space, x.Name.Local)
}
}
}
https://play.golang.org/p/_ZfG9oCESLJ has a complete working example (not of the SOAP case but something smaller).
XML parsing in Go, like basically everything else, is a "pull" model: you tell the reader what to read and it gets the data from the io.Reader
you give it. If you manually create an xml.Decoder
you can pull one token from a time from it, and that will presumably call r.Read
in digestible chunks, but you can't push tiny increments of data into the parser as you propose.
I can't specifically speak to the performance of encoding/xml
, but a hybrid-streaming approach like this will at least get you better latency to the first output and keep less live data in memory at a time.
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53986280%2fperformance-problem-with-read-and-parse-large-xml-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can do something even better: You can tokenize your xml files.
Say you have an xml like this
<inventory>
<item name="ACME Unobtainium">
<tag>Foo</tag>
<count>1</count>
</item>
<item name="Dirt">
<tag>Bar</tag>
<count>0</count>
</item>
</inventory>
you can actually have the following data model
type Inventory struct {
Items Item `xml:"item"`
}
type Item struct {
Name string `xml:"name,attr"`
Tags string `xml:"tag"`
Count int `xml:"count"`
}
Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:
decoder := xml.NewDecoder(file)
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we are at the end of the file, we are done
if err == io.EOF {
log.Println("The end")
break
} else if err != nil {
log.Fatalf("Error decoding token: %s", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element.
// However, we have the complete token in t
case xml.StartElement:
switch se.Name.Local {
// Found an item, so we process it
case "item":
var item Item
// We decode the element into our data model...
if err = decoder.DecodeElement(&item, &se); err != nil {
log.Fatalf("Error decoding item: %s", err)
}
// And use it for whatever we want to
log.Printf("'%s' in stock: %d", item.Name, item.Count)
if len(item.Tags) > 0 {
log.Println("Tags")
for _, tag := range item.Tags {
log.Printf("t%s", tag)
}
}
}
}
}
Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt
1
The handling ofio.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)
– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
add a comment |
You can do something even better: You can tokenize your xml files.
Say you have an xml like this
<inventory>
<item name="ACME Unobtainium">
<tag>Foo</tag>
<count>1</count>
</item>
<item name="Dirt">
<tag>Bar</tag>
<count>0</count>
</item>
</inventory>
you can actually have the following data model
type Inventory struct {
Items Item `xml:"item"`
}
type Item struct {
Name string `xml:"name,attr"`
Tags string `xml:"tag"`
Count int `xml:"count"`
}
Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:
decoder := xml.NewDecoder(file)
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we are at the end of the file, we are done
if err == io.EOF {
log.Println("The end")
break
} else if err != nil {
log.Fatalf("Error decoding token: %s", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element.
// However, we have the complete token in t
case xml.StartElement:
switch se.Name.Local {
// Found an item, so we process it
case "item":
var item Item
// We decode the element into our data model...
if err = decoder.DecodeElement(&item, &se); err != nil {
log.Fatalf("Error decoding item: %s", err)
}
// And use it for whatever we want to
log.Printf("'%s' in stock: %d", item.Name, item.Count)
if len(item.Tags) > 0 {
log.Println("Tags")
for _, tag := range item.Tags {
log.Printf("t%s", tag)
}
}
}
}
}
Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt
1
The handling ofio.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)
– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
add a comment |
You can do something even better: You can tokenize your xml files.
Say you have an xml like this
<inventory>
<item name="ACME Unobtainium">
<tag>Foo</tag>
<count>1</count>
</item>
<item name="Dirt">
<tag>Bar</tag>
<count>0</count>
</item>
</inventory>
you can actually have the following data model
type Inventory struct {
Items Item `xml:"item"`
}
type Item struct {
Name string `xml:"name,attr"`
Tags string `xml:"tag"`
Count int `xml:"count"`
}
Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:
decoder := xml.NewDecoder(file)
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we are at the end of the file, we are done
if err == io.EOF {
log.Println("The end")
break
} else if err != nil {
log.Fatalf("Error decoding token: %s", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element.
// However, we have the complete token in t
case xml.StartElement:
switch se.Name.Local {
// Found an item, so we process it
case "item":
var item Item
// We decode the element into our data model...
if err = decoder.DecodeElement(&item, &se); err != nil {
log.Fatalf("Error decoding item: %s", err)
}
// And use it for whatever we want to
log.Printf("'%s' in stock: %d", item.Name, item.Count)
if len(item.Tags) > 0 {
log.Println("Tags")
for _, tag := range item.Tags {
log.Printf("t%s", tag)
}
}
}
}
}
Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt
You can do something even better: You can tokenize your xml files.
Say you have an xml like this
<inventory>
<item name="ACME Unobtainium">
<tag>Foo</tag>
<count>1</count>
</item>
<item name="Dirt">
<tag>Bar</tag>
<count>0</count>
</item>
</inventory>
you can actually have the following data model
type Inventory struct {
Items Item `xml:"item"`
}
type Item struct {
Name string `xml:"name,attr"`
Tags string `xml:"tag"`
Count int `xml:"count"`
}
Now, all you have to do is to use filepath.Walk and do something like this for each file you want to process:
decoder := xml.NewDecoder(file)
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we are at the end of the file, we are done
if err == io.EOF {
log.Println("The end")
break
} else if err != nil {
log.Fatalf("Error decoding token: %s", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element.
// However, we have the complete token in t
case xml.StartElement:
switch se.Name.Local {
// Found an item, so we process it
case "item":
var item Item
// We decode the element into our data model...
if err = decoder.DecodeElement(&item, &se); err != nil {
log.Fatalf("Error decoding item: %s", err)
}
// And use it for whatever we want to
log.Printf("'%s' in stock: %d", item.Name, item.Count)
if len(item.Tags) > 0 {
log.Println("Tags")
for _, tag := range item.Tags {
log.Printf("t%s", tag)
}
}
}
}
}
Working example with dummy XML: https://play.golang.org/p/MiLej7ih9Jt
answered Dec 31 '18 at 12:08
Markus W MahlbergMarkus W Mahlberg
12.4k53959
12.4k53959
1
The handling ofio.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)
– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
add a comment |
1
The handling ofio.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)
– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
1
1
The handling of
io.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)– kostix
Dec 31 '18 at 12:51
The handling of
io.EOF
in this model should be a bit more involved: if you hit it while not having had seen the end element of the root node, this is actually an error, not a legitimate "end of input". Hence I'd say an imaginary "bullet-proof" solution would be your and @david-maze's answers combined ;-)– kostix
Dec 31 '18 at 12:51
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
@kostix Feel free to edit accordingly. But the aim was more to create an understanding, less to provide scaffolding, let alone c&p material. Give a man a fish...
– Markus W Mahlberg
Dec 31 '18 at 13:30
1
1
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
I'm fine with my remark existing as a comment—precisely to not over-complicate your answer.
– kostix
Dec 31 '18 at 14:13
add a comment |
The encoding/xml
package provides a medium-level xml.Decoder
type. That lets you read through an XML input stream one Token
at a time, not unlike the streaming Java SAX model of old. When you find the thing you're looking for, you can jump back into decoder.Decode
to run the normal unmarshaling sequence to get individual objects out. Just remember that the token stream might contain several things that are "irrelevant" (whitespace-only text nodes, processing instructions, comments) and you need to skip over them, while still looking for things that are "important" (non-whitespace text nodes, unexpected start/end elements).
As a high-level example, if you're expecting a very large SOAP message with a list of records, you might do a "streaming" parse until you see the <soap:Body>
start-element, check that its immediate child (e.g., the next start-element) is the element you expect, and then call decoder.Decode
on each of its child elements. If you see the end of the operation element, you can unwind the element tree (you now expect to see </soap:Body></soap:Envelope>
). Anything else is an error, that you need to catch and process.
The skeleton of an application here might look like
type Foo struct {
Name string `xml:"name"`
}
decoder := xml.NewDecoder(r)
for {
t, err := decoder.Token()
if err != nil {
panic(err)
}
switch x := t.(type) {
case xml.StartElement:
switch x.Name {
case xml.Name{Space: "", Local: "foo"}:
var foo Foo
err = decoder.DecodeElement(&foo, &x)
if err != nil {
panic(err)
}
fmt.Printf("%+vn", foo)
default:
fmt.Printf("Unexpected SE {%s}%sn", x.Name.Space, x.Name.Local)
}
case xml.EndElement:
switch x.Name {
default:
fmt.Printf("Unexpected EE {%s}%sn", x.Name.Space, x.Name.Local)
}
}
}
https://play.golang.org/p/_ZfG9oCESLJ has a complete working example (not of the SOAP case but something smaller).
XML parsing in Go, like basically everything else, is a "pull" model: you tell the reader what to read and it gets the data from the io.Reader
you give it. If you manually create an xml.Decoder
you can pull one token from a time from it, and that will presumably call r.Read
in digestible chunks, but you can't push tiny increments of data into the parser as you propose.
I can't specifically speak to the performance of encoding/xml
, but a hybrid-streaming approach like this will at least get you better latency to the first output and keep less live data in memory at a time.
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
add a comment |
The encoding/xml
package provides a medium-level xml.Decoder
type. That lets you read through an XML input stream one Token
at a time, not unlike the streaming Java SAX model of old. When you find the thing you're looking for, you can jump back into decoder.Decode
to run the normal unmarshaling sequence to get individual objects out. Just remember that the token stream might contain several things that are "irrelevant" (whitespace-only text nodes, processing instructions, comments) and you need to skip over them, while still looking for things that are "important" (non-whitespace text nodes, unexpected start/end elements).
As a high-level example, if you're expecting a very large SOAP message with a list of records, you might do a "streaming" parse until you see the <soap:Body>
start-element, check that its immediate child (e.g., the next start-element) is the element you expect, and then call decoder.Decode
on each of its child elements. If you see the end of the operation element, you can unwind the element tree (you now expect to see </soap:Body></soap:Envelope>
). Anything else is an error, that you need to catch and process.
The skeleton of an application here might look like
type Foo struct {
Name string `xml:"name"`
}
decoder := xml.NewDecoder(r)
for {
t, err := decoder.Token()
if err != nil {
panic(err)
}
switch x := t.(type) {
case xml.StartElement:
switch x.Name {
case xml.Name{Space: "", Local: "foo"}:
var foo Foo
err = decoder.DecodeElement(&foo, &x)
if err != nil {
panic(err)
}
fmt.Printf("%+vn", foo)
default:
fmt.Printf("Unexpected SE {%s}%sn", x.Name.Space, x.Name.Local)
}
case xml.EndElement:
switch x.Name {
default:
fmt.Printf("Unexpected EE {%s}%sn", x.Name.Space, x.Name.Local)
}
}
}
https://play.golang.org/p/_ZfG9oCESLJ has a complete working example (not of the SOAP case but something smaller).
XML parsing in Go, like basically everything else, is a "pull" model: you tell the reader what to read and it gets the data from the io.Reader
you give it. If you manually create an xml.Decoder
you can pull one token from a time from it, and that will presumably call r.Read
in digestible chunks, but you can't push tiny increments of data into the parser as you propose.
I can't specifically speak to the performance of encoding/xml
, but a hybrid-streaming approach like this will at least get you better latency to the first output and keep less live data in memory at a time.
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
add a comment |
The encoding/xml
package provides a medium-level xml.Decoder
type. That lets you read through an XML input stream one Token
at a time, not unlike the streaming Java SAX model of old. When you find the thing you're looking for, you can jump back into decoder.Decode
to run the normal unmarshaling sequence to get individual objects out. Just remember that the token stream might contain several things that are "irrelevant" (whitespace-only text nodes, processing instructions, comments) and you need to skip over them, while still looking for things that are "important" (non-whitespace text nodes, unexpected start/end elements).
As a high-level example, if you're expecting a very large SOAP message with a list of records, you might do a "streaming" parse until you see the <soap:Body>
start-element, check that its immediate child (e.g., the next start-element) is the element you expect, and then call decoder.Decode
on each of its child elements. If you see the end of the operation element, you can unwind the element tree (you now expect to see </soap:Body></soap:Envelope>
). Anything else is an error, that you need to catch and process.
The skeleton of an application here might look like
type Foo struct {
Name string `xml:"name"`
}
decoder := xml.NewDecoder(r)
for {
t, err := decoder.Token()
if err != nil {
panic(err)
}
switch x := t.(type) {
case xml.StartElement:
switch x.Name {
case xml.Name{Space: "", Local: "foo"}:
var foo Foo
err = decoder.DecodeElement(&foo, &x)
if err != nil {
panic(err)
}
fmt.Printf("%+vn", foo)
default:
fmt.Printf("Unexpected SE {%s}%sn", x.Name.Space, x.Name.Local)
}
case xml.EndElement:
switch x.Name {
default:
fmt.Printf("Unexpected EE {%s}%sn", x.Name.Space, x.Name.Local)
}
}
}
https://play.golang.org/p/_ZfG9oCESLJ has a complete working example (not of the SOAP case but something smaller).
XML parsing in Go, like basically everything else, is a "pull" model: you tell the reader what to read and it gets the data from the io.Reader
you give it. If you manually create an xml.Decoder
you can pull one token from a time from it, and that will presumably call r.Read
in digestible chunks, but you can't push tiny increments of data into the parser as you propose.
I can't specifically speak to the performance of encoding/xml
, but a hybrid-streaming approach like this will at least get you better latency to the first output and keep less live data in memory at a time.
The encoding/xml
package provides a medium-level xml.Decoder
type. That lets you read through an XML input stream one Token
at a time, not unlike the streaming Java SAX model of old. When you find the thing you're looking for, you can jump back into decoder.Decode
to run the normal unmarshaling sequence to get individual objects out. Just remember that the token stream might contain several things that are "irrelevant" (whitespace-only text nodes, processing instructions, comments) and you need to skip over them, while still looking for things that are "important" (non-whitespace text nodes, unexpected start/end elements).
As a high-level example, if you're expecting a very large SOAP message with a list of records, you might do a "streaming" parse until you see the <soap:Body>
start-element, check that its immediate child (e.g., the next start-element) is the element you expect, and then call decoder.Decode
on each of its child elements. If you see the end of the operation element, you can unwind the element tree (you now expect to see </soap:Body></soap:Envelope>
). Anything else is an error, that you need to catch and process.
The skeleton of an application here might look like
type Foo struct {
Name string `xml:"name"`
}
decoder := xml.NewDecoder(r)
for {
t, err := decoder.Token()
if err != nil {
panic(err)
}
switch x := t.(type) {
case xml.StartElement:
switch x.Name {
case xml.Name{Space: "", Local: "foo"}:
var foo Foo
err = decoder.DecodeElement(&foo, &x)
if err != nil {
panic(err)
}
fmt.Printf("%+vn", foo)
default:
fmt.Printf("Unexpected SE {%s}%sn", x.Name.Space, x.Name.Local)
}
case xml.EndElement:
switch x.Name {
default:
fmt.Printf("Unexpected EE {%s}%sn", x.Name.Space, x.Name.Local)
}
}
}
https://play.golang.org/p/_ZfG9oCESLJ has a complete working example (not of the SOAP case but something smaller).
XML parsing in Go, like basically everything else, is a "pull" model: you tell the reader what to read and it gets the data from the io.Reader
you give it. If you manually create an xml.Decoder
you can pull one token from a time from it, and that will presumably call r.Read
in digestible chunks, but you can't push tiny increments of data into the parser as you propose.
I can't specifically speak to the performance of encoding/xml
, but a hybrid-streaming approach like this will at least get you better latency to the first output and keep less live data in memory at a time.
answered Dec 31 '18 at 11:40
David MazeDavid Maze
13.3k31226
13.3k31226
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
add a comment |
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
Darn, you are fast ;)
– Markus W Mahlberg
Dec 31 '18 at 12:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53986280%2fperformance-problem-with-read-and-parse-large-xml-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
ZYRtVPwH,gRdZJIFKZsUt,dB1WNbdX7Nrx0X8gv53nKj2ki
golang.org/pkg/encoding/xml/#Decoder.Token
– Peter
Dec 31 '18 at 11:08