Open a zip file and stream the xml file inside of the zip file
![Multi tool use Multi tool use](http://sgv.ssvwv.com/sg/ssvwvcomimagb.png)
Multi tool use
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.
What I want to create is a function that does the following:
- For each *.zip file
- Extract all xml file(s) (or open xml file(s) for reading)
- Read lines from the xml file(s)
- Append each line until the next xml declaration
- Return the string
So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.
def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()
for file in allFiles:
try:
with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
print(fin, '- ok')
print(fin.namelist())
for name in fin.namelist():
if name.endswith('xml'):
print(name) # all files that end in 'xml'
fin.extract(name, path='bulk/')
print('extracted ', name)
# TODO function to read lines of the xml file and
except zipfile.BadZipFile:
print(file,'- Bad zip file')
if __name__ == '__main__': main()
python xml zipfile
add a comment |
I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.
What I want to create is a function that does the following:
- For each *.zip file
- Extract all xml file(s) (or open xml file(s) for reading)
- Read lines from the xml file(s)
- Append each line until the next xml declaration
- Return the string
So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.
def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()
for file in allFiles:
try:
with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
print(fin, '- ok')
print(fin.namelist())
for name in fin.namelist():
if name.endswith('xml'):
print(name) # all files that end in 'xml'
fin.extract(name, path='bulk/')
print('extracted ', name)
# TODO function to read lines of the xml file and
except zipfile.BadZipFile:
print(file,'- Bad zip file')
if __name__ == '__main__': main()
python xml zipfile
add a comment |
I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.
What I want to create is a function that does the following:
- For each *.zip file
- Extract all xml file(s) (or open xml file(s) for reading)
- Read lines from the xml file(s)
- Append each line until the next xml declaration
- Return the string
So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.
def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()
for file in allFiles:
try:
with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
print(fin, '- ok')
print(fin.namelist())
for name in fin.namelist():
if name.endswith('xml'):
print(name) # all files that end in 'xml'
fin.extract(name, path='bulk/')
print('extracted ', name)
# TODO function to read lines of the xml file and
except zipfile.BadZipFile:
print(file,'- Bad zip file')
if __name__ == '__main__': main()
python xml zipfile
I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.
What I want to create is a function that does the following:
- For each *.zip file
- Extract all xml file(s) (or open xml file(s) for reading)
- Read lines from the xml file(s)
- Append each line until the next xml declaration
- Return the string
So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.
def main():
path = 'bulk/'
allFiles = glob.glob(path + '*.zip')
allFiles.sort()
for file in allFiles:
try:
with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:
print(fin, '- ok')
print(fin.namelist())
for name in fin.namelist():
if name.endswith('xml'):
print(name) # all files that end in 'xml'
fin.extract(name, path='bulk/')
print('extracted ', name)
# TODO function to read lines of the xml file and
except zipfile.BadZipFile:
print(file,'- Bad zip file')
if __name__ == '__main__': main()
python xml zipfile
python xml zipfile
edited Jan 3 at 22:38
![](https://i.stack.imgur.com/4E3zE.jpg?s=32&g=1)
![](https://i.stack.imgur.com/4E3zE.jpg?s=32&g=1)
cody
8,23631326
8,23631326
asked Jan 3 at 21:49
![](https://lh3.googleusercontent.com/-EtcS8xRzg0Q/AAAAAAAAAAI/AAAAAAABCkU/iWskI4FvKrw/photo.jpg?sz=32)
![](https://lh3.googleusercontent.com/-EtcS8xRzg0Q/AAAAAAAAAAI/AAAAAAABCkU/iWskI4FvKrw/photo.jpg?sz=32)
BrittBritt
14610
14610
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Use read
instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54030279%2fopen-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use read
instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
add a comment |
Use read
instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
add a comment |
Use read
instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
Use read
instead of extract
. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.
For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:
from zipfile import ZipFile
def extract(f):
zf = ZipFile(f)
return {name: zf.read(name) for name in zf.namelist()}
edited Jan 3 at 22:37
answered Jan 3 at 22:31
![](https://i.stack.imgur.com/4E3zE.jpg?s=32&g=1)
![](https://i.stack.imgur.com/4E3zE.jpg?s=32&g=1)
codycody
8,23631326
8,23631326
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54030279%2fopen-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
q1,xaao4VXGUDKfzNaA0fbZ FxyR k,4j,B5vZP,KQkz4PQd3ArZcOm,Z05bs8EgNRK6KQPQo,RRWMtjA4J