Open a zip file and stream the xml file inside of the zip file

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

-2

I am trying to open bulk data from the USPTO. The xml files within the zips are concatenated xml files containing multiple xml declarations and are quiet large. I am trying to only read lines from the xml until i get to the next xml declaration. I found this related question, without code.

What I want to create is a function that does the following:

For each *.zip file

Extract all xml file(s) (or open xml file(s) for reading)

Read lines from the xml file(s)

Append each line until the next xml declaration

Return the string

So far, I've been able to open the zip file, find all the xml file(s) and extract each xml file. I would prefer to not write the xml file to disk, but instead create a string that is a single xml document that I then further parse.

def main():

path = 'bulk/'

allFiles = glob.glob(path + '*.zip')

allFiles.sort()



for file in allFiles:

    try:

        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:

            print(fin, '- ok')

            print(fin.namelist())

            for name in fin.namelist():

                if name.endswith('xml'):

                    print(name) # all files that end in 'xml'

                    fin.extract(name, path='bulk/')

                    print('extracted ', name)

                    # TODO function to read lines of the xml file and









    except zipfile.BadZipFile:

            print(file,'- Bad zip file')



if __name__ == '__main__': main()

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

add a comment |

-2

What I want to create is a function that does the following:

For each *.zip file

Extract all xml file(s) (or open xml file(s) for reading)

Read lines from the xml file(s)

Append each line until the next xml declaration

Return the string

def main():

path = 'bulk/'

allFiles = glob.glob(path + '*.zip')

allFiles.sort()



for file in allFiles:

    try:

        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:

            print(fin, '- ok')

            print(fin.namelist())

            for name in fin.namelist():

                if name.endswith('xml'):

                    print(name) # all files that end in 'xml'

                    fin.extract(name, path='bulk/')

                    print('extracted ', name)

                    # TODO function to read lines of the xml file and









    except zipfile.BadZipFile:

            print(file,'- Bad zip file')



if __name__ == '__main__': main()

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

add a comment |

-2

What I want to create is a function that does the following:

For each *.zip file

Extract all xml file(s) (or open xml file(s) for reading)

Read lines from the xml file(s)

Append each line until the next xml declaration

Return the string

def main():

path = 'bulk/'

allFiles = glob.glob(path + '*.zip')

allFiles.sort()



for file in allFiles:

    try:

        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:

            print(fin, '- ok')

            print(fin.namelist())

            for name in fin.namelist():

                if name.endswith('xml'):

                    print(name) # all files that end in 'xml'

                    fin.extract(name, path='bulk/')

                    print('extracted ', name)

                    # TODO function to read lines of the xml file and









    except zipfile.BadZipFile:

            print(file,'- Bad zip file')



if __name__ == '__main__': main()

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

What I want to create is a function that does the following:

For each *.zip file

Extract all xml file(s) (or open xml file(s) for reading)

Read lines from the xml file(s)

Append each line until the next xml declaration

Return the string

def main():

path = 'bulk/'

allFiles = glob.glob(path + '*.zip')

allFiles.sort()



for file in allFiles:

    try:

        with zipfile.ZipFile(file, mode = 'r', allowZip64 = True) as fin:

            print(fin, '- ok')

            print(fin.namelist())

            for name in fin.namelist():

                if name.endswith('xml'):

                    print(name) # all files that end in 'xml'

                    fin.extract(name, path='bulk/')

                    print('extracted ', name)

                    # TODO function to read lines of the xml file and









    except zipfile.BadZipFile:

            print(file,'- Bad zip file')



if __name__ == '__main__': main()

python xml zipfile

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

edited Jan 3 at 22:38

cody

8,23631326

edited Jan 3 at 22:38

cody

8,23631326

edited Jan 3 at 22:38

cody

8,23631326

asked Jan 3 at 21:49

Britt

14610

asked Jan 3 at 21:49

Britt

14610

asked Jan 3 at 21:49

Britt

14610

add a comment |

1 Answer
1

active

oldest

votes

Use read instead of extract. It returns the bytes of a file in the zip, given a name. It's important to understand that you're essentially extracting the archive to memory, so be aware of how much data is actually going to be extracted and your limitations in that regard.

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile



def extract(f):

    zf = ZipFile(f)

    return {name: zf.read(name) for name in zf.namelist()}

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54030279%2fopen-a-zip-file-and-stream-the-xml-file-inside-of-the-zip-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile



def extract(f):

    zf = ZipFile(f)

    return {name: zf.read(name) for name in zf.namelist()}

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

add a comment |

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile



def extract(f):

    zf = ZipFile(f)

    return {name: zf.read(name) for name in zf.namelist()}

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

add a comment |

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile



def extract(f):

    zf = ZipFile(f)

    return {name: zf.read(name) for name in zf.namelist()}

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

For example, the following function returns a dict with the names of a zip archive's files as keys, and the files' contents as values:

from zipfile import ZipFile



def extract(f):

    zf = ZipFile(f)

    return {name: zf.read(name) for name in zf.namelist()}

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

edited Jan 3 at 22:37

answered Jan 3 at 22:31

cody

8,23631326

answered Jan 3 at 22:31

cody

8,23631326

answered Jan 3 at 22:31

cody

8,23631326

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk