Converting an xml doc into a specific dot-expanded json structure

I have the following XML document:

<Item ID="288917">

  <Main>

    <Platform>iTunes</Platform>

    <PlatformID>353736518</PlatformID>

  </Main>

  <Genres>

    <Genre FacebookID="6003161475030">Comedy</Genre>

    <Genre FacebookID="6003172932634">TV-Show</Genre>

  </Genres>

  <Products>

    <Product Country="CA">

      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>

      <Offers>

        <Offer Type="HDBUY">

          <Price>3.49</Price>

          <Currency>CAD</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>2.49</Price>

          <Currency>CAD</Currency>

        </Offer>

      </Offers>

    </Product>

    <Product Country="FR">

      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>

      <Rating>Tout public</Rating>

      <Offers>

        <Offer Type="HDBUY">

          <Price>2.49</Price>

          <Currency>EUR</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>1.99</Price>

          <Currency>EUR</Currency>

        </Offer>

      </Offers>

    </Product>

  </Products>

</Item>

Currently, to get it into json format I'm doing the following:

parser = etree.XMLParser(recover=True)

node = etree.fromstring(s, parser=parser)

data = xmltodict.parse(etree.tostring(node))

Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:

{

    "Item[@ID]": 288917, # if no preceding element, use the root node tag

    "Main.Platform": "iTunes",

    "Main.PlatformID": "353736518",

    "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated

    "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],

    "Products.Product[@Country]": ["CA", "FR"],

    "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],

    "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],

    "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],

    "Products.Product.Offers.Offer.Currency": "EUR"    

}

asked Dec 31 '18 at 3:24

David L

38516

add a comment |

I have the following XML document:

<Item ID="288917">

  <Main>

    <Platform>iTunes</Platform>

    <PlatformID>353736518</PlatformID>

  </Main>

  <Genres>

    <Genre FacebookID="6003161475030">Comedy</Genre>

    <Genre FacebookID="6003172932634">TV-Show</Genre>

  </Genres>

  <Products>

    <Product Country="CA">

      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>

      <Offers>

        <Offer Type="HDBUY">

          <Price>3.49</Price>

          <Currency>CAD</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>2.49</Price>

          <Currency>CAD</Currency>

        </Offer>

      </Offers>

    </Product>

    <Product Country="FR">

      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>

      <Rating>Tout public</Rating>

      <Offers>

        <Offer Type="HDBUY">

          <Price>2.49</Price>

          <Currency>EUR</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>1.99</Price>

          <Currency>EUR</Currency>

        </Offer>

      </Offers>

    </Product>

  </Products>

</Item>

Currently, to get it into json format I'm doing the following:

parser = etree.XMLParser(recover=True)

node = etree.fromstring(s, parser=parser)

data = xmltodict.parse(etree.tostring(node))

Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:

{

    "Item[@ID]": 288917, # if no preceding element, use the root node tag

    "Main.Platform": "iTunes",

    "Main.PlatformID": "353736518",

    "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated

    "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],

    "Products.Product[@Country]": ["CA", "FR"],

    "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],

    "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],

    "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],

    "Products.Product.Offers.Offer.Currency": "EUR"    

}

asked Dec 31 '18 at 3:24

David L

38516

add a comment |

I have the following XML document:

<Item ID="288917">

  <Main>

    <Platform>iTunes</Platform>

    <PlatformID>353736518</PlatformID>

  </Main>

  <Genres>

    <Genre FacebookID="6003161475030">Comedy</Genre>

    <Genre FacebookID="6003172932634">TV-Show</Genre>

  </Genres>

  <Products>

    <Product Country="CA">

      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>

      <Offers>

        <Offer Type="HDBUY">

          <Price>3.49</Price>

          <Currency>CAD</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>2.49</Price>

          <Currency>CAD</Currency>

        </Offer>

      </Offers>

    </Product>

    <Product Country="FR">

      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>

      <Rating>Tout public</Rating>

      <Offers>

        <Offer Type="HDBUY">

          <Price>2.49</Price>

          <Currency>EUR</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>1.99</Price>

          <Currency>EUR</Currency>

        </Offer>

      </Offers>

    </Product>

  </Products>

</Item>

Currently, to get it into json format I'm doing the following:

parser = etree.XMLParser(recover=True)

node = etree.fromstring(s, parser=parser)

data = xmltodict.parse(etree.tostring(node))

Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:

{

    "Item[@ID]": 288917, # if no preceding element, use the root node tag

    "Main.Platform": "iTunes",

    "Main.PlatformID": "353736518",

    "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated

    "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],

    "Products.Product[@Country]": ["CA", "FR"],

    "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],

    "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],

    "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],

    "Products.Product.Offers.Offer.Currency": "EUR"    

}

asked Dec 31 '18 at 3:24

David L

38516

I have the following XML document:

<Item ID="288917">

  <Main>

    <Platform>iTunes</Platform>

    <PlatformID>353736518</PlatformID>

  </Main>

  <Genres>

    <Genre FacebookID="6003161475030">Comedy</Genre>

    <Genre FacebookID="6003172932634">TV-Show</Genre>

  </Genres>

  <Products>

    <Product Country="CA">

      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>

      <Offers>

        <Offer Type="HDBUY">

          <Price>3.49</Price>

          <Currency>CAD</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>2.49</Price>

          <Currency>CAD</Currency>

        </Offer>

      </Offers>

    </Product>

    <Product Country="FR">

      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>

      <Rating>Tout public</Rating>

      <Offers>

        <Offer Type="HDBUY">

          <Price>2.49</Price>

          <Currency>EUR</Currency>

        </Offer>

        <Offer Type="SDBUY">

          <Price>1.99</Price>

          <Currency>EUR</Currency>

        </Offer>

      </Offers>

    </Product>

  </Products>

</Item>

Currently, to get it into json format I'm doing the following:

parser = etree.XMLParser(recover=True)

node = etree.fromstring(s, parser=parser)

data = xmltodict.parse(etree.tostring(node))

Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:

{

    "Item[@ID]": 288917, # if no preceding element, use the root node tag

    "Main.Platform": "iTunes",

    "Main.PlatformID": "353736518",

    "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated

    "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],

    "Products.Product[@Country]": ["CA", "FR"],

    "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],

    "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],

    "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],

    "Products.Product.Offers.Offer.Currency": "EUR"    

}

python xml recursion elementtree

asked Dec 31 '18 at 3:24

David L

38516

asked Dec 31 '18 at 3:24

David L

38516

asked Dec 31 '18 at 3:24

David L

38516

asked Dec 31 '18 at 3:24

David L

38516

asked Dec 31 '18 at 3:24

David L

38516

add a comment |

2 Answers
2

active

oldest

votes

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.

The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.

Demo:

from xml.etree.ElementTree import ElementTree

from pprint import pprint



# Setup XML tree for parsing

tree = ElementTree()

tree.parse("sample.xml")

root = tree.getroot()



def collect_xml_paths(root, path=, result={}):

    """Collect XML paths into a dictionary"""



    # First collect root items

    if not result:

        root_id, root_value = tuple(root.attrib.items())[0]

        root_key = root.tag + "[@%s]" % root_id

        result[root_key] = root_value



    # Go through each child from root

    for child in root:



        # Extract text

        text = child.text.strip()



        # Update path

        new_path = path[:]

        new_path.append(child.tag)



        # Create dot separated key

        key = ".".join(new_path)



        # Get child attributes

        attributes = child.attrib



        # Ensure we have attributes

        if attributes:



            # Add each attribute to result

            for k, v in attributes.items():

                attrib_key = key + "[@%s]" % k

                result.setdefault(attrib_key, ).append(v)



        # Add text if it exists

        if text:

            result.setdefault(key, ).append(text)



        # Recurse through paths once done iteration

        collect_xml_paths(child, new_path)



    # Separate single values from list values

    return {k: v[0] if len(v) == 1 else v for k, v in result.items()}



pprint(collect_xml_paths(root))

Output:

{'Genres.Genre': ['Comedy', 'TV-Show'],

 'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],

 'Item[@ID]': '288917',

 'Main.Platform': 'iTunes',

 'Main.PlatformID': '353736518',

 'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],

 'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],

 'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],

 'Products.Product.Rating': 'Tout public',

 'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',

                      'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],

 'Products.Product[@Country]': ['CA', 'FR']}

If you want to serialize this dictionary to JSON, you can use json.dumps():

from json import dumps



print(dumps(collect_xml_paths(root)))

# {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

add a comment |

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:

node = etree.fromstring(file_data.encode('utf-8'), parser=parser)

data = OrderedDict()

nodes = [(node, ''),] # format is (node, prefix)



while nodes:



    for sub, prefix in nodes:



        # remove the prefix tag unless its for the first attribute

        tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''

        atr_prefix = sub.tag if (sub == node) else tag_prefix



        # tag

        if sub.text.strip():

            _prefix = tag_prefix + '.' + sub.tag

            _value = sub.text.strip()

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        # atr

        for k, v in sub.attrib.items():

            _prefix = atr_prefix + '[@%s]' % k

            _value = v

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        nodes.remove((sub, prefix))



        for s in sub.getchildren():

            _prefix = (prefix + '.' + sub.tag).strip('.')

            nodes.append((s, _prefix))



    if not nodes: break

answered Dec 31 '18 at 4:57

David L

38516

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53983293%2fconverting-an-xml-doc-into-a-specific-dot-expanded-json-structure%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.

The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.

Demo:

from xml.etree.ElementTree import ElementTree

from pprint import pprint



# Setup XML tree for parsing

tree = ElementTree()

tree.parse("sample.xml")

root = tree.getroot()



def collect_xml_paths(root, path=, result={}):

    """Collect XML paths into a dictionary"""



    # First collect root items

    if not result:

        root_id, root_value = tuple(root.attrib.items())[0]

        root_key = root.tag + "[@%s]" % root_id

        result[root_key] = root_value



    # Go through each child from root

    for child in root:



        # Extract text

        text = child.text.strip()



        # Update path

        new_path = path[:]

        new_path.append(child.tag)



        # Create dot separated key

        key = ".".join(new_path)



        # Get child attributes

        attributes = child.attrib



        # Ensure we have attributes

        if attributes:



            # Add each attribute to result

            for k, v in attributes.items():

                attrib_key = key + "[@%s]" % k

                result.setdefault(attrib_key, ).append(v)



        # Add text if it exists

        if text:

            result.setdefault(key, ).append(text)



        # Recurse through paths once done iteration

        collect_xml_paths(child, new_path)



    # Separate single values from list values

    return {k: v[0] if len(v) == 1 else v for k, v in result.items()}



pprint(collect_xml_paths(root))

Output:

{'Genres.Genre': ['Comedy', 'TV-Show'],

 'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],

 'Item[@ID]': '288917',

 'Main.Platform': 'iTunes',

 'Main.PlatformID': '353736518',

 'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],

 'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],

 'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],

 'Products.Product.Rating': 'Tout public',

 'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',

                      'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],

 'Products.Product[@Country]': ['CA', 'FR']}

If you want to serialize this dictionary to JSON, you can use json.dumps():

from json import dumps



print(dumps(collect_xml_paths(root)))

# {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

add a comment |

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.

The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.

Demo:

from xml.etree.ElementTree import ElementTree

from pprint import pprint



# Setup XML tree for parsing

tree = ElementTree()

tree.parse("sample.xml")

root = tree.getroot()



def collect_xml_paths(root, path=, result={}):

    """Collect XML paths into a dictionary"""



    # First collect root items

    if not result:

        root_id, root_value = tuple(root.attrib.items())[0]

        root_key = root.tag + "[@%s]" % root_id

        result[root_key] = root_value



    # Go through each child from root

    for child in root:



        # Extract text

        text = child.text.strip()



        # Update path

        new_path = path[:]

        new_path.append(child.tag)



        # Create dot separated key

        key = ".".join(new_path)



        # Get child attributes

        attributes = child.attrib



        # Ensure we have attributes

        if attributes:



            # Add each attribute to result

            for k, v in attributes.items():

                attrib_key = key + "[@%s]" % k

                result.setdefault(attrib_key, ).append(v)



        # Add text if it exists

        if text:

            result.setdefault(key, ).append(text)



        # Recurse through paths once done iteration

        collect_xml_paths(child, new_path)



    # Separate single values from list values

    return {k: v[0] if len(v) == 1 else v for k, v in result.items()}



pprint(collect_xml_paths(root))

Output:

{'Genres.Genre': ['Comedy', 'TV-Show'],

 'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],

 'Item[@ID]': '288917',

 'Main.Platform': 'iTunes',

 'Main.PlatformID': '353736518',

 'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],

 'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],

 'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],

 'Products.Product.Rating': 'Tout public',

 'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',

                      'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],

 'Products.Product[@Country]': ['CA', 'FR']}

If you want to serialize this dictionary to JSON, you can use json.dumps():

from json import dumps



print(dumps(collect_xml_paths(root)))

# {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

add a comment |

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.

The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.

Demo:

from xml.etree.ElementTree import ElementTree

from pprint import pprint



# Setup XML tree for parsing

tree = ElementTree()

tree.parse("sample.xml")

root = tree.getroot()



def collect_xml_paths(root, path=, result={}):

    """Collect XML paths into a dictionary"""



    # First collect root items

    if not result:

        root_id, root_value = tuple(root.attrib.items())[0]

        root_key = root.tag + "[@%s]" % root_id

        result[root_key] = root_value



    # Go through each child from root

    for child in root:



        # Extract text

        text = child.text.strip()



        # Update path

        new_path = path[:]

        new_path.append(child.tag)



        # Create dot separated key

        key = ".".join(new_path)



        # Get child attributes

        attributes = child.attrib



        # Ensure we have attributes

        if attributes:



            # Add each attribute to result

            for k, v in attributes.items():

                attrib_key = key + "[@%s]" % k

                result.setdefault(attrib_key, ).append(v)



        # Add text if it exists

        if text:

            result.setdefault(key, ).append(text)



        # Recurse through paths once done iteration

        collect_xml_paths(child, new_path)



    # Separate single values from list values

    return {k: v[0] if len(v) == 1 else v for k, v in result.items()}



pprint(collect_xml_paths(root))

Output:

{'Genres.Genre': ['Comedy', 'TV-Show'],

 'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],

 'Item[@ID]': '288917',

 'Main.Platform': 'iTunes',

 'Main.PlatformID': '353736518',

 'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],

 'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],

 'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],

 'Products.Product.Rating': 'Tout public',

 'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',

                      'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],

 'Products.Product[@Country]': ['CA', 'FR']}

If you want to serialize this dictionary to JSON, you can use json.dumps():

from json import dumps



print(dumps(collect_xml_paths(root)))

# {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.

The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.

Demo:

from xml.etree.ElementTree import ElementTree

from pprint import pprint



# Setup XML tree for parsing

tree = ElementTree()

tree.parse("sample.xml")

root = tree.getroot()



def collect_xml_paths(root, path=, result={}):

    """Collect XML paths into a dictionary"""



    # First collect root items

    if not result:

        root_id, root_value = tuple(root.attrib.items())[0]

        root_key = root.tag + "[@%s]" % root_id

        result[root_key] = root_value



    # Go through each child from root

    for child in root:



        # Extract text

        text = child.text.strip()



        # Update path

        new_path = path[:]

        new_path.append(child.tag)



        # Create dot separated key

        key = ".".join(new_path)



        # Get child attributes

        attributes = child.attrib



        # Ensure we have attributes

        if attributes:



            # Add each attribute to result

            for k, v in attributes.items():

                attrib_key = key + "[@%s]" % k

                result.setdefault(attrib_key, ).append(v)



        # Add text if it exists

        if text:

            result.setdefault(key, ).append(text)



        # Recurse through paths once done iteration

        collect_xml_paths(child, new_path)



    # Separate single values from list values

    return {k: v[0] if len(v) == 1 else v for k, v in result.items()}



pprint(collect_xml_paths(root))

Output:

{'Genres.Genre': ['Comedy', 'TV-Show'],

 'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],

 'Item[@ID]': '288917',

 'Main.Platform': 'iTunes',

 'Main.PlatformID': '353736518',

 'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],

 'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],

 'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],

 'Products.Product.Rating': 'Tout public',

 'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',

                      'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],

 'Products.Product[@Country]': ['CA', 'FR']}

If you want to serialize this dictionary to JSON, you can use json.dumps():

from json import dumps



print(dumps(collect_xml_paths(root)))

# {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

edited Dec 31 '18 at 5:33

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

answered Dec 31 '18 at 5:14

RoadRunner

11.2k31340

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

add a comment |

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

– David L
Dec 31 '18 at 19:40

add a comment |

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:

node = etree.fromstring(file_data.encode('utf-8'), parser=parser)

data = OrderedDict()

nodes = [(node, ''),] # format is (node, prefix)



while nodes:



    for sub, prefix in nodes:



        # remove the prefix tag unless its for the first attribute

        tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''

        atr_prefix = sub.tag if (sub == node) else tag_prefix



        # tag

        if sub.text.strip():

            _prefix = tag_prefix + '.' + sub.tag

            _value = sub.text.strip()

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        # atr

        for k, v in sub.attrib.items():

            _prefix = atr_prefix + '[@%s]' % k

            _value = v

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        nodes.remove((sub, prefix))



        for s in sub.getchildren():

            _prefix = (prefix + '.' + sub.tag).strip('.')

            nodes.append((s, _prefix))



    if not nodes: break

answered Dec 31 '18 at 4:57

David L

38516

add a comment |

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:

node = etree.fromstring(file_data.encode('utf-8'), parser=parser)

data = OrderedDict()

nodes = [(node, ''),] # format is (node, prefix)



while nodes:



    for sub, prefix in nodes:



        # remove the prefix tag unless its for the first attribute

        tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''

        atr_prefix = sub.tag if (sub == node) else tag_prefix



        # tag

        if sub.text.strip():

            _prefix = tag_prefix + '.' + sub.tag

            _value = sub.text.strip()

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        # atr

        for k, v in sub.attrib.items():

            _prefix = atr_prefix + '[@%s]' % k

            _value = v

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        nodes.remove((sub, prefix))



        for s in sub.getchildren():

            _prefix = (prefix + '.' + sub.tag).strip('.')

            nodes.append((s, _prefix))



    if not nodes: break

answered Dec 31 '18 at 4:57

David L

38516

add a comment |

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:

node = etree.fromstring(file_data.encode('utf-8'), parser=parser)

data = OrderedDict()

nodes = [(node, ''),] # format is (node, prefix)



while nodes:



    for sub, prefix in nodes:



        # remove the prefix tag unless its for the first attribute

        tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''

        atr_prefix = sub.tag if (sub == node) else tag_prefix



        # tag

        if sub.text.strip():

            _prefix = tag_prefix + '.' + sub.tag

            _value = sub.text.strip()

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        # atr

        for k, v in sub.attrib.items():

            _prefix = atr_prefix + '[@%s]' % k

            _value = v

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        nodes.remove((sub, prefix))



        for s in sub.getchildren():

            _prefix = (prefix + '.' + sub.tag).strip('.')

            nodes.append((s, _prefix))



    if not nodes: break

answered Dec 31 '18 at 4:57

David L

38516

This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:

node = etree.fromstring(file_data.encode('utf-8'), parser=parser)

data = OrderedDict()

nodes = [(node, ''),] # format is (node, prefix)



while nodes:



    for sub, prefix in nodes:



        # remove the prefix tag unless its for the first attribute

        tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''

        atr_prefix = sub.tag if (sub == node) else tag_prefix



        # tag

        if sub.text.strip():

            _prefix = tag_prefix + '.' + sub.tag

            _value = sub.text.strip()

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        # atr

        for k, v in sub.attrib.items():

            _prefix = atr_prefix + '[@%s]' % k

            _value = v

            if data.get(_prefix): # convert it to a list if multiple values

                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]

                data[_prefix].append(_value)

            else:

                data[_prefix] = _value



        nodes.remove((sub, prefix))



        for s in sub.getchildren():

            _prefix = (prefix + '.' + sub.tag).strip('.')

            nodes.append((s, _prefix))



    if not nodes: break

answered Dec 31 '18 at 4:57

David L

38516

answered Dec 31 '18 at 4:57

David L

38516

answered Dec 31 '18 at 4:57

David L

38516

answered Dec 31 '18 at 4:57

David L

38516

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk