Converting an xml doc into a specific dot-expanded json structure












3















I have the following XML document:



<Item ID="288917">
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
</Main>
<Genres>
<Genre FacebookID="6003161475030">Comedy</Genre>
<Genre FacebookID="6003172932634">TV-Show</Genre>
</Genres>
<Products>
<Product Country="CA">
<URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
<Offers>
<Offer Type="HDBUY">
<Price>3.49</Price>
<Currency>CAD</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>2.49</Price>
<Currency>CAD</Currency>
</Offer>
</Offers>
</Product>
<Product Country="FR">
<URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
<Rating>Tout public</Rating>
<Offers>
<Offer Type="HDBUY">
<Price>2.49</Price>
<Currency>EUR</Currency>
</Offer>
<Offer Type="SDBUY">
<Price>1.99</Price>
<Currency>EUR</Currency>
</Offer>
</Offers>
</Product>
</Products>
</Item>


Currently, to get it into json format I'm doing the following:



parser = etree.XMLParser(recover=True)
node = etree.fromstring(s, parser=parser)
data = xmltodict.parse(etree.tostring(node))


Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:



{
"Item[@ID]": 288917, # if no preceding element, use the root node tag
"Main.Platform": "iTunes",
"Main.PlatformID": "353736518",
"Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
"Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],
"Products.Product[@Country]": ["CA", "FR"],
"Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
"Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
"Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
"Products.Product.Offers.Offer.Currency": "EUR"
}









share|improve this question



























    3















    I have the following XML document:



    <Item ID="288917">
    <Main>
    <Platform>iTunes</Platform>
    <PlatformID>353736518</PlatformID>
    </Main>
    <Genres>
    <Genre FacebookID="6003161475030">Comedy</Genre>
    <Genre FacebookID="6003172932634">TV-Show</Genre>
    </Genres>
    <Products>
    <Product Country="CA">
    <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
    <Offers>
    <Offer Type="HDBUY">
    <Price>3.49</Price>
    <Currency>CAD</Currency>
    </Offer>
    <Offer Type="SDBUY">
    <Price>2.49</Price>
    <Currency>CAD</Currency>
    </Offer>
    </Offers>
    </Product>
    <Product Country="FR">
    <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
    <Rating>Tout public</Rating>
    <Offers>
    <Offer Type="HDBUY">
    <Price>2.49</Price>
    <Currency>EUR</Currency>
    </Offer>
    <Offer Type="SDBUY">
    <Price>1.99</Price>
    <Currency>EUR</Currency>
    </Offer>
    </Offers>
    </Product>
    </Products>
    </Item>


    Currently, to get it into json format I'm doing the following:



    parser = etree.XMLParser(recover=True)
    node = etree.fromstring(s, parser=parser)
    data = xmltodict.parse(etree.tostring(node))


    Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:



    {
    "Item[@ID]": 288917, # if no preceding element, use the root node tag
    "Main.Platform": "iTunes",
    "Main.PlatformID": "353736518",
    "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
    "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],
    "Products.Product[@Country]": ["CA", "FR"],
    "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
    "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
    "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
    "Products.Product.Offers.Offer.Currency": "EUR"
    }









    share|improve this question

























      3












      3








      3


      1






      I have the following XML document:



      <Item ID="288917">
      <Main>
      <Platform>iTunes</Platform>
      <PlatformID>353736518</PlatformID>
      </Main>
      <Genres>
      <Genre FacebookID="6003161475030">Comedy</Genre>
      <Genre FacebookID="6003172932634">TV-Show</Genre>
      </Genres>
      <Products>
      <Product Country="CA">
      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
      <Offers>
      <Offer Type="HDBUY">
      <Price>3.49</Price>
      <Currency>CAD</Currency>
      </Offer>
      <Offer Type="SDBUY">
      <Price>2.49</Price>
      <Currency>CAD</Currency>
      </Offer>
      </Offers>
      </Product>
      <Product Country="FR">
      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
      <Rating>Tout public</Rating>
      <Offers>
      <Offer Type="HDBUY">
      <Price>2.49</Price>
      <Currency>EUR</Currency>
      </Offer>
      <Offer Type="SDBUY">
      <Price>1.99</Price>
      <Currency>EUR</Currency>
      </Offer>
      </Offers>
      </Product>
      </Products>
      </Item>


      Currently, to get it into json format I'm doing the following:



      parser = etree.XMLParser(recover=True)
      node = etree.fromstring(s, parser=parser)
      data = xmltodict.parse(etree.tostring(node))


      Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:



      {
      "Item[@ID]": 288917, # if no preceding element, use the root node tag
      "Main.Platform": "iTunes",
      "Main.PlatformID": "353736518",
      "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
      "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],
      "Products.Product[@Country]": ["CA", "FR"],
      "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
      "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
      "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
      "Products.Product.Offers.Offer.Currency": "EUR"
      }









      share|improve this question














      I have the following XML document:



      <Item ID="288917">
      <Main>
      <Platform>iTunes</Platform>
      <PlatformID>353736518</PlatformID>
      </Main>
      <Genres>
      <Genre FacebookID="6003161475030">Comedy</Genre>
      <Genre FacebookID="6003172932634">TV-Show</Genre>
      </Genres>
      <Products>
      <Product Country="CA">
      <URL>https://itunes.apple.com/ca/tv-season/id353187108?i=353736518</URL>
      <Offers>
      <Offer Type="HDBUY">
      <Price>3.49</Price>
      <Currency>CAD</Currency>
      </Offer>
      <Offer Type="SDBUY">
      <Price>2.49</Price>
      <Currency>CAD</Currency>
      </Offer>
      </Offers>
      </Product>
      <Product Country="FR">
      <URL>https://itunes.apple.com/fr/tv-season/id353187108?i=353736518</URL>
      <Rating>Tout public</Rating>
      <Offers>
      <Offer Type="HDBUY">
      <Price>2.49</Price>
      <Currency>EUR</Currency>
      </Offer>
      <Offer Type="SDBUY">
      <Price>1.99</Price>
      <Currency>EUR</Currency>
      </Offer>
      </Offers>
      </Product>
      </Products>
      </Item>


      Currently, to get it into json format I'm doing the following:



      parser = etree.XMLParser(recover=True)
      node = etree.fromstring(s, parser=parser)
      data = xmltodict.parse(etree.tostring(node))


      Of course the xmltodict is doing the heavy lifting. However, it gives me a format that is not ideal for what I'm trying to accomplish. Here is what I'd like the end data to look like:



      {
      "Item[@ID]": 288917, # if no preceding element, use the root node tag
      "Main.Platform": "iTunes",
      "Main.PlatformID": "353736518",
      "Genres.Genre": ["Comedy", "TV-Show"] # list of elements if repeated
      "Genres.Genre[@FacebookID]": ["6003161475030", "6003161475030"],
      "Products.Product[@Country]": ["CA", "FR"],
      "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"],
      "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"],
      "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"],
      "Products.Product.Offers.Offer.Currency": "EUR"
      }






      python xml recursion elementtree






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 31 '18 at 3:24









      David LDavid L

      38516




      38516
























          2 Answers
          2






          active

          oldest

          votes


















          1














          You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.



          The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.



          Demo:



          from xml.etree.ElementTree import ElementTree
          from pprint import pprint

          # Setup XML tree for parsing
          tree = ElementTree()
          tree.parse("sample.xml")
          root = tree.getroot()

          def collect_xml_paths(root, path=, result={}):
          """Collect XML paths into a dictionary"""

          # First collect root items
          if not result:
          root_id, root_value = tuple(root.attrib.items())[0]
          root_key = root.tag + "[@%s]" % root_id
          result[root_key] = root_value

          # Go through each child from root
          for child in root:

          # Extract text
          text = child.text.strip()

          # Update path
          new_path = path[:]
          new_path.append(child.tag)

          # Create dot separated key
          key = ".".join(new_path)

          # Get child attributes
          attributes = child.attrib

          # Ensure we have attributes
          if attributes:

          # Add each attribute to result
          for k, v in attributes.items():
          attrib_key = key + "[@%s]" % k
          result.setdefault(attrib_key, ).append(v)

          # Add text if it exists
          if text:
          result.setdefault(key, ).append(text)

          # Recurse through paths once done iteration
          collect_xml_paths(child, new_path)

          # Separate single values from list values
          return {k: v[0] if len(v) == 1 else v for k, v in result.items()}

          pprint(collect_xml_paths(root))


          Output:



          {'Genres.Genre': ['Comedy', 'TV-Show'],
          'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],
          'Item[@ID]': '288917',
          'Main.Platform': 'iTunes',
          'Main.PlatformID': '353736518',
          'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
          'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
          'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
          'Products.Product.Rating': 'Tout public',
          'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
          'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
          'Products.Product[@Country]': ['CA', 'FR']}


          If you want to serialize this dictionary to JSON, you can use json.dumps():



          from json import dumps

          print(dumps(collect_xml_paths(root)))
          # {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}





          share|improve this answer


























          • this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

            – David L
            Dec 31 '18 at 19:40



















          1














          This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:



          node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
          data = OrderedDict()
          nodes = [(node, ''),] # format is (node, prefix)

          while nodes:

          for sub, prefix in nodes:

          # remove the prefix tag unless its for the first attribute
          tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
          atr_prefix = sub.tag if (sub == node) else tag_prefix

          # tag
          if sub.text.strip():
          _prefix = tag_prefix + '.' + sub.tag
          _value = sub.text.strip()
          if data.get(_prefix): # convert it to a list if multiple values
          if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
          data[_prefix].append(_value)
          else:
          data[_prefix] = _value

          # atr
          for k, v in sub.attrib.items():
          _prefix = atr_prefix + '[@%s]' % k
          _value = v
          if data.get(_prefix): # convert it to a list if multiple values
          if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
          data[_prefix].append(_value)
          else:
          data[_prefix] = _value

          nodes.remove((sub, prefix))

          for s in sub.getchildren():
          _prefix = (prefix + '.' + sub.tag).strip('.')
          nodes.append((s, _prefix))

          if not nodes: break





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53983293%2fconverting-an-xml-doc-into-a-specific-dot-expanded-json-structure%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.



            The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.



            Demo:



            from xml.etree.ElementTree import ElementTree
            from pprint import pprint

            # Setup XML tree for parsing
            tree = ElementTree()
            tree.parse("sample.xml")
            root = tree.getroot()

            def collect_xml_paths(root, path=, result={}):
            """Collect XML paths into a dictionary"""

            # First collect root items
            if not result:
            root_id, root_value = tuple(root.attrib.items())[0]
            root_key = root.tag + "[@%s]" % root_id
            result[root_key] = root_value

            # Go through each child from root
            for child in root:

            # Extract text
            text = child.text.strip()

            # Update path
            new_path = path[:]
            new_path.append(child.tag)

            # Create dot separated key
            key = ".".join(new_path)

            # Get child attributes
            attributes = child.attrib

            # Ensure we have attributes
            if attributes:

            # Add each attribute to result
            for k, v in attributes.items():
            attrib_key = key + "[@%s]" % k
            result.setdefault(attrib_key, ).append(v)

            # Add text if it exists
            if text:
            result.setdefault(key, ).append(text)

            # Recurse through paths once done iteration
            collect_xml_paths(child, new_path)

            # Separate single values from list values
            return {k: v[0] if len(v) == 1 else v for k, v in result.items()}

            pprint(collect_xml_paths(root))


            Output:



            {'Genres.Genre': ['Comedy', 'TV-Show'],
            'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],
            'Item[@ID]': '288917',
            'Main.Platform': 'iTunes',
            'Main.PlatformID': '353736518',
            'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
            'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
            'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
            'Products.Product.Rating': 'Tout public',
            'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
            'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
            'Products.Product[@Country]': ['CA', 'FR']}


            If you want to serialize this dictionary to JSON, you can use json.dumps():



            from json import dumps

            print(dumps(collect_xml_paths(root)))
            # {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}





            share|improve this answer


























            • this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

              – David L
              Dec 31 '18 at 19:40
















            1














            You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.



            The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.



            Demo:



            from xml.etree.ElementTree import ElementTree
            from pprint import pprint

            # Setup XML tree for parsing
            tree = ElementTree()
            tree.parse("sample.xml")
            root = tree.getroot()

            def collect_xml_paths(root, path=, result={}):
            """Collect XML paths into a dictionary"""

            # First collect root items
            if not result:
            root_id, root_value = tuple(root.attrib.items())[0]
            root_key = root.tag + "[@%s]" % root_id
            result[root_key] = root_value

            # Go through each child from root
            for child in root:

            # Extract text
            text = child.text.strip()

            # Update path
            new_path = path[:]
            new_path.append(child.tag)

            # Create dot separated key
            key = ".".join(new_path)

            # Get child attributes
            attributes = child.attrib

            # Ensure we have attributes
            if attributes:

            # Add each attribute to result
            for k, v in attributes.items():
            attrib_key = key + "[@%s]" % k
            result.setdefault(attrib_key, ).append(v)

            # Add text if it exists
            if text:
            result.setdefault(key, ).append(text)

            # Recurse through paths once done iteration
            collect_xml_paths(child, new_path)

            # Separate single values from list values
            return {k: v[0] if len(v) == 1 else v for k, v in result.items()}

            pprint(collect_xml_paths(root))


            Output:



            {'Genres.Genre': ['Comedy', 'TV-Show'],
            'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],
            'Item[@ID]': '288917',
            'Main.Platform': 'iTunes',
            'Main.PlatformID': '353736518',
            'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
            'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
            'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
            'Products.Product.Rating': 'Tout public',
            'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
            'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
            'Products.Product[@Country]': ['CA', 'FR']}


            If you want to serialize this dictionary to JSON, you can use json.dumps():



            from json import dumps

            print(dumps(collect_xml_paths(root)))
            # {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}





            share|improve this answer


























            • this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

              – David L
              Dec 31 '18 at 19:40














            1












            1








            1







            You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.



            The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.



            Demo:



            from xml.etree.ElementTree import ElementTree
            from pprint import pprint

            # Setup XML tree for parsing
            tree = ElementTree()
            tree.parse("sample.xml")
            root = tree.getroot()

            def collect_xml_paths(root, path=, result={}):
            """Collect XML paths into a dictionary"""

            # First collect root items
            if not result:
            root_id, root_value = tuple(root.attrib.items())[0]
            root_key = root.tag + "[@%s]" % root_id
            result[root_key] = root_value

            # Go through each child from root
            for child in root:

            # Extract text
            text = child.text.strip()

            # Update path
            new_path = path[:]
            new_path.append(child.tag)

            # Create dot separated key
            key = ".".join(new_path)

            # Get child attributes
            attributes = child.attrib

            # Ensure we have attributes
            if attributes:

            # Add each attribute to result
            for k, v in attributes.items():
            attrib_key = key + "[@%s]" % k
            result.setdefault(attrib_key, ).append(v)

            # Add text if it exists
            if text:
            result.setdefault(key, ).append(text)

            # Recurse through paths once done iteration
            collect_xml_paths(child, new_path)

            # Separate single values from list values
            return {k: v[0] if len(v) == 1 else v for k, v in result.items()}

            pprint(collect_xml_paths(root))


            Output:



            {'Genres.Genre': ['Comedy', 'TV-Show'],
            'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],
            'Item[@ID]': '288917',
            'Main.Platform': 'iTunes',
            'Main.PlatformID': '353736518',
            'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
            'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
            'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
            'Products.Product.Rating': 'Tout public',
            'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
            'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
            'Products.Product[@Country]': ['CA', 'FR']}


            If you want to serialize this dictionary to JSON, you can use json.dumps():



            from json import dumps

            print(dumps(collect_xml_paths(root)))
            # {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}





            share|improve this answer















            You can use recursion here. One way is to store the paths progressively as your recurse the XML document, and return a result dictionary at the end, which can be serialized to JSON.



            The below demo uses the standard library xml.etree.ElementTree for parsing XML documents.



            Demo:



            from xml.etree.ElementTree import ElementTree
            from pprint import pprint

            # Setup XML tree for parsing
            tree = ElementTree()
            tree.parse("sample.xml")
            root = tree.getroot()

            def collect_xml_paths(root, path=, result={}):
            """Collect XML paths into a dictionary"""

            # First collect root items
            if not result:
            root_id, root_value = tuple(root.attrib.items())[0]
            root_key = root.tag + "[@%s]" % root_id
            result[root_key] = root_value

            # Go through each child from root
            for child in root:

            # Extract text
            text = child.text.strip()

            # Update path
            new_path = path[:]
            new_path.append(child.tag)

            # Create dot separated key
            key = ".".join(new_path)

            # Get child attributes
            attributes = child.attrib

            # Ensure we have attributes
            if attributes:

            # Add each attribute to result
            for k, v in attributes.items():
            attrib_key = key + "[@%s]" % k
            result.setdefault(attrib_key, ).append(v)

            # Add text if it exists
            if text:
            result.setdefault(key, ).append(text)

            # Recurse through paths once done iteration
            collect_xml_paths(child, new_path)

            # Separate single values from list values
            return {k: v[0] if len(v) == 1 else v for k, v in result.items()}

            pprint(collect_xml_paths(root))


            Output:



            {'Genres.Genre': ['Comedy', 'TV-Show'],
            'Genres.Genre[@FacebookID]': ['6003161475030', '6003172932634'],
            'Item[@ID]': '288917',
            'Main.Platform': 'iTunes',
            'Main.PlatformID': '353736518',
            'Products.Product.Offers.Offer.Currency': ['CAD', 'CAD', 'EUR', 'EUR'],
            'Products.Product.Offers.Offer.Price': ['3.49', '2.49', '2.49', '1.99'],
            'Products.Product.Offers.Offer[@Type]': ['HDBUY', 'SDBUY', 'HDBUY', 'SDBUY'],
            'Products.Product.Rating': 'Tout public',
            'Products.Product.URL': ['https://itunes.apple.com/ca/tv-season/id353187108?i=353736518',
            'https://itunes.apple.com/fr/tv-season/id353187108?i=353736518'],
            'Products.Product[@Country]': ['CA', 'FR']}


            If you want to serialize this dictionary to JSON, you can use json.dumps():



            from json import dumps

            print(dumps(collect_xml_paths(root)))
            # {"Item[@ID]": "288917", "Main.Platform": "iTunes", "Main.PlatformID": "353736518", "Genres.Genre[@FacebookID]": ["6003161475030", "6003172932634"], "Genres.Genre": ["Comedy", "TV-Show"], "Products.Product[@Country]": ["CA", "FR"], "Products.Product.URL": ["https://itunes.apple.com/ca/tv-season/id353187108?i=353736518", "https://itunes.apple.com/fr/tv-season/id353187108?i=353736518"], "Products.Product.Offers.Offer[@Type]": ["HDBUY", "SDBUY", "HDBUY", "SDBUY"], "Products.Product.Offers.Offer.Price": ["3.49", "2.49", "2.49", "1.99"], "Products.Product.Offers.Offer.Currency": ["CAD", "CAD", "EUR", "EUR"], "Products.Product.Rating": "Tout public"}






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 31 '18 at 5:33

























            answered Dec 31 '18 at 5:14









            RoadRunnerRoadRunner

            11.2k31340




            11.2k31340













            • this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

              – David L
              Dec 31 '18 at 19:40



















            • this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

              – David L
              Dec 31 '18 at 19:40

















            this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

            – David L
            Dec 31 '18 at 19:40





            this is a perfect answer and I appreciate the thorough comments within the code. I've asked another similar question as well at stackoverflow.com/questions/53990897/….

            – David L
            Dec 31 '18 at 19:40













            1














            This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:



            node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
            data = OrderedDict()
            nodes = [(node, ''),] # format is (node, prefix)

            while nodes:

            for sub, prefix in nodes:

            # remove the prefix tag unless its for the first attribute
            tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
            atr_prefix = sub.tag if (sub == node) else tag_prefix

            # tag
            if sub.text.strip():
            _prefix = tag_prefix + '.' + sub.tag
            _value = sub.text.strip()
            if data.get(_prefix): # convert it to a list if multiple values
            if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
            data[_prefix].append(_value)
            else:
            data[_prefix] = _value

            # atr
            for k, v in sub.attrib.items():
            _prefix = atr_prefix + '[@%s]' % k
            _value = v
            if data.get(_prefix): # convert it to a list if multiple values
            if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
            data[_prefix].append(_value)
            else:
            data[_prefix] = _value

            nodes.remove((sub, prefix))

            for s in sub.getchildren():
            _prefix = (prefix + '.' + sub.tag).strip('.')
            nodes.append((s, _prefix))

            if not nodes: break





            share|improve this answer




























              1














              This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:



              node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
              data = OrderedDict()
              nodes = [(node, ''),] # format is (node, prefix)

              while nodes:

              for sub, prefix in nodes:

              # remove the prefix tag unless its for the first attribute
              tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
              atr_prefix = sub.tag if (sub == node) else tag_prefix

              # tag
              if sub.text.strip():
              _prefix = tag_prefix + '.' + sub.tag
              _value = sub.text.strip()
              if data.get(_prefix): # convert it to a list if multiple values
              if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
              data[_prefix].append(_value)
              else:
              data[_prefix] = _value

              # atr
              for k, v in sub.attrib.items():
              _prefix = atr_prefix + '[@%s]' % k
              _value = v
              if data.get(_prefix): # convert it to a list if multiple values
              if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
              data[_prefix].append(_value)
              else:
              data[_prefix] = _value

              nodes.remove((sub, prefix))

              for s in sub.getchildren():
              _prefix = (prefix + '.' + sub.tag).strip('.')
              nodes.append((s, _prefix))

              if not nodes: break





              share|improve this answer


























                1












                1








                1







                This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:



                node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
                data = OrderedDict()
                nodes = [(node, ''),] # format is (node, prefix)

                while nodes:

                for sub, prefix in nodes:

                # remove the prefix tag unless its for the first attribute
                tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
                atr_prefix = sub.tag if (sub == node) else tag_prefix

                # tag
                if sub.text.strip():
                _prefix = tag_prefix + '.' + sub.tag
                _value = sub.text.strip()
                if data.get(_prefix): # convert it to a list if multiple values
                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
                data[_prefix].append(_value)
                else:
                data[_prefix] = _value

                # atr
                for k, v in sub.attrib.items():
                _prefix = atr_prefix + '[@%s]' % k
                _value = v
                if data.get(_prefix): # convert it to a list if multiple values
                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
                data[_prefix].append(_value)
                else:
                data[_prefix] = _value

                nodes.remove((sub, prefix))

                for s in sub.getchildren():
                _prefix = (prefix + '.' + sub.tag).strip('.')
                nodes.append((s, _prefix))

                if not nodes: break





                share|improve this answer













                This is a bit verbose, but it wasn't too hard to format this as a flat dict. Here is an example:



                node = etree.fromstring(file_data.encode('utf-8'), parser=parser)
                data = OrderedDict()
                nodes = [(node, ''),] # format is (node, prefix)

                while nodes:

                for sub, prefix in nodes:

                # remove the prefix tag unless its for the first attribute
                tag_prefix = '.'.join(prefix.split('.')[1:]) if ('.' in prefix) else ''
                atr_prefix = sub.tag if (sub == node) else tag_prefix

                # tag
                if sub.text.strip():
                _prefix = tag_prefix + '.' + sub.tag
                _value = sub.text.strip()
                if data.get(_prefix): # convert it to a list if multiple values
                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
                data[_prefix].append(_value)
                else:
                data[_prefix] = _value

                # atr
                for k, v in sub.attrib.items():
                _prefix = atr_prefix + '[@%s]' % k
                _value = v
                if data.get(_prefix): # convert it to a list if multiple values
                if not isinstance(data[_prefix], list): data[_prefix] = [data[_prefix],]
                data[_prefix].append(_value)
                else:
                data[_prefix] = _value

                nodes.remove((sub, prefix))

                for s in sub.getchildren():
                _prefix = (prefix + '.' + sub.tag).strip('.')
                nodes.append((s, _prefix))

                if not nodes: break






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Dec 31 '18 at 4:57









                David LDavid L

                38516




                38516






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53983293%2fconverting-an-xml-doc-into-a-specific-dot-expanded-json-structure%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Mossoró

                    Error while reading .h5 file using the rhdf5 package in R

                    Pushsharp Apns notification error: 'InvalidToken'