如何将 XML 文件转换为漂亮的熊猫数据框架？

小开

最佳答案

您可以很容易地使用 xml(来自 Python 标准库)转换为 pandas.DataFrame。下面是我将要做的(当从文件读取时，用文件或文件对象的名称替换 xml_data) :

import pandas as pd
import xml.etree.ElementTree as ET
import io


def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict


xml_data = io.StringIO(u'''YOUR XML STRING HERE''')


etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

如果原始文档中有多个作者，或者 XML 的根不是 author，那么我将添加以下生成器:

def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row

把 doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))改成 doc_df = pd.DataFrame(list(iter_author(etree)))

查看 xml库文件中提供的 ElementTree 教程。

小开

下面是将 xml 转换为熊猫数据框架的另一种方法。例如，我从字符串中解析 xml，但是这种逻辑也适用于读取文件。

import pandas as pd
import xml.etree.ElementTree as ET


xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'


etree = ET.fromstring(xml_str)
dfcols = ['id', 'name']
df = pd.DataFrame(columns=dfcols)


for i in etree.iter(tag='data'):
df = df.append(
pd.Series([i.get('id'), i.get('name')], index=dfcols),
ignore_index=True)


df.head()

小开

还可以通过创建元素字典进行转换，然后直接转换为数据帧:

import xml.etree.ElementTree as ET
import pandas as pd


# Contents of test.xml
# <?xml version="1.0" encoding="utf-8"?> <tags>   <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" />   <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" />   <row Id="3" TagName="elicitation" Count="10" />   <row Id="5" TagName="open-source" Count="16" /> </tags>


root = ET.parse('test.xml').getroot()


tags = {"tags":[]}
for elem in root:
tag = {}
tag["Id"] = elem.attrib['Id']
tag["TagName"] = elem.attrib['TagName']
tag["Count"] = elem.attrib['Count']
tags["tags"]. append(tag)


df_users = pd.DataFrame(tags["tags"])
df_users.head()

小开

插话推荐使用 (咒语)库。它可以很好地处理 xml 文本，我已经使用它来获取一个包含近百万条记录的 xml 文件。

小开

从 V1.3开始，您可以简单地使用:

pandas.read_xml(path_or_file)