加载并解析包含多个 JSON 对象的 JSON 文件

小开

最佳答案

你有一个 JSON 行格式化文本文件。你需要逐行解析你的文件:

import json


data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))

每个台词都包含有效的 JSON，但作为一个整体，它不是一个有效的 JSON 值，因为没有顶级列表或对象定义。

请注意，由于文件中每行都包含 JSON，因此可以省去一次性解析所有文件或指定流 JSON 解析器的麻烦。您现在可以选择在移动到下一行之前分别处理每一行，从而节省进程中的内存。如果您的文件非常大，您可能不希望将每个结果附加到一个列表中，并且那么处理所有内容。

如果您有一个包含单个 JSON 对象的文件，中间有分隔符，那么使用如何使用 JSON 模块一次读入一个 JSON 对象？来使用缓冲方法解析单个对象。

小开

是格式错误。每行只有一个 JSON 对象，但它们不包含在较大的数据结构(即数组)中。您可能需要重新格式化它，使它以 [开始，以 ]结束，每一行的末尾都有一个逗号，或者将它逐行解析为单独的字典。

小开

对于那些偶然发现这个问题的人: python jsonlines库(比这个问题要年轻得多)优雅地处理每行只有一个 json 文档的文件。见 https://jsonlines.readthedocs.io/

小开

如果您正在使用 pandas，并且您有兴趣将 json文件作为数据帧加载，那么您可以使用:

import pandas as pd
df = pd.read_json('file.json', lines=True)

要将其转换为 json 数组，可以使用:

df.to_json('new_file.json')

小开

附加到@arunppsg 的答案，但是使用多处理来处理目录中的大量文件。

import numpy as np
import pandas as pd
import json
import os
import multiprocessing as mp
import time


directory = 'your_directory'


def read_json(json_files):
df = pd.DataFrame()
for j in json_files:
with open(os.path.join(directory, j)) as f:
df = df.append(pd.read_json(f, lines=True)) # if there's multiple lines in the json file, flag lines to true, false otherwise.
return df


def parallelize_json(json_files, func):
json_files_split = np.array_split(json_files, 10)
pool = mp.Pool(mp.cpu_count())
df = pd.concat(pool.map(func, json_files_split))
pool.close()
pool.join()
return df


# start the timer
start = time.time()


# read all json files in parallel
df = parallelize_json(json_files, read_json)


# end the timer
end = time.time()


# print the time taken to read all json files
print(end - start)