将 Django 查询集转换为熊猫数据框架

我将把一个 Django QuerySet 转换为一个熊猫 DataFrame,如下所示:

qs = SomeModel.objects.select_related().filter(date__year=2012)
q = qs.values('date', 'OtherField')
df = pd.DataFrame.from_records(q)

这是有效的,但是有没有更有效的方法呢?

77404 次浏览

从 Django 的角度来看(我不熟悉 pandas) ,这很好。我唯一担心的是,如果您有大量的记录,您可能会遇到内存问题。如果是这种情况,那么沿着这个 内存高效查询集迭代器的思路就是必要的。(所编写的代码片段可能需要进行一些重写,以便您能够巧妙地使用 .values())。

import pandas as pd
import datetime
from myapp.models import BlogPost


df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(list(BlogPost.objects.filter(date__gte=datetime.datetime(2012, 5, 1)).values()))


# limit which fields
df = pd.DataFrame(list(BlogPost.objects.all().values('author', 'date', 'slug')))

以上是我如何做同样的事情。最有用的补充是指定您感兴趣的字段。如果它只是您感兴趣的可用字段的一个子集,那么我想这将提高性能。

你也许可以使用 model _ to _ dict

import datetime
from django.forms import model_to_dict
pallobjs = [ model_to_dict(pallobj) for pallobj in PalletsManag.objects.filter(estado='APTO_PARA_VENTA')]
df = pd.DataFrame(pallobjs)
df.head()

Django Pandas 相当巧妙地解决了这个问题: https://github.com/chrisdev/django-pandas/

自述:

class MyModel(models.Model):
full_name = models.CharField(max_length=25)
age = models.IntegerField()
department = models.CharField(max_length=3)
wage = models.FloatField()


from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)

Convert the queryset on values_list() will be more memory efficient than on values() directly. Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data). It will save about 50% memory, just need to set the column information when you call pd.DataFrame().

Method 1:
queryset = models.xxx.objects.values("A","B","C","D")
df = pd.DataFrame(list(queryset))  ## consumes much memory
#df = pd.DataFrame.from_records(queryset) ## works but no much change on memory usage


Method 2:
queryset = models.xxx.objects.values_list("A","B","C","D")
df = pd.DataFrame(list(queryset), columns=["A","B","C","D"]) ## this will save 50% memory
#df = pd.DataFrame.from_records(queryset, columns=["A","B","C","D"]) ##It does not work. Crashed with datatype is queryset not list.

我在我的项目中使用了超过100万行的数据进行了测试,峰值内存从2G 减少到了1G。