用于在内存中维护表格数据的数据结构?

我的场景如下: 我有一个数据表(少量字段,少于100行) ,我在程序中广泛使用。我还需要这些数据是持久的,所以我将它保存为一个 CSV 并在启动时加载它。我选择不使用数据库,因为每个选项(甚至 SQLite)对于我不起眼的需求都是过度的(同时——我希望能够以一种简单的方式离线编辑值,没有什么比记事本更简单)。

假设我的数据看起来如下(文件中用逗号分隔,没有标题,这只是一个例子) :

 Row  | Name     | Year   | Priority
------------------------------------
1    | Cat      | 1998   | 1
2    | Fish     | 1998   | 2
3    | Dog      | 1999   | 1
4    | Aardvark | 2000   | 1
5    | Wallaby  | 2000   | 1
6    | Zebra    | 2001   | 3

备注:

  1. 行可以是写入文件的“真实”值,也可以只是表示行号的自动生成值。不管怎样,它都存在于记忆中。
  2. 名字是独一无二的。

我用数据做的事情:

  1. 根据 ID (迭代)或名称(直接访问)查找行。
  2. 根据多个字段以不同的顺序显示表格: 我需要按优先级排序,然后是年,或者年,然后是优先级,等等。
  3. 我需要根据一组参数来计算实例,例如,在1997年到2002年之间有多少行,或者在1998年有多少行,优先级 > 2,等等。

我知道这个“哭泣”的 SQL..。

我正在试图找出数据结构的最佳选择。下面是我看到的几种选择:

行列表列表:

a = []
a.append( [1, "Cat", 1998, 1] )
a.append( [2, "Fish", 1998, 2] )
a.append( [3, "Dog", 1999, 1] )
...

列列表列表(显然会有一个用于 add _ row 等的 API) :

a = []
a.append( [1, 2, 3, 4, 5, 6] )
a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] )
a.append( [1998, 1998, 1999, 2000, 2000, 2001] )
a.append( [1, 2, 1, 1, 1, 3] )

列列表字典(可以创建常量来替换字符串键) :

a = {}
a['ID'] = [1, 2, 3, 4, 5, 6]
a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"]
a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001]
a['Priority'] = [1, 2, 1, 1, 1, 3]

键为元组(行,字段)的字典:

Create constants to avoid string searching
NAME=1
YEAR=2
PRIORITY=3


a={}
a[(1, NAME)] = "Cat"
a[(1, YEAR)] = 1998
a[(1, PRIORITY)] = 1
a[(2, NAME)] = "Fish"
a[(2, YEAR)] = 1998
a[(2, PRIORITY)] = 2
...

而且我相信还有其他的方法... ... 然而,每种方法都有缺点,当涉及到我的要求(复杂的订购和计数)。

推荐的方法是什么?

编辑:

澄清一下,性能对我来说不是一个主要问题。因为表非常小,所以我相信几乎每个操作都在毫秒的范围内,这不是我的应用程序所关心的问题。

134350 次浏览

I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.

Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.

Another thing you can do is have a list of dictionaries.

rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})

This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.

Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?

connection = sqlite3.connect(':memory:')

Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.

Here is a sample of how one might create and populate the database:

import csv
import sqlite3


db = sqlite3.connect(':memory:')


def init_db(cur):
cur.execute('''CREATE TABLE foo (
Row INTEGER,
Name TEXT,
Year INTEGER,
Priority INTEGER)''')


def populate_db(cur, csv_fp):
rdr = csv.reader(csv_fp)
cur.executemany('''
INSERT INTO foo (Row, Name, Year, Priority)
VALUES (?,?,?,?)''', rdr)


cur = db.cursor()
init_db(cur)
populate_db(cur, open('my_csv_input_file.csv'))
db.commit()

If you'd really prefer not to use SQL, you should probably use a list of dictionaries:

lod = [ ] # "list of dicts"


def populate_lod(lod, csv_fp):
rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])
lod.extend(rdr)


def query_lod(lod, filter=None, sort_keys=None):
if filter is not None:
lod = (r for r in lod if filter(r))
if sort_keys is not None:
lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])
else:
lod = list(lod)
return lod


def lookup_lod(lod, **kw):
for row in lod:
for k,v in kw.iteritems():
if row[k] != str(v): break
else:
return row
return None

Testing then yields:

>>> lod = []
>>> populate_lod(lod, csv_fp)
>>>
>>> pprint(lookup_lod(lod, Row=1))
{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'}
>>> pprint(lookup_lod(lod, Name='Aardvark'))
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'}
>>> pprint(query_lod(lod, sort_keys=('Priority', 'Year')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> pprint(query_lod(lod, sort_keys=('Year', 'Priority')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002))
6
>>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2))
0

Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.

First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?

You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.

That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.

Above all, measure what operations will be more common and decide which structure will end up costing less.

Have a Table class whose rows is a list of dict or better row objects

In table do not directly add rows but have a method which update few lookup maps e.g. for name if you are not adding rows in order or id are not consecutive you can have idMap too e.g.

class Table(object):
def __init__(self):
self.rows =  []# list of row objects, we assume if order of id
self.nameMap = {} # for faster direct lookup for row by name


def addRow(self, row):
self.rows.append(row)
self.nameMap[row['name']] = row


def getRow(self, name):
return self.nameMap[name]




table = Table()
table.addRow({'ID':1,'name':'a'})

A very old question I know but...

A pandas DataFrame seems to be the ideal option here.

http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html

From the blurb

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure

http://pandas.pydata.org/

I personally wrote a lib for pretty much that quite recently, it is called BD_XML

as its most fundamental reason of existence is to serve as a way to send data back and forth between XML files and SQL databases.

It is written in Spanish (if that matters in a programming language) but it is very simple.

from BD_XML import Tabla

It defines an object called Tabla (Table), it can be created with a name for identification an a pre-created connection object of a pep-246 compatible database interface.

Table = Tabla('Animals')

Then you need to add columns with the agregar_columna (add_column) method, with can take various key word arguments:

  • campo (field): the name of the field

  • tipo (type): the type of data stored, can be a things like 'varchar' and 'double' or name of python objects if you aren't interested in exporting to a data base latter.

  • defecto (default): set a default value for the column if there is none when you add a row

  • there are other 3 but are only there for database tings and not actually functional

like:

Table.agregar_columna(campo='Name', tipo='str')
Table.agregar_columna(campo='Year', tipo='date')
#declaring it date, time, datetime or timestamp is important for being able to store it as a time object and not only as a number, But you can always put it as a int if you don't care for dates
Table.agregar_columna(campo='Priority', tipo='int')

Then you add the rows with the += operator (or + if you want to create a copy with an extra row)

Table += ('Cat', date(1998,1,1), 1)
Table += {'Year':date(1998,1,1), 'Priority':2, Name:'Fish'}
#…
#The condition for adding is that is a container accessible with either the column name or the position of the column in the table

Then you can generate XML and write it to a file with exportar_XML (export_XML) and escribir_XML (write_XML):

file = os.path.abspath(os.path.join(os.path.dirname(__file__), 'Animals.xml'))
Table.exportar_xml()
Table.escribir_xml(file)

And then import it back with importar_XML (import_XML) with the file name and indication that you are using a file and not an string literal:

Table.importar_xml(file, tipo='archivo')
#archivo means file

Advanced

This are ways you can use a Tabla object in a SQL manner.

#UPDATE <Table> SET Name = CONCAT(Name,' ',Priority), Priority = NULL WHERE id = 2
for row in Table:
if row['id'] == 2:
row['Name'] += ' ' + row['Priority']
row['Priority'] = None
print(Table)


#DELETE FROM <Table> WHERE MOD(id,2) = 0 LIMIT 1
n = 0
nmax = 1
for row in Table:
if row['id'] % 2 == 0:
del Table[row]
n += 1
if n >= nmax: break
print(Table)

this examples assume a column named 'id' but can be replaced width row.pos for your example.

if row.pos == 2:

The file can be download from:

https://bitbucket.org/WolfangT/librerias