Python Pandas read_csv 跳过行,但保留标题

我有麻烦找出如何跳过 n 行在一个 csv 文件,但保持标题是1行。

我想要做的是迭代,但是保持标题从第一行。skiprows使标题成为跳过的行之后的第一行。做这件事的最好方法是什么?

data = pd.read_csv('test.csv', sep='|', header=0, skiprows=10, nrows=10)
115602 次浏览

You can pass a list of row numbers to skiprows instead of an integer.

By giving the function the integer 10, you're just skipping the first 10 lines.

To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))

Other ways to skip rows using read_csv

The two main ways to control which rows read_csv uses are the header or skiprows parameters.

Supose we have the following CSV file with one column:

a
b
c
d
e
f

In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).

  • Read all lines as values (no header, defaults to integers)

    >>> pd.read_csv(f, header=None)
    0
    0  a
    1  b
    2  c
    3  d
    4  e
    5  f
    
  • Use a particular row as the header (skip all lines before that):

    >>> pd.read_csv(f, header=3)
    d
    0  e
    1  f
    
  • Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):

    >>> pd.read_csv(f, header=[2, 4])
    c
    e
    0  f
    
  • Skip N rows from the start of the file (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=3)
    d
    0  e
    1  f
    
  • Skip one or more rows by giving the row indices (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=[2, 4])
    a
    0  b
    1  d
    2  f
    

To expand on @AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))

is the same as:

pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])

The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.

If you're iterating through a long csv file, you can use the chunksize argument. If for some reason you need to manually step through it, you can try the following as long as you know how many iterations you need to go through:

for i in range(num_iters):
pd.read_csv('test.csv', sep='|', header=0,
skiprows = range(i*10 + 1, (i+1)*10), nrows=10)
    

Great answers already. Consider this generalized scenario:

Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).

Here's the snippet:

pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)

If you need to skip/drop specific rows, say the first 3 rows (i.e. 0,1,2) and then 2 more rows (i.e. 4,5). You can use the following to retain the header row:

df = pd.read_csv(file_in, delimiter='\t', skiprows=[0,1,2,4,5], encoding='utf-16', usecols=cols)