How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.

My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").

When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.

Is there a way to import this column unchanged maybe as a string?

I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.

70959 次浏览

I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.

EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.

As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.

converters={'column_name': str}

Let's say I have csv file projects.csv like below:

project_name,project_id
Some Project,000245
Another Project,000478

As for example below code is trimming leading zeros:

from pandas import read_csv


dataframe = read_csv('projects.csv')
print dataframe

Result:

      project_name  project_id
0     Some Project         245
1  Another Project         478

Solution code example:

from pandas import read_csv


dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe

Required result:

      project_name project_id
0     Some Project     000245
1  Another Project     000478

To have all columns as str:

pd.read_csv('sample.csv', dtype=str)

To have certain columns as str:

# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)

here is a shorter, robust and fully working solution:

simply define a mapping (dictionary) between variable names and desired data type:

dtype_dic= {'subject_id': str,
'subject_number' : 'float'}

use that mapping with pd.read_csv():

df = pd.read_csv(yourdata, dtype = dtype_dic)

et voila!

If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:

df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file

You could also do:

df = pd.read_csv("your_file.csv", dtype=str)

By doing this you will have all your columns as strings and you won't lose any leading zeros.

You can use converters to convert number to fixed width if you know the width.

For example, if the width is 5, then

data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})

This will do the trick. It works for pandas==0.23.0 and also read_excel.

Python3.6 or higher required.

You Can do This , Works On all Versions of Pandas

pd.read_csv('filename.csv', dtype={'zero_column_name': object})

As an example, consider the following my_data.txt file:

id,A
03,5
04,6

To preserve the leading zeros for the id column:

df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
   id  A
0  03  5
1  04  6