熊猫: 将 dtype‘ object’转换为 int

我在 Panda 中读入了一个 SQL 查询,这些值是 dtype‘ object’,尽管它们是字符串、日期和整数。我能够转换日期’对象’熊猫日期时间 dtype,但我得到了一个错误时,试图转换字符串和整数。

这里有一个例子:

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
id    date          purchase
1  abc1  2016-05-22    1
2  abc2  2016-05-29    0
3  abc3  2016-05-22    2
4  abc4  2016-05-22    0


>>> df.dtypes
id          object
date        object
purchase    object
dtype: object

df['date']转换为日期时间可行:

>>> pd.to_datetime(df['date'])
1  2016-05-22
2  2016-05-29
3  2016-05-22
4  2016-05-22
Name: date, dtype: datetime64[ns]

但是,当我试图将 df['purchase']转换为整数时,出现了一个错误:

>>> df['purchase'].astype(int)
....
pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()


TypeError: long() argument must be a string or a number, not 'java.lang.Long'

注意: 我在尝试 .astype('float')时也得到了类似的错误

当试图转换为字符串时,似乎什么也没有发生。

>>> df['id'].apply(str)
1 abc1
2 abc2
3 abc3
4 abc4
Name: id, dtype: object
547366 次浏览

Documenting the answer that worked for me based on the comment by @piRSquared.

I needed to convert to a string first, then an integer.

>>> df['purchase'].astype(str).astype(int)

It's simple

pd.factorize(df.purchase)[0]

Example:

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`
labels
# array([0, 0, 1, 2, 0])
uniques
# array(['b', 'a', 'c'], dtype=object)

Follow these steps:

1.clean your file -> open your datafile in csv format and see that there is "?" in place of empty places and delete all of them.

2.drop the rows containing missing values e.g.:

df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)

3.use astype now for conversion

df["normalized-losses"]=df["normalized-losses"].astype(int)

Note: If still finding erros in your program then again inspect your csv file, open it in excel to find whether is there an "?" in your required column, then delete it and save file and go back and run your program.

comment success! if it works. :)

My train data contains three features are object after applying astype it converts the object into numeric but before that, you need to perform some preprocessing steps:

train.dtypes


C12       object
C13       object
C14       Object


train['C14'] = train.C14.astype(int)


train.dtypes


C12       object
C13       object
C14       int32

pandas >= 1.0

convert_dtypes

The (self) accepted answer doesn't take into consideration the possibility of NaNs in object columns.

df = pd.DataFrame({
'a': [1, 2, np.nan],
'b': [True, False, np.nan]}, dtype=object)
df


a      b
0    1   True
1    2  False
2  NaN    NaN


df['a'].astype(str).astype(int) # raises ValueError

This chokes because the NaN is converted to a string "nan", and further attempts to coerce to integer will fail. To avoid this issue, we can soft-convert columns to their corresponding nullable type using convert_dtypes:

df.convert_dtypes()


a      b
0     1   True
1     2  False
2  <NA>   <NA>


df.convert_dtypes().dtypes


a      Int64
b    boolean
dtype: object

If your data has junk text mixed in with your ints, you can use pd.to_numeric as an initial step:

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want


0      1
1      2
2    ...
dtype: string


# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')


0    1.0
1    2.0
2    NaN
dtype: float64


# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes()


0       1
1       2
2    <NA>
dtype: Int64

Cannot comment so posting this as an answer, which is somewhat in between @piRSquared/@cyril's solution and @cs95's:

As noted by @cs95, if your data contains NaNs or Nones, converting to string type will throw an error when trying to convert to int afterwards.

However, if your data consists of (numerical) strings, using convert_dtypes will convert it to string type unless you use pd.to_numeric as suggested by @cs95 (potentially combined with df.apply()).

In the case that your data consists only of numerical strings (including NaNs or Nones but without any non-numeric "junk"), a possibly simpler alternative would be to convert first to float and then to one of the nullable-integer extension dtypes provided by pandas (already present in version 0.24) (see also this answer):

df['purchase'].astype(float).astype('Int64')

Note that there has been recent discussion on this on github (currently an -unresolved- closed issue though) and that in the case of very long 64-bit integers you may have to convert explicitly to float128 to avoid approximations during the conversions.

This was my data

## list of columns
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2']


for i in l1:
for j in range(0, 8431): #rows = 8431
df[i][j] = int(df[i][j])

I recommend you to use this only with small data. This code has complexity of O(n^2).

In my case, I had a df with mixed data:

df:
0   1   2    ...                  242                  243                  244
0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
....

The floats are actually objects, but I need them to be real floats.

To fix it, referencing @AMC's comment above:

def coerce_to_float(val):
try:
return float(val)
except ValueError:
return val


df = df.applymap(lambda x: coerce_to_float(x))
df['col_name'] = pd.to_numeric(df['col_name'])

This is a better option

to change the data type and save it into the data frame, it is needed to replace the new data type as follows:

ds["cat"] = pd.to_numeric(ds["cat"])

or

ds["cat"] = ds["cat"].astype(int)

If these methods fail, you can try a list comprehension such as this:

df["int_column"] = [int(x) if x.isnumeric() else x for x in df["str_column"] ]

Converting object to numerical int or float.

code is:--

df["total_sqft"] = pd.to_numeric(df["total_sqft"], errors='coerce').fillna(0, downcast='infer')

use astype fuction to convert the datype of that column