How to determine whether a column/variable is numeric or not in Pandas/NumPy?
Each Answer to this Q is separated by one/two green lines.
Is there a better way to determine whether a variable in Pandas
and/or NumPy
is numeric
or not ?
I have a self defined dictionary
with dtypes
as keys and numeric
/ not
as values.
In pandas 0.20.2
you can do:
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1.0, 2.0, 3.0]})
is_string_dtype(df['A'])
>>>> True
is_numeric_dtype(df['B'])
>>>> True
You can use np.issubdtype
to check if the dtype is a sub dtype of np.number
. Examples:
np.issubdtype(arr.dtype, np.number) # where arr is a numpy array
np.issubdtype(df['X'].dtype, np.number) # where df['X'] is a pandas Series
This works for numpy’s dtypes but fails for pandas specific types like pd.Categorical as Thomas noted. If you are using categoricals is_numeric_dtype
function from pandas is a better alternative than np.issubdtype.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0],
'C': [1j, 2j, 3j], 'D': ['a', 'b', 'c']})
df
Out:
A B C D
0 1 1.0 1j a
1 2 2.0 2j b
2 3 3.0 3j c
df.dtypes
Out:
A int64
B float64
C complex128
D object
dtype: object
np.issubdtype(df['A'].dtype, np.number)
Out: True
np.issubdtype(df['B'].dtype, np.number)
Out: True
np.issubdtype(df['C'].dtype, np.number)
Out: True
np.issubdtype(df['D'].dtype, np.number)
Out: False
For multiple columns you can use np.vectorize:
is_number = np.vectorize(lambda x: np.issubdtype(x, np.number))
is_number(df.dtypes)
Out: array([ True, True, True, False], dtype=bool)
And for selection, pandas now has select_dtypes
:
df.select_dtypes(include=[np.number])
Out:
A B C
0 1 1.0 1j
1 2 2.0 2j
2 3 3.0 3j
Based on @jaime’s answer in the comments, you need to check .dtype.kind
for the column of interest. For example;
>>> import pandas as pd
>>> df = pd.DataFrame({'numeric': [1, 2, 3], 'not_numeric': ['A', 'B', 'C']})
>>> df['numeric'].dtype.kind in 'biufc'
>>> True
>>> df['not_numeric'].dtype.kind in 'biufc'
>>> False
NB The meaning of biufc
: b
bool, i
int (signed), u
unsigned int, f
float, c
complex. See https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.kind.html#numpy.dtype.kind
Pandas has select_dtype
function. You can easily filter your columns on int64, and float64 like this:
df.select_dtypes(include=['int64','float64'])
This is a pseudo-internal method to return only the numeric type data
In [27]: df = DataFrame(dict(A = np.arange(3),
B = np.random.randn(3),
C = ['foo','bar','bah'],
D = Timestamp('20130101')))
In [28]: df
Out[28]:
A B C D
0 0 -0.667672 foo 2013-01-01 00:00:00
1 1 0.811300 bar 2013-01-01 00:00:00
2 2 2.020402 bah 2013-01-01 00:00:00
In [29]: df.dtypes
Out[29]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
In [30]: df._get_numeric_data()
Out[30]:
A B
0 0 -0.667672
1 1 0.811300
2 2 2.020402
How about just checking type for one of the values in the column? We’ve always had something like this:
isinstance(x, (int, long, float, complex))
When I try to check the datatypes for the columns in below dataframe, I get them as ‘object’ and not a numerical type I’m expecting:
df = pd.DataFrame(columns=('time', 'test1', 'test2'))
for i in range(20):
df.loc[i] = [datetime.now() - timedelta(hours=i*1000),i*10,i*100]
df.dtypes
time datetime64[ns]
test1 object
test2 object
dtype: object
When I do the following, it seems to give me accurate result:
isinstance(df['test1'][len(df['test1'])-1], (int, long, float, complex))
returns
True
You can also try:
df_dtypes = np.array(df.dtypes)
df_numericDtypes= [x.kind in 'bifc' for x in df_dtypes]
It returns a list of booleans: True
if numeric, False
if not.
Just to add to all other answers, one can also use df.info()
to get whats the data type of each column.
You can check whether a given column contains numeric values or not using dtypes
numerical_features = [feature for feature in train_df.columns if train_df[feature].dtypes != 'O']
Note: “O” should be capital