pd.NA vs np.nan for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas?

Some sample code that uses them both:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA],
                   'numeric': [1, 2, np.nan , 4],
                    'categorical': pd.Categorical(['d', np.nan,'f', 'g'])
                 })

output:

|    | object   |   numeric | categorical   |
|---:|:---------|----------:|:--------------|
|  0 | a        |         1 | d             |
|  1 | b        |         2 | nan           |
|  2 | c        |       nan | f             |
|  3 | <NA>     |         4 | g             |

As of now (release of pandas-1.0.0) I would really recommend to use it carefully.

First, it’s still an experimental feature:

Experimental: the behaviour of pd.NA can still change without warning.

Second, the behaviour differs from np.nan:

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations.

Both quotas from release-notes

To show some additional example, I was surprised with interpolation behaviour:

Create simple DataFrame:

df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]})
df
#       a    b
# 0     0  0.0
# 1  <NA>  NaN
# 2     2  2.0

and try to interpolate:

df.interpolate()
#       a    b
# 0     0  0.0
# 1  <NA>  1.0
# 2     2  2.0

There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences – It is an experimental feature and it behaves differently in some cases.

I think it will be very useful feature, but I would be really careful with statements like “It should be completely fine to use it instead of np.nan“. It might be true for most cases, but can cause some troubles when you are not aware of it.

According to the docs

The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types

So if you have a column with multiple dtypes use pd.NA else np.nan should be fine.

However since pd.NA seem to have the same functionality as np.nan, it might just be better to use pd.NA for all your nan purposes

Only one import now

Both pd.NA and np.nan denote missing values in the dataframe.
The main difference that I have noticed is that np.nan is a floating point value while pd.NA stores an integer value.
If you have column1 with all integers and some missing values in your dataset, and the missing values are replaced by np.nan, then the datatype of the column becomes a float, since np.nan is a float.
But if you have column2 with all integers and some missing values in your dataset, and the missing values are replaced by pd.NA, then the datatype of the column remains an integer, since pd.NA is an integer.
This might be useful if you want to keep any columns as int, and not change it to float.

pd.NA is the new guy in town and is pandas own null value. A lot of datatypes are borrowed from numpy that includes np.nan.

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

Lets build a df with all the different dtypes.

d = {'int': pd.Series([1, None], dtype=np.dtype("O")),
    'float': pd.Series([3.0, np.NaN], dtype=np.dtype("float")),
    'str': pd.Series(['test', None], dtype=np.dtype("str")),
    "bool": pd.Series([True, np.nan], dtype=np.dtype("O")),
    "date": pd.Series(['1/1/2000', np.NaN], dtype=np.dtype("O"))}
df1 = pd.DataFrame(data=d)

df1['date'] = pd.to_datetime(df1['date'], errors="coerce")
df1.info()
df1

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   int     1 non-null      object        
 1   float   1 non-null      float64       
 2   str     1 non-null      object        
 3   bool    1 non-null      object        
 4   date    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 208.0+ bytes
    int   float str     bool    date
0   1     3.0   test    True    2000-01-01
1   None  NaN   None    NaN     NaT

If you have a DataFrame or Series using traditional types that have missing data represented using np.nan, there are convenience methods convert_dtypes() in Series and convert_dtypes() in DataFrame that can convert data to use the newer dtypes for integers, strings and booleans and from v1.2 floats using convert_integer=False.

df1[['int', 'str', 'bool', 'date']] = df1[['int', 'str', 'bool', 'date']].convert_dtypes()
df1['float'] = df1['float'].convert_dtypes(convert_integer=False)
df1.info()
df1

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   int     1 non-null      Int64         
 1   float   1 non-null      Float64       
 2   str     1 non-null      string        
 3   bool    1 non-null      boolean       
 4   date    1 non-null      datetime64[ns]
dtypes: Float64(1), Int64(1), boolean(1), datetime64[ns](1), string(1)
memory usage: 200.0 bytes
    int     float   str     bool    date
0   1       3.0     test    True    2000-01-01
1   <NA>    <NA>    <NA>    <NA>    NaT

Note the capital ‘F’ to distinguish from np.float32 or np.float64, also note string which is the new pandas StringDtype (from Pandas 1.0) and not str or object.
Also pd.Int64 (from pandas 0.24) nullable integer capital ‘I’ and not np.int64.

For more on datatypes read here and here. This page has some good info on subtypes.

I am using pandas v1.2.4 so hopeful in time we will have a universal null value for all datatypes which will warm our hearts.

Warning this is new and experimental use careful for now.

pd.NA was introduced in the recent release of pandas-1.0.0.

I would recommend using it over np.nan, since it is contained in the pandas library it should work best with the DataFrames.

pd.NA is still experimental (https://pandas.pydata.org/docs/user_guide/missing_data.html) and can have undesired outcomes.

For example:

import pandas as pd
df = pd.DataFrame({'id':[1,2,3]})
df.id.replace(2, pd.NA, inplace=True)
df.id.replace(3, pd.NA, inplace=True)

Pandas 1.2.4:

id
0 1
1 <NA>
2 3

Pandas 1.4.2:

AttributeError: 'bool' object has no attribute 'to_numpy'

It appears that pd.NA changes the data frame in a way that the second replacement doesn’t work anymore.

The same code with np.nan works without problems.

import pandas as pd
import numpy as np
df = pd.DataFrame({'id':[1,2,3]})
df.id.replace(2, np.nan, inplace=True)
df.id.replace(3, np.nan, inplace=True)