Each Answer to this Q is separated by one/two green lines.
I have a Pandas DataFrame with a
date column (eg:
2013-04-01) of dtype
datetime.date. When I include that column in
X_train and try to fit the regression model, I get the error
float() argument must be a string or a number. Removing the
date column avoided this error.
What is the proper way to take the
date into account in the regression model?
data = sql.read_frame(...) X_train = data.drop('y', axis=1) y_train = data.y rf = RandomForestRegressor().fit(X_train, y_train)
TypeError Traceback (most recent call last) <ipython-input-35-8bf6fc450402> in <module>() ----> 2 rf = RandomForestRegressor().fit(X_train, y_train) C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight) 292 X.ndim != 2 or 293 not X.flags.fortran): --> 294 X = array2d(X, dtype=DTYPE, order="F") 295 296 n_samples, self.n_features_ = X.shape C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype, order, copy) 78 raise TypeError('A sparse matrix was passed, but dense data ' 79 'is required. Use X.toarray() to convert to dense.') ---> 80 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order) 81 _assert_all_finite(X_2d) 82 if X is X_2d and copy: C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order) 318 319 """ --> 320 return array(a, dtype, copy=False, order=order) 321 322 def asanyarray(a, dtype=None, order=None): TypeError: float() argument must be a string or a number
The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:
- hour of the day (24 boolean features)
- day of the week (7 boolean features)
- day of the month (up to 31 boolean features)
- month of the year (12 boolean features)
- year (as many boolean features as they are different years in your dataset)
That should make it possible to identify linear dependencies on periodic events on typical human life cycles.
Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.
You have two options. You can convert the date to an ordinal i.e. an integer representing the number of days since year 1 day 1. You can do this by a
Alternatively, you can turn the dates into categorical variables using sklearn’s OneHotEncoder. What it does is create a new variable for each distinct date. So instead of something like column
date with values
['2013-04-01', '2013-05-01'], you will have two columns,
date_2013_04_01 with values
[1, 0] and
date_2013_05_01 with values
I would recommend using the
toordinal approach if you have many different dates, and the one hot encoder if the number of distinct dates is small (let’s say up to 10 – 100, depending on the size of your data and what sort of relation the date has with the output variable).
Before doing boolean encoding using the 1-of-K encoding suggested by @ogrisel, you may try enriching your data and playing with the number of features that you can extract from the datetime-type, i.e. day of week, day of month, day of year, week of year, quarter, etc.
See for example https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.weekofyear.html and links to other functions.
Often it’s better to keep the amount of features low and there is not much information necessary from the timestamp. In my case it was enough to keep the date as a day-difference from the initial timestamp. This keeps the order and will leave you with only one (ordinal) feature.
df['DAY_DELTA'] = (df.TIMESTAMP - df.TIMESTAMP.min()).dt.days
Of cause this will not identify behaviour within one day (hour dependent). So maybe you wanna go down to the scale that identifyes changing behaviour in your data the best.
df['HOURS_DELTA'] = (df.TIMESTAMP - df.TIMESTAMP.min()).dt.components['hours']
The code above adds a new column with the delta value, to remove the old TIMESTAMP do this afterwards:
df = df.drop('TIMESTAMP', axis=1)
I usually turn the DateTime to features of interest such as Year, Month, Day, Hour, Minute.
df['Year'] = df['Timestamp'].apply(lambda time: time.year) df['Month'] = df['Timestamp'].apply(lambda time: time.month) df['Day'] = df['Timestamp'].apply(lambda time: time.day) df['Hour'] = df['Timestamp'].apply(lambda time: time.hour) df['Minute'] = df['Timestamp'].apply(lambda time: time.minute)