How to handle date variable in machine learning data pre-processing

yppdgr picture yppdgr · Sep 26, 2017 · Viewed 21.1k times · Source

I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are:

  1. How to handle the date format? Shall I convert it to one number (like excel does automatically)? Shall I split it in more variables like day, month, year, hour, mins, seconds? any other possible suggestions?
  2. What if I would like to add distinct week number per year? shall I add variable like 342017(week 34 of year 2017)?
  3. Shall I make the same for question 2 for quarter of year?
#         Datetime               Gender        Purchase
1    23/09/2015 00:00:00           0             1
2    23/09/2015 01:00:00           1             0
3    25/09/2015 02:00:00           1             0
4    27/09/2015 03:00:00           1             1
5    28/09/2015 04:00:00           0             0

Answer

Ryan John picture Ryan John · Sep 26, 2017

Some random thoughts:

Dates are good sources for feature engineering, I don't think there is one method to use dates in a model. Business user expertise would be great; are there observed trends that can be coded into the data?

Possible suggestions of features include:

  • weekends vs weekdays
  • business hours and time of day
  • seasons
  • week of year number
  • month
  • year
  • beginning/end of month (pay days)
  • quarter
  • days to/from an action event(distance)
  • missing or incomplete data
  • etc.

All this depends on the data set and most won't apply.

some links:

http://appliedpredictivemodeling.com/blog/2015/7/28/feature-engineering-versus-feature-extraction

https://www.salford-systems.com/blog/dan-steinberg/using-dates-in-data-mining-models

http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/