I am stuck in the following lines
import quandl,math
import pandas as pd
import numpy as np
from sklearn import preprocessing ,cross_validation , svm
from sklearn.linear_model import LinearRegression
df = quandl.get('WIKI/GOOGL')
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]
df['HL_PCT'] = (df["Adj. High"] - df['Adj. Close'])/df['Adj. Close'] * 100
df['PCT_CHANGE'] = (df["Adj. Close"] - df['Adj. Open'])/df['Adj. Open'] * 100
df = df[['Adj. Close','HL_PCT','PCT_CHANGE','Adj. Open']]
forecast_col = 'Adj. Close'
df.fillna(-99999,inplace = True)
forecast_out = int(math.ceil(.1*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
print df.head()
I couldn't understand what is meant by df[forecast_col].shift(-forecast_out)
Please explain the command and what is does??
Shift function of pandas.Dataframe shifts index by desired number of periods with an optional time freq. For further information on shift function please refer this link.
Here is the small example of column values being shifted:
import pandas as pd
import numpy as np
df = pd.DataFrame({"date": ["2000-01-03", "2000-01-03", "2000-03-05", "2000-01-03", "2000-03-05",
"2000-03-05", "2000-07-03", "2000-01-03", "2000-07-03", "2000-07-03"],
"variable": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "D"],
"no": [1, 2.2, 3.5, 1.5, 1.5, 1.2, 1.3, 1.1, 2, 3],
"value": [0.469112, -0.282863, -1.509059, -1.135632, 1.212112, -0.173215,
0.119209, -1.044236, -0.861849, None]})
Below is the column value before it is shifted
df['value']
output
0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 1.212112
5 -0.173215
6 0.119209
7 -1.044236
8 -0.861849
9 NaN
Using shift function values are shifted depending on period given
for example using shift with positive integer shifts rows value downwards:
df['value'].shift(1)
output
0 NaN
1 0.469112
2 -0.282863
3 -1.509059
4 -1.135632
5 1.212112
6 -0.173215
7 0.119209
8 -1.044236
9 -0.861849
Name: value, dtype: float64
using shift with negative integer shifts rows value upwards:
df['value'].shift(-1)
output
0 -0.282863
1 -1.509059
2 -1.135632
3 1.212112
4 -0.173215
5 0.119209
6 -1.044236
7 -0.861849
8 NaN
9 NaN
Name: value, dtype: float64