R^2が高すぎるとき なぜか?
重回帰分析にて
y = ax + bのを逆にするといけない。
間違えの例: R^2 = 0.97
y = ypred予測結果の数
X = sales_norm
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model
# show the intercept, etc
print res.params
# predict values - compare with y
ypred = res.predict(X)
正しい:R^2=0.74
y = sales_norm
X = np.array([prev_sales_norm, avg_max_temp_norm, avg_min_temp_norm, precipitation_norm, weekend_norm, weather_number_norm]).T
# COUNT
print np.array(X).shape, np.array(y).shape
X = sm.add_constant(X)
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model
# show the intercept, etc
print res.params
# predict values - compare with y
ypred = res.predict(X)
Python 祝日を判定
追加する:
import datetime
import math
import sys
Terminal: Python;
コードを追加したときはexit()してもう一度やる
もとのコード:
http://www.h3.dion.ne.jp/~sakatsu/holiday_logic5.htm
--------
エラー:
File "kaikibunseki_test3.3.py", line 63, in <module>
h = jholiday.holiday_name(date=d)
File "/Users/HOkaniwa/ABEJA/matplotlib_test/jholiday.py", line 140, in holiday_name
if date < datetime.date(1948, 7, 20):
TypeError: can't compare datetime.datetime to datetime.date
TypeError: can't compare datetime.date to unicode
文字列から日付(date)
import datetime
tstr = '2012-12-29 13:49:37'
tdatetime = datetime.datetime.strptime(tstr, '%Y-%m-%d %H:%M:%S')
tdate = datetime.date(tdatetime.year, tdatetime.month, tdatetime.day)
for d in date_2:
print type(date_2)
da = datetime.datetime.strptime(d, '%m/%d/%y')
da = datetime.date(da.year, da.month, da.day)
print type(date)
h = jholiday.holiday_name(date=da)
if h is not None:
holiday.append(1)
else:
holiday.append(0)
データから予測する:重回帰分析
回帰分析
0:とは
定義:
1:ツール選択
参考まとめ:qiita.com
サンプルコード:
ーーーツール:statsmodels
StatsModels: Statistics in Python — statsmodels 0.6.1 documentation
ステップバイステップのわかりやすい解説:
Getting started — statsmodels 0.6.1 documentation
Fitting models using R-style formulas — statsmodels 0.6.1 documentation
ほか 例:
メモ:
この1行がないとエラー:
X = sm.add_constant(X)
エラー:
File "kaikibunseki_ex2.py", line 10, in <module>
y = np.dot(X, beta) + e
ValueError: shapes (100,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)
この1行を使わなくてもいい:
import MySQLdb
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas
import numpy as np
connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
#arrayをつくる
sales =
avg_max_temp =
avg_min_temp =
precipitation =
#データをそこに入れる
cursor.execute("select avg_sales, avg_min_temp, avg_max_temp, precipitation from minatoa")
data = cursor.fetchall()
for row in data:
sales.append(row[0])
avg_max_temp.append(row[1])
avg_min_temp.append(row[2])
precipitation.append(row[3])
#y=ax + b~ + c のとき、y,ax + b~ + cを定義。
y = sales
X = np.array([avg_max_temp, avg_min_temp, precipitation]).T ポイント1*
#.shapeにてマトリックスの (384, 10) x (384,) を確認。これにより、計算式として成り立つかを確認する。
print np.array(X).shape, np.array(y).shape
X = sm.add_constant(X)
# print '----------------------------------'
# print X
# print y
# print '----------------------------------'
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model
# show the intercept, etc
print res.params
ーーーーーーーーーーーーー
ポイント1*
.shapeでみたときに X, yが逆だった。このため、numpyのメソッド .Tを使って軸を反転。このためには numpyメソッドでarrayを作らないと認識されないので、np.array([-- , --, --]) とした。
-------------------------------------
結果:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.071
Model: OLS Adj. R-squared: 0.064
Method: Least Squares F-statistic: 9.674
Date: Tue, 18 Aug 2015 Prob (F-statistic): 3.62e-06
Time: 16:23:29 Log-Likelihood: -2441.3
No. Observations: 384 AIC: 4891.
Df Residuals: 380 BIC: 4906.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 845.7819 67.419 12.545 0.000 713.221 978.343
x1 37.7164 8.901 4.237 0.000 20.214 55.218
x2 -42.0506 9.224 -4.559 0.000 -60.188 -23.914
x3 0.0009 0.082 0.011 0.992 -0.160 0.161
==============================================================================
Omnibus: 120.607 Durbin-Watson: 0.960
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1029.115
Skew: 1.066 Prob(JB): 3.39e-224
Kurtosis: 10.731 Cond. No. 853.
==============================================================================
予測値を出す
Prediction (out of sample) — statsmodels 0.6.0 documentation
ーーーーーーーーーーーーーーーーー
メモ: エラー:
TypeError: unsupported operand type(s) for /: 'list' and 'int'
行列の大きさをチェック!len()を使う。
Python matplotlib 同じ軸にグラフをかく plt.legend()
参考:
コード:
import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
# new
from pandas import *
# A
minatoa = read_csv('minatoA.csv')
m = minatoa.groupby('avg_min_temp')['avg_sales'].mean()
# print (m)
plt.plot(m, "m.", label="A")
# B
tokyoresortb = read_csv('TokyoResortB.csv')
t = tokyoresortb.groupby(['avg_min_temp'])['avg_sales'].mean()
# print (t)
plt.plot(t, "g.", label="B")
# C
suburbresortb = read_csv('SuburbResortC.csv')
s = suburbresortb.groupby('avg_min_temp')['avg_sales'].mean()
# print (s)
plt.plot(s, "r.", label="C")
# D
centerd = read_csv('CenterinVicinityD.csv')
d = centerd.groupby('avg_min_temp')['avg_sales'].mean()
# print(d)
plt.plot(d, "k.", label="D")
plt.legend()
Python matplotlib + pandas クロス集計
やろうとしたこと(1):
天気と気温のクロス集計
使ったこと:
pandas.crosstab — pandas 0.16.2 documentation
参考:
エラー:
ValueError: If using all scalar values, you must pass an index
原因:
read.csvの代わりに appendを使って変数にデータを入れていた。
コード:
import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter
# new
from pandas import *
connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
weather=
avg_max_temp=
avg_sales=[]
cursor.execute("select avg_sales, weather, avg_max_temp from minatoa")
data = cursor.fetchall()
for row in data:
avg_sales.append(row[0])
weather.append(row[1])
avg_max_temp.append(row[2])
sales_weather_maxtemp = crosstab(weather, avg_max_temp)
# サンプルコードの場合: tips = read_csv('tips.csv')
# do cross examination
counts = crosstab(weather, avg_max_temp)
print( counts )
解決:
コード:
minatoa = read_csv('minatoA.csv')
# do cross examination
counts = crosstab(minatoa.weather, minatoa.avg_max_temp, aggfunc=[len, np.mean])
print( counts )
結果:
➜ matplotlib_test python cross_ex1.py
avg_max_temp 0 9 10 11 12 13 14 15 16 17 ... 23 24 25 26 27 \
weather ...
J 0 0 0 0 1 0 0 1 0 1 ... 1 0 1 0 0
J/ 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0
J/ 0 0 1 0 1 0 0 1 0 0 ... 1 0 1 0 1
/J 0 2 0 0 2 0 1 1 0 0 ... 1 0 1 1 0
/ 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0
/ 0 3 10 1 1 3 2 1 3 3 ... 2 3 3 2 3
0 13 22 12 9 3 7 6 4 2 ... 0 4 6 3 0
0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0
0 2 4 3 0 1 1 2 0 0 ... 2 1 0 4 3
/J 0 1 4 4 2 1 3 2 2 1 ... 0 3 6 1 5
/ 0 0 2 0 3 0 1 1 0 0 ... 1 1 1 1 1
/ 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0
ݒ 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0
avg_max_temp 28 29 30 31 32
weather
J 0 0 0 0 0
J/ 0 0 0 0 0
J/ 0 0 1 0 0
/J 0 1 2 0 0
/ 0 0 0 0 0
/ 2 3 9 3 1
2 0 7 1 6
0 0 0 0 0
2 0 4 1 1
/J 4 8 9 3 0
/ 0 0 6 2 0
/ 0 0 0 0 0
ݒ 0 0 0 0 0
[13 rows x 25 columns]
ーーーーーーーーー
やろうとしたこと(2):
天気と気温の組み合わせごとに、売り上げの平均値を出す
使ったこと:
pandas.DataFrame.groupby — pandas 0.16.2 documentation
コード:
import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
from pandas import *
minatoa = read_csv('minatoA.csv')
m = minatoa.groupby(['weather', 'avg_max_temp'])['avg_sales'].mean()
print (m)
結果:
➜ matplotlib_test python cross_ex1.py
weather avg_max_temp
cloudy 9 209.500000
10 329.000000
11 309.666667
13 385.000000
14 394.000000
15 375.500000
18 570.000000
19 422.500000
20 393.500000
21 234.000000
22 246.000000
23 405.000000
24 288.000000
26 360.250000
27 362.333333
28 387.500000
30 327.750000
31 212.000000
32 350.000000
cloudy/rainy 9 327.000000
10 422.250000
11 320.500000
12 442.000000
13 446.000000
14 338.333333
15 392.000000
16 293.500000
17 578.000000
18 364.500000
19 466.000000
...
sunny/cloudy 17 335.000000
18 261.000000
19 339.833333
20 427.600000
21 377.333333
22 348.000000
23 456.500000
24 381.000000
25 312.333333
26 350.000000
27 361.666667
28 369.000000
29 313.333333
30 277.555556
31 326.000000
32 209.000000
sunny/rainy 9 324.000000
12 344.000000
14 365.000000
15 496.000000
18 303.000000
20 320.666667
21 298.000000
22 486.000000
23 397.000000
25 347.000000
26 295.000000
29 550.000000
30 239.500000
sunny/snowy 10 277.000000
Name: avg_sales, dtype: float64
Python matplotlib subplot 複数のグラフを表示する
参考:
from pylab import * x = arange(1, 10, 0.1) y1 = sin(x) y2 = cos(x) # 上の行のグラフ subplot(211) plot(x, y1) xlabel("1") # 上の行のxlabel # 下の行のグラフ subplot(212) plot(x, y2) xlabel("2") # 下の行のylabel show()
matplotlibのグラフの体裁を整える - たこ焼き食べた.net
発展
参考リンク:
pylab_examples example code: subplots_demo.py — Matplotlib 1.4.3 documentation
例:
import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter
connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
d = "select avg_sales+abs(avg_destruction), date from centerinvicinityd"
cursor.execute(d)
datad = cursor.fetchall()
proda =
prodd =
datestring=[]
print 1212
years = YearLocator() # every year
months = MonthLocator() # every month
monthsFmt = DateFormatter('%m')
days = DayLocator()
for row in datad:
prodd.append(row[0])
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%y')
datestring.append(dateeach)
print 2222
subplot(211)
axis([min(datestring), max(datestring), min(prodd), max(prodd)])
print 33333
fig, ax = plt.subplots()
ax.plot_date(datestring, prodd, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)
print 44444444
xlabel("center in vicinity d")
print 5555
subplot(212)
print 6666
a = "select avg_sales+abs(avg_destruction), date from minatoa"
cursor.execute(a)
dataa = cursor.fetchall()
for row in dataa:
proda.append(row[0])
print 777777
axis([min(datestring), max(datestring), min(proda), max(proda)])
fig, ax = plt.subplots()
ax.plot_date(datestring, proda, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)
print 88888
xlabel("minato a")
# title = "Prod over Date in Areas"
# title(title)
grid(True)
plt.show()
cursor.close()
connection.close()
Python 初心者 時系列のグラフを書く
ここからグラフの形をえらんで作れる: matplotlib を使う場合
グラフの例:Thumbnail gallery — Matplotlib 1.4.3 documentation
import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter
connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
query="select avg_sales+abs(avg_destruction), date from centerinvicinityd"
cursor.execute(query)
data = cursor.fetchall()
prod =
datestring=
for row in data:
prod.append(row[0])
print row[1]
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%y')
datestring.append(dateeach)
axis([min(datestring), max(datestring), min(prod), max(prod)])
years = YearLocator() # every year
months = MonthLocator() # every month
monthsFmt = DateFormatter('%m')
days = DayLocator()
fig, ax = plt.subplots()
ax.plot_date(datestring, prod, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)
title(query)
grid(True)
plt.show()
cursor.close()
connection.close()
------------------------------------------
参考の例:
pylab_examples example code: date_demo1.py — Matplotlib 1.4.3 documentation
pylab_examples example code: date_demo2.py — Matplotlib 1.4.3 documentation
エラーメモ:
➜ matplotlib_test python avg_min_temp_prod.py
0
エラー:
Traceback (most recent call last):
File "avg_min_temp_prod.py", line 22, in <module>
datetime.datetime.strptime(date, '%m/%d/%Y')
TypeError: must be string, not list
=>
------------------
コード:
prod =
datestring=
for row in data:
prod.append(row[0])
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%Y')
datestring.append(dateeach)
エラー:
ValueError: time data '1/1/14' does not match format '%m/%d/%Y'
原因:
01/01/14 とかにしないといけなかった
参考:
解決方法:
参考:
ーーーーーーーーーーーー参考サイト
You should be using datetime.datetime.strptime
. Note that very old versions of Python (2.4 and older) don't have datetime.datetime.strptime
; use time.strptime
in that case.
参考:
You must first convert your timestamps to python datetime
objects (use datetime.strptime
). Then use date2num
to convert the dates to matplotlib format.
Plot the dates and values using plot_date
:
dates = matplotlib.dates.date2num(list_of_datetimes)
plot_date(dates, values