2015-09-07

R^2が高すぎるとき　なぜか？

重回帰分析にて
y = ax + bのを逆にするといけない。

間違えの例： R^2 = 0.97
y = ypred予測結果の数
X = sales_norm
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model
# show the intercept, etc
print res.params
# predict values - compare with y
ypred = res.predict(X)

正しい：R^2=0.74

y = sales_norm
X = np.array([prev_sales_norm, avg_max_temp_norm, avg_min_temp_norm, precipitation_norm, weekend_norm, weather_number_norm]).T

# COUNT
print np.array(X).shape, np.array(y).shape

X = sm.add_constant(X)

mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model
# show the intercept, etc
print res.params
# predict values - compare with y
ypred = res.predict(X)

2015-09-03

Python 祝日を判定

追加する：

import datetime
import math
import sys

Terminal: Python;
コードを追加したときはexit()してもう一度やる

<a href="http://d.hatena.ne.jp/yuheiomori0718/20140919/1411138964" data-mce-href="http://d.hatena.ne.jp/yuheiomori0718/20140919/1411138964">Pythonで日本の祝日判定をするスクリプト - brainstorm</a>d.hatena.ne.jp

もとのコード：

http://www.h3.dion.ne.jp/~sakatsu/holiday_logic5.htm

--------

エラー：

File "kaikibunseki_test3.3.py", line 63, in <module>

h = jholiday.holiday_name(date=d)

File "/Users/HOkaniwa/ABEJA/matplotlib_test/jholiday.py", line 140, in holiday_name

if date < datetime.date(1948, 7, 20):

TypeError: can't compare datetime.datetime to datetime.date

TypeError: can't compare datetime.date to unicode

文字列から日付(date)

import datetime

tstr = '2012-12-29 13:49:37'
tdatetime = datetime.datetime.strptime(tstr, '%Y-%m-%d %H:%M:%S')
tdate = datetime.date(tdatetime.year, tdatetime.month, tdatetime.day)

for d in date_2:
print type(date_2)
da = datetime.datetime.strptime(d, '%m/%d/%y')
da = datetime.date(da.year, da.month, da.day)
print type(date)
h = jholiday.holiday_name(date=da)
if h is not None:
holiday.append(1)
else:
holiday.append(0)

2015-08-17

データから予測する：重回帰分析

回帰分析

０：とは

定義：

&amp;lt;a href="http://tokeigaku.blog.jp/python/numpy/%E9%87%8D%E5%9B%9E%E5%B8%B0%E5%88%86%E6%9E%90" data-mce-href="http://tokeigaku.blog.jp/python/numpy/%E9%87%8D%E5%9B%9E%E5%B8%B0%E5%88%86%E6%9E%90"&amp;gt;Python/NumPyで重回帰分析・多変量解析&amp;lt;/a&amp;gt;tokeigaku.blog.jp

１：ツール選択

参考まとめ：&lt;a href="http://qiita.com/HirofumiYashima/items/7e18970d68a5c83084e7" data-mce-href="http://qiita.com/HirofumiYashima/items/7e18970d68a5c83084e7"&gt;Python で回帰分析を行う 6 つの方法～回帰分析した結果の t値・p値・（自由度修正済み）決定係数・D.W.値などを出力できる方法／出力できない方法 - Qiita&lt;/a&gt;qiita.com

サンプルコード：

&lt;a href="http://kumamotosan.hatenablog.com/entry/2014/03/05/222315" data-mce-href="http://kumamotosan.hatenablog.com/entry/2014/03/05/222315"&gt;重回帰分析。 - ほんとうに一から始める機械学習&lt;/a&gt;kumamotosan.hatenablog.com

tokeigaku.blog.jp

ーーーツール：statsmodels

StatsModels: Statistics in Python — statsmodels 0.6.1 documentation

ステップバイステップのわかりやすい解説：

Getting started — statsmodels 0.6.1 documentation

Fitting models using R-style formulas — statsmodels 0.6.1 documentation

ほか　例：

import numpy as np
import statsmodels.api as sm

# Generate artificial data (2 regressors + constant)
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
print results.summary()

メモ：
この１行がないとエラー：
X = sm.add_constant(X)

エラー：

File "kaikibunseki_ex2.py", line 10, in <module>

y = np.dot(X, beta) + e

ValueError: shapes (100,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)

この１行を使わなくてもいい：

import MySQLdb
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas
import numpy as np

connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()

#arrayをつくる
sales =
avg_max_temp =
avg_min_temp =
precipitation =

#データをそこに入れる

cursor.execute("select avg_sales, avg_min_temp, avg_max_temp, precipitation from minatoa")
data = cursor.fetchall()
for row in data:
sales.append(row[0])
avg_max_temp.append(row[1])
avg_min_temp.append(row[2])
precipitation.append(row[3])

#y=ax + b~ + c のとき、y,ax + b~ + cを定義。

y = sales
X = np.array([avg_max_temp, avg_min_temp, precipitation]).T　ポイント１*

#.shapeにてマトリックスの (384, 10) x (384,) を確認。これにより、計算式として成り立つかを確認する。

print np.array(X).shape, np.array(y).shape

X = sm.add_constant(X)

# print '----------------------------------'
# print X
# print y
# print '----------------------------------'

mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
print res.summary() # Summarize model

# show the intercept, etc
print res.params

ーーーーーーーーーーーーー

ポイント１*
.shapeでみたときに X, yが逆だった。このため、numpyのメソッド .Tを使って軸を反転。このためには numpyメソッドでarrayを作らないと認識されないので、np.array([-- , --, --]) とした。

-------------------------------------

結果：

OLS Regression Results

==============================================================================

Dep. Variable: y R-squared: 0.071

Model: OLS Adj. R-squared: 0.064

Method: Least Squares F-statistic: 9.674

Date: Tue, 18 Aug 2015 Prob (F-statistic): 3.62e-06

Time: 16:23:29 Log-Likelihood: -2441.3

No. Observations: 384 AIC: 4891.

Df Residuals: 380 BIC: 4906.

Df Model: 3

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

const 845.7819 67.419 12.545 0.000 713.221 978.343

x1 37.7164 8.901 4.237 0.000 20.214 55.218

x2 -42.0506 9.224 -4.559 0.000 -60.188 -23.914

x3 0.0009 0.082 0.011 0.992 -0.160 0.161

==============================================================================

Omnibus: 120.607 Durbin-Watson: 0.960

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1029.115

Skew: 1.066 Prob(JB): 3.39e-224

Kurtosis: 10.731 Cond. No. 853.

==============================================================================

予測値を出す

Prediction (out of sample) — statsmodels 0.6.0 documentation

ーーーーーーーーーーーーーーーーー

メモ：エラー：

TypeError: unsupported operand type(s) for /: 'list' and 'int'

行列の大きさをチェック！len()を使う。

2015-08-17

Python matplotlib 同じ軸にグラフをかく plt.legend()

python matplotlib

参考：

<a href="http://qiita.com/ynakayama/items/e37c222771db53a0e629" data-mce-href="http://qiita.com/ynakayama/items/e37c222771db53a0e629">Python - matplotlib (+ pandas) によるデータ可視化の方法 (4) - Qiita</a>qiita.com

コード：

import MySQLdb
from pylab import *

import sys
import numpy as np
import matplotlib.pyplot as plt
# new
from pandas import *

# A
minatoa = read_csv('minatoA.csv')
m = minatoa.groupby('avg_min_temp')['avg_sales'].mean()
# print (m)
plt.plot(m, "m.", label="A")

# B
tokyoresortb = read_csv('TokyoResortB.csv')
t = tokyoresortb.groupby(['avg_min_temp'])['avg_sales'].mean()
# print (t)
plt.plot(t, "g.", label="B")
# C
suburbresortb = read_csv('SuburbResortC.csv')
s = suburbresortb.groupby('avg_min_temp')['avg_sales'].mean()
# print (s)
plt.plot(s, "r.", label="C")
# D
centerd = read_csv('CenterinVicinityD.csv')
d = centerd.groupby('avg_min_temp')['avg_sales'].mean()
# print(d)
plt.plot(d, "k.", label="D")

plt.legend()

f:id:haruokny:20150817144417p:plain

2015-08-17

Python matplotlib + pandas クロス集計

やろうとしたこと（１）：

天気と気温のクロス集計

使ったこと：

pandas.crosstab — pandas 0.16.2 documentation

参考：

エラー：

ValueError: If using all scalar values, you must pass an index

原因：

read.csvの代わりに appendを使って変数にデータを入れていた。

コード：

import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter
# new
from pandas import *

connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()

weather=
avg_max_temp=
avg_sales=[]

cursor.execute("select avg_sales, weather, avg_max_temp from minatoa")
data = cursor.fetchall()
for row in data:
avg_sales.append(row[0])
weather.append(row[1])
avg_max_temp.append(row[2])

sales_weather_maxtemp = crosstab(weather, avg_max_temp)

# サンプルコードの場合: tips = read_csv('tips.csv')

# do cross examination
counts = crosstab(weather, avg_max_temp)
print( counts )

解決：

コード：

minatoa = read_csv('minatoA.csv')
# do cross examination
counts = crosstab(minatoa.weather, minatoa.avg_max_temp, aggfunc=[len, np.mean])
print( counts )

結果：

➜ matplotlib_test python cross_ex1.py

avg_max_temp 0 9 10 11 12 13 14 15 16 17 ... 23 24 25 26 27 \

weather ...

J 0 0 0 0 1 0 0 1 0 1 ... 1 0 1 0 0

J/ 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0

J/ 0 0 1 0 1 0 0 1 0 0 ... 1 0 1 0 1

/J 0 2 0 0 2 0 1 1 0 0 ... 1 0 1 1 0

/ 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0

/ 0 3 10 1 1 3 2 1 3 3 ... 2 3 3 2 3

0 13 22 12 9 3 7 6 4 2 ... 0 4 6 3 0

0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0

0 2 4 3 0 1 1 2 0 0 ... 2 1 0 4 3

/J 0 1 4 4 2 1 3 2 2 1 ... 0 3 6 1 5

/ 0 0 2 0 3 0 1 1 0 0 ... 1 1 1 1 1

/ 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0

ݒ 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0

avg_max_temp 28 29 30 31 32

weather

J 0 0 0 0 0

J/ 0 0 0 0 0

J/ 0 0 1 0 0

/J 0 1 2 0 0

/ 0 0 0 0 0

/ 2 3 9 3 1

2 0 7 1 6

0 0 0 0 0

2 0 4 1 1

/J 4 8 9 3 0

/ 0 0 6 2 0

/ 0 0 0 0 0

ݒ 0 0 0 0 0

[13 rows x 25 columns]

ーーーーーーーーー

やろうとしたこと(２)：

天気と気温の組み合わせごとに、売り上げの平均値を出す

使ったこと：

pandas.DataFrame.groupby — pandas 0.16.2 documentation

コード：

import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
from pandas import *

minatoa = read_csv('minatoA.csv')

m = minatoa.groupby(['weather', 'avg_max_temp'])['avg_sales'].mean()
print (m)

結果：

➜ matplotlib_test python cross_ex1.py

weather avg_max_temp

cloudy 9 209.500000

10 329.000000

11 309.666667

13 385.000000

14 394.000000

15 375.500000

18 570.000000

19 422.500000

20 393.500000

21 234.000000

22 246.000000

23 405.000000

24 288.000000

26 360.250000

27 362.333333

28 387.500000

30 327.750000

31 212.000000

32 350.000000

cloudy/rainy 9 327.000000

10 422.250000

11 320.500000

12 442.000000

13 446.000000

14 338.333333

15 392.000000

16 293.500000

17 578.000000

18 364.500000

19 466.000000

...

sunny/cloudy 17 335.000000

18 261.000000

19 339.833333

20 427.600000

21 377.333333

22 348.000000

23 456.500000

24 381.000000

25 312.333333

26 350.000000

27 361.666667

28 369.000000

29 313.333333

30 277.555556

31 326.000000

32 209.000000

sunny/rainy 9 324.000000

12 344.000000

14 365.000000

15 496.000000

18 303.000000

20 320.666667

21 298.000000

22 486.000000

23 397.000000

25 347.000000

26 295.000000

29 550.000000

30 239.500000

sunny/snowy 10 277.000000

Name: avg_sales, dtype: float64

2015-08-14

Python matplotlib subplot 複数のグラフを表示する

参考：

from pylab import *
x = arange(1, 10, 0.1)
y1 = sin(x)
y2 = cos(x)

# 上の行のグラフ
subplot(211)
plot(x, y1)
xlabel("1") # 上の行のxlabel

# 下の行のグラフ
subplot(212)
plot(x, y2)
xlabel("2") # 下の行のylabel

show()

matplotlibのグラフの体裁を整える - たこ焼き食べた.net

発展

参考リンク：

pylab_examples example code: subplots_demo.py — Matplotlib 1.4.3 documentation

例：

import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter

connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
d = "select avg_sales+abs(avg_destruction), date from centerinvicinityd"
cursor.execute(d)
datad = cursor.fetchall()

proda =
prodd =
datestring=[]

print 1212
years = YearLocator() # every year
months = MonthLocator() # every month
monthsFmt = DateFormatter('%m')
days = DayLocator()

for row in datad:
prodd.append(row[0])
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%y')
datestring.append(dateeach)

print 2222

subplot(211)
axis([min(datestring), max(datestring), min(prodd), max(prodd)])

print 33333

fig, ax = plt.subplots()
ax.plot_date(datestring, prodd, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)

print 44444444
xlabel("center in vicinity d")

print 5555

subplot(212)
print 6666
a = "select avg_sales+abs(avg_destruction), date from minatoa"
cursor.execute(a)
dataa = cursor.fetchall()
for row in dataa:
proda.append(row[0])

print 777777

axis([min(datestring), max(datestring), min(proda), max(proda)])
fig, ax = plt.subplots()
ax.plot_date(datestring, proda, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)

print 88888

xlabel("minato a")

# title = "Prod over Date in Areas"
# title(title)
grid(True)
plt.show()

cursor.close()
connection.close()

f:id:haruokny:20150817104248p:plain

2015-08-13

Python 初心者時系列のグラフを書く

ここからグラフの形をえらんで作れる: matplotlib を使う場合

グラフの例：Thumbnail gallery — Matplotlib 1.4.3 documentation

import MySQLdb
from pylab import *
import sys
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.dates import YearLocator, MonthLocator, DayLocator, DateFormatter

connection = MySQLdb.connect(host="localhost", db="agemono", user="root", passwd="password", charset="utf8")
cursor= connection.cursor()
query="select avg_sales+abs(avg_destruction), date from centerinvicinityd"
cursor.execute(query)
data = cursor.fetchall()

prod =
datestring=
for row in data:
prod.append(row[0])
print row[1]
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%y')
datestring.append(dateeach)

axis([min(datestring), max(datestring), min(prod), max(prod)])

years = YearLocator() # every year
months = MonthLocator() # every month
monthsFmt = DateFormatter('%m')
days = DayLocator()

fig, ax = plt.subplots()
ax.plot_date(datestring, prod, '-')
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(monthsFmt)
ax.xaxis.set_minor_locator(days)

title(query)
grid(True)
plt.show()

cursor.close()
connection.close()

------------------------------------------

参考の例:

pylab_examples example code: date_demo1.py — Matplotlib 1.4.3 documentation

pylab_examples example code: date_demo2.py — Matplotlib 1.4.3 documentation

エラーメモ：

➜ matplotlib_test python avg_min_temp_prod.py

エラー：

Traceback (most recent call last):

File "avg_min_temp_prod.py", line 22, in <module>

datetime.datetime.strptime(date, '%m/%d/%Y')

TypeError: must be string, not list

------------------

コード：

prod =
datestring=

for row in data:
prod.append(row[0])
dateeach = datetime.datetime.strptime(row[1], '%m/%d/%Y')
datestring.append(dateeach)

エラー：

ValueError: time data '1/1/14' does not match format '%m/%d/%Y'

原因：

01/01/14 とかにしないといけなかった

参考：

&lt;a href="http://qiita.com/shibainurou/items/0b0f8b0233c45fc163cd" data-mce-href="http://qiita.com/shibainurou/items/0b0f8b0233c45fc163cd"&gt;Pythonで文字列 &amp;lt;-&amp;gt; 日付(date, datetime) の変換 - Qiita&lt;/a&gt;qiita.com

解決方法：

参考：

<a href="http://www.system-ido.com/risouken/index.php?page=tech&num=2" data-mce-href="http://www.system-ido.com/risouken/index.php?page=tech&num=2">開発メモ：MySQLのカラムデータに文字列を付け足してUPDATE｜理総研Web＠中学受験専門理科総合研究所</a>www.system-ido.com

ーーーーーーーーーーーー参考サイト

You should be using datetime.datetime.strptime. Note that very old versions of Python (2.4 and older) don't have datetime.datetime.strptime; use time.strptime in that case.

参考：

<a href="http://stackoverflow.com/questions/12070193/why-is-datetime-strptime-not-working-in-this-simple-example" data-mce-href="http://stackoverflow.com/questions/12070193/why-is-datetime-strptime-not-working-in-this-simple-example">Why is datetime.strptime not working in this simple example?</a>stackoverflow.com

<a href="http://www.tutorialspoint.com/python/time_strptime.htm" data-mce-href="http://www.tutorialspoint.com/python/time_strptime.htm">Python time strptime() Method</a>www.tutorialspoint.com

<a href="http://stackoverflow.com/questions/1574088/plotting-time-in-python-with-matplotlib" data-mce-href="http://stackoverflow.com/questions/1574088/plotting-time-in-python-with-matplotlib">plotting time in python with matplotlib</a>stackoverflow.com

You must first convert your timestamps to python datetime objects (use datetime.strptime). Then use date2num to convert the dates to matplotlib format.

Plot the dates and values using plot_date:

dates = matplotlib.dates.date2num(list_of_datetimes)
plot_date(dates, values