ch2_learning
Key terms
Y = f(x1, x2, x3)
- want to improve sales (Y) of a product
-> Y: output variable, dependent variable
-control adveritsing budgets : sns(x1), streaming(x2), flier(x3)
->x1, x2, x3 : input variables, independent variables, predictors
Key questions
1) What is the relationship between x1, x2, x3 and Y? -> learning
2) How accurately can we predict Y from x1, x2, x3? -> prediction
data -- (learn) --> pattern, knowledge, principles (model)
<--(apply) --
Formally,
collect data : observe Yi and Xi = (Xi1, ..., Xip) for i = 1, ..., n
assume that there is a relationship between Y and X's.
model the relationship(f) as Yi = f(Xi) + ei (random error (zero - mean))
statistical learning : estimate (learn) f from data
Models are useful for
1) prediction : predict Y from (new or unseen) X
2) inference : understand the relationship between X and Y
Prediction
Once we have a good model, we can predict Y from new X.
y^ (prediction, estimate) = f^(X) (estimate of f, f itself is unknown!)
(์ด ์์์์ ๊ธฐํธ ^๋ “์บ๋ฟ(caret)”์ด๋ผ๋ ๊ธฐํธ์ด๋ค. y^๋ “์์ด ํ(y hat)”์ด๋ผ๊ณ ์ฝ๋๋ค.)
How accurate is the prediction?
Reducible vs irreducible errors
True relationship : Y = f(x) + e
We learn f from data and use it for prediction. Y^ = f^(X)
In general, f^ != f : reducible error, potentially improved by f^ -> f.
But, even if we know f, there is irreducible error. Y^ = f(X) (still missing e)
Irreducible Errors are the errors caused by the variables beyond the realm of X (our set of predictor variables).
Quantification of the error
- mean squared error(MSE)
Goal : estimate f so that reducible error is minimized
Inference
In prediction, f was a (?)black box.
But, for inference, we want to know the exact form of f.
Understand how Y changes as a function of X1, ..., Xp.
input(x) ---> (?)black box f --> output(Y)
Inference questions
-which predictors are associated with the reponse
e.g. among X1, ..., Xp, which are relevant?
-what is the relationship between the response and each predictor?
e.g. increasing x1 -> increase (or decrease) Y?
e.g. increasing x1 -> increase (or decrease) Y when x2 is positive?
-Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Some examples of Prediction vs. inference
prediction example : direct-marketing
- given 90,000 people with 400 different characteristics, want to predict how much money an individual will donate.
- should I send a mailing to a given individual?
(Don't care how you estimate Y)
x1, ..., xp: demographic data
Y: positive or negative response
Inference example : advertising
which media contribute to sales?
which media generate the biggest boost in sales? or
how much increase in sales is associated with a given increase in SNS advertising?
Inference example : Housing
How do we estimate f?
Given a set of training data {(x1, y1), (x2, y2), ..., (xn,yn)}, we want to estimate f.
Two types of approaches.
-parametric methods
-non-parametric methods
Parametric methods
-estimating f -> estimating a set of parameteres
Step 1. make an assumption about the functional form, a model, of f.
i.e. a linear model
Step 2. Use the training data to fit the model.
i.e. estimate beta0,1,2,...,Beta p of the linear model (using least square)
linear model : only need to estimate p+1 coefficients! coefficient(๊ณ์)
non-parametric methods
-do not make explicit assumptions about the functional form of f.
-advantage : flexibility! can fit a wider range of f.
-disadvantage: difficult to learn. require more data.
parametric vs. non-parametric models
-There are always parameters.
-Parametric models: parameteres are explicitly estimated. (e.g. linear regression)
-Non-parametric models:
-I chooses a family of models. But, I don't have direct control of parameters.
-Non-parameter models actually have far more parameters!
The bigger the better?
Q. why would we ever choose to use a more restrictive method instead of a very flexible approach?
Trade-off : flexibility vs. interpretability
์ฌ๋ฌ๊ฐ์ง ํต๊ณํ์ต ๋ฐฉ๋ฒ๋ค์ ์ ์ฐ์ฑ๊ณผ ํด์ ๊ฐ๋ฅ์ฑ ์ฌ์ด์ ๊ท ํ์ ๋ํ๋ด๋ ๊ทธ๋ฆผ. ์ผ๋ฐ์ ์ผ๋ก ์ ์ฐ์ฑ์ด ์ฆ๊ฐํ๋ฉด ํด์ ๊ฐ๋ฅ์ฑ์ด ๊ฐ์ํ๋ค.
Simple models are easier to interpret!
Back to linear regression.
y^=β0 + β1⋅Xi1+ β2Xi2 +...+ βpXip
βj : the average increase in Y for a one unit increase in Xj holding all other variables constant.
Overfitting
Too flexible model -> poor estimation
Y = f(X) + e (random error)
Summary
learned key concepts of supervised learning
learning : learn f from (training) data
prediction vs. inference
reducible vs. irreducible errors
parametric vs. non-parametric methods for learning
flexibility vs. interpretability
overfitting
์ฐธ๊ณ ์๋ฃ :
์ต์ ์ ๋ชจ๋ธ์ ์ฐพ์์ (๋ถ์ : bias์ variance ๋ฌธ์ ์ดํดํ๊ธฐ)
๋๋ชจ ์๋น ๋ง๋ฆฐ๊ณผ ์น๊ตฌ ๋๋ฆฌ๊ฐ ๋๋ชจ๋ฅผ ์ฐพ์ ๋์ ๋ฐ๋ค๋ก ๋ชจํ์ ๋ ๋ฌ๋ค๋ฉด, ์ฐ๋ฆฌ๋ ๋ฐ์ดํฐ์ ๋ฐ๋ค์์ ์ต์ ์ ๋ชจ๋ธ์ ์ฐพ์ ์ค๋๋ ๋ชจํ์ ๋ ๋ฉ๋๋ค. ๋ค๋ฅธ ์ ์ด ์๋ค๋ฉด ๋ง๋ฆฐ์ ์๋ค ๋๋ชจ๋ ์ ์ผํ
medium.com
https://senthilkumarbala.medium.com/reducible-and-irreducible-errors-663eadace3a3
Reducible and Irreducible Errors
Suppose you are an aspiring data scientist in a procurement team of a large company and want to get your hands dirty on your first…
senthilkumarbala.medium.com
2.4. Linear model vs non-Linear model
## Linear model vs non-Linear model ์ด์ ์ฅ์์ ์ ํ ๋ชจ๋ธ์ ํ์ฅ์ ์ผํ์ผ๋ก ๋น์ ํ ์๊ด๊ด๊ณ๋ฅผ ์์ฉํ๋๋ก ํ๋ ๋ฐฉ๋ฒ์ด ํ์ฉํ ์ ์์์ ๋ณด์์ต๋๋ค. …
wikidocs.net
https://process-mining.tistory.com/131
Parametric model๊ณผ Non-parametric model
๋จธ์ ๋ฌ๋(ํน์ ํต๊ณํ)์ ๊ณต๋ถํ๋ค ๋ณด๋ฉด, parametric/non-parametric model์ด๋ parametric/non-parametric test์ ๊ฐ์ ๋จ์ด๋ฅผ ์์ฃผ ์ ํ ์ ์๋ค. ์ด๋ฒ ํฌ์คํ ์์๋ parametric model๊ณผ non-parametric model์ ๋ป๊ณผ ํจ๊ป
process-mining.tistory.com