ch1_overview
statistical learning
supervised/ unsupervised
supervised : given data (x,y), build a model, a mapping Y(input) ~ f(x)(output)
- use the model f to predict Y for unknown inputs X'
find relationship between factors based on data
ex) based on various values year, age, gender ... => wage
consider outlier (이상치)
https://en.wikipedia.org/wiki/Outlier
Outlier - Wikipedia
From Wikipedia, the free encyclopedia Observation far apart from others in statistics and data science Figure 1. Box plot of data from the Michelson–Morley experiment displaying four outliers in the middle column, as well as one outlier in the first colu
en.wikipedia.org
scatter plot
box plot : median(center value) top 25% bottom 75%
=> there are some reasons using certain plot
regression : predict a continuous ouput from inputs(ch3)
predict a categorical output(classification) (ch4)
predict Y(a categorical output) from X (inputs)
categorical -> not continuous
machine learning does not always work
unsupervised
given only inputs x, learn the underlying structure => hard to analyze
=> use Dimension reduction technique (ch12)
find some 'meaningful' directions Z1 and Z2. and, then plot the data using z1 and z2. now, we can see some underlying structure.
ex) gene expression data (expression levels on 6830 genes from 64 cancer cell lines