机器学习概念介绍
[TOC]
机器学习如何学习
机器学习是一门理论和实践相结合的科目
- 理论导向
- derive everything deeply for solid understanding
- 不关心普罗大众
技术导向
- flash over the sexiest techniques broadly for shiny coverage.
- 有太多技术,很难选择,很难正确的使用到合适的场景
所以纯粹死学理论和不钻研理论只学花哨的技术都不合适,我们要从机器学习的本质基础学起。
- 要学习关键理论,关键技术,实践中的用法
- 学习每个机器学习者都必须掌握的东西
从学习到机器学习
学习:acquiring skill with experience accumulated from abservations 从观察积累的知识出发学习技能
观察 —> Learning —-> skill
机器学习:aquiring skill with experience accumulated/computed from data;
data —-> ML ——> skill
Skill : improve some performance measure ( e.g. prediction accuracy)
data —-> ML —–> improved performance measure
机器学习就是从数据出发,学习到技巧,从而对现有方案有促进提高机器可学习的场景
- 存在需要被学习的模式
- 但是没有可以很容易的对模式定义的数学描述
- 并且存在关于该模式的大量数据
机器学习的例子
- food:从社交网络上挖掘文本和位置信息,学习餐厅的卫生状况对健康的影响
- Clothing: 用销售数据和客户调查数据来给客户进行穿衣搭配的推荐
- Housing: 从建筑特征和能耗负载数据来预测建筑的能耗
- 行:自动驾驶
机器学习和其他领域的关系
机器学习和数据挖掘
- 机器学习:用数据去算出一个和目标函数很接近的假设函数
- 数据挖掘:用大量数据去找到数据里面有用有趣的性质,关联等
- 如果把数据挖掘的目标限制为找到一个和目标函数很接近的假设函数的话,那么机器学习和数据挖掘没什么本质的不同,他们目标是一致的。
- 但是数据挖掘的目标并不总是这样,如果interesting property和’hypothesis that approximate target是相关的,那么 数据挖掘 可以帮助机器学习,并且反过来也一样(vice versa)
- 传统的数据挖掘同样也关注在大数据库里实现高效的计算
- 他们非常相像,但是不完全相同
机器学习和人工智能
- 机器学习: use data to compute hypothesis g that approximates target f
- 人工智能:compute something that shows intelligent behavior。
* 如果把机器学习要学习的目标函数的功能限定为,这个函数可以让计算机实现智能化的行为(自动驾驶),那么机器学习和人工智能就是相同的。 * 但是实现智能的途径不只有机器学习一种。
机器学习和统计学
- Statistics : use data to make inference about an unknown process
- g is an inference outcome; f is something unknown ; statistics can be used to achieve ML
- 传统的统计学同样也专注于证明数学假设,但是不关心如何计算
- 机器学习用到的许多工具很早就在统计学里面出现了,所以统计学为机器学习提供了有力的工具
机器学习的组成部分
基本符号
- input: x$\in$X
- output: y $\in$ Y
- unknown pattern to be learned $\Leftarrow\Rightarrow$ target function: f: X$\rightarrow$Y
- data $\Leftarrow\Rightarrow$ training examples
- hypothesis $\Leftarrow\Rightarrow$ skill with hopefully good performance: g: X $\rightarrow$ Y
##参考资料
經典文獻
F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386-408, 1958. (第二講:Perceptron 的出處)
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. (第四講:Hoeffding’s Inequality)
Y. S. Abu-Mostafa, X. Song , A. Nicholson, M. Magdon-ismail. The bin model, 1995. (第四講:bin model 的出處)
V. Vapnik. The nature of statistical learning theory, 2nd edition, 2000. (第五到八講:VC dimension 與 VC bound 的完整數學推導及延伸)
Y. S. Abu-Mostafa. The Vapnik-Chervonenkis dimension: information versus complexity in learning. Neural Computation, 1(3):312-317, 1989. (第七講:VC Dimension 的概念與重要性)
參考文獻
A. Sadilek, S. Brennan, H. Kautz, and V. Silenzio. nEmesis: Which restaurants should you avoid today? First AAAI Conference on Human Computation and Crowdsourcing, 2013. (第一講:ML 在「食」的應用)
Y. S. Abu-Mostafa. Machines that think for themselves. Scientific American, 289(7):78-81, 2012. (第一講:ML 在「衣」的應用)
A. Tsanas, A. Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49: 560-567, 2012. (第一講:ML 在「住」的應用)
J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel. Introduction to the special issue on machine learning for traffic sign recognition. IEEE Transactions on Intelligent Transportation Systems 13(4): 1481-1483, 2012. (第一講:ML 在「行」的應用)
R. Bell, J. Bennett, Y. Koren, and C. Volinsky. The million dollar programming prize. IEEE Spectrum, 46(5):29-33, 2009. (第一講:Netflix 大賽)
S. I. Gallant. Perceptron-based learning algorithms. IEEE Transactions on Neural Networks, 1(2):179-191, 1990. (第二講:pocket 的出處,注意到實際的 pocket 演算法比我們介紹的要複雜)
R. Xu, D. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645-678, 2005. (第三講:Clustering)
X. Zhu. Semi-supervised learning literature survey. University of Wisconsin Madison, 2008. (第三講:Semi-supervised)
Z. Ghahramani. Unsupervised learning. In Advanced Lectures in Machine Learning (MLSS ’03), pages 72–112, 2004. (第三講:Unsupervised)
L. Kaelbling, M. Littman, A. Moore. reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4: 237-285. (第三講:Reinforcement)
A. Blum. On-Line algorithms in machine learning. Carnegie Mellon University,1998. (第三講:Online)
B. Settles. Active learning literature survey. University of Wisconsin Madison, 2010. (第三講:Active)
D. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7): 1341-1390. (第四講:No free lunch 的正式版)
T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14(3):326–334, 1965. (第五到六講:Growth Function)
B. Zadrozny, J. Langford, N. Abe. Cost sensitive learning by cost-proportionate example weighting. IEEE International Conference on Data Mining, 2003. (第八講:Weighted Classification)
G. Sever, A. Lee. Linear Regression Analysis, 2nd Edition, Wiley, 2003. (第九講:Linear Regression
由統計學的角度來分析;第十二到十三講:Polynomial Transform 後再做 Linear Regression)
D. C. Hoaglin, R. E. Welsch. The hat matrix in regression and ANOVA. American Statistician, 32:17–22, 1978. (第九講:Linear Regression 的 Hat Matrix)
D. W. Hosmer, Jr., S. Lemeshow, R. X. Sturdivant. Applied Logistic Regression, 3rd Edition, Wiley, 2013 (第十講:Logistic Regression 由統計學的角度來分析)
T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. International Conference on Machine Learning, (第十一講:Stochastic Gradient Descent 用在線性模型的理論分析)
R. Rifkin, A. Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5: 101-141, 2004. (第十一講:One-versus-all)
J. Fürnkranz. Round Robin Classification. Journal of Machine Learning Research, 2: 721-747, 2002. (第十一講:One-versus-one)
L. Li, H.-T. Lin. Optimizing 0/1 loss for perceptrons by random coordinate descent. In Proceedings of the 2007 International Joint Conference on Neural Networks (IJCNN ’07), pages 749–754, 2007. (第十一講:一個由最佳化角度出發的 Perceptron Algorithm)
G.-X. Yuan, C.-H. Ho, C.-J. Lin. Recent advances of large-scale linear classification. Proceedings of IEEE, 2012. (第十一講:更先進的線性分類方法)
Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11(2010), 1471-1490. (第十二講:一個使用多項式轉換加上線性分類模型的方法)
M. Magdon-Ismail, A. Nicholson, Y. S. Abu-Mostafa. Learning in the presence of noise. In Intelligent Signal Processing. IEEE Press, 2001. (第十三講:Noise 和 Learning)
A. Neumaier, Solving ill-conditioned and singular linear systems: A tutorial on regularization, SIAM Review 40 (1998), 636-666. (第十四講:Regularization)
T. Poggio, S. Smale. The mathematics of learning: Dealing with data. Notices of the American Mathematical Society, 50(5):537–544, 2003. (第十四講:Regularization)
P. Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika, 76(3): 503–514, 1989. (第十五講:Cross Validation)
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial intelligence (IJCAI ’95), volume 2, 1137–1143, 1995. (第十五講:Cross Validation)
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Information Processing Letters, 24(6):377–380, 1987. (第十六講:Occam’s Razor)