DataScholars

A blog about data science, computer science, machine learning, artificial intelligence, computational social science, data mining, analysis, and visualization.

An Introduction to Variable and Feature Selection (Isabelle Guyon, André Elisseeff)

by reiver

Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

[PDF]

(H/T Kazem Jahanbakhsh)

Using Maximum Entropy for Text Classification (Kamal Nigam, John Lafferty, Andrew McCallum)

by reiver

This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, part-of-speech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally non-uniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes signicantly better, but also sometimes worse. Much future work remains, but the results indicate that maximum entropy is a promising technique for text classification.

[PDF]

A Spelling Correction Program Based on a Noisy Channel Model (Mark D. Kemighan, Kenneth W. Church, William A. Gale)

by reiver

This paper describes a new program, correct, which takes words rejected by the Unix® spell program, proposes a list of candidate corrections, and sorts them by probability. The probability scores are the novel contribution of this work. Probabilities are based on a noisy channel model. It is assumed that the typist knows what words he or she wants to type but some noise is added on the way to the keyboard (in the form of typos and spelling errors). Using a classic Bayesian argument of the kind that is popular in the speech recognition literature (Jelinek, 1985), one can often recover the intended correction, c, from a typo, t, by finding the correction c that maximizes Pr(c)Pr(tlc). The first factor, Pr(c), is a prior model of word probabilities; the second factor, Pr(t[c), is a model of the noisy channel that accounts for spelling transformations on letter sequences (e.g., insertions, deletions, substitutions and reversals). Both sets of probabilities were trained on data collected from the Associated Press (AP) newswire. This text is ideally suited for this purpose since it contains a large number of typos (about two thousand per month).

[PDF]

Document Clustering Based On Non-negative Matrix Factorization (Wei Xu, Xin Liu, Yihong Gong)

by reiver

In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.

[PDF]

Statistical Data Mining Tutorials (Andrew Moore)

by reiver

If you want to learn statistical data mining, one place to learn this is with Andrew W. Moore's Statistical Data Mining Tutorial.

Here is Andrew's table of contents:

  • Decision Trees. The Decision Tree is one of the most popular classification algorithms in current use in Data Mining and Machine Learning. This tutorial can be used as a self-contained introduction to the flavor and terminology of data mining without needing to review many statistical or probabilistic pre-requisites. If you're new to data mining you'll enjoy it, but your eyebrows will raise at how simple it all is! After having defined the job of classification, we explain how information gain (next Andrew Tutorial) can be used to find predictive input attributes. We show how applying this procedure recursively allows us to build a decision tree to predict future events. We then look carefully at a question so fundamental, it is the basis for much of all statistics and machine learning theory: how do you choose between a complicated model that fits the data really well and an "Occam's razor" model that is succinct yet not so good at fitting data (this topic will be revisited in later Andrew Lectures, including "Cross-validation" and "VC-dimension"). We also discuss the very wide world of improvements and tweaks on the basic decision tree idea.

  • Information Gain. This tutorial steps through the ideas from Information Theory that eventually lead to Information Gain...one of the most popular measures of association currently used in data mining. We visit the ideas of Entropy and Conditional Entropy along the way. Look at the lecture on Gaussians for discussion of Entropy in the case of continuous probability density functions.

  • Probability for Data Miners. This tutorial reviews Probability starting right at ground level. It is, arguably, a useful investment to be completely happy with probability before venturing into advanced algorithms from data mining, machine learning or applied statistics. In addition to setting the stage for techniques to be used over and over again throughout the remaining tutorials, this tutorial introduces the notion of Density Estimation as an important operation, and then introduces Bayesian Classifiers such as the overfitting-prone Joint-Density Bayes Classifier, and the over-fitting-resistant Naive Bayes Classifier.

  • Probability Density Functions. A review of a world that you've probably encountered before: real-valued random variables, probability density functions, and how to deal with multivariate (i.e. high dimensional) probablity densities. Here's where you can review things like Expectations, Covariance Matrices, Independence, Marginal Distributions and Conditional Distributions. Once you're happy with this stuff you won't be a data miner, but you'll have the tools to very quickly become one.

  • Gaussians. Gaussians, both the friendly univariate kind, and the slightly-reticent-but-nice-when-you-get-to-know-them multivariate kind are extremely useful in many parts of statistical data mining, including many data mining models in which the underlying data assumption is highly non-Gaussian. You need to be friend with multivariate Gaussians.

  • Maximum Likelihood Estimation. MLE is a solid tool for learning parameters of a data mining model. It is a methodlogy which tries to do two things. First, it is a reasonably well-principled way to work out what computation you should be doing when you want to learn some kinds of model from data. Second, it is often fairly computationally tractable. In any case, the important thing is that in order to understand things like polynomial regression, neural nets, mixture models, hidden Markov models and many other things it's going to really help if you're happy with MLE.

  • Gaussian Bayes Classifiers. Once you are friends with Gaussians, it it easy to use them as subcomponents of Bayesian Classifiers. This tutorial show you how.

  • Cross-Validation. Cross-validation is one of several approaches to estimating how well the model you've just learned from some training data is going to perform on future as-yet-unseen data. We'll review testset validation, leave-one-one cross validation (LOOCV) and k-fold cross-validation, and we'll discuss a wide variety of places that these techniques can be used. We'll also discuss overfitting...the terrible phenomenon that CV is supposed to present. And at the end, our hairs will stand on end as we realize that even when using CV, you can still overfit arbitrarily badly.

  • Neural Networks. We begin by talking about linear regression...the ancestor of neural nets. We look at how linear regression can use simple matrix operations to learn from data. We gurgle with delight as we see why one initial assumption leads inevitably to the decision to try to minimize sum squared error. We then explore an alternative way to compute linear parameters---gradient descent. And then we exploit gradient descent to allow classifiers in addition to regressors, and finally to allow highly non-linear models---full neural nets in all their glory.

  • Instance-based learning (aka Case-based or Memory-based or non-parametric). Over a century old, this form of data mining is still being used very intensively by statisticians and machine learners alike. We explore nearest neighbor learning, k-nearest-neighbor, kernel methods and locally weighted polynomial regression. Software and data for the algorithms in this tutorial are available from http://www.cs.cmu.edu/~awm/vizier. The example figures in this slide-set were created with the same software and data.

  • Eight Regression Algorithms. You'll have to wait to find out Andrew's ordering on them, but based on all the foundations you've covered so far we will quickly be able to run through: Regression Trees, Cascade Correlation, Group Method Data Handling (GMDH), Multivariate Adaptive Regression Splines (MARS), Multilinear Interpolation, Radial Basis Functions, Robust Regression, Cascade Correlation + Projection Pursuit

  • Predicting Real-valued Outputs: An introduction to regression. This lecture is made up entirely from material from the start of the Neural Nets lecture and a subset of the topics in the "Favorite Regression Algorithms" lecture. We talk about linear regression, and then these topics: Varying noise, Non-linear regression (very briefly), Polynomial Regression, Radial Basis Functions, Robust Regression, Regression Trees, Multilinear Interpolation and MARS.

  • Bayesian Networks. The tutorial first reviews the fundamentals of probability (but to do that properly, please see the earlier Andrew lectures on Probability for Data Mining). It then discusses the use of Joint Distributions for representing and reasoning about uncertain knowledge. Having discussed the obvious drawback (the curse of dimensionality) for Joint Distributions as a general tool, we visit the world of clever tricks involving indepedence and conditional independence that allow us to express our uncertain knowledge much more succinctly. And then we beam with pleasure as we realize we've got most of the knowledge we need to understand and appreciate Bayesian Networks already. The remainder of the tutorial introduces the important question of how to do inference with Bayesian Networks (see also the next Andrew Lecture for that).

  • Inference in Bayesian Networks (by Scott Davies and Andrew Moore). The majority of these slides were conceived and created by Scott Davies (scottd@cs.cmu,edu). Once you've got hold of a Bayesian Network, there remains the question of how you do inference with it. Inference is the operation in which some subset of the attributes are given to us with known values, and we must use the Bayes net to estimate the probability distribution of one or more of the remaining attributes. A typical use of inference is "I've got a temperature of 101, I'm a 37-year-old Male and my tongue feels kind of funny but I have no headache. What's the chance that I've got bubonic plague?".

  • Learning Bayesian Networks. This short and simple tutorial overviews the problem of learning Bayesian networks from data, and the approaches that are used. This is an area of active research by many research group, including Andrew and his students (see the Auton Lab Website for more details).

  • A Short Intro to Naive Bayesian Classifiers. I recommend using Probability For Data Mining for a more in-depth introduction to Density estimation and general use of Bayes Classifiers, with Naive Bayes Classifiers as a special case. But if you just want the executive summary bottom line on learning and using Naive Bayes classifiers on categorical attributes then these are the slides for you.

  • Short Overview of Bayes Nets. This is a very short 5 minute "executive overview" of the intuition and insight behind Bayesian Networks. Read the full Bayes Net Tutorial for more information.

  • Gaussian Mixture Models. Gaussian Mixture Models (GMMs) are among the most statistically mature methods for clustering (though they are also used intensively for density estimation). In this tutorial, we introduce the concept of clustering, and see how one form of clustering...in which we assume that individual datapoints are generated by first choosing one of a set of multivariate Gaussians and then sampling from them...can be a well-defined computational operation. We then see how to learn such a thing from data, and we discover that an optimization approach not used in any of the previous Andrew Tutorials can help considerably here. This optimization method is called Expectation Maximization (EM). We'll spend some time giving a few high level explanations and demonstrations of EM, which turns out to be valuable for many other algorithms beyond Gaussian Mixture Models (we'll meet EM again in the later Andrew Tutorial on Hidden Markov Models). The wild'n'crazy algebra mentioned in the text can be found (hand-written) here.

  • K-means and Hierarchical Clustering. K-means is the most famous clustering algorithm. In this tutorial we review just what it is that clustering is trying to achieve, and we show the detailed reason that the k-means approach is cleverly optimizing something very meaningful. Oh yes, and we'll tell you (and show you) what the k-means algorithm actually does. You'll also learn about another famous class of clusterers: hierarchical methods (much beloved in the life sciences). Phrases like "Hierarchical Agglomerative Clustering" and "Single Linkage Clustering" will be bandied about.

  • Hidden Markov Models. In this tutorial we'll begin by reviewing Markov Models (aka Markov Chains) and then...we'll hide them! This simulates a very common phenomenon... there is some underlying dynamic system running along according to simple and uncertain dynamics, but we can't see it. All we can see are some noisy signals arising from the underlying system. From those noisy observations we want to do things like predict the most likely underlying system state, or the time history of states, or the likelihood of the next observation. This has applications in fault diagnosis, robot localization, computational biology, speech understanding and many other areas. In the tutorial we will describe how to happily play with the mostly harmless math surrounding HMMs and how to use a heart-warming, and simple-to-implement, approach called dynamic programming (DP) to efficiently do most of the HMM computations you could ever want to do. These operations include state estimation, estimating the most likely path of underlying states, and and a grand (and EM-filled) finale, learning HMMs from data.

  • VC dimension. This tutorial concerns a well-known piece of Machine Learning Theory. If you've got a learning algorithm in one hand and a dataset in the other hand, to what extent can you decide whether the learning algorithm is in danger of overfitting or underfitting? If you want to put some formal analysis into the fascinating question of how overfitting can happen, then this is the tutorial for you. In addition to getting good understanding of the overfitting phenomenon, you also end up with a method for estimating how well an algorithm will perform on future data that is solely based on its training set error, and a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to cross-validation, called Structural Risk Minimization (SRM), for choosing classifiers. We'll discuss that. We'll also very briefly compare both CV and SRM to two other model selection methods: AIC and BIC.

  • Support Vector Machines. We review the idea of the margin of a classifier, and why that may be a good criterion for measuring a classifier's desirability. Then we consider the computational problem of finding the largest margin linear classifier. At this point we look at our toes with embarassment and note that we have only done work applicable to noise-free data. But we cheer up and show how to create a noise resistant classifier, and then a non-linear classifier. We then look under a microscope at the two things SVMs are renowned for---the computational ability to survive projecting data into a trillion dimensions and the statistical ability to survive what at first sight looks like a classic overfitting trap.

  • PAC Learning. PAC stands for "Probably Approximately Correct" and concerns a nice formalism for deciding how much data you need to collect in order for a given classifier to achieve a given probability of correct predictions on a given fraction of future test data. The resulting estimate is somewhat conservative but still represents an interesting avenue by which computer science has tried to muscle in on the kind of analytical problem that you would normally find in a statistics department.

  • Markov Decision Processes. How do you plan efficiently if the results of your actions are uncertain? There is some remarkably good news, and some some significant computational hardship. We begin by discussing Markov Systems (which have no actions) and the notion of Markov Systems with Rewards. We then motivate and explain the idea of infinite horizon discounted future rewards. And then we look at two competing approaches to deal with the following computational problem: given a Markov System with Rewards, compute the expected long-term discounted rewards. The two methods, which usually sit at opposite corners of the ring and snarl at each other, are straight linear algebra and dynamic programming. We then make the leap up to Markov Decision Processes, and find that we've already done 82% of the work needed to compute not only the long term rewards of each MDP state, but also the optimal action to take in each state.

  • Reinforcement Learning. You need to be happy about Markov Decision Processes (the previous Andrew Tutorial) before venturing into Reinforcement Learning. It concerns the fascinating question of whether you can train a controller to perform optimally in a world where it may be necessary to suck up some short term punishment in order to achieve long term reward. We will discuss certainty-equivalent RL, the Temporal Difference (TD) learning, and finally Q-learning. The curse of dimensionality will be constantly learning over our shoulder, salivating and cackling.

  • Biosurveillance: An example. We review methods described in other biosurveillance slides as applied to hospital admissions data from the Walkerton Cryptosporidium outbreak of 2000. This is work performed as part of the ECADS project.

  • Elementary probability and Naive Bayes classifiers. This slide repeats much of the material of the main Probability Slide from Andrew's tutorial series, but this slide-set focusses on disease surveillance examples, and includes a very detailed description for non-experts about how Bayes rule is used in practice, about Bayes Classifiers, and how to learn Naive Bayes classifiers from data.

  • Spatial Surveillance. This tutorial discusses Scan Statistics, a famous epidemiological method for discovering overdensities of disease cases.

  • Time Series Methods. This tutorial reviews some elementary univariate time series methods, with a focus on using the time series for alerting when a sequence of observations is starting to behave strangely.

  • Game Tree Search Algorithms, including Alpha-Beta Search. Introduction to algorithms for computer game playing. We describe the assumptions about two-player zero-sum discrete finite deterministic games of perfect information. We also practice saying that noun-phrase in a single breath. After the recovery teams have done their job we talk about solving such games with minimax and then alpha-beta search. We also discuss the dynamic programming approach, used most commonly for end-games. We also debate the theory and practice of heuristic evaluation functions in games.

  • Zero-Sum Game Theory. Want to know how and why to bluff in poker? How games can be compiled down to a matrix form? And general discussion of the basics of games of hidden information? Then these are the slides for you. It might help you to begin by reading the slides on game-tree search.

  • Non-zero-sum Game Theory. Auctions and electronic negotiations are a fascinating topic. These slides take you through most of the basic story of the assumptions, the formalism and the mathematics behind non-zero-sum game theory. It might help you to begin by reading the slides on game-tree search and Zero-sum Game theory with Hidden information available from this same set of tutorials. In this tutorial we cover the definition of a multiplayer non-zero-sum game, domination of strategies, Nash Equilibia. We deal with discrete games, and also games in which strategies include real numbers, such as your bid in a two player double auction negotiation. We cover prisoner's dilemma, tragedy of the commons, double auctions, and multi-player auctions such as the first price sealed auction and the second price auction. The math for the double auction analysis can be found at http://www.cs.cmu.edu/~awm/double_auction_math.pdf.

  • Introductory overview of time-series-based anomaly detection algorithms. This simple tutorial overviews some methods for detecting anomalies in biosurveillance time series. The slides are incomplete: verbal commentary from the presentation has not yet been included as explanatory textboxes. Please let me (awm@cs.cmu.edu) know if you would be interested in more detail on these slides and/or access to the software that implements and graphs the various univariate methods. If I receive enough requests I will try to make both of the above available.

  • AI Class introduction. A very quick informal discussion of the different kinds of AI research motivations out there

  • Search Algorithms. What is a search algorithm? What job does it do and where can it be applied? We introduce various flavors of Breadth First Search and Depth First search and then looks at alternatives and improvements that include Iterative Deepening and Bidirectional Search. Then we look with furrowed brow at an idea called Best First Search. This will be our first view of a search algorithm that is able to exploit a heuristic function.

  • A-star Heuristic Search. The classic algorithm for finding shortests paths given an admissible heuristic. We'll deal with the notion of admissibility (summary: admissible = optimistic). We show how you can prove properties of A*. We'll also briefly discuss IDA* (iterative deepening A*).

  • Constraint Satisfaction Algorithms, with applications in Computer Vision and Scheduling. The tutorial teaches concepts from the AI literature on Constraint Satisfaction. Accompanying animations are in http://www.cs.cmu.edu/~awm/animations/constraint. This is a special case of uninformed search in which we want to find a solution configuration for some set of variables that satisfies a set of constraints. Example problems including graph coloring, 8-queens, magic squares, the Waltz algorithm for interpreting line drawings, many kinds of scheduling and most important of all, the deduction phase of minesweeper. The algorithms we'll look at include backtracking search, forward checking search and constraint propagation search. We'll also look at general-purpose heuristics for additional search accelerations.

  • Robot Motion Planning. We review some algorithms for clever path planning once we arrive in real-valued continuous space instead of the safe and warm discrete space we've been sheltering in so far. We look at configuration spaces, visibility graphs, cell-decomposition, voronoi-based planning and potential field methods. Unfortunately some of the figures are missing from the PDF version.

  • HillClimbing, Simulated Annealing and Genetic Algorithms. Some very useful algorithms, to be used only in case of emergency.

Information Visualization for Large-Scale Data Workflows (Michael Conover)

by reiver

Michael Conover gives a talk on information visualization for large-scale data workflows:

Figure 1. Michael Conover: Information Visualization for Large-Scale Data Workflows

(H/T Peter Skomoroch)

Intro to D3 (Manu Kapoor)

by reiver

D3 is a popular and powerful library for creating visualizations on the web with JavaScript.

Manu Kapoor gives an intro tutorial on D3:

Figure 1. 2013-09-24 - Polyglot Visits Vancouver Data Visualization - Under the Hood of D3 - Manu Kapoor

(H/T Saem Ghani)

Exploiting Similarities among Languages for Machine Translation (Tomas Mikolov, Quoc V. Le, Ilya Sutskever)

by reiver

Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

arXiv:1309.4168 [cs.CL]

Video Tutorials: Intro to R

by reiver

Do you want to learn R?

There is a great series of video tutorial on learning R that teaches the basic and sets you up to be a very competent R programmer.

Keynote on Visualization Principles (Tamara Munzner)

by reiver

Tamara Munzner (bit.ly/tmunzner) presents very lucid and useful guidelines for creating effective visualizations, including how to correctly rank visual channel types and how to use categorical color constraints. She explains advantages of 2D representation and drawbacks of 3D, immersive, or animated visualizations. She also describes how to create visualizations that reduce the viewer's cognitive load, and how to validate visualizations. This talk was presented at VIZBI 2011, an international conference series on visualizing biological data (vizbi.org) funded by NIH & EMBO. This video was filmed and distributed with permission under a creative common license. Slides from the talk are at bit.ly/nCJM5U

[VIDEO]

Figure 1. Tamara Munzner: Keynote on Visualization Principles

Animated Transitions in Statistical Data Graphics (Jeffrey Heer, George G. Robertso)

by reiver

In this paper we investigate the effectiveness of animated transitions between common statistical data graphics such as bar charts, pie charts, and scatter plots. We extend theoretical models of data graphics to include such transitions, introducing a taxonomy of transition types. We then propose design principles for creating effective transitions and illustrate the application of these principles in DynaVis, a visualization system featuring animated data graphics. Two controlled experiments were conducted to assess the efficacy of various transition types, finding that animated transitions can significantly improve graphical perception

[PDF]

Complex Adaptive Dynamical Systems, a Primer (Claudius Gros)

by reiver

Complex Adaptive Dynamical Systems, a Primer is a free book by Claudius Gros.

A related description from the author:

Complex system theory is rapidly developing and gaining importance, providing tools and concepts central to our modern understanding of emergent phenomena. [...]

Network theory, dynamical systems and information theory, the core of modern complex system sciences [...] [covers] basic concepts and phenomena like

  • small-world networks,
  • bifurcation theory and
  • information entropy.

[...] [Other topics under] complex system sciences [...] [are] emergence and self-organization [...] Prominent examples are

  • self-organized criticality in adaptive systems,
  • life at the edge of chaos,
  • hypercycles and coevolutionary avalanches,
  • synchronization phenomena,
  • absorbing phase transitions and
  • the cognitive system approach to the brain.

There is also a number of exercises that are related to this book, at the course page: Complex and Adaptive Dynamical Systems.

(H/T Richard Harper)

Very Simple Classification Rules Perform Well on Most Commonly Used Datasets (Robert C. Holte)

by reiver

This article reports an empirical investigation of the accuracy of rules that classify examples on the basis of a single attribute. On most datasets studied, the best of these very simple rules is as accurate as the rules induced by the majority of machine learning systems. The article explores the implications of this finding for machine learning research and applications.

[PDF] or [HTML]

Tommy Levi's "Data Science in the Wild" Slides

by reiver

We had a record turn out at Tommy Levi's Data Science in the Wild talk. (At least 250 people showed up.)

There has been a lot of people asking for Tommy's slides, so without further adieu, here are they are.

Download them here: tslevi-datascience-in-the-wild.pdf

If you want to do data science with R cluster, you want to read these slides.

Yoshua Bengio's AAAI Tutorial on Deep Learning of Representations

by reiver

Deep Learning of Representations: a AAAI 2013 Tutorial is a tutorial Yoshua Bengio gave at the AAAI 2013.

Get his slides here. And bibliographic references here.

(Via Chris Brockett)

Data Science in the Wild: Tommy Levi speaking at Vancouver Meetup on Wednesday June 26th at 6:30 PM

by reiver

Tommy Levi, a data scientist in Vancouver, is giving an interesting talk on Wednesday June 26th at 6:30 PM.

Tommy will be speaking at a combined meetup event for the Vancouver-based Data Science group, Machine Learning group and the Vancouver R user group.

(At the time of writing, over 200 people have registered for the talk. So there is obviously a lot of interest.)

Here is Tommy's talk's abstract:

Analyzing User Behavior at Plenty of Fish: Data Science in the Wild

How is Machine Learning and Data Science actually used in the real-world?

Tommy Levi tells you how, and goes into the details of how he has been using them.

Tommy will walk through the opening steps (and missteps) he took in starting to analyze user behavior on the Plenty of Fish site. He will discuss data preparation and wrangling, parallel computing and initial exploration and feature analysis. The goal of the talk is not to focus on any specific algorithms, but to show the steps taken for a real world analysis on large, often messy data and how to get actionable, useful results from such an analysis.

If you are in the Vancouver area, and are interested in Data Science or Machine Learning, you should be there.

Recent Developments in Deep Learning

by reiver

Geoff Hinton is a well known name when it comes to artificial neural networks.

Here Geoff Hinton talks about recent developments in deep learning:

Figure 1. Geoff Hinton - Recent Developments in Deep Learning

Deep Learning using Support Vector Machines (Yichuan Tang)

by reiver

Recently, fully-connected and convolutional neural networks have been trained to reach state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics data. For classification tasks, much of these "deep learning" models employ the softmax activation functions to learn output labels in 1-of-K format. In this paper, we demonstrate a small but consistent advantage of replacing softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. In almost all of the previous works, hidden representation of deep networks are first learned using supervised or unsupervised techniques, and then are fed into SVMs as inputs. In contrast to those models, we are proposing to train all layers of the deep networks by backpropagating gradients through the top level SVM, learning features of all layers. Our experiments show that simply replacing softmax with linear SVMs gives significant gains on datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.

arXiv:1306.0239 [cs.LG]

A Bayesian Approach for Predicting the Popularity of Tweets (Tauhid Zaman, Emily B. Fox, Eric T. Bradlow)

by reiver

We predict the popularity of short messages called tweets created in the micro-blogging site known as Twitter. We measure the popularity of a tweet by the time-series path of its retweets, which is when people forward the tweet to others. We develop a probabilistic model for the evolution of the retweets using a Bayesian approach, and form predictions using only observations on the retweet times and the local network or "graph" structure of the retweeters. We obtain good step ahead forecasts and predictions of the final total number of retweets even when only a small fraction (i.e. less than one tenth) of the retweet paths are observed. This translates to good predictions within a few minutes of a tweet being posted and has potential implications for understanding the spread of broader ideas, memes, or trends in social networks and also revenue models for both individuals who "sell tweets" and for those looking to monetize their reach.

arXiv:1304.6777 [cs.SI]

John Langford on Machine Learning

by reiver

John Langford is a machine learning research scientist at Microsoft Research New York and the principal developer of Vowpal Wabbi.

Here John Langford talks about machine learning.

Figure 1. Machine Learning for Industry with Microsoft Research Lead Scientist

Vowpal Wabbit (VW): Fast Open Source Optimization

by reiver

Vowpal Wabbit (or just VW for short) is an open source system designed to be a fast, scalable, useful learning algorithm, used for solving optimization problems.

Examples are available here.

Item-based Collaborative Filtering Recommendation Algorithms (Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl)

by reiver

Recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. These systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the Web. The tremendous growth in the amount of available information and the number of visitors to Web sites in recent years poses some key challenges for recommender systems. These are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. In traditional collaborative filtering systems the amount of work increases with the number of participants in the system. New recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. To address these issues we have explored item-based collaborative filtering techniques. Item-based techniques first analyze the user-item matrix to identify relationships between different items, and then use these relationships to indirectly compute recommendations for users.

In this paper we analyze different item-based recommendation generation algorithms. We look into different techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and different techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). Finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. Our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available user-based algorithms.

[HTML]

Approximating Bayesian inference with a sparse distributed memory system (Joshua T. Abbott, Jessica B. Hamrick, Thomas L. Griffiths)

by reiver

Probabilistic models of cognition have enjoyed recent success in explaining how people make inductive inferences. Yet, the difficult computations over structured representations that are often required by these models seem incompatible with the continuous and distributed nature of human minds. To reconcile this issue, and to understand the implications of constraints on probabilistic models, we take the approach of formalizing the mechanisms by which cognitive and neural processes could approximate Bayesian inference. Specifically, we show that an associative memory system using sparse, distributed representations can be reinterpreted as an importance sampler, a Monte Carlo method of approximating Bayesian inference. This capacity is illustrated through two case studies: a simple letter reconstruction task, and the classic problem of property induction. Broadly, our work demonstrates that probabilistic models can be implemented in a practical, distributed manner, and helps bridge the gap between algorithmic- and computational-level models of cognition.

[PDF]

Video Lectures: Convex Optimization, by Stephen Boyd

by reiver

Convex Optimization has become more and more important to people researching machine learning.

Stephen Boyd has a series of video lectures available on this topic.

The video lectures are in two parts: "Convex Optimization I" and "Convex Optimization II". Here is the description of the "Convex Optimization I" sub-series:

Convex Optimization I concentrates on recognizing and solving convex optimization problems that arise in engineering. Convex sets, functions, and optimization problems. Basics of convex analysis. Least-squares, linear and quadratic programs, semidefinite programming, minimax, extremal volume, and other problems. Optimality conditions, duality theory, theorems of alternative, and applications. Interior-point methods. Applications to signal processing, control, digital and analog circuit design, computational geometry, statistics, and mechanical engineering.

And here is the description for the "Convex Optimization II" sub-series:

This course introduces topics such as subgradient, cutting-plane, and ellipsoid methods. Decentralized convex optimization via primal and dual decomposition. Alternating projections. Exploiting problem structure in implementation. Convex relaxations of hard problems, and global optimization via branch & bound. Robust optimization. Selected applications in areas such as control, circuit design, signal processing, and communications.

All video below:

A play list is available too.

Stein's Paradox in Statistics (Bradley Efron, Carl Morris)

by reiver

Sometimes a mathematical result is strikingly contrary to generally held belief even though an obviously valid proof is given. Charles Stein of Stanford University discovered such as a paradox in statistics in 1955. His result undermined a century and a half of work on estimation theory, going back to Karl Friedrich Gauss and Adrien Marie Legendre. After a long period of resistance to Stein's ideas, punctuated by frequent and sometimes angry debate, the sense of paradox has diminished and Stein's ideas are being incorporated into applied and theoretical statistics.

Stein's paradox concerns the use of observed averages to estimate unobserved quantities. Averaging is the second most basic process in statistics, the first being the simple act of counting.

[...]

The paradoxical element in Stein's result is that it sometimes contradicts this elementary law of statistical theory.

[PDF]

TOMORROW: Hadley Alexander Wickham: Speaking at Vancouver Meetup on Wednesday May 8th at 7:00 PM

by reiver

Tomorrow is the day.

R users are likely to know the name: Hadley Alexander Wickham.

Hadley will be in Vancouver tomorrow, and will be speaking at a combined meetup event for the Vancouver-based Data Science group and the Vancouver R user group.

If you are in the Vancouver area, use R, or are interested in statistics or visualization, be there tomorrow.

Video Lectures: Signals and Systems, by Alan V. Oppenheim

by reiver

Alan V. Oppenheim has a number of video lectures on: Signals and Systems (up on MIT OpenCourseWare).

Here is the complete list of video lectures:

To get the most out of the video lectures, it would probably be a good idea to work through the assignments and readings too.

A Global Geometric Framework for Nonlinear Dimensionality Reduction (Joshua B. Tenenbaum, Vin de Silva, John C. Langford)

by reiver

Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. The human brain confronts the same problem in everyday perception, extracting from its high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic nerve fibers—a manageably small number of perceptually relevant features. Here we describe an approach to solving dimensionality reduction problems that uses easily measured local metric information to learn the underlying global geometry of a data set. Unlike classical techniques such as principal component analysis (PCA) and multidimensional scaling (MDS), our approach is capable of discovering the nonlinear degrees of freedom that underlie complex natural observations, such as human handwriting or images of a face under different viewing conditions. In contrast to previous algorithms for nonlinear dimensionality reduction, ours efficiently computes a globally optimal solution, and, for an important class of data manifolds, is guaranteed to converge asymptotically to the true structure.

[PDF]