DataScholars

A blog about data science, computer science, machine learning, artificial intelligence, computational social science, data mining, analysis, and visualization.

The emergence and role of strong ties in time-varying communication networks (Márton Karsai, Nicola Perra, Alessandro Vespignani)

by reiver

In most social, information, and collaboration systems the complex activity of agents generates rapidly evolving time-varying networks. Temporal changes in the network structure and the dynamical processes occurring on its fabric are usually coupled in ways that still challenge our mathematical or computational modelling. Here we analyse a mobile call dataset describing the activity of millions of individuals and investigate the temporal evolution of their egocentric networks. We empirically observe a simple statistical law characterizing the memory of agents that quantitatively signals how much interactions are more likely to happen again on already established connections. We encode the observed dynamics in a reinforcement process defining a generative computational network model with time-varying connectivity patterns. This activity-driven network model spontaneously generates the basic dynamic process for the differentiation between strong and weak ties. The model is used to study the effect of time-varying heterogeneous interactions on the spreading of information on social networks. We observe that the presence of strong ties may severely inhibit the large scale spreading of information by confining the process among agents with recurrent communication patterns. Our results provide the counterintuitive evidence that strong ties may have a negative role in the spreading of information across networks.

arXiv:1303.5966 [physics.soc-ph]

Node Centrality in Weighted Networks: Generalizing Degree and Shortest Paths (Tore Opsahl, Filip Agneessens, John Skvoretz)

by reiver

Ties often have a strength naturally associated with them that differentiate them from each other. Tie strength has been operationalized as weights. A few network measures have been proposed for weighted networks, including three common measures of node centrality: degree, closeness, and betweenness. However, these generalizations have solely focused on tie weights, and not on the number of ties, which was the central component of the original measures. This paper proposes generalizations that combine both these aspects. We illustrate the benets of this approach by applying one of them to Freeman's EIES dataset.

[PDF]

Also see article.

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems (Sébastien Bubeck, Nicolò Cesa-Bianchi)

by reiver

Sébastien Bubeck and Nicolò Cesa-Bianchi have put out a book on optimization that will be of interest to many readers.

It is called: Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.

You can either buy a copy here (discount code: MAL022024) or download it for free here.

Here's the authors' description:

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration–exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration–exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this monograph, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.

PolyGlot (Un)Conference 2013

by reiver

The popular PolyGlot (Un)Conference is happening on Friday May 24th to Sunday May 26th in Vancouver. Tickets are on sale now.

If you are in Vancouver, and you are a software engineer or data scientist, you should be there. (I'll be there.)

Here's the schedule:

If you can only make it one day, be there on Saturday, May 25th. (But obviously, the more days you can be there, the more you will get out of it.)

There will be something on data science, machine learning and artificial intelligence there.

WTF: The Who to Follow Service at Twitter (Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh)

by reiver

WTF ("Who to Follow") is Twitter's user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which signicantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter's user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

[PDF]

Chaotic Boltzmann machines (Hideyuki Suzuki, Jun-ichi Imura, Yoshihiko Horio, Kazuyuki Aihara)

by reiver

The chaotic Boltzmann machine proposed in this paper is a chaotic pseudo-billiard system that works as a Boltzmann machine. Chaotic Boltzmann machines are shown numerically to have computing abilities comparable to conventional (stochastic) Boltzmann machines. Since no randomness is required, efficient hardware implementation is expected. Moreover, the ferromagnetic phase transition of the Ising model is shown to be characterised by the largest Lyapunov exponent of the proposed system. In general, a method to relate probabilistic models to nonlinear dynamics by derandomising Gibbs sampling is presented.

10.1038/srep01610

R 3.0.0 Released

by reiver

R is a popular environment for statistical computing and graphics. R version 3.0.0 has been released.

From the release e-mail:

Major R releases have not previously marked great landslides in terms of new features. Rather, they represent that the codebase has developed to a new level of maturity. This is not going to be an exception to the rule.

Version 1.0.0 was released at a point in time when we felt that we had reached a level of completeness and stability high enough to characterize a full statistical system, which could be put to production use.

Version 2.0.0 came out after strong enhancements of the memory management subsystem as well as several major features, including Sweave.

Version 3.0.0, as of this writing, contains only really major new feature: The inclusion of long vectors (containing more than 2^31-1 elements!). More changes are likely to make it into the final release, but the main reason for having it as a new major release is that R over the last 8.5 years has reached a new level: we now have 64 bit support on all platforms, support for parallel processing, the Matrix package, and much more.

Josh Wills: Data Scientist Definition

by reiver

A definition of what a data scientist is from Josh Wills:

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

(Sometimes I think data engineer might be a better name for it.)

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Trevor Hastie, Robert Tibshirani, Jerome Friedman)

by reiver

Statistics is a fundamental part of our vocation. We live it and breathe it, so to speak.

Thus, interesting books on statistics tend to catch my attention. And when they are also freely available, even better :-) (Although with this book, you can also buy a copy of it, and help the authors out.)

One such book that seems worth a read is: The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

The book is written by Trevor Hastie, Robert Tibshirani and Jerome Friedman, who have made the book available both as a free download and for sale.

Yelp Dataset Challenge; And A New Data Set To Play With

by reiver

Yelp has made a new data set available with their yelp dataset challenge.

Get it here.