DataScholars

A blog about data science, computer science, machine learning, artificial intelligence, computational social science, data mining, analysis, and visualization.

Revealing social networks of spammers through spectral clustering (Kevin S. Xu, Mark Kliger, Yilun Chen, Peter J. Woolf, Alfred O. Hero III)

by reiver

To date, most studies on spam have focused only on the spamming phase of the spam cycle and have ignored the harvesting phase, which consists of the mass acquisition of email addresses. It has been observed that spammers conceal their identity to a lesser degree in the harvesting phase, so it may be possible to gain new insights into spammers' behavior by studying the behavior of harvesters, which are individuals or bots that collect email addresses. In this paper, we reveal social networks of spammers by identifying communities of harvesters with high behavioral similarity using spectral clustering. The data analyzed was collected through Project Honey Pot, a distributed system for monitoring harvesting and spamming. Our main findings are (1) that most spammers either send only phishing emails or no phishing emails at all, (2) that most communities of spammers also send only phishing emails or no phishing emails at all, and (3) that several groups of spammers within communities exhibit coherent temporal behavior and have similar IP addresses. Our findings reveal some previously unknown behavior of spammers and suggest that there is indeed social structure between spammers to be discovered.

arXiv:1305.0051 [cs.SI]

JSNetworkX: Visualizing Graphs On The Web Using JavaScript

by reiver

Visualizing graphs (in the graph theory sense of the word) can be a challenge. Visualizing graphs on the web can be a even bigger challenge. Luckily there is JSNetworkX.

JSNetworkX is an open source JavaScript library that makes visualizing graph data easy.

JSNetwork is a port of the NetworkX graph visualization library to JavaScript and the Web (using D3).

Here is an example, as shown in figure 1.

Figure 1. Sample graph visualization created with JSNetworkX.

And here is the code for the graph visualization in figure 1, as shown in figure 2.


var G = jsnx.Graph();
     
G.add_nodes_from([1,2,3,4], {group:0});
G.add_nodes_from([5,6,7], {group:1});
G.add_nodes_from([8,9,10,11], {group:2});
     
G.add_path([1,2,5,6,7,8,11]);
G.add_edges_from([[1,3],[1,4],[3,4],[2,3],[2,4],[8,9],[8,10],[9,10],[11,10],[11,9]]);
     
var color = d3.scale.category20();
jsnx.draw(G, {
  element: '#PUT_IT_OF_WHERE_TO_RENDER_THE_GRAPH_HERE',
  layout_attr: {
    charge: -120,
    linkDistance: 20
  },
  node_attr: {
    r: 5,
    title: function(d) { return d.label;}
  },
  node_style: {
    fill: function(d) {
      return color(d.data.group);
    },
    stroke: 'none'
  },
  edge_style: {
    fill: '#999'
  }
});
			
Figure 2. JavaScript source code using JSNetworkX for the graph visualization shown in figure 1.

More information about JSNetworkX is available here.

Networks, Crowds, and Markets: Reasoning About a Highly Connected World (David Easley, Jon Kleinberg)

by reiver

David Easley and Jon Kleinberg have created an excellent book on (a subset of graph theory known as) small-world networks and their applications, such as social networks. Although that description probably doesn't do the book justice. (The book is definitely worth a read.)

It is called: Networks, Crowds, and Markets: Reasoning About a Highly Connected World.

You can either buy a copy here or download it for free.

Here is their description of it:

Over the past decade there has been a growing public fascination with the complex "connectedness" of modern society. This connectedness is found in many incarnations: in the rapid growth of the Internet and the Web, in the ease with which global communication now takes place, and in the ability of news and information as well as epidemics and financial crises to spread around the world with surprising speed and intensity. These are phenomena that involve networks, incentives, and the aggregate behavior of groups of people; they are based on the links that connect us and the ways in which each of our decisions can have subtle consequences for the outcomes of everyone else.

Networks, Crowds, and Markets combines different scientific perspectives in its approach to understanding networks and behavior. Drawing on ideas from economics, sociology, computing and information science, and applied mathematics, it describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected.

[HTML + PDF]

Analysing Mood Patterns in the United Kingdom through Twitter Content (Vasileios Lampos, Thomas Lansdall-Welfare, Ricardo Araya, Nello Cristianini)

by reiver

Social Media offer a vast amount of geo-located and time-stamped textual content directly generated by people. This information can be analysed to obtain insights about the general state of a large population of users and to address scientific questions from a diversity of disciplines. In this work, we estimate temporal patterns of mood variation through the use of emotionally loaded words contained in Twitter messages, possibly reflecting underlying circadian and seasonal rhythms in the mood of the users. We present a method for computing mood scores from text using affective word taxonomies, and apply it to millions of tweets collected in the United Kingdom during the seasons of summer and winter. Our analysis results in the detection of strong and statistically significant circadian patterns for all the investigated mood types. Seasonal variation does not seem to register any important divergence in the signals, but a periodic oscillation within a 24-hour period is identified for each mood type. The main common characteristic for all emotions is their mid-morning peak, however their mood score patterns differ in the evenings.

arXiv:1304.5507 [cs.SI]

New: PLOS Text Mining

by reiver

The folks over at PLOS are introducing the PLOS Text Mining Collection.

Text Mining is an interdisciplinary field combining techniques from linguistics, computer science and statistics to build tools that can efficiently retrieve and extract information from digital text. Over the last few decades, there has been increasing interest in text mining research because of the potential commercial and academic benefits this technology might enable.

[...]

First, the rate of growth of the scientific literature has now outstripped the ability of individuals to keep pace with new publications, even in a restricted field of study. Second, text-mining tools have steadily increased in accuracy and sophistication to the point where they are now suitable for widespread application. Finally, the rapid increase in availability of digital text in an Open Access format now permits text-mining tools to be applied more freely than ever before.

[...]

PLOS launches the Text Mining Collection, a compendium of major reviews and recent highlights published in the PLOS family of journals on the topic of text mining. As one of the major publishers of the Open Access scientific literature, it is perhaps no coincidence that research in text mining in PLOS journals is flourishing. As noted above, the widespread application and societal benefits of text mining is most easily achieved under an Open Access model of publishing, where the barriers to obtaining published articles are minimized and the ability to remix and redistribute data extracted from text is explicitly permitted. Furthermore, PLOS is one of the few publishers who is actively promoting text mining research by providing an open Application Programming Interface to mine their journal content.

See it here.

Video Lectures: Information Theory, Pattern Recognition, and Neural Networks, by David J. C. MacKay

by reiver

David J.C. MacKay has a number of video lectures available on: Information Theory, Pattern Recognition, and Neural Networks.

Here is the complete list of video lectures:

Hadley Alexander Wickham: Speaking at Vancouver Meetup on Wednesday May 8th at 7:00 PM

by reiver

For those who use R, the name Hadley Alexander Wickham is a well known one.

Hadley is coming to Vancouver, and will be speaking at a combined meetup event for the Vancouver-based Data Science group and the Vancouver R user group.

If you are in the Vancouver area, use R, or are interested in statistics or visualization, you should be there.

Practical Tools For Exploring Data And Models (Hadley Alexander Wickham)

by reiver

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with. Chapter 2 describes the reshape framework for restructuring data. Second, you plot the data to get a feel for what is going on. Chapter 3 introduces the layered grammar of graphics. Third, you iterate between graphics and models to build a succinct quantitative summary of the data. Chapter 4 introduces some strategies for visualising models. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future. Chapter 5 summarises the impact of my work and my plans for the future.

[PDF]

Sprak: Graph Visualization From The Command Line

by reiver

So you are at the command line. You have a bunch of data you pulled from the database or a file. You did a bunch of "magic" to awk, sort and other command line tools to extract the "important" parts of the data.

Now you want to see that "important" extract in a graph.

What's the fastest way to visualize it? What's the fastest to see a graph?

MS Excel? OpenOffice Calc? GNU Octave? R?

No, no, no and no.

The fastest way is using spark.

Spark lets you create and view graphs right from the command line.

Here is a very basic example:


spark 0 30 55 80 33 150
				
Figure 1. Very basic usage of spak.

▁▂▃▅▂▇
				
Figure 2. Output of spark command in figure 1.

And here is a more typical example:


curl http://earthquake.usgs.gov/earthquakes/catalogs/eqs1day-M1.txt --silent | sed '1d' | cut -d, -f9 | spark
				
Figure 3. More typical usage of spak. Magnitude of earthquakes over 1.0 in the last 24 hours.

 ▅▆▂▃▂▂▂▅▂▂▅▇▂▂▂▃▆▆▆▅▃▂▂▂▁▂▂▆▁▃▂▂▂▂▃▂▆
				
Figure 4. Output of spark command in figure 3.

(More spark examples here.)

Check out spark.

We Love Open Data: Ontario Open Data

by reiver

The "currency" of a data scientist's vocation is (surprise, surprise) data. Sometimes data scientists have to go to great lengths to gather the data sets themselves. And sometimes people will "give" it to them.

The government of Ontario has done just that.

Meet the Ontario Open Data portal.

Explore the data sets they offer yourself.