If you read a book on statistical modeling, often you will read about examples where the data is beautifully presented, with several independent variables each having a well-defined meaning, and then mapping onto a known dependent variable. It is then the task of the statistician to merely apply some known procedures, turn the crank, and produce the correct model. As one may assume, in practice, data rarely comes out this way. In this article, I’m going to talk about a case where we were able to take some data that had little external meaning, and extract actual information from it.
So, in this case, we were hired to investigate the intranet traffic of a large industrial facility. We were provided with a series of network addresses, timestamps of occurrance, a label giving us whether the event was blocked or not, and some information about what protocol was used. From this, we were asked to identify any evidence of threatening behavior in the network traffic. How were we able to do this?
The first step was to realize that each of these internal addresses represented a physical computer in their network, and that they all talked to each other. From there, we were able to construct a graph of how information traveled through the system. Furthermore, a deeper dive in the data showed instances where the intranet traffic touched the open internet, and some smaller cases where traffic touched blacklisted IP addresses. Combined with the known labels, we were able to identify bad action by three criteria:
1. Contact of a machine with blacklisted entity
2. Frequency with which a machine engaged in behavior blocked by the client system, per their logs
3. Statistical outlying behavior derived from the first item
4. Second-order threat risk
Items 1, 2, and 3 should be familiar enough, but what is meant by “second order threat risk?” After we’ve assigned threat scores from the first two items, and knowing the graph of the intranet traffic, and after some operationalization to map continuous variables as you would see in a fluid to the discrete variables that you see in a graph, we are able to apply a generalization of the diffusion equation:
Here, represents the threat score of a particular node in the graph, X represents distance along the graph, and the “time” variable is a count of contacts along nodes. This equation is more famous for describing the rate at which, say, a drop of dye will mix with water that it is submerged in, but here, we use it to describe the traveling of a threat vector through the network. Using this algorithm, even with minimal input data, we are then able to determine threat scores for every node on the network, simply using information about exposure risk. In this case, we were then able to turn around to the customer and have them identify actual problems on their client computers from the information provided. What’s better is that the underlying mathematics would apply equally to any system where you have ground truth risk that is able to propagate around a system of identical components.
Last modified: November 22, 2017