Statistics and Machine Learning can help to detect fraudulent behavior both online and in the real world. Christos Faloutsos talks about how data science can help making the world a little safer.
How many friends do you have on social media? How many of your friends are also friends with each other? These two simple questions may hold the answer to how normal you are, at least from a data science point-of-view. It’s information like this that Christos Faloutsos and his team at Carnegie Mellon University (CMU) use to find fraudulent profiles in social networks and elsewhere on the web.
Christos is a professor in the School of Computer Science at CMU. Originally from Athens, Greece, he and his team work on pattern discovery and anomaly recognition in big data. Christos is currently on a sabbatical from his position at CMU and works with Amazon on new ways to apply his research. He is a cheerful man and a splendid teacher. Each concept he explains to me is outlined with a colorful example and often accompanied by a giggle or laugh.
Using math to characterize fraudsters
The idea behind Christos’ research is that fraudsters in online shops and social networks don’t behave like real people. Since their accounts only exist to achieve a certain goal — be it bad-mouthing sellers with reviews on Amazon or creating fake followers on Twitter — they don’t show the same behavior patterns as regular — “normal” people do.
Christos’ team wants to identify these people by their unnatural behavior. To do this, they build a graph from the available data. Taking the social media example from above, “every node is a person, every edge means they’re friends,” Christos explains. “Now the question is: if you see a group of 50 people all being friends with each other, is this normal or abnormal? Our analyses show that this is abnormal. If you have 50 people, no matter how good friends they are, not everybody will be friends with everybody else.”
In order to find these abnormally behaving profiles, one first has to find out what’s “normal”. However, there’s no easy definition what normal means. Instead, normal is what most people do. “Normal is whatever is popular,” Christos says.
How to find out what’s normal
But how do you find out what’s popular? Which indicators should you look at to analyze a user’s behavior? Christos laughs when I ask the question. “Well, it’s kind of an art. You have to figure out how to extract useful features from a graph,” he says. The process of choosing features to analyze is a manual labor, even in times of AI. Experience, trial-and-error, as well as large amounts of validation data, are key in this phase of a project.
Using social media as an example again, Christos explains the procedure: “Take 20 friends. If they all know each other, then one good feature to describe the situation is how many friends everybody has — 20, as well as how many triangles they participate in.”
A triangle describes the relationship between three friends. Each corner of the triangle corresponds to an account, while the lines denote that the two accounts on the corners are friends with each other. “If all your friends know each other, you participate in roughly 20*20=400 triangles.”
The number of friends and the number of triangles for each account are used as characteristic features of an account. Christos continues with the next steps: “Then you compare your two numbers with everybody else’s two numbers. If there a lot of people with 20 friends and 400 triangles, then we say [this is] normal. A person with 50 friends that don’t know each other then has 50 friends and 0 triangles. This person will stand out.”
Fraudsters work in lock-step
Another telltale sign for fraudsters is synchronized behavior. Christos gives the example of two competing sellers on the Indian online retailer Flipkart, with whom he wrote a paper on the topic. “You want to sell shoes on Flipkart. I’m your competitor, so how do I boost my sales? [I get] 4,000 people to rate you with one star [out of five] and ruin your reputation.”
These people can be easily hired, he explains. They may be real people, but the way they do work gives them away. “Usually, they will [post their review] more or less on the same day, because the customer wants results quickly.”
Often such hired guns, as Christos calls them, will not only badmouth one product. These are low-paid one-time jobs, so a large variety of them is needed to make a sustainable income. In Christos’ words, “they will bad-mouth shoes on Monday, shirts on Wednesday, laptops on Sunday, and so on.”
To identify this lock-step behavior in large amounts of data, Christos and his team use a variety of mathematical methods, as well as machine learning. “One of the matrix algebra tools we’re using is the so-called Singular Value Decomposition (SVD),” Christos says.
SVD is a method to reduce the complexity of data and find an accurate yet simpler representation of the original data. The method is also used to group information into overarching concepts. This ability can be used to identify underlying similarities within datasets.
Fighting crime with data
These methods cannot only be applied to online communities and social networks. He mentions computers in a network, and even genetics as an example: “In biological networks, genes [are] interacting with each other. The nodes [of the graph] are genes now and the chemical reactions are interactions. Are the interactions of genes in this graph normal? Does [an anomaly] maybe show an onset of cancer?”
Christos tells me about another application of his work: to fight organized crime, in particular human trafficking. He explains that the way traffickers post escort ads in newspapers exhibits lock-step behavior as well.
“[They always use] more or less the same advertising in [different] newspapers. They only change the name or the photograph. This, again, is lock-step behavior. 50 times the same ad with a different name or a different phone number.” The same behavior detection methods, including SVD, can be applied to this case.
This ain’t a scene, it’s an arms race
Christos admits that his work resembles an arms race. While the “good guys” develop better algorithms, fraudsters also improve their methods. But, he says, the goal cannot be to make fraudulent behavior impossible. Instead, the point is to make it expensive.
“We’re raising the bar. If we’re spotting lock-step behavior on Facebook, Flipkart, or escort ads, the fraudsters have to change their behavior. That will cost them.” When ads or reviews have to be more individual, Christos hopes, fewer people will be able to pay for such fraud.
Another restriction for fraudsters is the timeliness of their service. When paying for Twitter followers, a customer wants results immediately, not in the distant future. “[The fraudsters] have to satisfy the customer. If you have to deliver 1,000 customers, you have to deliver them today.” Thus, lock-step behavior will always remain a characteristic of fraudulent behavior, no matter how sophisticated fraud tactics become. You can’t trick time, after all.
With their research, Christos and his team are helping to make the Internet — and the world in general — a safer place. Fraud and trolling will always exist, but researchers like Christos and his colleagues try to make it as hard and expensive as possible for them to do their mischievous work.