The Global Mail has ceased operations.
Technology
<p>Ella Rubeli/The Global Mail</p>

Ella Rubeli/The Global Mail

Big Data Can Tell By Your Tweets If You’re A Psychopath. That’s Only The Beginning…

It’s made your car more efficient, your medicine more effective and your roads safer. It has also opened up a big hole in your privacy and it’s peering in. Is big data a good thing — or just the world’s newest economic asset?


At the turn of the 20th century, prospectors were rugged, bearded men. They sat on their haunches by mountain streams, rattling pans, sifting through gravel with weather-worn hands in the hope of finding something shiny buried in ancient rock. Very few struck it rich.

Just over a century later, prospectors have degrees in computer science and, it's fair to say, a rather different image. They sit on ergonomic office chairs, tapping out algorithms on keyboards running them through huge databases in the hope of finding something useful in the seams of ones-and-zeroes still so new to our planet. What constitutes gold in this new industry? A way of identifying the most persuadable voters is treasure for some, for others it's predicting trends in criminal activity.

“A lot of what big data is all about is wading through the crap, for lack of a better term. ”

Data miners rely far less on luck for their livelihoods than the prospectors of old. Many earn six-figure salaries for their work — that's pay dirt. And their fields of exploration are no ordinary geography. These people are drilling into Big Data — the 2.5 quintillion bytes of information created daily by businesses, individuals, financial institutions, government departments, research labs and countless other organisations around the world. The potential economic value of insights hidden within this resource recently prompted the World Economic Forum to declare it a new asset class and is pushing the concept of 'data equity' — already well understood by web companies such as Google and Facebook — into the mainstream.

Finding those virtual nuggets is a complicated task. For a start, a Big Data set can include all manner of records: years — sometimes decades — of financial transactions, exchange rates, web history, air temperature, blood pressure, crime rates, personal attributes, Facebook 'likes', status updates, videos ...

This pithy tweet from a Canadian analyst probably explains Big Data best for the layperson:

@SHamelCP: "Simplest definition of #BigData ever: 'it doesn't fit in Excel' :)"

Nowhere near. But volume is only one of the three Vs pushing the 'data tsunami' beyond the capacity of traditional business analysis tools and into the hands of data scientists. Velocity, or the speed with which new information is acquired and the demand to act upon it in real time, and variety — messy, 'unstructured' data such as images, raw text and multimedia — have, until now, also made harnessing the latent power of all this information a monumental challenge.

<p>Kaggle.com</p>

Kaggle.com

Kaggle CEO Anthony Goldbloom

Hence the latest tech boom: the value of the global big-data market is projected to top the $50 billion mark by 2017, a tenfold increase from the beginning of this year. And business opportunities are manifold, ranging from storage and data management to analytical services and consulting.

Big players include established computing giants, such as IBM, Intel and HP, and traditional business-intelligence providers, while hundreds of big-data-focused start-ups have entered the fray. The rush is well and truly on.

"It's exploding right now, in terms of interest, in terms of activity in the vendor community — and everybody is now talking about big data from an end-user perspective," says Jeff Kelly, a tech-industry analyst speaking to The Global Mail from his office in Marlborough, Massachusetts — the US state which is aiming to become the world's big-data hub. "People want to know how they can use the data they've been collecting for years — now that the technology has become available and affordable and they can actually do something with it, instead of it just sitting in a repository somewhere."

These people are drilling into ‘Big Data’ — the 2.5 quintillion bytes of information created daily by businesses, individuals, financial institutions, government departments, research labs and countless other organisations around the world.

Kelly covers the big-data world for open-source research company Wikibon, and authored the first comprehensive report on the size of the global market. "We're seeing interest from all kinds of areas — financial services, healthcare, retail — I can't find an industry I don't think will be impacted by this to some degree."

In Australia interest is also gathering pace, says Richard Price, vice president of systems at business-intelligence provider Oracle ANZ in Melbourne. "Businesses are realising that this will become a source of competitive advantage. In a big-data world, any organisation that fails to sufficiently leverage its analytical insights will be left behind."

Anthony Goldbloom, an Australian entrepreneur who recently moved his big-data start-up, Kaggle, to Silicon Valley, puts the buzz into perspective. "Put it this way: the first quarter of this year saw more venture capital investment in big data companies than in consumer internet companies." We are most definitely, he says, in "the era of big data".

<p>Ella Rubeli/The Global Mail</p>

Ella Rubeli/The Global Mail

Numbers fill out the picture: The May 2011 report Big Data: The Next Frontier for Innovation, Competition and Productivityby the McKinsey Global Institute claims using big-data could provide USD300 billion annual value to healthcare in the United States and EUR100 billion of efficiency savings to Europe's public sector. In research conducted by the Economist Intelligence Unit for French professional services firm Capgemini early this year, senior executives reported an average of 26 per cent company performance improvement in the past three years thanks to big data — a figure they expect to rise to 41 per cent in the next three years.

They weren't identified, but some of those executives were likely from large banks, where big data is already being deployed to improve fraud-detection.

"They know everything you bought, when you bought it, how you bought it — when you look at that across all the years it's easy to spot when something out of the ordinary happens. In the past you couldn't crunch that much data, so you might say, 'This is a little out of the ordinary but we can't say how much out of the ordinary because we don't have enough to go on.' Now, you can act on it immediately," explains Kelly.

“Our cars have been made more efficient by it, our medicines more effective, road safety improved and crimes solved faster.”

At Predictive Analytics World, the premier commercial-data-science conference, the sessions include movie studios "using big data to optimise and predict opening week at the box office", and "Pfizer: Right Medicine, Right Patient". Even Navy SEALs are covered, in "US Special Forces: Hiring and Selecting Key Personnel Using Predictive Analytics", while marketing research firm Nielsen caters for those with more of an eye for profit, with a financial services session titled "Finding Consumers More Accurately and Actionably Using Data Mining Tools".

CONSUMER TARGETING is undoubtedly where much of the potential treasure lies, as Price explains: "Consider the difference it would make to a company if marketers could quickly and easily see that certain products or services are generating 'buzz' at a given time and location, or even identify a reason why a product is not selling and respond to this by targeting supplementary promotions to the relevant geography," he says, pointing to the McKinsey report figure of a 60 per cent potential increase in profit margins for retailers through big-data applications.

There are, as always, a few points of caution. 'Big-data washing', for example, refers to the fact that "everybody and their brother is coming out saying 'this is our big-data tool' — frankly, some of it is more marketing than reality," warns Kelly.

<p>Courtesy of Kaggle</p>

Courtesy of Kaggle

The Kaggle team

Having mountains of information doesn't necessarily equate to mountains of value.

"A lot of what big data is all about is wading through the crap, for lack of a better term. Maybe you can figure out if somebody's likely to purchase a particular type of gum if the weather's a certain way, but does it really matter? That's not exactly a high-margin business or a significant social insight. That's the challenge — to find the interesting bits that are just buried under petabytes of data," says Kelly.

He hastens to add that he thinks some of the hype is justified.

At the premier commercial-data-science conference, sessions include movie studios “using big data to optimise and predict opening week at the box office”, and “Pfizer: Right Medicine, Right Patient”.

"There is definitely a lot more chatter going on than there is large-scale deployment, but I'm not sure I'd call that hype, I'd call that early talk — because this technology really does have huge potential to impact all industries."

A few obstacles still lie between the talk and the actual dollars. The biggest by far is a shortage of talent. "Although we have these big-data technologies now, we simply don't have enough qualified people to use them. A lot of this stuff was created by highly skilled engineers at web companies like Google or Yahoo! — things like MapReduce and Hadoop —because they were the first to really need to deal with massive data sets and there were simply no tools available for them to use," says Kelly.

"So what they came up with was not necessarily user-friendly, it was designed for their core business. The people we need to help commercialise this stuff — we call them 'data scientists' but it requires a whole mix of skills around maths, statistics, programming, business, social sciences — there just aren't enough people who meet that criteria now to make big-data analysis possible in too many organisations."

<p>Ella Rubeli/The Global Mail</p>

Ella Rubeli/The Global Mail

The McKinsey report quantified this workforce shortfall for the US alone at "140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions". Those findings are mirrored in the explosion of data-related job advertisements over the past two years.

"You go to data conferences and just about every presentation ends with 'oh and by the way — we're hiring'," says Goldbloom, who in November last year secured USD11 million in funding to further grow Kaggle, a competitive-crowd sourcing platform, which is credited with "making data science a sport".

Goldbloom, a former Australian Treasury economist, founded the company in 2010 after recognising just how big the demand for big-data analysts was becoming. "I was interning at The Economist in London, writing a piece about big data and predictive modelling, and got to speak to a whole lot of CIO-level people and ask them how high on their list of priorities this stuff was. I discovered that they were all wanting to do it but having trouble putting anything into action — they didn't have access to the people who could."

It’s a point almost everyone you speak to from the big-data world makes. ‘We’re just doing what human beings have always done’ — finding patterns and relationships to help us make better-informed decisions.

He came up with a model that allows companies and organisations to post their data and particular problem online; there, a pool of over 45,000 data scientists from all over the world competes to find the best solution. A leader board is updated in real time until the competition closes and the winner claims their prize money from the host. Bounty can range from a few thousand dollars to $3 million.

Participants who consistently perform well in public competitions may then be invited — and paid — to compete in private contests.

"It's a meritocracy, like golf or tennis," says Goldbloom, who hopes Kaggle will play a central role in the future of the industry. "We'd like to see the world's best data scientists making their living this way."

“An employer might run your tweets through to get a sense of your employability based on your Twitter profile.”

In the meantime, more big-data wranglers have to be trained. Goldbloom sits on the advisory board for a data science course being created at New York's Columbia University, one of many educational institutions preparing to offer qualifications specifically designed for this new discipline.

"Universities are starting to come around to the fact that this is an area in great demand around industry, but it will probably take a long time before these courses become ubiquitous and a long time before students are graduating from these courses, so it's a long game. The parallel one might draw is engineering, which wasn't initially a uni degree but now very much is — I think we'll see the same phenomenon with data science."

THE ISSUE OF PRIVACY — we know you've been wondering — is ever present in conversations about big data. While not all information ripe for big-data analysis is derived from the personal lives of human beings (think NASA's climate sensors, or motor-vehicle-performance data), much of the most profitable information is.

When a teenage girl in Minneapolis began stocking up on signal items like unscented lotion, vitamins and cotton wool, it prompted Target to send her coupons for baby clothes and maternity wear — a move her father considered grossly inappropriate until he learned she was, in fact, expecting.

A memorable story from earlier this year gives an example of just how powerful — and disturbing — big-data insights based on personal information can be: An in-house statistician at Target (in the US) analysed the purchasing behaviour of women on the department store's baby-shower registry to come up with a "pregnancy prediction" model which could then be applied to all shoppers on its customer database. When a teenage girl in Minneapolis began stocking up on signal items like unscented lotion, vitamins and cotton wool, it prompted Target to send her coupons for baby clothes and maternity wear — a move her father considered grossly inappropriate until he learned she was, in fact, expecting.

"For a lot of people, that crosses a line," says Chris Yiu, the economist heading the Digital Government Unit at UK think tank Policy Exchange in London. Yiu recently authored a report highlighting the potential for between GBP16 billion and GBP33 billion of public-sector efficiency savings through big-data analytics, and says the issue of privacy is one of the biggest obstacles.

"With all of this very rich data you have tremendous potential to save money, but also to infringe privacy and civil liberties. You need a way to hold the government to a very high standard of ethical behaviour," Yiu says. His report recommends governments adopt a Code for Responsible Analytics requiring adherence to the highest ethical and privacy standards, and also suggests test-driving big-data initiatives before rolling them out to the real world.

“I’d argue the principle that should always be kept in mind is that just because you can do something with big data, doesn’t mean you should.”

"We should sandbox and test with synthetic data before releasing this stuff into the wild, because there's so much potential for it to go wildly wrong," says Yiu. Do it 'in a lab' first and see how it goes, then have a debate about the public policy benefits versus how far you had to go with personal data, and ask 'does it overstep the mark?' If it does, kill it in the lab."

Kelly takes a similar ethical position: "I'd argue the principle that should always be kept in mind is that just because you can do something with big data, doesn't mean you should."

Whether the private sector will display the same level of concern remains to be seen, and will depend largely on what we — consumers — are prepared to provide in return for free services.

"What people will start to understand is that when you log on to Facebook, you're essentially giving away your data. People might find it creepy that an organisation mines social data to make better decisions, but ultimately you've made that decision to give it away," says Kelly.

The potential consequences of that behaviour was on the agenda at DEF CON — the 20th annual, and controversial, computer hacker convention in Las Vegas, held in July this year — when the Online Privacy Foundation presented the results of its Kaggle-hosted competition titled "Psychopathy Prediction Based on Twitter Usage".

The organisation provided an anonymised dataset of around 3,000 Twitter users who had completed a psychological survey which calculated their "psychopathy score". Competitors were then invited to analyse 337 variables derived from the users' Twitter activity to come up with a model that could identify those with high levels of psychopathy based on their online behaviour.

"They did find there is a correlation — if you swear in your tweets or reply with a swear word, the more you do that the higher the psychopathy score. And if you reply with a conjunction — with a 'but' for instance — that increases the probability you're a psychopath. The correlation wasn't crazily strong, but there was one," says Goldbloom.

The real point of the exercise was to raise awareness about social-media use. "For instance, given this algorithm, an employer might run your tweets through to get a sense of your employability based on your Twitter profile."

Price reminds us that mining online chatter could also have positive outcomes.

"Imagine a scenario where health practitioners can use real-time, big-data analytics to understand where the flu virus is spreading, and at what pace, so they can tailor their response and ensure that sufficient vaccine stocks get to the right places," he says.

"The modern world has been built squarely on the foundations of data. Almost every aspect of our lives has been impacted by the ability of organisations to marshal, interrogate and analyse data. Our cars have been made more efficient by it, our medicines more effective, road safety improved and crimes solved faster."

It's a point almost everyone you speak to from the big-data world makes. "We're just doing what human beings have always done" — finding patterns and relationships to help us make better-informed decisions. Whether those insights are used for good or ill, profit or power, still comes down to the people using them. The difference today is merely one of scale.

Read more The Global Mail technology coverage of accused Megaupload martyr Kim Dotcom, Australia's godfather of computer intelligence, and the power of Twitter in times of tragedy.

4 comments on this story
by Carlos

Amazing. Reading this its clear that this 'data mining' is only really being done to help me make decisions. Well thats just fantastic. Thanks ever so much.

August 22, 2012 @ 2:30pm
Show previous 1 comments
by Stehen Wilson

Great story, thanks.
It's true that "Big Data" can tell all sorts of things. But the question is, should it? The amazing capacity for sophisticated statistical techniques to extract truths about us seems to have inured some Big Data practitioners not only to the ethical dimension of their work, but its very legality.
Our Privacy Act takes a broad view of the "collection" of Personal Information. A lot of IT people presume that "collection" is an overt act of gathering original information, as by questionnaire or interview. They can be surprised to learn that any process by which Personal Information comes to be in a database is counted as a "collection". This includes the creation of new PI by data mining, and the analysis of "public domain" data. The Privacy Act in fact doesn't even use the concepts of "public" and "private"; if a business is holding Personal Information, it is in general subject to the Privacy Act. regardless of where that PI came from.
And so when Big Data generates new predictions about our preferences, beliefs, mental states and so on, the first questions that should be asked are these:
- Do they have a legitimate need to be collecting such information?
- Do individuals know it's being collected?
- Does the Privacy Policy disclose what data is being collected and why?
- Are the results (especially any related to Health and other Sensitive Information) being safeguarded in line with the National Privacy Principles of the Privacy Act?

August 27, 2012 @ 11:43am
by George Michaelson

At an OECD conference on big data and the Internet I heard speakers present on big-data opportunities in health. In one instance, a speaker pointed out that big data analysis had confirmed all the significant known indicators for one kind of cancer, but identified some more which classical data analysis had not identified. I find that outcome immensely encouraging to the good side of "wading through the crap" -and we should be mindful that in the act of *some* privacy loss, we also secure benefits. Its not all one -sided.

February 12, 2013 @ 5:25pm
by rawnet

regardless of any benefits, we must ALWAYS ensure individuals know what is being collected and agree with what it will be used for...

September 9, 2013 @ 1:26am
CLOSE
Type a keyword to search for a story or journalist

Journalists

Stories