Recent Posts:

Predicting people is hard || The Handmaid's Tale and Data Collection ||

Predicting people is hard

Jan 24, 2016

I had some fun with machine learning last year. Essentially I had a lot of engagement data for a group of customers and I wanted to see if I could use it to see which way they'd fall on a YES/NO decision.

It turns out I was terrible at working out why people were YES, but much better at working out that they would be NO. By knowing this I could adjust aggregate results and started getting more realistic predictions. I got some very right! I am a data scientist magician. I got some a little off. Opps?

In the end I settled on a rough range I could be reasonably confident in - but this was too large to really make any real decisions on. The vague result I was coming up was (while neat to pull from raw data) not really better than everyone else's gut.

I still think it was cool I built a gut, but from a business point of view it's fairly pointless. We already had a lot of those.

There's a school of thought that says what I was doing was absolutely the right track: I just needed to gather more information. My machine gut will scale better than everyone else's gut if I just found more firehoses to point at it.

I think this is wishful thinking. We really want machine learning to do the cool things for people-based analysis that we know it can do for more concrete subjects. So the fact that the data is almost universally non-existent, inaccessible or misleading is put to the side.

(Incidentally, I love this story about pigeons being used to process brain scans. If you ever get stuck in the past you can jump start all kind of research with a sufficiently large pigeon coup.)

There's a famous story about Target predicting someone was pregnant from their purchase history:

“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

Now if you think about it for a moment, this story is fairly suspect. Taboo sex, a father defending honour, a father shamed (by the MACHINE). When stories have features that make it especially shareable (sex, disgust, shame) we should generally be suspicious of them. If the odds say it didn't reach you on its merits, chances are it has few merits.

And as it turns out, this story is very suspect.

Stories like this help construct the idea of data analysis as something that is closer to magic. We put the information in the box, turn it on and we know you better than you know yourself. In reality, predictive stuff is quite bad quite a lot. Here is Amazon trying to market to me:

screenshot of amazon

Now I really like books. Amazon has all my card details on file. It has years of my book purchasing history. I have a device that can only display books bought from Amazon. I can click a single button and it will take my money and instantly give me a book on that device.

And Amazon is marketing the same book I don't want to me three times.

Ever bought an oven (or similar, infrequent purchase)? Were you followed round the internet for weeks by an optimistic algorithm hoping this was just the start of your oven buying spree?

Think about the huge amount of hours invested in making that incredibly complicated process of identification, auctioning and ad display happen. Think about how stupid the outcome is.

This article on the Facebook newsfeed team (whose job is to guess what things your friends have to say you'll find interesting) has a number of telling details, but I liked this one:

Over the past several months, the social network has been running a test in which it shows some users the top post in their news feed alongside one other, lower-ranked post, asking them to pick the one they’d prefer to read. The result? The algorithm’s rankings correspond to the user’s preferences “sometimes,” Facebook acknowledges, declining to get more specific. When they don’t match up, the company says, that points to “an area for improvement".

Personalisation like this is a hard trap to see when you falling into it. You start off with excellent data about people's past engagement, but the instant you start using that information to shortcut the process you damage your own information collecting system. They are no longer engaging with things you don't show them. It's cutting off your own legs on the grounds that if you weight less you'll run faster.

To avoid this you'd have to do all sorts of clever tricks to avoid self-reinforcing data, justifying the entire team of clever people.

But what's the result? "Sometimes".

There are two assumptions that justify the creation of these systems:

  • If you have enough information, you can know things about people without asking.
  • You have enough information.

The first is the temptation to godhood that cheap processing power offers. The second should be a rebuff to that, but is usually forgotten.

The Handmaid's Tale and Data Collection

Dec 31, 2015

Finally got round to reading The Handmaid's Tale this Christmas. This bit especially struck me as being very relevant to the next few decades:

You had to take those pieces of paper with you when you went shopping, though by the time I was nine or ten most people used plastic cards . Not for the groceries though, that came later. It seems so primitive, totemistic even, like cowrie shells. I must have used that kind of money myself, a little, before everything went on the Compubank.

I guess that’s how they were able to do it, in the way they did, all at once, without anyone knowing beforehand. If there had still been portable money, it would have been more difficult. [...]

Tried getting anything on your Compucard today?

Yes, I said. I told her about that too.

They've frozen them, she said. Mine too. The collective's too. Any account with an F on it instead of an M. All they needed to do is push a few buttons. We're cut off.

One of Atwood's repeated points with her flashbacks and coda is that Gilead isn't an alien thing dropped from the sky. Future oppressive governments won't just be throwbacks, they'll also be logical continuations of us - our society and technology.

There's not really a good reason for a bank account to know if you're male or female - it might help them make pie charts and market to you better, but it doesn't actually come into the service they're providing. It's harmless until it isn't.

Maciej Cegłowski makes this point in a talk where he argues that we should treat data as being a little more toxic.

Eric Schmidt of Google suggests that one way to solve the problem is to never do anything that you don’t want made public.

But sometimes there's no way to know ahead of time what is going to be bad.

In the forties, the Soviet Union was our ally. We were fighting Hitler together! It was fashionable in Hollywood to hang out with Communists and progressives and other lefty types.

Ten years later, any hint of Communist ties could put you on a blacklist and end your career. Some people went to jail for it. Imagine if we had had Instagram back then.

Closer to our time, consider the hypothetical case of a gay blogger in Moscow who opens a LiveJournal account in 2004, to keep a private diary.

In 2007 LiveJournal is sold to a Russian company, and a few years later — to everyone's surpise — homophobia is elevated to state ideology.

Now that blogger has to live with a dark pit of fear in his stomach.

If they take control of your country, your company, your data - how easy have you made it for them? What do you store that does you almost no good, but could do others massive harm?

UPDATE dbo.customers SET status = "Invalid" WHERE ?