Romantic data-mining

Feb 04, 2009 20:24

It is well known, at least in certain circles, that the optimal strategy for choosing a spouse is to first date 37% of the people whom you could conceivably marry, and then continue dating people but marry the next one you come to whom you like better than all those you've dated so far. If you date less than 37% of people before doing this then, under certain assumptions which admittedly may not reflect reality perfectly, it is statistically likely that if you'd waited a bit longer you would have found someone you like better than your current spouse. If you date more than 37% then you're statistically more likely to have already gone past your optimal spouse while trying to get a larger sample size. Of course, dating 37% of all the people of the opposite gender to yourself who are within a certain age range of you is hardly feasible - and it's even worse if you swing both ways. It also kind of sucks all the fun out of romance to have to tell someone "Look, I really like you and I think things are working out great, it's just that, well, statistically speaking...".

In this same vein of completely sucking all the fun out of life, I was thinking just now, rather randomly, of writing a little program to harvest the interests of say 10,000 random female LJ users and doing some analyses. This would be fairly straight forward. You point the program at your own profile page, which contains a list of all your friends' usernames. Given their usernames it's easy for the program to figure out the URL for their profile pages, and then you can repeat the same process for all of your friends' friends. Then you can do their friends' friends' friends, and so on, spidering out through the degrees of separation and stopping when you hit 10,000, or whatever threshold you like. Perfectly simple, and incidentally this is how places like Google build up their huge ass databases of websites.

If along the way you store each person's interests then you can make estimates of things like "what percentage of females are interested in anime and video games and metal and whatever else you like", which is obviously of interest for the task of answering questions like "How lucky am I to have met potential spouse X who ticks, say, 7 out of my 10 important boxes?". I thought this would be a cool experiment to do to pass some time and make a fun LJ entry out of it, even if it wouldn't be particularly scientifically valid (there is, obviously, a huge sampling bias in effect - we're answering the question "what proportion of females on Livejournal are interested in X, Y and Z. People who use LJ are certainly more likely to be nerds than people selected at random off the street, so the proportion of people interested in nerdy things is likely to be substantially overestimated. The data available to someone doing my proposed experiment is not enough to be able to sensibly correct for this bias.).

So I started looking through people's profile pages to get a feel for just what would be involved in pulling the relevant details out of the HTML code and I realised "Geez, none of my friends have chosen to specify what gender they are. I didn't think I hung out with such privacy sensitive people on LJ". Then I looked at my own profile page and realised that it didn't have my gender on there, even though when I checked it, it was absolutely set to "male". It would seem that LJ doesn't actually do anything with this information. At first I thought perhaps the default privacy setting was to not show it, but after a quick poke around the options (and I do mean quick, I could easily have missed something) I couldn't find anything like it. I really think it's impossible to get your gender to show up on your profile page. Which completely ruins my experiment, and also raises the question - why do they even bother letting you set your gender if nobody but you can possibly see it? Purely for market research purposes? How odd.

The best way around this would be to develop some strategy for estimating someone's gender based on their username, and estimating the likely chance of error. I suspect that, in principle, you could actually make such guesses with pretty good accuracy, perhaps as high as 80% or above (observe that the average person on the street with no particularly high rate of exposure to Japanese media could still probably get a pretty good gut instinct that Motoko is a girl's name and Kenji is a boy's. There's just an intuitive feel to these sorts of things), but the effort involved in training a program to make the guesses is sufficiently high compared to how interested I am in seeing the results that it destroys the project's status as being quick, easy and fun enough to be worth doing on the spur of the moment. Oh well.

On a closing note, I want to stress that this whole post actually has almost nothing to do with my marriage problems as touched upon in a recent entry, which I'm actually feeling quite a bit better about at the moment. People, especially Kirsty, should interpret this entry and being much, much more about my predilection for at least thinking about behaving in a mathematically optimal sense at all times, and also for having fun doing impulsive nifty tricks with computers the moment that I think of them.

statistics, data mining, maths, social networks

Previous post Next post
Up