I wrote a machine learning Instagram bot
Disclaimer: I'm no longer scrapping Instagram, and I'm not encouraging you to scrap Instagram; just know the tools are there. In doing this, I did violate Instagram terms of service (this isn't illegal- just grounds for getting your account banned).
So what did I do exactly? Meet my dog:
When we first started Rowdy's Instagram back in November 2014, it was a joke for friends. I forgot about the account until January 2015 when Spring semester started. Considering we only had 22 followers, and what was in my opinion- "quality content" - I decided to do the only logical thing to gain traction: Write an Instagram bot.
A 'bot', is just a program that pretends to be human. In this instance, the bot pretended to be an Instagram user in order to gain followers.
Initially the bot was pretty simple. I set a series of hashtags and every 5 minutes the bot would like and follow people posting under a particular tag. The bot would change tags every 5 minutes until the list was over and then it would start again. I initially played with the bot changing its volume of likes throughout the day to match the account's target demographic activity level. The bot would be more active during the day and then reduce likes in the evening (based on EST). However, we found that the bot was reaching an audience in Japan and Australia, so there was no point restraining the bot. To maintain what we thought was a reasonable follower, following ratio, we came up with a pretty nifty function:
The function itself is pretty simple. It's dependent on 2 user set variables,
target. Magic is the ratio that the follower to following ratio should be at target. For instance: if
magic = 0.75 and
target = 1000 then at 1000 followers, the bot should be following 75% of of its followers or 750 people. This function allowed us to find the sweet spot between following people and being followed at a gradual and smooth rate. Play with the function yourself below.
Rowdy did really well. A little after a month, we passed 1000 followers. His growth seemed to be linear and reliable. I'd check the account casually throughout the week, rarely posting; yet his numbers seemed to grow at an average of 32 followers a day. I'd tweak the hashtags every couple weeks and let the bot do its business.
This data isn't perfect, the bot went down for an age without us realizing it (in addition to other hiccups)- but the general trend is there
I'd been brewing on the idea of turning Rowdy into a machine learning project for awhile, and during a relative lull in school work, I expanded rowdy-bot to be machine learning (capable of making intelligent decisions based on statistical data). At this point, I decided there was enough training data that we could model the accounts Rowdy interacted with. So I set up the bot to asynchronously download and process Instagram data. Here's a little visualization of what we found:
Axes should be labeled Followers, Following, Posts (X,Y,Z) where Blue dots are the bot's followers (the good guys) and Red dots are the folks who don't follow the bot back (the bad guys). This distinction was chosen because those who don't follow back are accounts the bot should actively attempt to stay away from.
My first impression was: Hey. The different types of accounts look separable. So using gradient descent, I came up with an OK logistic regression model, that was capable of correctly guessing 70% of the test set data.
In addition to recognizing potential followers, a smarter bot should be able to choose its own hashtags for finding and following people. To do this, I played with the PageRank algorithm (what Google used to use for ranking webpages). PageRank takes the number of connections linked to a webpage and creates a score for the webpage. The higher the score, the higher the webpage ranking. In my implementation, I counted a 'connection' as concurrent hashtag use in a post. For instance if a post used the hashtags [
#love] then a connection was made between each of the tags. The more connections to a hashtag means a higher score, but the more connections total, means less value per connection. There's a little bit more to the algorithm, but that's the gist. Play around with the idea below for some intuition (bigger size means better ranking):
My alteration of PageRank first ranked hashtags used by accounts that followed Rowdy (the good guys) within a 3 hour time period. This allowed us to see what tags were trending among Rowdy's followers and lead the bot to better hashtags. We quickly found out that this train of thought was flawed. Some hashtags on Instagram just have an extremely high volume. As such, the algorithm found hashtags that were not particular to Rowdy's followers, but hashtags that were just popular in general. To fix this, I ranked the hashtags used by 'those who Rowdy followed but didn't follow back' (the bad guys) and then subtracted these values from the other ranking. I was pleased to see this worked pretty well.
#dog ranked pretty well, in addition to some of the values I saw before, like
#beauty. However certain high ranking tags like
#selfie now had a negative value. I was really pleased with that. Try it out for yourself (The positive values are taken from the sliders above. If it's red, then it has a negative value):
#love bad connections
#omg bad connections
#selfie bad connections
#dog bad connections
Now the bot was 'smart', it started averaging around 100 new followers a day. It was also averaging several thousand API requests to fuel its decisions. The bot got a bit too greedy with the number of requests it was making, and Instagram shut it down. Ultimately, it was completely fine. Using a bot was technically wrong. It was just really fun. It was a great learning experience and I felt as if it was worth it. We got cut off at 3700 followers.
If I had to do it again, there are certain things I might reconsider:
Instead of finding potential followers by hashtags, find out what accounts your followers follow and determine if they meet your model. I imagine this would reduce the chance the bot finds spam accounts, and would be more likely to find new followers closer to the existing model.
Be more scientific. Sure- there is some data, but really nothing was controlled. I can't really say it was the follower modeling or the hashtag ranking that made the bot better. I can't really determine what leads to accounts following in the first place. To make the bot better: collecting more data and stricter control of the environment would be a must.
Incorporate computer vision. Instagram is a social media outlet based on images! I completely ignored what was probably very valuable data. I can imagine running some sort of unsupervised feature detection to create parameters I could then use in my model.