A social media marketing platform is only as good as the information it can provide to its clients. By analyzing social media platforms it is possible to build a sizable database, however the usable information extracted from this database will be limited. Many users do not to provide all their information in their profile, or choose to provide some information in their bio instead of in the expected fields. Because of this, the scraped data is often incomplete and of limited use to those searching through it.
Fortunately users on a social media platforms behave in a structured way. The features they leave out are often correlated with other, known features. Using machine learning algorithms, it is possible to express this relation between the different user features in order to accurately predict a feature not present in the data set. This relation could easily be inferred based on the rest of their profile.
In our case, we are interested in the geographical location of Instagram users. Based on their physical location an influencer might or might not be suited to a specific Influencer Marketing Campaign. Additionally the location of an influencer’s audience would be of even greater interest, as an influencer’s appeal is not strictly tied to their own location. Some influencers will have a large international following and retrieving the general location of their followers would allow clients to choose influencers based on the following they have in the markets they wish to target.
Here our task is to predict the location of users based on the available information written in their bio and posts. In essence, approximating the reading comprehension a human analyst would have, but automating it so it can be used on the scale of a constantly updating social media database.
An illustration of the relative scarcity of labeled data in social media data.
Out of the many Instagram users only a small subset has their location stated; only 0.4% of all users provide this information.
Given the total size of our database, this means there are over 2 million profiles available to be used as training data for a machine learning model. In practice not all the available data is needed for training. A small subset of 140,000 will suffice for the purposes of finding the relation between available input and desired output (location). The inputs are the text in the profile as well as those in the 5 most recent captions. To verify the accuracy of these results a validation set of 30,000 is set apart and is used to evaluate how well the model performs on data it has not encountered during the training phase.
Google’s BERT (Bidirectional Encoder Representation from Transformers) model works well as a jumping off point for our approach. It is a relatively recent model that uses transformer units in order generate language-embeddings for a variety of NLP tasks.
The multilingual base tokenizer addresses our need for meaningful input tokens in many different languages and the pretrained model weights are a good initialization for our text embeddings. So rather than starting to train the model from scratch, the model has already been pre-trained on existing data. This means the initial word embeddings will be of higher quality than those of a randomly initialized model.
Taking Lefty’s data we then train a stripped down version of the BERT model (performance is maintained with only three of the twelve original Transformer units) and tokenize our inputs sentences with a vocabulary enhanced with location related emoji’s:[📍, 🏠, 🇫🇷, 🇯🇵, 🇨🇦, etc.].
After only three epochs of training on our training data for the geolocation task, the weights will be fine-tuned to perform optimally for our use-case. The embeddings are passed to a fully-connected neural network layer and finally a softmax function that maps them to one of the possible output categories (36 specific countries and 1 catch-all OTHER category).
General overview of the different parts of the Geolocation Model.
On the 36 countries that were of interest to our task the final accuracy of the model is close to 90%. However, accuracy scores can be misleading in the case of our geolocation task. The main reason is that we have imbalanced classes where the majority of the users belong to a small set of countries (e.g. USA, Brazil, Russia). A model that predicts results in these dominant classes will perform well on the overall data set at the expense of the smaller countries. Because users from the top 12 most represented countries make up two thirds of the database, overall accuracy could be high even if the predictions on the smaller classes are poor. We will need to look at how well each class performs individually.
An illustration of the imbalance of different classes in our data set. A small number of highly active countries make up a majority of the available data.
The metrics we use to look at the per-class performance are precision and recall. Precision is the percentage of the predictions for that class that are true (e.g. how many of the influencers labeled ‘SPAIN’ are actually Spanish?) and Recall is the percentage of class members that are labeled correctly (what percentage of real Spaniards got assigned the label SPAIN).
When looking at the initial results it became clear that not all classes performed equally well.
However, it was possible to take only the labels we were most confident of (in the top 3% of certainty) and ignore the rest by putting them in an uncertain category. These were typically either influencers that spoke English and did not mention their location in their posts, or alternatively influencers that explicitly mentioned several distinct locations and left it unclear which of them they belonged to.
Using this approach, two-thirds of the total dataset gets assigned directly to the correct label. Of the remaining third a large majority is assigned the uncertain label to avoid polluting the pool real users from a country, with users that are anything less than extremely likely to be from that same country.
Overall accuracy is hereby reduced in order to maintain a standard of at least 90% precision for the users added to each country.
Attention Is All You Need; Vaswani et al. 2017.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Devlin et al. 2018.
HuggingFace's Transformers: State-of-the-art Natural Language Processing; Wolf et al. 2019.