Emoji vectors and optimization. Oh my! (Part 1: Word2Vec and Python)
Emoji vectors and optimization. Oh my! (Part 1: Word2Vec and Python)
A while back ago, the Slack bot EmojiBot went offline, endangering my workflow. You could ask it to find an emoji by sending a query like @emojibot electricity
, and it would reply with emoji such as , , , and . But all was okay because in the meantime, I built up my own replacement that was a Python http server connected to a lovely little Hubot instance. This post details how I went about doing that.
Before I move on, a few basics on the overarching question: how does one match up text to emoji?
One great project I encountered was emojilib. The project contains a mapping of emoji to a few descriptive key words each. You can find a search interface for it here. It works spectacularly for most emoji, but once you pull out a thesaurus things get worse. For example: ghost
finds , but ghoul
turns up nothing. Happy
finds a long list of emoji (, , , and to name a few), but ecstatic
finds none. And nothing comes up for Ryan
, my very own name, either!
We can do better. Google has word vectors of 300 million words in its Google News word2vec dataset! Surely there must be a way to utilize this. And there is. There are already methods to make a word2vec model out of emoji, such as the one detailed by this paper. And while the word2vec training methods are amazingly ingenious and interesting, I decided to not dive into the depths of AI deep neural network machine learning and rather play around with Googleβs word embeddings and emojilib.
With word2vec, you can find the similarity of two words by finding the angle between their two vectors. This is the same as taking the arc cosine dot product of their unit vectors, However, we can deal without the trigonometry and just take dot products, which gives similarities between 0 and 1.
With this method, I can build a way to relate fancy words (e.g. ecstatic) -> simple words (e.g. happy) -> emoji, and will be able to build a pretty good emoji searcher!
Building a Map
Googleβs word vectors are for words, not emoji. So I need a way to translate words to emoji. Luckily, this is easy to do with emojilib and a little Python:
with open('emojilib.json', 'r', encoding='utf-8') as f:
emojis = json.load(f) # Assume emojilib is downloaded as emojilib.json
mapping = defaultdict(list) # A dict where each item defaults to []
for emoji_name, emoji in emojis.items():
mapping[emoji_name].append(emoji['char'])
for part in emoji_name.split('_'):
mapping[part].append(emoji['char'])
for keyword in emoji['keywords']:
mapping[keyword].append(emoji['char']) # Add extra terms into the mapping
print(mapping) # The generated mapping of words -> emoji
Then we get a nice mapping like this for categories:
flag: π¬π§ πΈπ° πΉπ¨ π²π° π³π« π§π― πΉπ· π¬π· π―π΄ π°π² π²πΎ πΉπΏ π΅π π¨π© π©π² π¬π³ π±π· π¨π» π»π¬ π²πΏ π΅π± π»π¦
banner: π¬π§ πΈπ° πΉπ¨ π²π° π³π« π§π― πΉπ· π¬π· π―π΄ π°π² π²πΎ πΉπΏ π΅π π¨π© π©π² π¬π³ π±π· π¨π» π»π¬ π²πΏ π΅π±
country: π¬π§ πΈπ° πΉπ¨ π²π° π³π« π§π― πΉπ· π¬π· π―π΄ π°π² π²πΎ πΉπΏ π΅π π¨π© π©π² π¬π³ π±π· π¨π» π»π¬ π²πΏ π΅π±
nation: π¬π§ πΈπ° πΉπ¨ π²π° π³π« π§π― πΉπ· π¬π· π―π΄ π°π² π²πΎ πΉπΏ π΅π π¨π© π©π² π¬π³ π±π· π¨π» π»π¬ π²πΏ π΅π±
woman: π π©βπ¦ π§ββοΈ π§ββοΈ π£ββοΈ π£ββοΈ π©ββοΈ π©ββοΈ π©βπ©βπ§ π©βπ©βπ§ π§ββοΈ πββοΈ π€¦ββοΈ π€¦ββοΈ π©βπ¨ π©βπ¨
nature: π¦ π© π π π· π πΎ πΊ π π π π π π‘ π π π
man: π π§ββοΈ π§ββοΈ π€·ββοΈ π€·ββοΈ π± π± π€½ββοΈ πββοΈ πββοΈ π£ π§ββοΈ π§ββοΈ πββοΈ πββοΈ
face: π π€ π π€€ π€€ π’ π π π π¬ π· π π π€¬ π£ π π π€£
animal: π¦ π© π πΊ π π½ π π π π π‘ π π π¦ π¦
π€ π¦
human: π π©βπ¦ π©ββοΈ π©βπ©βπ§ π« πββοΈ π©βπ¨ π©βπ©βπ§βπ§ π©βπ§βπ§ π¨βπ€ π©βπΎ βΉ π¨βπ¦ π΅ π π₯ π¨ββοΈ π¨ββ€οΈβπ¨ π©βπ§βπ¦
And this for individual emoji:
lost: π³
moving: π¦
sterling: π·
yummy: π
vegas: π°
suspension: π
eat: π½
bolivarian: π»πͺ
saucer: πΈ
second: π₯
Wonderful! Now onto utilizing this!
Short notice: Making gensim
more efficient
I use the Python gensim
library for parsing Googleβs .bin
file. However, when loading the dataset, gensim
stores it all in memory. WIth my little 8 GB laptop already running a few browsers, things get a little tight (looking at you Electron). Actually extremely tight. Like Windows freezes as it tries to move memory pages to the disk in panic. Not to great. To rememdy this you can move all the vectors to a memmap
:
gensim
stores its data in two places: an array of all the words (in
, at
, etc.) and a 2D array of vectors, each being mapped to its word by the index. The code below saves both of these for later use (to avoid loading the full model into memory each time:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
with open('vocab.json', 'w') as f:
json.dump(model.index2word, f) # Save the word strings
model.init_sims(replace=True) # Calculate all the unit vectors
norms = np.memmap('normsmemmap.dat', dtype=model.wv.vectors_norm.dtype,
mode='w+', shape=model.wv.vectors_norm.shape)
norms[:] = model.wv.vectors_norm # Write the normed vectors to the memmap
del model # Discard the model from memory
And to load it:
with open('vocab.json', 'r') as f:
vocab = json.load(f)
norms = np.memmap('normsmemmap.dat', dtype=np.float32).reshape((-1, 300))
Filtering the map
While Googleβs dataset has 3 million words, emoji names can still be obscure enough to avoid appearance in the English language. For example, shipit
is not yet a recognized as a word. Neither is dog2
. Country names as lowercase also arenβt included. That requires a little extra parsing work:
lowercasedvocab = [word.lower() for word in vocab]
vocab_set = set(vocab) # For faster searching
lower_vocab_set = set(lowercasedvocab) # Again, faster searching
for word in list(mapping.keys()):
if word not in vocab_set:
# Check to see if tje word appears as a different case
if word.lower() in lower_vocab_set:
cased_word = vocab[lowercasedvocab.index(word)]
mapping[cased_word] = mapping[word]
# Remove the word from the mapping
del mapping[word]
This process performs operations such as the following (β for rename, β for remove):
β papua β Papua
β couple_with_heart_woman_man
β funeral_urn
β man_elf
β kazakhstan β Kazakhstan
β drooling_face
β ideograph
β climbing_woman
β medal_military
β blonde_man
β straight face
β norfolk_island β NORFOLK_ISLAND
β rowing_woman
β oncoming_police_car
β highfive
β full_moon β Full_Moon
Here utilizing set
makes a huge difference. At least a 5x gain in speed. Itβs a lot faster to search a hash table than to iterate through a 3 million word array.
Filtering the word vectors
Googleβs dataset is HUGE. It takes a ton of memory and a ton of time to load it, and in reality we only care about emoji. If words arenβt similar to emoji names, we donβt need them.
The code below uses np.inner
, which is a nice way of computing dot products between the 3mil x 300 word2vec vocab and 2854 x 300 emoji vocab and giving a 3mil x 2854 array of the dot product results. You may see from this that we could transpose the emoji vocab array so we can do a matrix multiplication (@
in numpy) between the new 3mil x 300 and 300 x 2854 arrays, but np.inner
is a little simpler to call in my opinion.
Also, computing this 3mil x 2854 array in go would take a lot of memory. So we do it in chunks and write these chunks to a memmap
. Using amax
to find the maximum dot product for each word (i.e. find the closest each word is to our emoji vocab) also decreases the memory needed.
vocab_norms = np.array([norms[vocab.index(word)] for word in mapping])
dp = np.memmap('dpmap.dat', dtype=norms.dtype, mode='w+', shape=(norms.shape[0],))
CHUNKSIZE = 1000
for s in range(0, norms.shape[0], CHUNKSIZE):
e = s+CHUNKSIZE
dp[s:e] = np.amax(np.inner(norms[s:e], vocab_norms), axis=1)
Now we can find things like the words least similar to our emoji vocab:
Word Similarity
------------------------- -----------
HuMax_IL8_TM 0.045590762
J.Gordon_##-### 0.05180111
By_DOUG_HAIDET 0.056708228
G.Biffle_###-### 0.068331175
K.Kahne_###-### 0.08244177
HuMax_TAC_TM 0.08358895
mso_para_margin_0in 0.08385273
Nasdaq_NASDAQ_TRIN 0.08743682
BY_ROBERTO_ACOSTA 0.08953415
Globalization_KEY_FACTORS 0.09093615
At least it seems Google #
ed out personal information. Thatβs nice. I have no idea why peopleβs names ended up in the dataset in the first place though.
And things like the most similar words:
Word Similarity
--------------- ----------
Senegal 1.0000011
bearded 1.000001
drink 1.000001
urn 1.000001
industry 1.0000008
fly 1.0000008
training 1.0000008
organizing 1.0000008
Macao 1.0000007
dark_sunglasses 1.0000007
These dot products should never be greater than one. Iβll blame it on floating point imprecision.
Iβll choose a random cutoff (letβs say 0.5
), and keep only the words with a similarity greater than or equal to 0.5
:
good_indices = np.where(dp > 0.5)
good_vocab = np.array(vocab)[good_indices]
with open('good_vocab.txt', 'w', encoding='utf-8') as f:
for word in good_vocab:
f.write(word + ' ')
good_norms = norms[good_indices]
np.save('good_weights.npy', good_norms)
This yields about a 570 MB file (you can cut that down to 144 MB with a threshold of 0.6
). Thatβs somewhat more manageable! Gzipped itβs about 440 MB, which is still on the large side.
Using the filtered vectors
Now to use the data:
# Reform the list of words and normalized vectors pointing to words in the emoji map
lookup = lambda word: good_weights[good_vocab.index(word)]
emoji_words = list(mapping.keys())
emoji_norms = np.array([lookup(word) for word in emoji_words])
# Calculate similarities (dot products) between
dot = np.dot(emoji_norms, lookup('ecstatic'))
matches = np.argpartition(dot, -10)[-10:]
sortedmatches = matches[np.argsort(dot[matches])][::-1]
for index in sortedmatches:
print(dot[index], emoji_words[index], ' '.join(mapping[emoji_words[index]]))
This gives:
0.6626913 happy π π π π π πΉ π π π π π πΊ π
0.57641125 disappointed π₯ π π π
0.5699152 surprised π²
0.5558953 glad π
0.55013615 shocked π€―
0.543222 astonished π² π²
0.54264677 proud π€
0.53669024 stunned π§
0.5175118 awesome β¨ βοΈ π π
0.51358485 relieved π₯ π π
Not bad! Iβve never heard of disappointed
being a synonyms for ecstatic
, but everything else makes sense.
In part 2, Iβll move to JavaScript so we can get a nicer emoji search in the browser.