 ###### Developing a Statistical Early Warning System
November 20, 2019 ###### Adding custom code to a specific page on your wordpress site.
February 5, 2020

Yes, I’m very pleased with myself for coding this chart using elm.

If you’re interested in Search Engine Optimisation (SEO) you will probably come across the term “TF-IDF” sooner or later. You will probably also come across a lot of very good articles explaining what it is. This post is not so much about what it is, but what it does, although I will explain what it is just for those who haven’t yet read those very good articles.

TF-IDF assesses how relevant a document is to a particular term by balancing the frequency of the term in the document against how often it appears elsewhere. It is a key measure in matching a set of keywords to relevant documents. Even if a search engine doesn’t use TF-IDF itself, it will use something like it. So having access to a TF-IDF tool (such as this one) gives you an insight into the terms which a search engine thinks best matches a given search query.

Although these tools will do the hard work for you, it is useful to understand what they are doing.

## Some Mathematical Terms

If you have some background in mathematics you can skip this section.

Just before we go on, I want to make sure that those who are less familiar with maths understand some of the mathematical terms I’m going to use. They’re very well known, but nonetheless, in the spirit of being as friendly as possible the following mathematical terms will be used. Consider the following expression:

\log(\frac{a}{b})

log means “logarithm” – it is a way of expressing how many times you have to multiply a number by itself (the “base”) in order to get to another number. For instance: the log to the base 10 of 100 is 2 because:

10^2 = 10 \times 10 = 100 = 10^{\log_{10}(100)}

and the log to the base 2 of 8 is 3 because

2^3 = 2 \times 2 \times 2 = 8 = 2^{\log_{2}(8)}

a/b is a fraction. Here, a is referred to as the numerator, because it tells us how many of the fraction we are dealing with, and b is the denominator because it “names” (“nominates”) the size of the fractions being counted.

From the point of view of this article it is important to realise that as a gets bigger the value of a/b gets bigger too, and as b gets bigger the value of a/b gets smaller.

Right, now I’ve got that out the way, let’s return to understanding TF-IDF.

## What is TF-IDF?

TF-IDF is an acronym for the phrase “Term Frequency, Inverse Document Frequency”. It refers to a family of mathematical functions that are used to work out how specific a particular term is to a given document. It is the product of two different functions, called tf and idf respectively (see below). tf stands for “term frequency”, and idf stands for “inverse document frequency”:

tfidf = tf \times idf

### tf

The purpose of tf is to tell us how important a word is to a particular document. tf can be calculated in lots of different ways, but for the purposes of this discussion we’ll use the following:

tf(t,d) = \frac{t}{d}

where t is a count of a term in a document and d is the length of the document.

If a word appears frequently in a document we presume that it is important, and if it does not we presume that it is unimportant. However, there is a problem, because although words which are important to the subject of a document will appear relatively frequently in that document, there are other words which appear frequently in any document.

Take these two documents: Moby Dick, and Bushido: The Soul of Japan. I am sure you would agree that they cover two very different topics (full disclosure: I haven’t read them). Here are the most frequent terms in each, as measured using the tf function above:

It should be clear that you cannot differentiate Moby Dick from Bushido: Soul of Japan using tf alone, and indeed this is the case if you were to compare any two documents. This is because languages are filled with commonly used function words: words that contribute to the syntax but not the meaning of a sentence.

We clearly need a way of working out which words differentiate one subject from another; those words which are specific to particular subjects.

### idf

A popular solution to this problem was first proposed in 1972 by Karen Sparck Jones, a computer scientist at Cambridge university. Sparck Jones proposed that the more specific a word is, the fewer documents it will appear in, leading to the following function

idf(N,nd) = \log (\frac{N}{nd})

Where N is the total number of documents in a corpus (i.e. collection) of documents, and nd is the number of documents in that corpus where a term appears.

This function works in a similar way to the tf function. As the numerator, N, gets larger so N/nd gets larger, but as the denominator nd gets larger so N/nd gets smaller. Intuitively, as a word becomes more commonly used across a corpus so N/nd gets smaller thereby indicating that it has less specificity and vice-versa.

The log of N/nd is taken for two reasons. One mathematical (which we won’t discuss) and the other intuitive. The intuitive reason is that it expresses the intuition that the relationship between specificity and N/nd is not necessarily linear. If a word appears in a handful of documents then it is clearly more specific than a word that appears in half of all documents. But it is not obvious that a word that appears in half the documents is much less specific than a word that appears in all of them.

### tfidf

The TF-IDF function combines these two functions by multiplying them together.

tfidf = tf \times idf = \frac{t}{d}\times \log(\frac{N}{nd})

In plain English, tfidf is the frequency of a word in a document, multiplied by its specificity. Function words, for instance, appear often in a document thereby giving a high tf value, but also appear in most other documents in the corpus, giving a low idf value. When you multiply the two together you get a low TF-IDF value. Similarly, if a word appears frequently in a document but infrequently in the corpus you get a high tf value, and a high idf value giving a high TF-IDF score.

It is worth mentioning that there are several different implementations of TF-IDF, so don’t be surprised if the results of a tool you use don’t match the function above exactly.

## Using the Chart

Let’s see how these two functions work. In the above chart you can set which parameter is on the x-axis by ticking the checkbox next to it, and then see how the TF-IDF curve changes as you move the other parameters around.

Note that the default settings for these numbers are entirely fictional and quite unrealistic. They’re just to give you a feel of how TF-IDF works. Per Moby Dick, you are unlikely to find a word that appears more than about 10% of the time in any given document.

When t is on the x axis, TF-IDF is linear. Since t is the numerator in tf, there is a direct relationship between the number of terms and TF-IDF. As a term becomes more frequent in a document it is presumed to be more important to that document.

When d is on the x axis, the TF-IDF curve is non-linear. Since d is the denominator in td, each increase in d produces a smaller change in td (the difference between 1/2 and 1/3 is greater than the difference between 1/99 and 1/100). Like the explanation of using the log function in idf above, a term which appears 10% of the time in a document is likely to be much more important than one that appears 5% of the time, but one that appears 5% of the time is not likely to be much more important than one which appears 2% of the time.

When nd is on the x axis the TF-IDF curve is non-linear and reduces as nd increases. As a term appears more frequently in the corpus it is less specific and so the TF-IDF reduces.

When N is on the x axis the TF-IDF is non-linear but increases with N. As the corpus gets larger the specificity of the term increases because the documents it appears in become rarer and idf increases correspondingly.