TF-IDF: The Gateway to Text Analysis with Machine Learning
Let’s start by thinking about the basics of what we want to achieve. Consider a classification problem where we want features that can effectively differentiate between different documents. The challenge is that if we simply count the frequency of words in each document and use those counts as features, we may end up with features that do not discriminate well.
We want the importance of words that reveal the differences between documents to be weighted more heavily, rather than treating every word equally. For example, common words that appear in nearly all documents (e.g., “is,” “the,” “and”) may not be very informative. To address this, we need a way to assign different weights to words based on their importance in distinguishing documents.
This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play. TF-IDF aims to bridge this gap. Instead of simply counting the frequency of words in each document, it also considers how often each word appears across all documents. This way, it assigns higher weights to words that are less common across documents and lower weights to common words.
TF-IDF is calculated using two components:
Term Frequency (TF): The frequency of a word in a particular document.
Inverse Document Frequency (IDF): The inverse of the number of documents that contain the word.
The more common a word is, appearing in multiple documents, the lower its IDF (Inverse Document Frequency) value tends to be (approaching zero). This means that the weight of such words is reduced. When we multiply both components together, we obtain the TF-IDF (Term Frequency-Inverse Document Frequency) score for a word that appears in a document.
For example, let’s say we have a total of 100 documents, with 2 documents selected as samples. Document 1 contains 70 words, and Document 2 contains 80 words. Now, let’s take a look at 3 words that appear in these documents, along with the number of times they appear in each document and the number of documents where they are present, and calculate their TF-IDF values according to the table below:
Word | Number of Times in Document 1 | Number of Times in Document 2 | Number of Documents Where the Word is Present |
High | 15 | 5 | 80 |
Screw | 4 | 1 | 10 |
Blade | 1 | 5 | 12 |
The TF value of each word in Document 1 can be calculated as follows.
TF(High) = 15/70 = 0.214
TF(Screw) = 4/70 = 0.057
TF(Blade) = 1/70 = 0.014
As for the TF value of each word in Document 2, it will be as follows.
TF(High) = 5/80 = 0.063
TF(Screw) = 1/80 = 0.013
TF(Blade) = 5/80 = 0.063
For the IDF of a word, it will have the same value no matter which document it is.
IDF(High) = ln(100/80) = 0.223
IDF(Screw) = ln(100/10) = 2.303
IDF(Blade) = ln(100/12) = 2.120
This causes the TF-IDF value of each word on each document to be like this.
High | Screw | Blade | |
TF-IDF ใน Document 1 | 0.048 | 0.132 | 0.030 |
TF-IDF ใน Document 2 | 0.014 | 0.029 | 0.133 |
You can see that even though the word “High” appears in both documents, the difference in TF-IDF values between the two documents is small (0.034). This is because the word “High” is relatively common, appearing in 80 out of 100 documents, making it less useful as a feature for distinguishing between documents. On the other hand, when comparing the TF-IDF values of the words “Screw” and “Blade,” which appear in only a few documents, you can observe a larger difference in their TF-IDF scores (around 0.1). This indicates that “Screw” and “Blade” are potentially more meaningful features for differentiating between documents. With this understanding, we have obtained meaningful features that are ready to be used in building a Machine Learning model. In the next opportunity, let’s explore the tools available for practical application.