fbpx

TF-IDF: The Gateway to Text Analysis with Machine Learning

1
In general, when using Machine Learning to create various types of models like Clustering, Regression, or Classification, we are often familiar with working with structured data, where the features are clearly defined. However, what if the data we need to work with is in the form of unstructured text data? How can we incorporate such data into our models?

We might be accustomed to feature extraction in a more general sense, such as counting the number of purchases, aggregating the quantities and values of products bought in each transaction, which have inherent meaning on their own. But when it comes to text data, can we count or aggregate in the same way?

The answer is both yes and no.
Yes, that is the basis for counting words. It has implications for further use in machine learning.
No, just counting the number of words is not enough.

Let’s start by thinking about the basics of what we want to achieve. Consider a classification problem where we want features that can effectively differentiate between different documents. The challenge is that if we simply count the frequency of words in each document and use those counts as features, we may end up with features that do not discriminate well.

We want the importance of words that reveal the differences between documents to be weighted more heavily, rather than treating every word equally. For example, common words that appear in nearly all documents (e.g., “is,” “the,” “and”) may not be very informative. To address this, we need a way to assign different weights to words based on their importance in distinguishing documents.

This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play. TF-IDF aims to bridge this gap. Instead of simply counting the frequency of words in each document, it also considers how often each word appears across all documents. This way, it assigns higher weights to words that are less common across documents and lower weights to common words.

TF-IDF is calculated using two components:

Term Frequency (TF): The frequency of a word in a particular document.

Inverse Document Frequency (IDF): The inverse of the number of documents that contain the word.

The more common a word is, appearing in multiple documents, the lower its IDF (Inverse Document Frequency) value tends to be (approaching zero). This means that the weight of such words is reduced. When we multiply both components together, we obtain the TF-IDF (Term Frequency-Inverse Document Frequency) score for a word that appears in a document.

For example, let’s say we have a total of 100 documents, with 2 documents selected as samples. Document 1 contains 70 words, and Document 2 contains 80 words. Now, let’s take a look at 3 words that appear in these documents, along with the number of times they appear in each document and the number of documents where they are present, and calculate their TF-IDF values according to the table below:

Word Number of Times in Document 1 Number of Times in Document 2 Number of Documents Where the Word is Present
High 15 5 80
Screw 4 1 10
Blade 1 5 12

The TF value of each word in Document 1 can be calculated as follows.

TF(High) = 15/70 = 0.214

TF(Screw) = 4/70 = 0.057

TF(Blade) = 1/70 = 0.014

As for the TF value of each word in Document 2, it will be as follows.

TF(High) = 5/80 = 0.063

TF(Screw) = 1/80 = 0.013

TF(Blade) = 5/80 = 0.063

For the IDF of a word, it will have the same value no matter which document it is.

IDF(High) = ln(100/80) = 0.223

IDF(Screw) = ln(100/10) = 2.303

IDF(Blade) = ln(100/12) = 2.120

This causes the TF-IDF value of each word on each document to be like this.

  High Screw Blade
TF-IDF ใน Document 1 0.048 0.132 0.030
TF-IDF ใน Document 2 0.014 0.029 0.133

You can see that even though the word “High” appears in both documents, the difference in TF-IDF values between the two documents is small (0.034). This is because the word “High” is relatively common, appearing in 80 out of 100 documents, making it less useful as a feature for distinguishing between documents. On the other hand, when comparing the TF-IDF values of the words “Screw” and “Blade,” which appear in only a few documents, you can observe a larger difference in their TF-IDF scores (around 0.1). This indicates that “Screw” and “Blade” are potentially more meaningful features for differentiating between documents. With this understanding, we have obtained meaningful features that are ready to be used in building a Machine Learning model. In the next opportunity, let’s explore the tools available for practical application.

Reference
https://monkeylearn.com/blog/what-is-tf-idf/
http://www.tfidf.com/
https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
https://en.wikipedia.org/wiki/Tf-idf
Images source: https://pixabay.com
Tag: text; unstructured; machine learning; feature extraction
Level: Intermediate