Summary
- Google has developed a text vectorizer called RETVec that improves spam detection in Gmail by 38% and reduces false positives and false negatives.
- RETVec is Gmail's largest defense upgrade in years and works across all languages and characters, making it suitable for large-scale text classification.
- RETVec can be deployed on mobile, edge devices, and the web, and it is open-source with code available on GitHub for users to access.
Google is constantly looking into ways it can reduce the spam its customers receive in their Gmail inbox. A couple of months ago, it forced bulk senders to authenticate their email addresses and necessarily include an "Unsubscribe" button in their bulk emails. Now, it has outlined certain more technical ways to fight spam in Gmail.
As spotted by Ars Technica, Google recently detailed its efforts in combating spam in Gmail through better text classification methods. Malicious actors currently employ numerous techniques such as keyword stuffing, invisible characters, and more to bypass spam detection defenses such as text classifiers based on machine learning algorithms. In order to combat this problem, Google has developed a text vectorizer called RETVec, which works across multiple languages.
RETVec stands for "Resilient and Efficient Text Vectorizer" and it does exactly what it says on the tin, according to Google. The company says that its novel approach to text vectorization ensures state-of-the-art performance while reducing computation cost. In its internal testing spanning over a year, RETVec managed an improvement of 38% over the baseline in Gmail spam detection, along with reductions of 19.49% and 17.71% when it comes to false positives and false negatives, respectively.
Similarly, compared to the baseline, latency was reduced by 30%, whereas the reduction in the number of Tensor Processing Units (TPUs) and their memory utilization was a massive 83.13% and 62.50%, respectively. That said, the number of CPU cores did go up by 20%. Google says that the performance improvements are due to its lighter word embedding model - sporting 200,000 parameters - and Transformer, mechanisms to efficiently switch computation between the host system and the TPU, a compact encoder, augmentation-driven training, and the use of metric learning. Collectively, all of these enhancements have led to RETVec being Gmail's largest defense upgrade in years, with Google deploying it across its email application for end-users too.
Google has highlighted that RETVec works across all languages and characters with UTF-8 encoding. It does not require any text pre-processing either, which means that you can leverage it as-is. The tech firm has boasted that these capabilities make the vectorizer a strong candidate for deployment across environments which require large-scale text classification on the web or a device itself. The smaller Transformer model ensures reduced latency and computation cost, which are very important factors when deploying text classifiers at scale.
That's not all though. Machine learning models that are trained on RETVec can be converted to TFLite through a native implementation in the TensorFlow Text collection of libraries, which means that you can deploy it on mobile and edge devices which typically have limited access to computational and network infrastructure too. In the same vein, if you want to deploy a RETVec-based model on the web, you can utilize the TensorFlow.js implementation and check out the RETVecJS demo here as well.
Finally, it is important to note that RETVec is open-source, with the code hosted on GitHub along with the installation method, and a detailed tutorial available as a Jupyter Notebook file here. RETVec should already be resulting in lesser spam in your Gmail inbox since it's a back-end enhancement that doesn't require any human intervention.
