In today's post I will be walking you through a Python Library called LangDetect that will do the job of Text Language Detection for you. The Library we will be using is called LangDetect It is a Language-detection library ported from Google's language-detection of Nakatani Shuyo's language-detection library. Below are some coverage about this library/method. Image credits to [Reddit](https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fm92lwt47u0k91.png%3Fwidth%3D1440%26format%3Dpng%26auto%3Dwebp%26s%3Da7f7fba6603239cde5cf6f491a92b09d67863124). Someone on Reddit posted this benchmarking in terms of the accuracy of detecting the language in question. From the looks of it, LangDetect appears to be in a very good positioning with it being stacked at 4th position accuracy-wise in terms of the median. **Side note:** No library can come close to promising 100% accuracy, but based on my experience with this library it was very close to predicting certain languages with a higher accuracy. Without further ado, I am sharing the script below which you can use in Google Colab ## Step 1: Installations !pip install pandas langdetect ## Step 2: Imports & Uploading Your CSV import pandas as pd from langdetect import detect from google.colab import files # Upload the CSV file uploaded = files.upload() # Get the filename of the uploaded file input_filename = list(uploaded.keys())[0] # Read the CSV file df = pd.read_csv(input_filename) print(df.head()) ## Step 3: Function to Detect & Classify Languages def detect_language(text): try: return detect(text) except: return 'unknown' ## Step 4: Dataframe Preview & Generating Output CSV # Apply language detection to the 'title' column df['detected_language'] = df['title'].apply(detect_language) # Display the first few rows of the result print(df.head()) # Save the result to a new CSV file output_filename = 'output_with_language.csv' df.to_csv(output_filename, index=False) # Download the output file files.download(output_filename) ## **Probable Use Cases** - You have an English domain with foreign language titles, you can classify non-English & plan their pruning - Ensuring that the correct language content is sitting on the international site. **Fun fact:** I was able to classify about 100k titles within 10-20 minutes. In my case, I had uploaded a CSV with the mapping of URL to Title column which is why you will find code function in this manner, but you can rework the script for your specific use case.