Classify SMS Replies with Python & AI

Introduction

Hi, a few days ago I was talking to a friend ( non-developer ) who had built some kind of monster in Excel to understand how his potential customers responded to a marketing campaign .

The message they were sending was something like this:

🧣 Mojalab Special! Get 50% off our cozy scarves! Reply “Yes” if you’re interested, “Unsubscribe” if you don’t want future messages. Otherwise, maybe next time! 😊

Of course, the thing about humans ( and others ) is that they do what they want so the responses that came in were a fairly imaginative mix.

The excel monster created by my friend was going to search the answers for some useful words to determine whether:

✅ Lead interested
❌ Lead not interested in this campaign
🚫 Lead who no longer wants to be contacted (Opt-Out)

Seeing the data, as well as appreciating the imagination in the offenses of some Leads who clearly had not liked the message, I thought that the solution to this problem might offer some interesting insights into topics related to "Artificial Intelligence"

The idea is to make a simple program in python that takes the Lead's response to the message as input and offers it as output:

A potential response to Lead.
The Response Classification (Interested,Not Interested,Unsubcribe).
The Language used to process the message (for now we use English and Italian).
A score to express how confident we are in guessing the user’s intent.

To classify the response we will sequentially go through 4 different classifiers that are increasingly reliable ( and more expensive in terms of resources or money), with parameterizable confidence thresholds. As soon as the confidence score of one of the classifiers exceeds the set threshold the processing ends and the response is returned. If the score is not exceeded it moves on to the next classifier.

The classifiers are as follows:

classify_by_keyword: Classify the intent based on some keywords, it is the classifier similar to the one developed by my friend in excel
classify_by_sentiment: Classifies intent based on analysis of the sentiments expressed in the response message.
classify_by_embedding: Using embedding ( if you don't know what that is we'll talk about it in depth in a moment ) we classify intent by checking how close the response is to a response expressing interest, non-interest, request for deletion.
classify_with_llm: The last classifier makes use of an LLM ( in our example OpenAI GPT) to understand what the Lead wants to communicate to us

Not all messages are easily read. Some are direct ("Yes, call me"), some are ambiguous ("Hmm... I don't know"), and some are written in caps lock with 12 emoji. One method alone is not enough: we need a team of classifiers, each more experienced than the last.

🔗 Source Code Available: Check out the full project on GitHub

Since we need to process input and output in both Italian and English we also need to use a library to identify the language used by Lead, we chose langdetect https://pypi.org/project/langdetect/ which is a bit dated but gives fairly accurate results.

⚠️ Disclaimer

Before continuing I would like to point out that this is just a way to practice with some tools, probably what will come out of it will not be the most effective way to categorize messages. Also being a simple exercise I have not bothered to check whether processing/forwarding our Lead's messages in this way violates some privacy regulation or other.

That said, let us begin:

Classifiers

classify_by_keyword

As we said this classifier simply searches for some keywords in the received response:

# Confidence Scores for Different Methods
KEYWORD_UNSUB_CONFIDENCE = float(os.getenv("KEYWORD_UNSUB_CONFIDENCE", 0.95))
KEYWORD_INTERESTED_CONFIDENCE = float(os.getenv("KEYWORD_INTERESTED_CONFIDENCE", 0.90))
KEYWORD_NOT_INTERESTED_CONFIDENCE = float(os.getenv("KEYWORD_NOT_INTERESTED_CONFIDENCE", 0.90))
....
KEYWORD_CATEGORIES = {
    "unsubscribe": { "keywords": { "en": ["stop", "unsubscribe", "don't text", "remove me"], "it": ["stop", "cancellami", "non scrivermi più"] }, "confidence": KEYWORD_UNSUB_CONFIDENCE },
    "interested": { "keywords": { "en": ["yes", "sure", "interested", "call me", "i want", "okay", "yeah", "please call"], "it": ["sì", "certo", "interessato", "chiamami", "voglio info"] }, "confidence": KEYWORD_INTERESTED_CONFIDENCE },
    "not_interested": { "keywords": { "en": ["no", "not interested", "don't want", "maybe later", "no thanks", "nah"], "it": ["no", "non mi interessa", "forse dopo"] }, "confidence": KEYWORD_NOT_INTERESTED_CONFIDENCE }
}
...
# --- Classification Logic Functions ---
def classify_by_keyword(text: str, lang: str) -> Optional[Dict[str, Any]]:
    """Classifies intent based on keywords."""
    text_lower = text.strip().lower()
    for intent, category_data in KEYWORD_CATEGORIES.items():
        keywords_for_lang = category_data["keywords"].get(lang, [])
        for keyword in keywords_for_lang:
            if re.search(rf"\b{re.escape(keyword)}\b", text_lower):
                logger.info(f"Keyword match: '{keyword}' -> {intent}")
                return create_response(intent, category_data["confidence"], lang)
    return None

As we see this function receives in Input a text ( the response received), the language of the response and browses KEYWORD_CATEGORIES, if, for the language used, one of the keywords is found in the response the intent, confidence and language is used to prepare a response.

classify_by_sentiment

Before we delve into the code, let's try to understand how sentiment analysis works. Imagine that the computer not only has to understand what a message says, but also how it says it, that is, what is the general emotion or feeling it expresses.

Are we looking at a positive (happy, satisfied), negative (angry, disappointed) or neutral (objective, without strong emotion) message? This is the question that sentiment analysis tries to answer.

We imagine the output of our model as at a traffic light,each light corresponds to a feeling:

Green (Positive): The computer has learned (by studying millions of texts already labeled by humans) that certain words or smilies are associated with positive feelings. Words such as "great," thank you very much," "that's great!", "yay!", "perfect" make the green light come on.
Red (Negative): , he learned that words such as "terrible," "problem," "not working," "hate," "never again," or smiley faces indicate negative feelings. These turn on the red light.
Yellow (Neutral): Then there are phrases that do not express strong emotion, but perhaps just give information ("See you tomorrow at 10 a.m.") or are simple questions ("What is this about?"). These often turn on the yellow light.

The workflow will more or less be as follows:

1. Analyze a New Message: When a new text message arrives (e.g., "Thank you, bad service!"), the computer analyzes it looking for these positive and negative "clues" that it has learned.
- In our example, it finds "Thank you" (positive clue) but also "bad service" (strongly negative clue).
1. Deciding the Final Color: The computer "weighs" the clues found. If there are many more positive clues than negative ones, the final traffic light will be GREEN (positive). If negative ones prevail, it will be RED (negative). If the clues balance out or are sparse, the result is often YELLOW (neutral). In the example "Thank you, lousy service!", the weight of "lousy" is likely to be so strong that the traffic light becomes RED (negative), despite the initial "Thank you".

In our script to rank SMS responses, we use sentiment analysis in this way:

If the sentiment is very positive, we hypothesize that the user is likely to be "interested."
If it is very negative, we speculate that it may want to be "cancelled" (although this is not always true; one might complain but still want the service).
If it is neutral, we consider it closer to "not interested."

Important: Sentiment analysis is not magic and sometimes it gets it wrong! It struggles to understand sarcasm ("Great, another fine..." is sarcastic, but the computer might see it as positive!) or very specific contexts.

SENTIMENT_MODEL_NAME = os.getenv("SENTIMENT_MODEL_NAME", "cardiffnlp/twitter-xlm-roberta-base-sentiment")
#...
SENTIMENT_POSITIVE_CONFIDENCE = float(os.getenv("SENTIMENT_POSITIVE_CONFIDENCE", 0.75))
SENTIMENT_NEGATIVE_CONFIDENCE = float(os.getenv("SENTIMENT_NEGATIVE_CONFIDENCE", 0.75))
SENTIMENT_NEUTRAL_CONFIDENCE = float(os.getenv("SENTIMENT_NEUTRAL_CONFIDENCE", 0.65))
#...
SENTIMENT_POSITIVE_THRESHOLD = float(os.getenv("SENTIMENT_POSITIVE_THRESHOLD", 0.85))
SENTIMENT_NEGATIVE_THRESHOLD = float(os.getenv("SENTIMENT_NEGATIVE_THRESHOLD", 0.85))
SENTIMENT_NEUTRAL_THRESHOLD = float(os.getenv("SENTIMENT_NEUTRAL_THRESHOLD", 0.8))
#...
# --- Model Loading Function ---
def load_models():
				#...
        """Loads all ML models into global variables. Called once at startup."""
        global embedding_model, sentiment_tokenizer, sentiment_model, precomputed_embeddings
        logger.info("--- Starting Model Loading ---")
        # Load sentiment model
        logger.info(f"Loading sentiment model: {SENTIMENT_MODEL_NAME}")
        sentiment_tokenizer = AutoTokenizer.from_pretrained(SENTIMENT_MODEL_NAME)
        sentiment_model = AutoModelForSequenceClassification.from_pretrained(SENTIMENT_MODEL_NAME)
        logger.info("Sentiment model loaded successfully.")       
				#...
        
def classify_by_sentiment(text: str, lang: str) -> Optional[Dict[str, Any]]:
    """Classifies intent based on sentiment analysis."""
    if not sentiment_model or not sentiment_tokenizer:
         logger.error("Sentiment model/tokenizer not loaded. Skipping sentiment classification.")
         return None
    try:
        encoded = sentiment_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
        with torch.no_grad():
            output = sentiment_model(**encoded)
        scores = softmax(output.logits.detach().cpu().numpy()[0])
        sentiment_scores = {'negative': scores[0], 'neutral': scores[1], 'positive': scores[2]}
        logger.info(f"Sentiment scores: Neg={sentiment_scores['negative']:.2f}, Neu={sentiment_scores['neutral']:.2f}, Pos={sentiment_scores['positive']:.2f}")

        if sentiment_scores['positive'] > SENTIMENT_POSITIVE_THRESHOLD:
            logger.info("Sentiment -> interested")
            return create_response("interested", SENTIMENT_POSITIVE_CONFIDENCE, lang)
        elif sentiment_scores['negative'] > SENTIMENT_NEGATIVE_THRESHOLD:
            logger.info("Sentiment -> unsubscribe")
            return create_response("unsubscribe", SENTIMENT_NEGATIVE_CONFIDENCE, lang)
        elif sentiment_scores['neutral'] > SENTIMENT_NEUTRAL_THRESHOLD:
             logger.info("Sentiment -> not_interested")
             return create_response("not_interested", SENTIMENT_NEUTRAL_CONFIDENCE, lang)
        else:
            logger.info("Sentiment -> unclear")
            return None
    except Exception as e:
        logger.error(f"Error during sentiment analysis: {e}", exc_info=True)
        return None

The code is quite simple, practically all the work is done by the Sentiment model/tokenizer, in our case we used cardiffnlp/twitter-xlm-roberta-base-sentiment (https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment)

This model has been trained on nearly 200M tweets in multiple languages and is optimized for sentiment analysis.

First the message is Tokenized (converted human language into a language the model can understand). The model wants in input Tokens that may or may not correspond to individual words, the model we are using wants in input max 512 Tokens so if the message were to be too long this would be truncated.

During "tokenization" special tokens are also added to give additional information such as where a sentence begins and ends.

The variable encoded to the chosen sentiment_model that will return values that are not yet ready to be used, to make them usable the softmax function comes into play.

The softmax function (https://en.wikipedia.org/wiki/Softmax_function ) is used to make the output of the model comparable with other classifiers then to make sure that every response of our Lead has as its output something like this:

Input: "Thank you but I don't need it now."
Output sentiment scores: Pos=0.15, Neu=0.67, Neg=0.18 Result: 
not_interested

Note that the sum of the three possible scenarios makes 100%. If one of these possible scenarios passes the threshold ( specified in the configuration ) the function is called that taken the intent, confidence and language takes care of preparing a response.

classify_by_embedding

Suppose we want to teach a computer to understand not only the exact words of an SMS message, but the meaning behind it. For example, figuring out whether a message such as "okay, fine" means "I'm interested," even if it does not contain the word "interested."

Let's try to explain how the method works with "embeds" step by step:

1. To begin with we need a very advanced Artificial Intelligence model (our "embedding model") let's imagine it as to an expert linguist who has read a lot of texts in various languages. This model has learned to create a kind of giant invisible "map" where each phrase or word can be placed to its meaning. The embedder basically receives as input parts of text (chunks) and assigns them coordinates (like latitude and longitude). After analyzing all the text chunks with similar meanings are located at nearby points on the map, while phrases with different meanings are far away.
1. At this point we have an empty map so we need to start placing some clear examples on the map for each category we are interested in. We tell them:
- "Look, phrases like 'thank you,' 'call me,' 'I want info,' 'okay fine' all mean INTERESTED." Place these phrases (chunks) on the map. The area on the map where all these chunks are located will be the INTERESTED region
- "Instead, phrases like 'no thanks,' 'not now,' 'maybe later' mean NOT INTERESTED." By placing these chunks on the map as well, the region you find will be called NOT INTERESTED
- "And phrases like 'stop,' 'delete me,' 'don't send me any more messages' mean DELETE ME." As with the previous cases, the coordinates of these chunks will define the DELETE ME region.
1. When a new SMS arrives, e.g., "sure, contact me," we ask our AI model to find its coordinates on the meaning map.
1. At this point, the computer does a simple thing: It measures how "close" the position of the new message is to the positions of the examples we had given it for each category. It calculates the distance between the point of the new message and the points (or regions) of "INTERESTED," "NOT INTERESTED," and "DELETE ME."
1. Having found the distances to other points ( or to other regions) we can try to classify our message:
- If the point of the new message ends up very close to the "INTERESTED" example area, the computer says, "Okay, this message means INTERESTED!"
- If it is closer to the "NOT INTERESTED" zone, it classifies it as such.
- If it is closer to "DELETE ME," it does the same.
- If the message ends up in a somewhat remote spot on the map, far from our example areas, the computer might say, "I'm not sure of the meaning" (classification "unclear").

The Advantage:

This method is powerful because it does not rely only on the exact keywords. It understands context and semantic meaning. It can understand that "okay, proceed" is similar to "yes, I am interested," even if the words are different. And if the AI model is "multilingual" (like the one we use in this experiment), it can do this proximity reasoning on the map even across languages!

Basically, we turn the meaning of sentences into "locations" on a map and then simply look at which of our groups of examples the new sentence is closest to.

Let's take a look at the code:

from sentence_transformers import SentenceTransformer, util

EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", 'distiluse-base-multilingual-cased-v2')
...
EMBEDDING_CONFIDENCE_THRESHOLD = float(os.getenv("EMBEDDING_CONFIDENCE_THRESHOLD", 0.60))
...
EMBEDDING_EXAMPLES = {
    "interested": ["yes please", "sure thing", "call me", "i want info", "interested",
                   "sì", "certo", "chiamami", "voglio info"],
    "not_interested": ["no thanks", "not interested", "maybe later", "don't want it",
                       "no", "non mi interessa", "forse dopo"],
    "unsubscribe": ["stop", "remove me", "unsubscribe me", "don't text me again",
                    "stop", "cancellami", "non scrivermi più"]
}
# --- Model Loading Function ---
def load_models():
    """Loads all ML models into global variables. Called once at startup."""
    global embedding_model, sentiment_tokenizer, sentiment_model, precomputed_embeddings
    logger.info("--- Starting Model Loading ---")
    try:
        logger.info(f"Loading embedding model: {EMBEDDING_MODEL_NAME}")
        embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME, device='cpu') # Use 'cpu' for compatibility
        logger.info("Embedding model loaded successfully.")
        # Check if embedding model loaded successfully
        if not embedding_model:
            raise RuntimeError("Failed to load embedding model.")
				....
        ....
        logger.info("Precomputing embeddings for examples...")
        # Precompute embeddings for labeled examples
        for label, examples in EMBEDDING_EXAMPLES.items():
            if embedding_model: # Check if embedding model loaded successfully
                 precomputed_embeddings[label] = embedding_model.encode(examples, convert_to_tensor=True)
            else:
                 raise RuntimeError("Cannot precompute embeddings, embedding model failed to load.")
        logger.info("Example embeddings precomputed successfully.")
        logger.info("--- Model Loading Finished ---")

def classify_by_embedding(text: str, lang: str) -> Optional[Dict[str, Any]]:
    """Classifies intent based on semantic similarity using embeddings."""
    logging.info("Classifying using embeddings...")
    if not embedding_model or not precomputed_embeddings:
        logger.error("Embedding model/precomputed embeddings not available. Skipping embedding classification.")
        return None
    try:
        text_emb = embedding_model.encode(text, convert_to_tensor=True)
        best_score = 0.0
        best_label = "unclear"

        for label, example_embs in precomputed_embeddings.items():
            scores = util.cos_sim(text_emb, example_embs)
            max_score_for_label = float(scores.max())
            # logger.debug(f"Embedding score for label '{label}': {max_score_for_label:.4f}")
            if max_score_for_label > best_score:
                best_score = max_score_for_label
                best_label = label

        logger.info(f"Best embedding match: label='{best_label}', score={best_score:.4f}")
        if best_score >= EMBEDDING_CONFIDENCE_THRESHOLD and best_label in CUSTOM_REPLIES:
            logger.info(f"Embedding -> {best_label}")
            return create_response(best_label, best_score, lang)
        else:
            return None
    except Exception as e:
        logger.error(f"Error during embedding classification: {e}", exc_info=True)
        return None

As we see from the import we use sentence-transformers, what is it ?

Sentence-transformers (https://www.sbert.net/) is a Python library built on top of Hugging Face Transformers and PyTorch that makes it super easy to generate embeddings (vector representations) of sentences, paragraphs, or even short documents.

Sentence-transformers includes some pre-trained models that can be used, in our case we use distiluse-base-multilingual-cased-v2 that allows us to map sentences and paragraphs (in multiple different languages) to a 512-dimensional space ( in our example with only latitude and longitude the dimension space had only 2... it is a nice simplification but the example still holds)

In the load_models function after initializing the embedder we feed it all the examples in EMBEDDING_EXAMPLES and save the coordinates found for each "region" in precomputed_embeddings.

When we receive a message in the classify_by_embedding function we also embed the message and then calculate the distance between the text and the examples with:

scores= util.cos_sim(text_emb, example_embs)

The cos_sim ( Cosine Similarity) function compares the vector (the 512 dimensions that define the coordinates) of the arrived message with the vectors we had computed and stored during initialization from the examples.

At the end of the calculation in best_score we will have the best score, in best_label the message classification. If the threshold is passed we call the function to prepare the answer otherwise we go to the last method.

classify_with_llm

Imagine that the methods we saw earlier (keywords, sentiment analysis, meaning map/embeddings) are like specialists or assistants who try user intent by following precise rules or specific comparisons. Sometimes, however, these specialists run into difficulties:

The message is too vague ("Hmm, I don't know...").
He uses strange or ironic language that confuses previous methods.
It simply does not resemble any of the examples we had given (it is too "far away" on the map of meanings).

In such cases, when specialists throw up their arms and say, "I'm not sure!" our script does one thing: it calls the "boss," the supreme expert, the consultant who knows a little about everything.

1. Who is Expert? This expert is a Large Linguistic Model (LLM), such as GPT-4 (the technology that makes ChatGPT work, to be clear). These models are incredibly powerful AIs that have been trained on an inordinate amount of texts, books, etc. They have a very broad and "general" understanding of human language, context, and reasoning.
1. Giving Clear Instructions (The "Prompt"): We cannot simply give the message to the LLM and say "you do it." We have to give it a clear task. Our script prepares a detailed request (called the "prompt") that says, in essence:
- "Dear GPT Model, imagine you are an expert in classifying responses to SMS advertisements."
- "I received this message from a user: ." SMS original] [text is inserted here
- interested "Your task is to read this message and decide which of these four labels best describes user's intention: , , you just don't get it)." (i.e., wants to cancel), or (if not_interested unsubscribe unclear
- "Please respond to me in a specific format (JSON), giving me only the label you chose, a very brief explanation of why, and a number representing your confidence in the response."
1. The Expert Analyzes and Responds: The LLM receives these instructions and the message.

Using his vast knowledge, he "reads" and "reasons" about the message in the context we have given him (response to SMS marketing) and chooses the label he thinks is most appropriate from those we have allowed him to use. He then sends us his structured response as requested.

1. Why Use it for Last? This method is very powerful because the LLM can understand nuance, irony, typos, and complex contexts much better than other methods. However, it has two main disadvantages:
- Slow: Asking an LLM for a response takes longer (a few seconds) than the other methods (which are almost instantaneous).
- Cost: Sentiment analysis or embedding can be successfully run on computers that are not particularly high performance. In contrast, to use an LLM we need dedicated and very powerful hardware. In order to get to interact with these LLMs we can either use APIs and use them remotely ( on hardware provided by vendors) or run a freely available model on sufficiently powerful hardware that will provide. In either case, whether API or hosting managed by us we will have costs.

For these reasons, we use LLM as a last resort (in technical jargon, it is called "fallback"). Only if the keywords, sentiment, and meaning map have not given us a confident enough answer, then and only then do we "bother" the great LLM expert to get his final opinion.

It's like having a very good but very expensive consultant: you call him or her only when your internal staff doesn't have the capacity to solve a complex problem.

Let's look at the code:

import openai

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GPT_MODEL_NAME = os.getenv("GPT_MODEL_NAME", "gpt-4o") 


def classify_with_gpt(text: str, lang: str) -> Dict[str, Any]:
    """Uses OpenAI GPT as a fallback for classification."""
    logger.info("Falling back to GPT for classification.")
    prompt = (
        f"You are an intent classifier for SMS replies to marketing messages. The user replied with the following message (language: {lang}):\n"
        f"\"{text}\"\n\n"
        f"Classify the user's intent strictly as one of: 'interested', 'not_interested', 'unsubscribe', 'unclear'. Consider the context.\n"
        f"If you detect negative reactions, classify as 'unsubscribe'. If you detect clear disconfort about the marketing message or offensive word classify as 'unsubscribe'.\n"
        f"Respond ONLY with a JSON object containing three keys: 'intent', 'reasoning' (brief), and 'confidence' (float between {GPT_FALLBACK_CONFIDENCE_FLOOR} and 1.0)." )
    try:
        response = openai_client.chat.completions.create(
            model=GPT_MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are an accurate and concise intent classification assistant outputting JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
            response_format={"type": "json_object"}
        )
        gpt_response_content = response.choices[0].message.content
        logger.info(f"Raw GPT response: {gpt_response_content}")
        try:
            gpt_data = json.loads(gpt_response_content)
            intent = gpt_data.get("intent")
            confidence = float(gpt_data.get("confidence", 0.0))
            if intent not in CUSTOM_REPLIES or confidence < GPT_FALLBACK_CONFIDENCE_FLOOR:
                 logger.warning(f"Invalid intent ('{intent}') or low confidence ({confidence}) from GPT. Treating as 'unclear'.")
                 return create_response("unclear", 0.4, lang, error="Invalid GPT response or low confidence.")
            logger.info(f"GPT -> {intent} (Conf: {confidence}). Reason: {gpt_data.get('reasoning', 'N/A')}")
            final_confidence = max(GPT_FALLBACK_CONFIDENCE_FLOOR, min(1.0, confidence))
            return create_response(intent, final_confidence, lang)
        except (json.JSONDecodeError, TypeError, KeyError, ValueError) as json_e:
            logger.error(f"Failed to parse GPT JSON response: {json_e}. Content: {gpt_response_content}", exc_info=True)
            return create_response("unclear", 0.4, lang, error=f"GPT response parsing error: {json_e}")
    except openai.APIError as api_e:
        logger.error(f"OpenAI API error: {api_e}", exc_info=True)
        return create_response("unclear", 0.4, lang, error=f"OpenAI API error: {api_e}")
    except Exception as e:
        logger.error(f"Unexpected error during GPT classification: {e}", exc_info=True)
        return create_response("unclear", 0.4, lang, error=f"Unexpected GPT error: {e}")

Again the code is very simple, with a prompt we tell our LLM what it should do and how it should format the output. In the example we use gpt-4o from OpenAI (https://openai.com/index/hello-gpt-4o/) which is also too much for what we need to do.

This model allows us to specify the format of the response, in this case json_object. In any case if even the LLM returns us something that we cannot process we will treat the response as "Unclear"

Everything else

Okay, now that we have seen the classifiers in detail let's take a look at everything else.

The function that calls the classifiers one after the other is this:

#--- Main Classification Pipeline Function ---
def classify_message(text: str) -> Dict[str, Any]:
    """Main classification pipeline: Keyword -> Sentiment -> Embedding -> GPT."""
    if not text or not isinstance(text, str) or len(text.strip()) == 0:
         logger.warning("Received empty or invalid text message.")
         return create_response("unclear", 0.1, DEFAULT_LANG, error="Empty message received")

    # Ensure models are loaded before proceeding (belt-and-suspenders check)
    if not embedding_model or not sentiment_model:
        logger.error("Models are not loaded. Cannot perform classification.")
        return create_response("error", 0.0, DEFAULT_LANG, error="Internal server error: Classification models unavailable.")

    logger.info(f"Processing message: '{text}'")
    lang = detect_language(text)
    logger.info(f"Detected language: {lang}")

    # Execute classification steps in order
    result = classify_by_keyword(text, lang)
    if result: return result
  	result = classify_by_sentiment(text, lang)
    if result: return result
    result = classify_by_embedding(text, lang)
    if result: return result
    result = classify_with_gpt(text, lang)
    return result

The function to identify language is also very simple:

from langdetect import detect, LangDetectException 
#...
SUPPORTED_LANGUAGES= ["en", "en"] 
DEFAULT_LANG = "en"
#...
def detect_language(text: str) -> str:
    """Detects language, falling back to DEFAULT_LANG if 
unsupported or detection fails."""
    try:
        lang= detect(text)
        if lang not in SUPPORTED_LANGUAGES: logger.warning(f 
            "Detected language '{lang}' not in
supported list {SUPPORTED_LANGUAGES}. Using {DEFAULT_LANG} for 
processing.")
            return DEFAULT_LANG 
        return lang
    except LangDetectException:
        logger.warning(f "Language detection failed for text: 
'{text[:50]}...'. Falling back to {DEFAULT_LANG}.", exc_info=False)
        return DEFAULT_LANG

To respond in the right language to our lead we do this:

CUSTOM_REPLIES = {
    "interested": { "en": "Thanks! We’ll call you shortly.", "it": "Grazie! Ti chiameremo a breve." },
    "not_interested": { "en": "No problem, we’ll try again another time.", "it": "Va bene, magari alla prossima." },
    "unsubscribe": { "en": "Sorry to bother you. You won’t hear from us again.", "it": "Ci dispiace disturbarti. Non ti contatteremo più." },
    "unclear": { "en": "Just to be sure—did you mean yes, no, or stop?", "it": "Sei interessato, non interessato, o vuoi annullare l'iscrizione?" },
    "unsupported_language": { "en": "Sorry, this service currently supports only English and Italian.", "it": "Al momento supportiamo solo Italiano e Inglese." },
     "error": { "en": "Sorry, we encountered an technical issue. Please try again later.", "it": "Siamo spiacenti, si è verificato un problema tecnico. Riprova più tardi." }
}
#...
def get_reply(intent: str, lang: str) -> str:
    """Gets the appropriate reply for a given intent and language, with fallback."""
    if intent not in CUSTOM_REPLIES:
        logger.warning(f"Intent '{intent}' not found in CUSTOM_REPLIES. Falling back to 'unclear'.")
        intent = "unclear"
    return CUSTOM_REPLIES[intent].get(lang, CUSTOM_REPLIES[intent].get(DEFAULT_LANG, next(iter(CUSTOM_REPLIES[intent].values()))))
#...
def create_response(intent: str, confidence: float, lang: str, error: Optional[str] = None) -> Dict[str, Any]:
    """Standardizes the creation of the response dictionary."""
    response = {
        "intent": intent,
        "reply": get_reply(intent, lang),
        "confidence": round(confidence, 2),
        "language": lang
    }
    if error:
        response["error"] = error
    return response

Finally, wanting to equip our "text message classifier" with a usable interface from the WEB, we chose to use flask (https://flask.palletsprojects.com/en/stable/) and gunicorn (https://gunicorn.org/)

from flask import Flask, request, jsonify
#...
app = Flask(__name__)
# --- Load Models Globally at Startup ---
# This is crucial for Gunicorn with --preload. It runs when the module is imported.
try:
    load_models()
except Exception as startup_error:
    # Log critical error and prevent app from starting if models fail
    logger.critical(f"FATAL: Failed to load models during app initialization: {startup_error}", exc_info=True)
    # Exit helps ensure Gunicorn doesn't try to run workers with broken state
    exit(1)

# --- Flask Routes ---
@app.route("/classify", methods=["POST"])
def classify_endpoint():
    """Flask endpoint to classify a message."""
    # Check if models are loaded before processing request
    if not embedding_model or not sentiment_model:
         logger.error("Models not loaded, cannot classify request.")
         # Return 503 Service Unavailable if models aren't ready
         return jsonify(create_response("error", 0.0, DEFAULT_LANG, error="Service temporarily unavailable: Models not loaded.")), 503

    if not request.is_json:
        logger.warning("Received non-JSON request")
        return jsonify({"error": "Request must be JSON"}), 400

    data = request.get_json()
    message_text = data.get("message")

    if not message_text:
        logger.warning("Missing 'message' field in request JSON")
        return jsonify({"error": "Missing 'message' field in JSON payload"}), 400

    try:
        classification_result = classify_message(message_text)
        return jsonify(classification_result), 200
    except Exception as e:
        logger.error(f"Unexpected error during classification for message: '{message_text[:50]}...': {e}", exc_info=True)
        error_response = create_response("error", 0.0, DEFAULT_LANG, error="Internal server error during classification.")
        return jsonify(error_response), 500
@app.route("/health", methods=["GET"])
def health_check():
    """Basic health check endpoint, verifies models are loaded."""
    models_loaded = embedding_model is not None and sentiment_model is not None and sentiment_tokenizer is not None and precomputed_embeddings is not None
    status = "ok" if models_loaded else "error"
    http_code = 200 if models_loaded else 503 # Use 503 Service Unavailable if not ready
    logger.info(f"Health check endpoint called. Models loaded: {models_loaded}")
    return jsonify({"status": status, "models_loaded": models_loaded}), http_code

# --- Main Execution Block (for direct 'python app.py' runs ONLY) ---
if __name__ == "__main__":
    # This block is ignored when running with Gunicorn
    logger.info("Attempting to start Flask development server directly (python app.py)...")
    # Models should already be loaded globally above.
    # We add a final check here before trying to run Flask's dev server.
    if not (embedding_model and sentiment_model):
         logger.critical("Models not loaded successfully earlier. Aborting Flask development server start.")
         exit(1)
    try:
        # Run Flask's development server (NOT for production)
        # Use host='0.0.0.0' to make it accessible on your network
        app.run(debug=False, host='0.0.0.0', port=5000)
    except Exception as run_error:
         logger.critical(f"Flask development server failed: {run_error}", exc_info=True)
         exit(1)

To start the whole thing

gunicorn --workers 1 --bind 0.0.0.0:5001 --preload app:app

Once the models have downloaded (it will take some patience ) you can start experimenting with a command like this:

curl -X POST http://localhost:5001/classify \
-H "Content-Type: application/json" \
-d '{"message": "I don't want to hear anymore from Mojalab.com"}'

And this is the answer:

{"confidence":0.75, "intent": "unsubscribe", "language": "en", "reply": "Sor ry to bother you. You won\u2019t hear from us again."}

🛠️ Want to try it yourself?
Explore the full source code and instructions here:
👉 GitHub – message-classification-example

Conclusion: Now Go Build Something!

So, what started as tackling a friend's spreadsheet puzzle turned into an adventure in understanding language with AI! We dove into the unpredictable world of human text replies and emerged with a system capable of figuring out what people really (I hope) mean – whether it's a "yes," a "no," or a "please stop texting me!"

Our approach wasn't about finding one magic bullet, but layering different tools like building blocks. We started simple and fast with keywords and sentiment checks, then brought in the nuanced understanding of embeddings, and finally, called upon the powerhouse reasoning of an LLM for the truly tricky cases.

Think of this project not just as a solution (Is not a solution really), but as a spark. It wants shows how combining readily available AI tools could transform a frustrating problem into an automated, intelligent process. The techniques we explored – understanding sentiment, mapping meaning with embeddings, conversing with LLMs – are more accessible today than ever before.

What everyday challenge could you simplify with a bit of code and AI? What process could you automate? The journey of building something like this is incredibly rewarding. So, take these ideas, experiment, break things, learn, and most importantly – go build something amazing!

Disclaimer: At MojaLab, we aim to provide accurate and useful content, but hey, we’re human (well, mostly)! If you spot an error, have questions, or think something could be improved, feel free to reach out—we’d love to hear from you. Use the tutorials and tips here with care, and always test in a safe environment. Happy learning!!!

No AI was mistreated in the making of this tutorial—every LLM was used with the respect it deserves.

MojaLab shares tutorials, examples, and discoveries in coding, AI, and system administration for tech enthusiasts, sysadmins, and beginners.

An AI Adventure in SMS message Classification