Hospitality Review Sentiment Analysis

Comparative analysis of guest satisfaction themes between Airbnb and hotel accommodations using advanced NLP techniques

Project Overview

As part of a hospitality industry research project, I conducted sentiment analysis comparing guest satisfaction themes between Airbnb and hotel reviews. The project answered: What are the most frequently mentioned themes of guest satisfaction and dissatisfaction in Airbnb vs. hotel reviews?

Using VADER sentiment analysis, n-gram extraction, and Random Forest classification, this analysis reveals distinct patterns in how guests evaluate different accommodation types and identifies key satisfaction drivers across hospitality models.

Technical Implementation

Multi-Modal NLP Pipeline

VADER Sentiment Analysis

Custom classification thresholds optimized for review text

TF-IDF Vectorization

N-gram extraction (1,2) for comprehensive theme identification

Random Forest Classification

Grid search optimization for feature importance analysis

Language Detection & Filtering

Automated filtering ensuring English-only review analysis

Complete Analysis Pipeline

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Hospitality Review Sentiment Analysis
Comparative analysis of Airbnb vs Hotel guest satisfaction
"""

import os
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_recall_fscore_support
from sklearn.feature_extraction.text import TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
import pickle

# Utility Functions for Text Processing
def clean_txt(var_in):
    import re
    tmp_t = re.sub("[^A-Za-z']+", " ", var_in).strip().lower()
    return tmp_t

def rem_sw(str_in):
    from nltk.corpus import stopwords
    sw = stopwords.words('english')
    tmp = [word for word in str_in.split() if word not in sw]
    tmp = ' '.join(tmp)
    return tmp

def lemma_fun(var_in):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    split_ex = var_in.split()
    t_l = [lemmatizer.lemmatize(word) for word in split_ex]
    return ' '.join(t_l)

# Language detection for filtering
DetectorFactory.seed = 42

def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"
    
# Load and combine datasets
airbnb_data = pd.read_excel(airbnb_path)
hotels_data = pd.read_excel(hotels_path)

airbnb_data['source'] = 'airbnb'
hotels_data['source'] = 'hotels'
data = pd.concat([airbnb_data, hotels_data], ignore_index=True)

# Comprehensive text preprocessing
data['Reviews_Lemma'] = data['Reviews_Lemma'].fillna("").astype(str).apply(clean_txt)
data['Reviews_Lemma'] = data['Reviews_Lemma'].apply(rem_sw)
data['Reviews_Lemma'] = data['Reviews_Lemma'].apply(lemma_fun)

# Filter for English reviews only
data['language'] = data['Reviews_Lemma'].apply(lambda x: detect_language(x))
data = data[data['language'] == 'en']
print(f"Remaining reviews after filtering: {len(data)}")

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(data['Reviews_Lemma'])

# VADER Sentiment Analysis
vader_analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_score(text):
    sentiment = vader_analyzer.polarity_scores(text)
    if sentiment['compound'] >= 0.05:
        label = 'positive'
    elif sentiment['compound'] <= -0.05:
        label = 'negative'
    else:
        label = 'neutral'
    return sentiment['compound'], label

# Apply sentiment scoring
data[['compound_score', 'sentiment']] = data['Reviews_Lemma'].apply(
    lambda x: pd.Series(vader_sentiment_score(x))
)

# Train Random Forest Classifier
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, data['sentiment'], test_size=0.2, random_state=42
)

rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
param_grid = {"n_estimators": [50, 100, 200], "max_depth": [None, 10, 20]}
grid_search = GridSearchCV(rf_model, param_grid, cv=3, scoring='f1_weighted')
grid_search.fit(X_train, y_train)

best_rf_model = grid_search.best_estimator_

# Feature importance analysis
feature_importances = pd.DataFrame(
    best_rf_model.feature_importances_, 
    index=tfidf_vectorizer.get_feature_names_out(),
    columns=['importance']
).sort_values(by='importance', ascending=False)

# Comparative analysis
data['predicted_sentiment'] = best_rf_model.predict(
    tfidf_vectorizer.transform(data['Reviews_Lemma'])
)
comparison = data.groupby(['source', 'predicted_sentiment']).size().unstack(fill_value=0)

print("Analysis complete. Results exported.")

Key Findings

Sentiment Patterns

Airbnb Reviews

98% positive reviews

Average sentiment: 0.857

Consistency in experience with higher sentiment scores reflecting personalized hospitality model

Hotel Reviews

93.6% positive reviews

Average sentiment: 0.816

More variation reflecting standardized service expectations and diverse property types

Satisfaction Themes

Shared Priorities

Cleanliness and location matter equally across platforms, indicating universal guest expectations for accommodation fundamentals.

Airbnb Differentiators

Host interactions crucial for both satisfaction and dissatisfaction - personal relationship quality drives experience variability.

Hotel Differentiators

Professional service expectations and value-for-money focus reflecting standardized hospitality industry norms.

Critical Insights

Service Mentions

Hotels: 218,814 | Airbnb: 7,450

30x higher frequency indicating professional service centrality

Price Mentions

Hotels: 166,341 | Airbnb: 22,163

7x higher frequency reflecting different value expectations

Dissatisfaction Patterns

Airbnb: Variability issues | Hotels: Standardization failures

Different risk profiles across hospitality models

Business Impact

This analysis provides quantitative validation of hospitality industry trends while revealing actionable insights for platform optimization. The findings inform strategic decisions for accommodation platforms and property managers by highlighting distinct value propositions across hospitality models.

Technical Skills Demonstrated

Advanced NLP Preprocessing

Multi-stage text cleaning with language detection and filtering

Machine Learning Pipeline

TF-IDF vectorization with optimized Random Forest classification

Comparative Analysis

Cross-platform statistical comparison and theme extraction

Business Intelligence Translation

Converting technical findings into actionable insights