Home » Data science at Dealer.com: a case study with sentiment analysis of automotive brands

Data science at Dealer.com: a case study with sentiment analysis of automotive brands

The Product Analytics Team at Dealertrack consists of web analysts and data scientists1 who address the business needs of the products division. We use a range of business intelligence, data visualization, and statistical tools to provide insight for product development and performance. Recently, we’ve been increasing our use of the open-source R-language for statistical computing 2. I’ll briefly explain how and why we use R in the context of an example of our typical workflow.

Step 1: The Question.

Our typical project starts with a business question from a director or product owner. In this blog I’ll work through an example with public data to address a question that intrigued me: how does consumer sentiment vary among automotive brands?

Step 2: The Data.

Research Mark Wahlberg
Research Mark Wahlberg https://www.facebook.com/researchmark

To address this question, I needed data from a vast pool of real people sharing their thoughts on automotive brands. Lucky for me, this data is available from everyone’s favorite self-broadcast social messaging service, Twitter. Nicely, the people at Twitter provide free access to recent tweets through their API. One of the many reasons to use R is that there’s almost always a package on CRAN) to suit your analytical needs. In this case, I used the TwitteR package to get the most recent 1,000 tweets (in English) from the past week that mentioned (@brand) or hashtagged (#brand) any of the major automotive brands. For example, this line pulls 1,000 tweets from the past week that mention Ford.

# get tweets
tweets <- searchTwitter("@ford OR #ford", n = 1000, 
    since = format(Sys.Date()-7), lang="en")

The full code is more complex as it also strips retweets and munges the data into a dataframe. I also did the incredibly important and time-consuming data janitor work of cleaning the data at this point. For example, the Lincoln car brand is @lincolnmotorco and not @lincoln. I had to clean out non-alphanumeric tweets, remove the search string, etc…. After trial-and-error, here’s the function I used to clean the tweets.

 # function to clean tweets
 cleanTweet <- function(tweet, leaveout) {
   thistweet <- unlist(str_split(tweet, pattern = " "))
   # remove all non-alphanumeric characters
   thistweet <- str_replace_all(thistweet, "[^[:alnum:]]", " ")
   # convert to lowercase 
   thistweet <- tolower(thistweet)
   # remove brand as it swamps other words
   thistweet <- thistweet[!grepl(leaveout, thistweet)]
   # remove links
   thistweet <- thistweet[!grepl("http", thistweet)]
   # remove 'amp' which keeps showing up
   thistweet <- thistweet[!grepl("amp", thistweet)]
   # recombine and return
   paste(thistweet, collapse = " ")
 }

Step 3: The Exploration.

As with any new data set, I started by exploring the data visually with correlation plots, histograms, etc. My goal here was to find oddities (e.g. misspelling Chrysler brings up weird non-Chrysler car related tweets. Yes, more data cleaning!), and identify interesting patterns. After some perusal, it appeared that tweets about luxury cars (Mercedes, Aston Martin) were more positive than other brands. So I decided to test the specific hypothesis that luxury car brands have greater consumer sentiment than economy brands 3.

https://xkcd.com/688/
https://xkcd.com/688/

Step 4: The Final Analysis.

After checking the data and stating the hypothesis, I refined the analysis. I started with the quick and dirty approach. In this case, I followed the protocol of Breen 2011 where I simply split each tweet into distinct words, and then scored the sentiment of each tweet as the number of positive words minus the number of negative words 4. Positive and negative words were defined by the English language opinion lexicon of Hu & Liu 2004. Here’s the actual function I applied

sentiment_score <- function(text, pos_words, neg_words) {
  library(stringr)
  # split text into words
  word_list <- str_split(text, '\\s+')
  # unlist
  words <- unlist(word_list)

  # compare rods to the dictionaries of positive and negative terms
  # return only those that are TRUE/FALSE
  pos_matches <- !is.na(match(words, pos_words))
  neg_matches <- !is.na(match(words, neg_words))

  # score. TRUE/FAlSE treated as 1/0 by sum()
  score <- sum(pos_matches) - sum(neg_matches)
}

to each tweet using the mutate function from the plyr package 5. Once I had the sentiment of each tweet, I summarized by each automotive brand and then ranked them. I then grouped the luxury and economy brands, plotted the data and performed a T-test.

sentiment
Twitter sentiment by auto class

The figure clearly showed that there was no difference between economy and luxury automotive brands for consumer sentiment in the past week from the biased sample of people that tweet about cars, which is supported by the statistical test (T = 0.65, P = 0.53). Yes, I’m a statistician and have to apply caveats to (nearly) everything.

Step 5: The Last Mile!

At this point, it’s always good to take a break and bask in your results! Now take a deep breath and forget all of it. Very few people, including your mom, care about P-values. The last mile of a project is putting your results into the hands of the people (colleagues, bosses, moms) that care and want to use them. In this area, R excels (get the pun there?). First, I use knitr and rmarkdown to make beautiful reports summarizing the work. But the real game-changer is Shiny, which makes it easy for statisticians to make interactive web applications presenting their results.

For this mini-project, I decided that the quickest visualization to support the sentiment analysis would be a word cloud 6 showing the common words in tweets for each automotive brand. And yes, there’s an R package for word clouds! Here’s a screenshot of the Shiny app including the results of the sentiment analysis and word cloud of the tweets.

Twitter sentiment app
Twitter sentiment app

A benefit of the interactive app is that consumer sentiment, and Twitter, is dynamic over time, which my static analysis doesn’t capture, but can be seen in the Shiny app which updates over time.

Some finishing thoughts

At this point, I’ve demonstrated the “tech” behind our work in Product Analytics. I’ve shown how we address a business question, and generate a template for a new product, using open-source statistical tools. The beauty of using R (or Python) is that this work is scalable. I was able to leverage already existing libraries to access the data and generate visualizations, with a simple but complete analysis in between. At the end I have version-controlled code that can be re-used and further developed for new projects, which enables efficiency on our team and limits errors in our work. Finally, the reports and interactive apps we build are crucial for the most important part of role, which is communicating our results to stakeholders. The actual math and methods don’t matter if you can’t explain what it means!

Footnotes


  1. I’ve always been uneasy about the term data scientist and prefer to call myself a full-stack statistician, which I define as as statistically-trained analyst that can do both the back-end (gather, clean, exploratory data analysis) and front-end (final analysis, report, display) work of data science (Tukey 1961). 
  2. Yes – we could also use Python+Pandas+SciPy here. I’m not going to add to this debate other than to say that I’m trained as a statistician and thus started with R first. 
  3. But wait! Forming the hypothesis after looking at the data!? Isn’t that completely illegal? No. In this case, this is not a designed experiment, but a data-mining exercise. I’m being explicit in my assumptions and tests, and formulated the question before applying any specific tests. 
  4. Continuing this work, I’d probably train models on a set of tweets that I manually classified into meaningful categories such as “Positive car safety” or “Negative car safety”. 
  5. Yes, I’m transitioning to Hadley Wickham’s awesome dplyr package but still prefer plyr sometimes. 
  6. Word clouds have problems. Space is meaningless, and color and size of the words are typically conflated. But, the point here is to visually show something, not make the worlds most perfect and complicated figure.