Lovoo is a European mobile dating app. It ranked the 1st in German speaking markets and 2rd in Sounthern Europe. The goal of the app is to heal the human kind from disease of feeling loneliness. The app can help connect people with dating features and with live video community.
library(vembedr) # to embed youtube video
## Warning: package 'vembedr' was built under R version 4.1.3
embed_youtube(id = "FkJLyyttrqM", allowfullscreen = TRUE)
This project uses data of users in the Lovoo app which records information of 3992 female users. In the data, 42 predictors associated with each user are recorded.
Among all 42 predictors in the dataset, we want to predict female users’ attractiveness(counts of kisses user receive)! Specifically, the main questions we are addressing in our project is: what are the factors that make a great user profile on dating app? What makes a person charismatic? How do charismatic people present themselves?Does it worth it to purchase a VIP for a dating app? To answer these questions, we will be modeling using different methods - using linear regression, tree, xgboost, and neural network - and compare their Root Mean Square Error(RMSE) value.
The response variable in our project are
\(\textbf{counts_kisses} :\) Number of unique user accounts that “liked” (called “kiss” on the platform) this user account.
Since there are 42 predictors in our original dataset, after removing variables that are not of interest or lack of description, the followings are the rest of predictors we are considering: (There are 19 predictors, some of the binary variables are grouped)
\(\textbf{genderLooking}\)“: gender the user indicated an interest for
\(\textbf{age}\) : user age
\(\textbf{counts_pictures}\): number of pictures on the user’s profile
\(\textbf{counts_profileVisits}\): number of clicks on this user (to see his/her full profile) from other user accounts.
\(\textbf{counts_fans}\): Number of followers
\(\textbf{flirtInterests/flirtInterests_friendsflirtInterests_date}\): Whether the user indicated being open to chat with others/ be friends with others/ date with others.
\(\textbf{country}\): user’s country
\(\textbf{isFlirtstar/isMobile/isNew/isVip/verified}\): whether this user is an influencer account/ uses the mobile app/ recently created the account/ is VIP/was verified through one of the methods (Facebook, phone number, …)
\(\textbf{lang_count}\): Number of languages the user knows
\(\textbf{lang_fr/lang_en/lang_de/lang_it}\) : whether the user knows French/English/German/Italian
library(ggplot2)
library(dplyr)
library(tree)
library(visdat)
library(maptree)
library(tidyverse)
library(ggExtra)
library(ggalt)
library(GGally)
library(patchwork)
library(psych)
## Warning: package 'psych' was built under R version 4.1.3
library(tidyverse)
library(broom)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.3
library(keras)
## Warning: package 'keras' was built under R version 4.1.3
lovoo_data_original = read_csv("lovoo_data.csv")
lovoo_data = lovoo_data_original %>% dplyr::select( -name, # we delete name because we can identify users simply by user ID.
-counts_details, # there is a lack of description of this variable
-counts_g, # the original dataset is lack of description on this variable
-location, # there are too many locations, so using location as a predictor doesn't really help
-city, # same reason as location
-isHighlighted, # it only reflects information during the fetch time, which is not stable/reliable over time.
-isOnline, # same reason as isHighlighted
-lang_es, # Only 75 users who know spanish, sample size is too small
-lang_pt, # Only 42 users who know spanish, sample size is too small
-lastOnlineDate, # Since we don't know the date when the data is fetched, lastOnlineDate isn't helpful for us
-lastOnlineTime, # same reason as lastOnlineDate
-shareProfileEnabled,
-birthd, # None of the user's birthday is today
-crypt, # lack of description of variable
-freetext, # 97% are null, hard to interpret
-whazzup, #the data under this predictor are all texts, hard to analyze
-pictureId, #Not useful since pictureID is random
-isSystemProfile, # All are system profile, so including this predictor doesn't make any differnece.)
-userId,
-distance,
-gender,
-isInfluencer ## everyone is not an influencer in this dataset
)
The following graph provides an at-a-glance ggplot of the missingness inside a dataframe.
vis_miss(lovoo_data) #provides an at-a-glance ggplot of the missingness inside a dataframe
Black indicates a missing cell and grey indicates a present cell. In the plot, since we only see all gray without any tint of black, there is no missing values in each category.
In this chunk, we aim to split the dataset into a training set and a test set. Specifically, we aim to extract around 80% of the data as training set, and the rest of 20% as test set.
set.seed(123)
# we aim to split 80% of the data to be training data
sample_size = round(0.8*nrow(lovoo_data)) # round it in case 0.8*nrow() is not an integer
train = sample(1:nrow(lovoo_data), size = sample_size, replace = FALSE)
lovoo_train = lovoo_data[train,] # the training set
lovoo_test = lovoo_data[-train,] # the rest of the data is the test set
predictor =names(subset(lovoo_train, select = -c(counts_kisses))) # remove the response variable # "counts_kisses"
predictor
## [1] "genderLooking" "age" "counts_pictures"
## [4] "counts_profileVisits" "counts_fans" "flirtInterests_chat"
## [7] "flirtInterests_friends" "flirtInterests_date" "country"
## [10] "isFlirtstar" "isMobile" "isNew"
## [13] "isVip" "lang_count" "lang_fr"
## [16] "lang_en" "lang_de" "lang_it"
## [19] "verified"
# We get all the predictor names of lovoo_train data
The above are the predictors that we keep in the dataset. In the following chunks, we are going to see the relationship between each of the predictor and the response variable “counts_kisses” one by one.
# This chunk is to visualize the gender orientation of sample. Specifically, we count how many
# females are looking for a male/female/both/none.
ggplot(lovoo_train, aes(x = genderLooking))+
geom_histogram(stat = 'count', color = 'black', fill= "#FED9B7")+
theme_classic() # theme_classic() : remove gridlines, #fill="RRGGBB" : rgb specification
From the histogram, we can see that most of females are still looking for males. There are extremly small portion of female users who are looking for females or both. Surprising, there are quite sustainable amount of users who are not looking for both females, males or both. This is probably because they do not fill in this information when registering their account.
The dataset we get from this dating app has more straight female users than LGBT ones. Thus, we know that this dataset could be more reflected on heterosexual relationship.
# This chunk is to visualize the relationship between the gender the sample is seeking for and
# the number of kisses they receive
kiss_table = table(lovoo_train$counts_kisses) # output the frequency of each unique value of "counts_kisses"
# For example, 167 people received 0 count of kisses.
ggplot(lovoo_train, aes(x = counts_kisses, fill = genderLooking)) +
geom_histogram(position = "identity", alpha = 0.5, bins = 50) +
xlim(0,2000) +
ylim(0,1000) +
ggtitle("the Count of kisses for different Gender Looking")+
xlab("Count of Kisses")+
ylab("frequency") +
theme_classic()
From the graph, we can see that both people who are looking for male and people who are looking for “none” have a similar distribution of count of kisses. Specifically, both groups have the right-skewed distribution of count of kisses. Also, for users who looking for females and both, we can’t tell their distribution from the graph. This is because they only consist of a very tiny portion of the dataset and the sample size is too small for any conclusion. Hence, we can preliminarily conclude that genderLooking may not be a very significant factor.
mean_kisses_age <- lovoo_train %>%
group_by(age) %>%
summarise_at(vars(counts_kisses), list(count_kisses_mean = mean))
mean_kisses_age$age = as.factor(mean_kisses_age$age)
mean_kisses_age %>% ggplot(aes(x = age, y = count_kisses_mean, fill = age)) +
geom_bar(stat="identity") +
ggtitle("Polar Bar plot of the mean counnt of kisses in each age") +
xlab("age") +
ylab("mean of count of kisses") +
coord_polar() +
theme_bw() + theme(axis.text.x = element_text(face = 'bold', size = 10),
axis.text.y = element_text(face = 'bold', size = 10))
Based on the graph, we can see that the mean of likes in the dating app among each age is about the same except age 27. Why is the case? Is age 27 more attractive than all the other ages? Before we reach an conclusion, we should first check the possibility that there is too few samples in age 27 to cause this happening. We could investigate this by plotting a density plot from each age.
# This chunk is to visualize the distribution of counts_kisses for each age using boxplot.
age_fac = lovoo_train
age_fac$age = as.factor(age_fac$age)
age_fac %>%
ggplot(aes(x = age, y=counts_kisses, fill = age)) +
geom_boxplot() +
theme_classic()
# output all user's information of those with age 27
filter(lovoo_train, age == 27)
## # A tibble: 1 x 20
## genderLooking age counts_pictures counts_profileV~ counts_kisses counts_fans
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 27 21 8755 661 0
## # ... with 14 more variables: flirtInterests_chat <lgl>,
## # flirtInterests_friends <lgl>, flirtInterests_date <lgl>, country <chr>,
## # isFlirtstar <dbl>, isMobile <dbl>, isNew <dbl>, isVip <dbl>,
## # lang_count <dbl>, lang_fr <lgl>, lang_en <lgl>, lang_de <lgl>,
## # lang_it <lgl>, verified <dbl>
From the box-plot, we can see that the box of age 27 is significantly higher than any of the other ages. However, we don’t see a clear box for users of age 27. This could indicate that there is only few users with age 27 to form different quantiles. Hence, we try to look at what’s happening at users with age 27. We did this by outputting all user’s information of those with age 27. Then, we found that there is only one user aged 27. Since there is only one data(only one user has age of 27), we speculate that the user with age 27 could be an outlier. If we treat age 27 as an outlier, we can see users of all age receive similar mean counts of kisses. Hence, we conclude that age could be not a significant predictor.
In the following chunk, for each group of users who looking for a specific type of gender, we want to see the relationship between Counts of Pictures and Counts of Fans
p = lovoo_train %>% ggplot() +
geom_point(aes(x = counts_pictures , y = counts_fans,
color = genderLooking, shape = genderLooking), size = 3) +
ggtitle("Scatter Plot between Counts of Pictures and Fans") +
xlab("Counts of Pictures") +
ylab("Counts of Fans") +
theme_bw() # The classic dark-on-light ggplot2 theme: looks nicer
p1 = ggMarginal(p, type = "histogram", fill = "white") #create a marginal histogram
p1 # add the marginal plot to the existing scatterplot.
From the scatter plot, we can roughly see that most points gathered in counts of picture below 15. To look more specifically, we look at the histogram: we can see that is highly right-skewed, and we can see that most points gathered in counts of picture below 10. Hence, we can conclude that most users do not have more than 10 pictures.
From the scatter plot, we can easily see that most points gathered in counts of fans below 150. More specifically, by looking at the histogram, we can see that most users have counts of fans less than approximately 20. Hence, we can conclude that most uses do not have many fans.
Also, from the scatter plot, we don’t see a distinguishable different pattern of distribution of points for people who are looking for different genders.That is, the “plus” sign and the square in the plot have roughly similar distribution(right skewed). Hence, users who are looking for male and users who are looking none have similar property of relationship in terms of relationship between counts of pictures and counts of fans.
Similarly, in the following chunk, for each range nodes(0, 2500, 5000, 7500) of counts_kisses, we want to see the relationship between Counts of Pictures and Counts of Fans
lovoo_train %>% ggplot() +
geom_point(aes(x = counts_pictures, y = counts_fans, size = counts_kisses)) +
ggtitle("Scatter Plot between Counts of Pictures and Counts of kisses") +
xlab("Counts of Pictures") +
ylab("Counts of fans") +
theme_bw()
If more pictures and more counts of fans are correlated to more counts_kisses, we should expect bigger points gathered on the top right of the scatter plot and expect small points gathered on the bottom left corner. However, there is no such an indication since small points are almost everywhere. Hence, from the scatter plot, we don’t see distinguishably different pattern of distribution of points for people who have different number of kisses.
Hence, our preliminary conclusion is the more fans and more pictures doesn’t guarantee more counts of kisses.
In addition, one observation from the scatter plot is that most points of kisses gathered in the bottom of the picture. Does that mean users who have less fans have more kisses? Since it’s hard to tell the relationship between each two variables from counts_kisses, counts_pictures, counts_fans from the above scatter plot, we will use the following chunk to show correlations in a clearer way.
pic_fan_kiss= lovoo_train[, c("counts_kisses","counts_fans","counts_pictures")]
# find the correlation between counts_pictures,counts_fans,counts_kisses
cor(pic_fan_kiss[, unlist(lapply(pic_fan_kiss,is.numeric))]) # output the correlation table
## counts_kisses counts_fans counts_pictures
## counts_kisses 1.0000000 0.13761903 0.35826728
## counts_fans 0.1376190 1.00000000 0.09570505
## counts_pictures 0.3582673 0.09570505 1.00000000
# output the panel with correlation between each combination of two variables.
pairs.panels(pic_fan_kiss)
#library(psych)
From the output graph, we can see that the correlation between counts_kisses and counts_fans is 0.14, correlation between counts_kisses and counts_pictures is 0.36, correlation between counts_fans and counts_pictures is 0.10. Hence, we can’t say there is a negative correlation between counts_fans and counts_kisses. There is slightly positive correlation for each two of the variables. However, the correlation is negligibly small. We may conclude that these three variables do not have significant effect on count of kisses.
lovoo_train %>% ggplot(aes(x = counts_profileVisits, y = counts_kisses)) +
geom_point() +
geom_smooth(method = "lm", alpha = .15) + #overplot the linear relationship
ggtitle("Scatter Plot between count_profileVisits and Counts of kisses") +
xlab("Counts of profile visits") +
ylab("Counts of kisses") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
From the scatter plot, we can see a clear positive linear relationship between counts of profile visits and counts of kisses. That is, the more the profile is visited by other users, the more “kisses” the user will receive.
Since we see a clear positive relationship, to verify, we want to know what is the correlation coefficient. Hence, we use the following chunk.
kiss_visit= lovoo_train[, c("counts_kisses", "counts_profileVisits")]
# find the correlation between counts_pictures,counts_fans,counts_kisses
cor(kiss_visit[, unlist(lapply(kiss_visit,is.numeric))]) # output the correlation table
## counts_kisses counts_profileVisits
## counts_kisses 1.0000000 0.8974845
## counts_profileVisits 0.8974845 1.0000000
# output the panel with correlation between each combination of two variables.
pairs.panels(kiss_visit)
We can see that the correlation coefficient between counts_kisses and counts_profileVisits is as high as 0.90! This is a strong positive relationship.
However, we see an outlier at the top right corner. We want to report this user.
filter(lovoo_train, counts_kisses >7500)
## # A tibble: 1 x 20
## genderLooking age counts_pictures counts_profileV~ counts_kisses counts_fans
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 23 16 164425 9288 0
## # ... with 14 more variables: flirtInterests_chat <lgl>,
## # flirtInterests_friends <lgl>, flirtInterests_date <lgl>, country <chr>,
## # isFlirtstar <dbl>, isMobile <dbl>, isNew <dbl>, isVip <dbl>,
## # lang_count <dbl>, lang_fr <lgl>, lang_en <lgl>, lang_de <lgl>,
## # lang_it <lgl>, verified <dbl>
Since there is an outlier, we want to remove the outlier and see again the linear coefficient.
lovoo_train_nooutlier = lovoo_train %>% filter(counts_kisses !=9288) # remove outlier
kiss_visit1= lovoo_train_nooutlier[, c("counts_kisses", "counts_profileVisits")]
# output the panel with correlation between each combination of two variables.
pairs.panels(kiss_visit1)
After we remove the outlier, we see the linear coefficient dropped from 0.9 to 0.88, which is still a strong linear correlation. Hence, our conclusion is unchanged: the more number of times the profile is visited, the higher number of kisses a user receives.
In the following chunks, we want to see the relationship between flirtInterests and counts_kisses. Since there are 3 variables below flirtInteretsts - flirtInterests_chat, filrtInterests_friends,flirtInterests_date - we will see the relationship between counts_kisses and each of the category below flirtInterests.
In the following chunks, TRUE means that the user indicated being open to the corresponding interest. For example, TRUE in flirtInterests_friends means the user indicated being open to making friends.
# Instead of putting 3 separate boxplots we want to combine all 3 categories together in one
#plot.
#we put together all data of 3 categories in one data frame.
df_flirt = data.frame(x = c(lovoo_train$flirtInterests_chat, lovoo_train$flirtInterests_friends, lovoo_train$flirtInterests_date),
y = rep(lovoo_train$counts_kisses,3))
# there are 9582 rows in total,
# the first 1/3 of rows corresponding to coordinates with x=flirtInterets_chat, y = counts_kisses
# the second 1/3 of rows corresponding to coordinates with x=flirtInterets_friends, y =
# counts_kisses
# the third 1/3 of rows corresponding to coordinates with x=flirtInterets_date, y = counts_kisses
#Now we plot them out
df_flirt %>%
ggplot(aes(x = rep(c("interest_chat", "interest_friends", "interest_date"), each = 3194), y=y,
fill = x)) +
labs(x = "Category of Flirt Interests", y = "counts_kisses") +
geom_boxplot( )
From the box plot above, we don’t see a obvious distinction among the 3 categories of flirt interest. However, we do see an interesting observation: the one with most kisses turn all the three categories on. That is, the user who received more than 7500 kisses(the) is open to all chat, date, and being friends.
To have a clearer view, we zoom in to see the box part to see if there is any difference in the mean/lower quantile/upper quantile of the 3 boxes.
## zoom on
df_flirt = data.frame(myx = c(lovoo_train$flirtInterests_chat, lovoo_train$flirtInterests_friends, lovoo_train$flirtInterests_date),
myy = rep(lovoo_train$counts_kisses,3))
df_flirt %>%
ggplot(aes(x = rep(c("interest_chat", "interest_friends", "interest_date"), each = 3194), y=myy, fill = myx)) +
labs(x = "Category of Flirt Interests", y = "counts_kisses") +
ylim(0,500)+
geom_boxplot( )
From the plots, we can see that all three pairs of boxplots of each category of Flirtinterest have similar mean, lower, and upper quantile. Thus, we conclude that interest on dating, flirt or friends does not change the count of kisses significantly even though it affects the recommendation algorithms in this dating app. (Maybe this is a warning for engineerors to fix the recommendation algorithms for this app).
# We first get a mean for each country
mean_kisses_country <- lovoo_train %>%
group_by(country) %>% # now operations are perfomed by group/country
summarise_at(vars(counts_kisses), list(count_kisses_mean = mean))
# Plot the mean for each country
mean_kisses_country %>%
ggplot(aes(x=country, y = count_kisses_mean, fill = country)) +
geom_bar(stat = "identity") +
coord_flip() # We want to be able to read labels better
From the plots,we can see that the country “BR” has the highest count of kisses average among all. It has a mean of about 2000 kisses. For the other countries, most of the other countries have a mean of less than 500 kisses. Other than country “BR”, the only 3 countries that have a mean number of kisses higher than 500 are “US”,“HU”, and “CA”. Hence, we may want to pay more attention on these three countries in our future analysis.
Since it is possible that some countries may have a high mean simply because they have a small sample size and each sample has a high count of kisses. To get a better understanding of the mean of countries, we want to see the distribution of users’s country. Specifically, we will assign a count of frequency to each country.
# get the number of users from each country
country_count <- lovoo_train %>% count(country)
# for example, country "AR" has only 1 user
country_count %>%
ggplot(aes(x=country, y = n, fill = country)) +
labs (y="number of users") +
geom_bar(stat = "identity") + # to see the number of cases for each country
coord_flip() # to see the plot better
From the plot, we can see that country”CH” has the most number of users. From all the countries, only four countries “FR”, “DE”, “CH”, and IT” show a high number of users(more than 100 users). Hence, we want to put more attention on these four countries.
Also, while country “BR” has the highest mean of kisses, it doesn’t have many users. Because of the small sample size, we can’t conclude that users from Brazil are more or the most attractive.
Interestingly, while countries “DE” and “CH” have a similar large number of users(more than 1000), we can see from the last plot that users from country “CH” have approaximately 100 more kisses on average than users from country “DE”. Also, being the third largest country in terms of number of users, “FR” don’t have a high mean number of kisses.
Note that from last plot, we saw “BR”, “US”,“HU”, and “CA” have a high mean count of kisses. However, there aren’t many users from any of these countries in this dating app. This leads to a question: Does a popular country imply less like?
The following chunk is to see the correlation between the number of users in a country and the mean number of kisses.
country_count[,2]
## # A tibble: 28 x 1
## n
## <int>
## 1 1
## 2 18
## 3 1
## 4 3
## 5 5
## 6 2
## 7 1
## 8 1
## 9 1324
## 10 1
## # ... with 18 more rows
lovoo_train_countcountry= cbind(mean_kisses_country, country_count[,2])
kiss_countrycount= lovoo_train_countcountry[, c("count_kisses_mean", "n")]
# find the correlation between country_count and counts_kisses
#and output the correlation table
cor(kiss_countrycount[, unlist(lapply(kiss_countrycount,is.numeric))])
## count_kisses_mean n
## count_kisses_mean 1.0000000 -0.1191468
## n -0.1191468 1.0000000
# output the panel with correlation between each combination of two variables.
pairs.panels(kiss_countrycount)
# #Method 2
# # We first get a mean for each country
# mean_kisses_country1 <- lovoo_train %>%
# group_by(country) %>% # now operations are perfomed by group/country
# summarise_at(vars(counts_kisses), list(n=n(),count_kisses_mean = mean))
From the correlation coefficient of -0.12, we can see that there is a negative correlation between the number of users in a country and the mean number of kisses. However, since the coefficient is small, the negative correlation is negligible. We conclude that there is no negative correlation between the number of users in a country and the mean number of kisses.
To have a better understanding of the distribution of user’s country of origin, we append a map here. The following chunk is to show the map of Europe with the names of user’s country on it. Moreover, we highlighted the 3 countries with the most number of users.
library(imager)
im <- load.image("Highlighted_Map.png")
plot(im)
Note the popular countries are from all from mid Europe, which shows that this dating app may originated from mid Europe. However, since they don’t receive a high mean of kisses, it is possible that users from mid Europe are receiving less likes than others, and people from other countires may be more attracitve.
In this section, we want to explore the relationship between 5 binary properties of user and the counts of kisses the user received. Specifically, the binary properties are the followings: 1) isFlirstar, 2)isMobile, 3)isNew, 4) isVip, 5) verfied. Under these properties, TRUE means “yes, the user is.” while FALSE means the user is not belonging to this category.
We explore the relationship by plotting the boxplot for each of these binary categories with respect to counts of kisses. Specifically, we look at their means, upper quantile, and lower quantile to judge if there is any binary property would be special.
binary_data = lovoo_train
binary_data = binary_data %>%
mutate(isFlirtstar = ifelse(isFlirtstar == 1, "TRUE", "FALSE")) %>%
mutate(isMobile = ifelse(isMobile == 1, "TRUE", "FALSE")) %>%
mutate(isNew = ifelse(isNew == 1, "TRUE", "FALSE")) %>%
mutate(isVip = ifelse(isVip == 1, "TRUE", "FALSE")) %>%
mutate(verified = ifelse(verified == 1, "TRUE", "FALSE"))
df_binary = data.frame(x = c(binary_data$isFlirtstar, binary_data$isMobile, binary_data$isNew, binary_data$isVip, binary_data$verified),
y = rep(binary_data$counts_kisses, 5))
df_binary %>%
ggplot(aes(x = rep(c("is_flirt_star", "is_mobile", "is_new", "is_vip", "is_verified"), each = 3194), y=y, fill = x)) +
ylim(c(0, 100))+
xlab("Binary Class Columns") +
ylab("count of kisses") +
geom_boxplot()
## Warning: Removed 5155 rows containing non-finite values (stat_boxplot).
From the plots, we can tell that: 1) Whether or not the user being a flirt_star would not influence the count of kisses they receive much, since boxplots for “FALSE” and “TRUE” have very similar mean, upper quantile, and lower quantile.
Whether or not the user being a mobile user would not influence the count of kisses they receive much, since boxplots for “FALSE” and “TRUE” have very similar mean, upper quantile, and lower quantile.
Whether or not the user account is recently created could have an impact on counts of kisses users receive. Specifically, we can see that user’s accounts that are recently created have a much lower mean, lower quantile, and upper quantile of counts of kisses received than that of accounts not recently created. It is possible that not being a recent account is associated with more counts of kisses. However, this possibility needs more consideration because it is possible that new users haven’t exposed to enough other users to receive a high number of kisses.
Whether or not the user account is verifed could have an impact on counts of kisses users receive. Specifically, we can see that user’s accounts that are verified have a higher mean, lower quantile, and upper quantile of counts of kisses received than that of accounts that are not verified. It is possible that being verified is associated with more counts of kisses. This can make intuitive sense because people may tend to not like people who can be considered as fake, so people who are verified win more kisses.
Whether or not the user account is VIP could have an impact on counts of kisses users receive. Specifically, we can see that user’s accounts that are VIP have a higher mean, lower quantile, and upper quantile of counts of kisses received than that of accounts that are not VIP. It is possible that being an VIP is associated with more counts of kisses. This makes intuitive sense because VIP account may have more benefits in terms of harvesting kisses.
Despite that some observations make intuitive sense, the differences in the last 3 pairs of boxplot can be caused by large sample variance. For example, if one of the “TRUE” or “FALSE” category has a very small sample size compared to the other, the sample variance can be big and this can affect the mean, lower quantile, and upper quantile of the box. Hence, we want to see the sample size distribution of each category of “TRUE” and “FALSE” in is_new, is_verified, is_vip. If the difference in the pair of boxplot is due to sample variance, we will withhold giving an conclusion on that independent variable and proceed to further analysis.
To see if this result is caused by sample variance, we draw pie charts to see the distribution of “TRUE” and “FALSE” on isFlirtstar,
# Extract only the columns containing binary variables
pie_data = binary_data %>% select(c(isFlirtstar, isMobile, isNew, verified, isVip))
# First, we extract the counts of "FALSE" and "TRUE" for each binary variable
# We use table function in summarise_all function to output a tibble of counts for each variable
pie_data = pie_data %>%
summarise_all(table)
# Convert the tibble to data frame so that we can manipulate the data
pie_data_df = data.frame(pie_data)
# Now we want to plot Pie Charts for each of the binary classes
main = c("Percentgaes of Flirtstar", "Percentages of Mobile Ssers", "Percentages of New Users",
"Percentages of Verified Users", "Percentages of VIP Users")
# We want to put multilple charts together in the following way:
# First row: 2 charts with binary variables which the pair of boxplots show no major difference
#(i.e. isFlirtstar, isMobile)
# Second row: 3 charts with binary variables which the pair of boxplots show difference
#(e.g. isNEW, verified, isVIP)
layout(matrix(c(1,1,1,2,2,2,3,3,4,4,5,5), 2,6, byrow = TRUE))
#Use a for loop to plot all the pie charts
for (i in 1:5){
counts_= pie_data_df[,i]
percent_ = round(100*counts_/sum(counts_), 1)
percent_ = paste(percent_,"% (", counts_,")" ,sep = "")
pie(counts_, labels = percent_, col = rainbow(length(counts_)),
main = main[i])
legend("topright", c("FALSE","TRUE"), cex = 0.8,
fill = rainbow(length(counts_)))
}
To make sure observations (3) - (5) are not due to sample variance. We look at the second row of pie charts. From the pie charts, we can see that whether or not users created the account recently(Is new user or not) could have impact on the number of kisses they receive. This is because the sample size of both FALSE and TRUE are high(>700) Similarly, whether or not users are verified could have impact on the number of kisses they receive. However, whether or not users are VIP could have impact on the number of kisses is not conclusive, because the sample size of VIP users is too small compared to that of not VIP users.
# get the range of count_kisses for each count of languages
lang_count_df = lovoo_train %>%
group_by(lang_count) %>%
summarize(kiss_min = min(counts_kisses), kiss_max = max(counts_kisses))
#create a clustered bar chart using geom_dumbbell()
lang_count_df %>% ggplot() +
geom_dumbbell(aes(x = kiss_min, xend = kiss_max,
y = lang_count, group = lang_count),
colour = 'grey', size = 4,
colour_x = 'green', colour_xend = 'blue') +
ggtitle("Dumbbell Plot min_kiss to max_kiss") +
xlab('Counts of Kisses Range') +
ylab('Language Count') +
theme_bw() +
theme(axis.text.x = element_text(face = 'bold', size = 10),
axis.text.y = element_text(face = 'bold', size = 10))
It seems that the the most popular users are bilinguals. It is interesting that people who know four language have similar range as people who put “0” on their language count. The sample size of people who know six and seven languages are too few to draw any conclusions. Based on the graph, it seems that once you know more than two languages, you are no longer that attractive in dating app by learning more languages.
Since the language is highly correlated to the country, so we would like to make a heatmap on both country and language.
# Get a data frame grouped by country and count of language.
# For each combination of country and count of language, we calculated a kiss_mean.
country_lang_df = lovoo_train %>%
group_by(country, lang_count) %>%
summarize(kiss_mean = mean(counts_kisses))
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
# Use geom_tile function to plot heatmap
ggplot(country_lang_df, aes(x = country, y = lang_count, fill = kiss_mean)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed(ratio = 1) + # ratio =1 looks the best
theme_classic() #remove gridlines
From the heatmap, we can see an very obvious red square. This red square corresponds to users whose country is Brazil number of language known is 2. Specifically, while this red square has a mean of kisses 4000+, all the other squares have less than 1000 mean number of kisses. We want to study what happened and why so. Hence, we want to report information of all the Brazil users who know two languages.
# we report the the user in Brazil who corresponds to the redest sqaure in the heatmap.
print(filter(lovoo_data_original,country == "BR", lang_count == 2))
## # A tibble: 1 x 42
## gender genderLooking age name counts_details counts_pictures
## <lgl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 FALSE M 21 Gabriella 1 30
## # ... with 36 more variables: counts_profileVisits <dbl>, counts_kisses <dbl>,
## # counts_fans <dbl>, counts_g <dbl>, flirtInterests_chat <lgl>,
## # flirtInterests_friends <lgl>, flirtInterests_date <lgl>, country <chr>,
## # city <chr>, location <chr>, distance <dbl>, isFlirtstar <dbl>,
## # isHighlighted <dbl>, isInfluencer <dbl>, isMobile <dbl>, isNew <dbl>,
## # isOnline <dbl>, isVip <dbl>, lang_count <dbl>, lang_fr <lgl>,
## # lang_en <lgl>, lang_de <lgl>, lang_it <lgl>, lang_es <lgl>, ...
From the output data, we can see that this red square corresponds to just one user in Brazil. Hence, this user is also an influencer to the data.
It turned out that this user is a muscian who has 79.2K number of followers on Instagram. Based on her instagram, we see that the user in Brazil who has the highest number of kisses is a verified musician (Instagram: gabriiellafreiitas).
instagram <- load.image("Insta.png")
plot(instagram)
There are some other observations based on the heatmap. Since “CH”, “DE” ,“FR”, “IT” are the four countries with the most number of users. We want to study them on the heatmap more. The observations from the heatmap are:
For the country “IT” (Italy): users know 4 or 5 are more popular
For the country “CH” : (people with 5 language most popular)
For the country “DE”(people with 5 language most popular)
For the country “FR”(people with 5 language most popular)
Overall, we don’t see any obvious trend in the heatmap, but only some subtle differences.
Since this specific user is very special, we decide to remove this user and plot the heat plot again; otherwise, all other squares in the heatmap are too faint compared to this specific square corresponding to this special user.
# We filtered out the data which country == BR and lang_count == 2
# Be careful with the logical statment filter NOT (A AND B) = filter (A OR B)
country_lang_df1 = filter(country_lang_df, country!="BR" | lang_count !=2)
#Heatmap without the musician
ggplot(country_lang_df1, aes(x = country, y = lang_count, fill = kiss_mean)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "red") +
coord_fixed() +
theme_classic()
From the new heatplot, by looking at the 4 countries with the most users-“CH”, “DE”, “FR”, “IT”(ordered by number of users, from higher to lower), we have the following observations:
2)DE: Users with 5 languages are the most popular. It seems that more languages are correlated with more kisses.
3)FR: Users with 3 languages are the most popular. Users with 4 languages are not as popular as users with 3 languages, so knowing more languages doesn’t guarantee more kisses. However, this subjects to doubt because we only have 4 users who are from French and know 4 languages, so the sample size it too small to draw a conclusion
# To see how many users are from french and can speak 4 languages
filter(lovoo_train, country == "FR", lang_count == 4)
## # A tibble: 4 x 20
## genderLooking age counts_pictures counts_profileV~ counts_kisses counts_fans
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 M 20 3 592 4 0
## 2 M 23 3 2180 53 0
## 3 M 25 6 1476 77 0
## 4 M 19 1 399 9 0
## # ... with 14 more variables: flirtInterests_chat <lgl>,
## # flirtInterests_friends <lgl>, flirtInterests_date <lgl>, country <chr>,
## # isFlirtstar <dbl>, isMobile <dbl>, isNew <dbl>, isVip <dbl>,
## # lang_count <dbl>, lang_fr <lgl>, lang_en <lgl>, lang_de <lgl>,
## # lang_it <lgl>, verified <dbl>
Overall, though more languages correlates more kisses in country IT, this trend doesn’t happen in other countries. However, though knowing more languages doesn’t guarantee higher mean number of kisses in general, we can see that the highest mean kisses are always seen in users who speak no less than 2 languages. Hence, mastering no less than 2 languages is a good tip for boosting attractiveness!
Since most users are from countries “CH”,“DE”, and “FR”, we now want to highlight the data from these three countries by separating them from other countries. As such, we can distinguish people who are from popular countries and other countries.
# case_when function allows you to vectorise multiple if and else if statments
lovoo_train = lovoo_train %>%
mutate(country = case_when(country == "CH" ~ "CH",
country == "DE" ~ "DE",
country == "FR" ~ "FR",
country != "CH" | country != "DE" | country != "FR" ~ "Other"
))
table(select(lovoo_train, country)) # too se the distribution of contries after manipulation
##
## CH DE FR Other
## 1324 1166 531 173
The following chunk is to see if mastering a specific language could affect number of kisses users receive. Specifically, we want to see if being able to speak French/English/ Italian/German could affect distribution of number of kisses.
library(patchwork)
lang_fr <- lovoo_train %>% ggplot(aes(x = counts_kisses, fill = lang_fr)) +
geom_density(alpha = 0.7) +
xlim(c(0, 1000)) +
theme_classic()
lang_en <- lovoo_train %>% ggplot(aes(x = counts_kisses, fill = lang_en)) +
geom_density(alpha = 0.7) +
xlim(c(0, 1000)) +
theme_classic()
lang_it <- lovoo_train %>% ggplot(aes(x = counts_kisses, fill = lang_it)) +
geom_density(alpha = 0.7) +
xlim(c(0, 1000)) +
theme_classic()
lang_de <- lovoo_train %>% ggplot(aes(x = counts_kisses, fill = lang_de)) +
geom_density(alpha = 0.7) +
xlim(c(0, 1000)) +
theme_classic()
(lang_fr | lang_en) / (lang_it | lang_de)
From the plots above, we can see that the distribution of not knowing languages French(lang_fr), Italian(lang_it), and German(lang_de) are almost overlaping with that of knowing the languages. Hence, we can conclude that whether of not knowing French/Italian/German does not impact the number of kisses users receive.
However, from the plot of lang_en, we can see that the count_kisses distribution for people who do not know English is more right skewed than that for people who know English. That is, people who do not know English on average receive less number of kisses! Hence, whether or not the user knows English could impact the user’s attractiveness.
Thus, our conclusion will be that language seems to have no impact except for English.
Now, we start to build our model!The first model we build is through linear regression.
We first build a linear regression model using all all the 19 predictors to see how the model performs.
linear.fit = lm(counts_kisses~., data = lovoo_train)
summary(linear.fit)
##
## Call:
## lm(formula = counts_kisses ~ ., data = lovoo_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1017.1 -30.0 11.8 39.3 3584.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.289e+01 5.551e+01 -1.493 0.1355
## genderLookingF 5.535e+01 5.496e+01 1.007 0.3139
## genderLookingM 2.662e+01 4.011e+01 0.664 0.5069
## genderLookingnone 3.876e+01 4.114e+01 0.942 0.3462
## age 2.241e+00 1.582e+00 1.417 0.1567
## counts_pictures 2.756e-01 7.689e-01 0.358 0.7201
## counts_profileVisits 5.125e-02 4.957e-04 103.390 < 2e-16 ***
## counts_fans 1.258e-02 2.521e-01 0.050 0.9602
## flirtInterests_chatTRUE -4.388e+00 6.339e+00 -0.692 0.4888
## flirtInterests_friendsTRUE -7.279e+00 6.463e+00 -1.126 0.2601
## flirtInterests_dateTRUE 2.360e-01 6.420e+00 0.037 0.9707
## countryDE -3.923e+00 7.691e+00 -0.510 0.6100
## countryFR -1.153e+01 1.100e+01 -1.048 0.2947
## countryOther -3.399e-01 1.687e+01 -0.020 0.9839
## isFlirtstar 2.339e+01 2.628e+01 0.890 0.3735
## isMobile -5.806e+00 6.990e+00 -0.831 0.4062
## isNew 3.876e+01 7.539e+00 5.141 2.90e-07 ***
## isVip 1.667e+01 1.998e+01 0.835 0.4040
## lang_count 2.737e+01 1.081e+01 2.532 0.0114 *
## lang_frTRUE -1.036e+01 1.669e+01 -0.621 0.5350
## lang_enTRUE -3.656e+01 2.004e+01 -1.824 0.0682 .
## lang_deTRUE -6.832e+01 1.629e+01 -4.195 2.81e-05 ***
## lang_itTRUE -3.807e+01 2.079e+01 -1.831 0.0672 .
## verified -1.355e+01 7.579e+00 -1.788 0.0738 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 168 on 3170 degrees of freedom
## Multiple R-squared: 0.8121, Adjusted R-squared: 0.8108
## F-statistic: 595.8 on 23 and 3170 DF, p-value: < 2.2e-16
We can see that the adjusted R-squared is 0.8108. That is, there is 81.08% of the variation in the dependent variable that is predictable from the independent variable, which is very high. Hence linear regression using all predictors could be a good model.
However, from the data visualizations, we reached some preliminary conclusions on the effect of the following predictors: “counts_profileVisits”, “country”, “isNew”, “verified”. Hence, we fit a new linear regression model in the following chunk using only these 4 predictors.
linear.fit1 = lm(counts_kisses ~ counts_profileVisits + country + isNew + verified, data = lovoo_train)
summary(linear.fit1)
##
## Call:
## lm(formula = counts_kisses ~ counts_profileVisits + country +
## isNew + verified, data = lovoo_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1045.1 -25.0 11.4 36.4 3578.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.450e+01 5.641e+00 -6.116 1.08e-09 ***
## counts_profileVisits 5.101e-02 4.581e-04 111.350 < 2e-16 ***
## countryDE -2.192e+01 6.894e+00 -3.180 0.00149 **
## countryFR 2.293e+01 8.747e+00 2.622 0.00879 **
## countryOther 1.418e+01 1.367e+01 1.037 0.29967
## isNew 4.064e+01 7.413e+00 5.482 4.53e-08 ***
## verified -1.946e+01 7.377e+00 -2.638 0.00837 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 168.8 on 3187 degrees of freedom
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.8089
## F-statistic: 2254 on 6 and 3187 DF, p-value: < 2.2e-16
The new linear regression model using only 4 predictors have a adjusted R-squared of 0.8089. That is, 80.89% of the variation in the dependent variable that is predictable from the independent variable. Though there is slight drop of adjusted R-squared value from 0.8109 to 0.8089, based on the principle of parismony, we choose the second model if we do not remove any outliers.
lovoo_test = lovoo_test %>%
mutate(country = case_when(country == "CH" ~ "CH",
country == "DE" ~ "DE",
country == "FR" ~ "FR",
country != "CH" | country != "DE" | country != "FR" ~ "Other"
))
lm.test_pred = predict(linear.fit1, newdata = lovoo_test)
## change all the negative values to zero, because it does not make sense if we have negative kisses
for (i in 1:length(lm.test_pred)){
if (lm.test_pred[i] < 0){
lm.test_pred[i] = 0
}
}
lm.train_pred = predict(linear.fit1, newdata = lovoo_train)
for (i in 1:length(lm.train_pred)){
if (lm.train_pred[i] < 0){
lm.train_pred[i] = 0
}
}
## RMSE
sqrt(mean((lovoo_train$counts_kisses - lm.train_pred)^2))
## [1] 167.9899
sqrt(mean((lovoo_test$counts_kisses - lm.test_pred)^2))
## [1] 195.8608
For our chosen linear regression model, the training RMSE is 167.9899 and test RMSE is 195.8608. Since our scale is thoudands, the RMSE looks pretty nice.
Since there are some outliers: we speculate that by removing the outliers, we can get a higher adjusted R-squared value. We want to see if removing outliers would give us a lower RMSE or higher adjusted R-squared value.
The following chunk is to detect outliers:
# helper function
studentize_fn <- function(resid, n, p){
resid*sqrt((n - p - 1)/(n - p - resid^2))
}
# model matrix
x_mx <- model.matrix(linear.fit1)
# calculate case influence statistics
fit_df <- augment(linear.fit1, lovoo_train) %>%
mutate(obs_index = row_number(),
.ext.std.resid = studentize_fn(.std.resid,
n = nrow(x_mx),
p = ncol(x_mx) - 1))
# residuals vs. index
fit_df %>%
ggplot(aes(x = obs_index, y = .ext.std.resid)) +
geom_point() +
geom_hline(yintercept = 0, col = "red")+
ggtitle("Outlyingness")+
theme(plot.title = element_text(hjust = 0.5))
From the outlyingness plot, we can see that there are 4 points has
external studentized residual more than 10, we consider these users are
outliers. Hence, we want to identify them and remove them.
# Find the 4 outliers by using the largest four ext.std.resid
fit_df %>%
slice_max(order_by = .ext.std.resid, n = 4) %>%
select(.ext.std.resid)
## # A tibble: 4 x 1
## .ext.std.resid
## <dbl>
## 1 23.1
## 2 16.2
## 3 12.3
## 4 11.0
#Remove the 4 outliers from training data
outlier_id = lovoo_data_original[c(78,164,63,155),] %>% select(userId)
lovoo_train1 = lovoo_data_original[train, ]
lovoo_train2 = filter(lovoo_train1, userId != "50f9099d90c8fd99540000e6",
userId!="51a87b1c150ba0352d0000a3", userId!="50da3a797acab1f041000000", userId !="51989e08c86da19a700003f5")
lovoo_train2 = lovoo_train2 %>% select(-name, # we delete name because we can identify users simply by user ID.
-counts_details, # there is a lack of description of this variable
-counts_g, # the original dataset is lack of description on this variable
-location, # there are too many locations, so using location as a predictor doesn't really help
-city, # same reason as location
-isHighlighted, # it only reflects information during the fetch time, which is not stable/reliable over time.
-isOnline, # same reason as isHighlighted
-lang_es, # Only 75 users who know spanish, sample size is too small
-lang_pt, # Only 42 users who know spanish, sample size is too small
-lastOnlineDate, # Since we don't know the date when the data is fetched, lastOnlineDate isn't helpful for us
-lastOnlineTime, # same reason as lastOnlineDate
-shareProfileEnabled,
-birthd, # None of the user's birthday is today
-crypt, # lack of description of variable
-freetext, # 97% are null, hard to interpret
-whazzup, #the data under this predictor are all texts, hard to analyze
-pictureId, #Not useful since pictureID is random
-isSystemProfile, # All are system profile, so including this predictor doesn't make any differnece.)
-userId,
-distance,
-gender,
-isInfluencer ## everyone is not an influencer in this dataset
)
lovoo_train2 = lovoo_train2 %>%
mutate(country = case_when(country == "CH" ~ "CH",
country == "DE" ~ "DE",
country == "FR" ~ "FR",
country != "CH" | country != "DE" | country != "FR" ~ "Other"
))
#Now we fit linear regression model without 4 outliers
linear.fit2 = lm(counts_kisses ~ counts_profileVisits + country + isNew + verified, data = lovoo_train2)
summary(linear.fit2)
##
## Call:
## lm(formula = counts_kisses ~ counts_profileVisits + country +
## isNew + verified, data = lovoo_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -950.82 -25.08 8.23 30.80 1649.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.623e+01 4.693e+00 -5.589 2.48e-08 ***
## counts_profileVisits 4.814e-02 3.888e-04 123.810 < 2e-16 ***
## countryDE -1.699e+01 5.731e+00 -2.965 0.003049 **
## countryFR 2.428e+01 7.269e+00 3.341 0.000846 ***
## countryOther 1.915e+01 1.136e+01 1.685 0.092002 .
## isNew 3.052e+01 6.167e+00 4.949 7.84e-07 ***
## verified -2.260e+01 6.136e+00 -3.684 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 140.3 on 3183 degrees of freedom
## Multiple R-squared: 0.8401, Adjusted R-squared: 0.8398
## F-statistic: 2788 on 6 and 3183 DF, p-value: < 2.2e-16
After removing outliers, we can see that the R-squaerd improved from 0.81 to 0.8398. Then, we calculate RMSE again.
lm.test_pred1 = predict(linear.fit2, newdata = lovoo_test)
## change all the negative values to zero, because it does not make sense if we have negative kisses
for (i in 1:length(lm.test_pred1)){
if (lm.test_pred1[i] < 0){
lm.test_pred1[i] = 0
}
}
lm.train_pred1 = predict(linear.fit2, newdata = lovoo_train2)
for (i in 1:length(lm.train_pred1)){
if (lm.train_pred1[i] < 0){
lm.train_pred1[i] = 0
}
}
## RMSE
sqrt(mean((lovoo_train2$counts_kisses - lm.train_pred1)^2))
## [1] 139.7012
sqrt(mean((lovoo_test$counts_kisses - lm.test_pred1)^2))
## [1] 192.871
Training RMSE decreases from 167.9899 to 139.7012 and test RMSE decreases from 195.8608 to 192.871 after we remove the 4 outliers. Hence, our model is being more accurate.
However, notice that test root mean square error does not imporve a lot. I would say that removing outlier does not improve accuracy significantly. Thus, we decided to remain it for latter model fitting.
## tree method needs to have characters columns be factor
cols = c("genderLooking","flirtInterests_chat", "flirtInterests_friends", "flirtInterests_date", "country",
"isFlirtstar", "isMobile", "isNew", "isVip", "lang_fr", "lang_en","lang_de", "lang_it", "verified" )
lovoo_train = lovoo_train %>%
mutate_at(cols, funs(factor(.)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
lovoo_test = lovoo_test %>%
mutate_at(cols, funs(factor(.)))
tree.fit = tree(counts_kisses~., data =lovoo_train)
draw.tree(tree.fit, cex = 0.8, size = 2, digits = 3)
summary(tree.fit)
##
## Regression tree:
## tree(formula = counts_kisses ~ ., data = lovoo_train)
## Variables actually used in tree construction:
## [1] "counts_profileVisits" "counts_pictures"
## Number of terminal nodes: 8
## Residual mean deviance: 31400 = 1e+08 / 3186
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2759.0 -27.1 -13.1 0.0 18.9 4812.0
Based on the tree diagram, we can see this tree is highly dependent on the variable counts_profileVisits. The results from tree model aligns with our linear regression model. Hence, counts_profileVisits is the most important fatcor in predicting attraction!
Then, we want to see the training RMSE and test RMSE value for tree model.
tree_train_pred = predict(tree.fit, newdata = lovoo_train)
tree_test_pred = predict(tree.fit, newdata = lovoo_test)
## RMSE of tree model
sqrt(mean((lovoo_train$counts_kisses - tree_train_pred)^2)) # training RMSE
## [1] 176.9816
sqrt(mean((lovoo_test$counts_kisses - tree_test_pred)^2)) # test RMSE
## [1] 265.4276
We can see that this tree does worse than linear regression. (report values) Which is what we expected, because in Linear regression, we have a high R-square value, and tree usually cannot handle huge dataset, especailly when it comes to regression. we do not do tuning or pruning this tree because we are more interested in how random forest or boosting is performing for this dataset.
In this section, we are going to train an extremely popular algorithm called XGBoost.
require(xgboost)
## Loading required package: xgboost
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
x.train = data.matrix(select(lovoo_train, -counts_kisses))
y.train = data.matrix(lovoo_train$counts_kisses)
x.test = data.matrix(select(lovoo_test, -counts_kisses))
y.test = data.matrix(lovoo_test$counts_kisses)
xgbcv <- xgb.cv(data = x.train, label = y.train, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 10, early_stop_round = 50, maximize = F, seed = 123, verbose = 0)
min_loss = min(xgbcv$evaluation_log$test_rmse_mean)
min_loss_index = which.min(xgbcv$evaluation_log$test_rmse_mean)
cat("The miniumm loss of test RMSE we have is, ", min_loss, "\n")
## The miniumm loss of test RMSE we have is, 226.1709
cat("The n rounds parameter shoudld be:", min_loss_index)
## The n rounds parameter shoudld be: 14
This is also the first time I use XGBoost. This is basically a gradient boosting algorithm such that it uses gradient boosting algorithm to keep training on the residuals until it gets its optimal results. Apparently, one big issue disadvantage is that it tends to over fitting. Thus, tuning will be important in this case. The code above evaluation tells us that we should set n_round (similar to number of trees in Random Forest) to 14. Thus, we will have now use 14 as our nrounds to tune other parameters.
set.seed(123)
eta=c(0.05,0.1,0.2,0.5,1)
conv_eta = matrix(NA,min_loss_index,length(eta))
pred_eta = matrix(NA,length(y.test), length(eta))
colnames(conv_eta) = colnames(pred_eta) = eta
for(i in 1:length(eta)){
params=list(eta = eta[i])
xgb=xgboost(x.train, label = y.train, nrounds = min_loss_index, params = params, verbose = 0)
conv_eta[,i] = xgb$evaluation_log$train_rmse
pred_eta[,i] = predict(xgb, x.test)
}
conv_eta = data.frame(iter=1:min_loss_index, conv_eta)
conv_eta = melt(conv_eta, id.vars = "iter")
RMSE_eta_list = matrix(NA, 1,length(eta))
for (i in 1:length(eta)){
RMSE_eta = sqrt(mean((y.test-pred_eta[,i])^2))
RMSE_eta_list[1,i] = RMSE_eta
}
colnames(RMSE_eta_list) = eta
ggplot(data = conv_eta) +
geom_line(aes(x = iter, y = value, color = variable)) +
xlab("iteration") +
ylab("training RMSE")
RMSE_eta_list
## 0.05 0.1 0.2 0.5 1
## [1,] 232.9831 206.1999 213.108 225.6055 280.0424
Based on the training graph, we may tend to choose X_1. However, we will say it is over-fiting by looking at the test RMSE. Its RMSE is far greater than others. Thus, we will choose eta = 0.1 based on testing RMSE.
md=c(2,4,6,10)
set.seed(123)
conv_md=matrix(NA,min_loss_index,length(md))
pred_md=matrix(NA,length(y.test),length(md))
colnames(conv_md)=colnames(pred_md)=md
for(i in 1:length(md)){
params=list(eta= 0.1, max_depth = md[i])
xgb=xgboost(x.train, label = y.train, nrounds = min_loss_index, params=params, verbose = 0)
conv_md[,i] = xgb$evaluation_log$train_rmse
pred_md[,i] = predict(xgb, x.test)
}
conv_md=data.frame(iter=1:min_loss_index,conv_md)
conv_md=melt(conv_md,id.vars = "iter")
RMSE_md_list = matrix(NA, 1,length(md))
for (i in 1:length(md)){
RMSE_md = sqrt(mean((y.test-pred_eta[,i])^2))
RMSE_md_list[1,i] = RMSE_md
}
colnames(RMSE_md_list) = md
ggplot(data=conv_md)+
geom_line(aes(x=iter,y=value,color=variable)) +
xlab("iteration") +
ylab("training RMSE")
RMSE_md_list
## 2 4 6 10
## [1,] 232.9831 206.1999 213.108 225.6055
Notice that even though this parameter is important in xgboost, but it does not change much in test RMSE, compared to test RMSE above when we are not tuning this parameter at all, which means that this model has little room to improve for now. Thus, we will choose our final model with parameters eta = 0.1 and max_depth = 4 to train and evaluate our model.
params=list(eta= 0.1, max_depth = 4)
xgboost.fit = xgboost(x.train, label = y.train, params = params, nrounds = min_loss_index, verbose = 0)
xgb.test_pred = predict(xgboost.fit, newdata = x.test)
xgb.train_pred = predict(xgboost.fit, newdata = x.train)
## RMSE
xgb_train_rmse<- sqrt(mean((lovoo_train$counts_kisses -xgb.train_pred)^2))
xgb_test_rmse <- sqrt(mean((lovoo_test$counts_kisses - xgb.test_pred)^2))
xgb_train_rmse #training RMSE
## [1] 172.0779
xgb_test_rmse # test RMSE
## [1] 205.6341
importance_matrix <- xgb.importance(colnames(x.train), model = xgboost.fit)
xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
Now, we have that our training RMSE for xgboost is 172.0779107. and test RMSE will be around 205.6340844. . Surprisingly, it performs worse than linear regression. This is probably because that XGBoost is not good at regression problems.
Some Notes: Even though I think it makes more sense to do parameter tuning on validation set instead of test set, I believe the result will be similar. This is because based on the Relative Importance Graph, we could see that once xgboost relies heavily on the variable counts_profileVisits. This suggests that it relies on single variable information which may easily cause over-fitting in this case.
Since this is a final project of PSTAT231 in University of California, Santa Barbara, we do not have enough time to go deep enough to train model and parameter tuning. However, we would like to try out different probabilties of fitting each model and make a rough comparison. At this step, we start by
library(keras)
model <- keras_model_sequential()
## Loaded Tensorflow version 2.8.0
model %>%
layer_dense(units = 32, activation = 'relu', input_shape = 19) %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 32, activation = 'relu') %>%
layer_dense(units = 16, activation = 'relu') %>%
layer_dense(units = 8, activation = 'relu') %>%
layer_dense(units = 4, activation = 'relu') %>%
layer_dense(units = 1, activation = 'linear')
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_6 (Dense) (None, 32) 640
##
## dense_5 (Dense) (None, 64) 2112
##
## dense_4 (Dense) (None, 32) 2080
##
## dense_3 (Dense) (None, 16) 528
##
## dense_2 (Dense) (None, 8) 136
##
## dense_1 (Dense) (None, 4) 36
##
## dense (Dense) (None, 1) 5
##
## ================================================================================
## Total params: 5,537
## Trainable params: 5,537
## Non-trainable params: 0
## ________________________________________________________________________________
The model is desgined as a structure with layers 32, 64, 32, …, 1. For each layer we use relu activation function and for output layer we simply use linear function to get the result.
model %>% compile(
loss = 'mse',
optimizer = "adam"
)
history <- model %>% fit(
x.train, y.train,
epochs = 500, batch_size = 64,
verbose = 0,
validation_split = 0.2
)
plot(history)
## `geom_smooth()` using formula 'y ~ x'
nn_train_rmse = sqrt(mean((model %>% predict(x.train) - y.train)^2))
nn_test_rmse = sqrt(mean((model %>% predict(x.test) - y.test)^2))
cat("The training RMSE is,", sqrt(mean((model %>% predict(x.train) - y.train)^2)), "\n")
## The training RMSE is, 166.6925
cat("The test RMSE is, ", sqrt(mean((model %>% predict(x.test) - y.test)^2)))
## The test RMSE is, 192.3958
Based on above report, I would say Among these three algorithms, Linear Regression is better on predicting the models, which is surprising to me. I was initially thinking that maybe Neural Network will outperform other algorithms, but it turns out that Linear Regression is better on both train RMSE and Test RMSE.
Note: Even though I have not done any tuning and drop out for neural network, it still performs relatively well, it performs close to what Linear Regression is doing which has 80% R square. I would say in the futurework, we may focus more on improvement in Neural Network to see if we could do better than current job.
This project analyzed what factors predict more number of “likes” of European dating app users. The analysis can help the Lovoo app founders understand their users better and hopefully improve their product. Specifically, we found that the most important factor is the \(\textbf{number of times the user profile is visited by other users}\). Hence, app designers can pay more attention on designing functions that can help boost profile visits.
Also, this analysis is important for understanding in close relationship in online setting. From our analysis, we surprisingly found that age doesn’t correlate with higher attraction. This is unexpected since we speculated that younger girls are more attractive. However, this result could be due to a limitation of our dataset : our sample are all from similar age group (19-28), so the age range is not big enough for comparison; ages 18-28 are considered as young anyway. Future research can gather a more diverse-aged sample and do more analysis on the fatcor of age. Second, we found that bilinguals are more attractive, but knowing more language doesn’t correlate with higher attraction. Future research can verify the effect of diminishing utility of language after knowing two languages to see if the results are replicable. Third, we surprisingly found that being more open in terms of flirt interest is not correlated with higher attraction. This is unexpected since we expect people who are more open to chat/date/be friends are more accessible and thus more attractive. Future research can do more analysis on the intention of why users download the dating app to have a better understanding of the result.
During the analysis, we used linear regression, tree, and neural network model. Specifically, linear regression model has an adjusted R-square value of 0.81 and test RMSE of 195; tree model has test RMSE of 265.4276; XGB_boost model has test RMSE of 205.6340844. ; neural network model has test RMSE of 192.3957796. . All the test RMSE values are pretty excellent since our scale is in multiple of ten thousands. Very surprisingly, the linear regression model performed very well. However, although linear regression model has the highest adjusted R-square value of 0.81 and lowest RMSE value of 195, we finally selected neural network model. This is because our untuned neural network model could already achieve a similar performance as linear regression model by having test RMSE of 192.3957796. of Neural Network model is still very small compared to the scale. Furthermore, since there is still plenty of room of improvement in our Neural Network model, we believe that the model could perform better than linear regression model after improvement (for example by tuning). Hence, our final chosen model is Neural Network model.
Overall, our dataset provides us with some insights into what makes a user attractive on online dating. While online dating is becoming more popular, we hope our project can help not only Lovoo app founders to improve their app, but also help users to find ways to increase their attraction - Make user profile eye-catchy!