Analysis for the Lifestyle Channel
Maks Nikiforov and Mark Austin Due 10/31/2021
##For markdown automation need a different
## image and cache folder
## for each of the 6 channels so that results
## from different channels don't overwrite each other
##Also setting up currentChannel variable
if (params$channel=="data_channel_is_bus") {
knitr::opts_chunk$set(fig.path = "images/bus/",
cache.path = "cache/bus/")
currentChannel<-"Business"
} else if (params$channel=="data_channel_is_entertainment") {
knitr::opts_chunk$set(fig.path = "images/entertainment/",
cache.path="cache/entertainment/")
currentChannel<-"Entertainment"
} else if (params$channel=="data_channel_is_lifestyle") {
knitr::opts_chunk$set(fig.path = "images/lifestyle/",
cache.path = "cache/lifestyle/")
currentChannel<-"Lifestyle"
} else if (params$channel=="data_channel_is_socmed") {
knitr::opts_chunk$set(fig.path = "images/socmed/",
cache.path = "cache/socmed/")
currentChannel<-"Social Media"
} else if (params$channel=="data_channel_is_tech") {
knitr::opts_chunk$set(fig.path = "images/tech/",
cache.path = "cache/tech/")
currentChannel<-"Tech"
} else if (params$channel=="data_channel_is_world") {
knitr::opts_chunk$set(fig.path = "images/world/",
cache.path = "cache/world/")
currentChannel<-"World"
}
Data Import
Data was imported first to allow for a more automated introduction.
# Read all data into a tibble
fullData<-read_csv("./data/OnlineNewsPopularity.csv")
# Eliminate non-predictive variables
reduceVarsData<-fullData %>% select(-url,-timedelta)
#test code for pre markdown automation
#params$channel<-"data_channel_is_bus"
#filter by the current params channel
channelData<-reduceVarsData %>% filter(eval(as.name(params$channel))==1)
# URL data for top ten articles in each category
channelDataURL <- fullData %>% filter(eval(as.name(params$channel))==1)
###Can now drop the data channel variables
channelData<-channelData %>% select(-starts_with("data_channel"))
Introduction
This page offers an exploratory data analysis of Lifestyle articles in the online news popularity data set. The top ten articles in this category, based on the number of shares on social media, include the following titles:
Shares | Article title |
---|---|
208300 | Obama to Discuss NSA Reform With Lawmakers |
196700 | No Movie Trailer Is Complete Without This One Line |
139600 | 87% of American Teenagers Send Text Messages Each Month |
81200 | High-Tech Wristband Monitors Mood |
73100 | 22 Books for Your Ultimate Summer Reading List |
56000 | Finalists Exhibit Tech for $465 Million Virtual Border Fence |
54900 | Cybersecurity Experts Will Face Off in Mock NetWars |
54200 | 84% of Smartphone Owners Use Apps While Getting Ready in the Morning |
49700 | It’s Still Easy to Get Away With Revenge Porn |
45100 | Beats Solo² Headphones Sound Great, But You’re Paying for Fashion |
Two variables - url
and timedelta
- are non-predictive and have been
removed. The remaining 53 variables comprise 2099 observations, which
makes up 5.3 percent of the original data set. Fernandes et al., who
sourced the data, concentrated on article characteristics such as
verbosity and the polarity of content, publication day, the quantity of
included media, and keyword attributes (Fernandes et al., 2015). A
subset of these variables and the correlations between them are explored
in subsequent sections.
The broader purpose of this analysis is predicated on using supervised
learning to predict a target variable - shares
. To this end, the final
sections outline four unique models for conducting such predictions and
an assessment of their relative performance. Two models are rooted in
multiple linear regression analysis, which assesses relationships
between a response variable and two or more predictors. The remaining
models are based on random forest and boosted tree techniques. The
random forest method averages results from multiple decision trees which
are fitted with a random parameter subset. The boosted tree method
spurns averages in favor of results that stem from weighted iterations
(James et al., 2021).
Summarizations
Numerical Summaries
The first table summarizes information for article shares grouped by
whether an article was a weekend article or not. This summary gives an
idea of the center and spread of shares
across type of day group
levels.
channelData %>%
mutate(dayType=ifelse(is_weekend,"Weekend","Weekday")) %>%
group_by(dayType) %>%
summarise(Avg = mean(shares), Sd = sd(shares),
Median = median(shares), IQR =IQR(shares)) %>% kable()
dayType | Avg | Sd | Median | IQR |
---|---|---|---|---|
Weekday | 3628.255 | 9551.749 | 1600 | 2050 |
Weekend | 3916.696 | 5044.188 | 2100 | 2600 |
The next tables gives expands on the idea of the first table by grouping
shares
by each day of the week. This summary gives an idea of the
center and spread of shares
across day of the week group levels.
dowData<-channelData %>% select(starts_with("weekday_is"),shares) %>%
mutate(dayofWeek=case_when(as.logical(weekday_is_monday)~"Monday",
as.logical(weekday_is_tuesday)~"Tuesday",
as.logical(weekday_is_wednesday)~"Wednesday",
as.logical(weekday_is_thursday)~"Thursday",
as.logical(weekday_is_friday)~"Friday",
as.logical(weekday_is_saturday)~"Saturday",
as.logical(weekday_is_sunday)~"Sunday")) %>%
select(dayofWeek,shares)
dowLevels<-c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday")
dowData$dayofWeek<-factor(dowData$dayofWeek,levels = dowLevels)
dowData %>%
group_by(dayofWeek) %>%
summarise(Avg = mean(shares), Sd = sd(shares),
Median = median(shares), IQR =IQR(shares)) %>% kable()
dayofWeek | Avg | Sd | Median | IQR |
---|---|---|---|---|
Monday | 4345.711 | 14072.938 | 1600 | 2575.00 |
Tuesday | 4152.494 | 13544.476 | 1500 | 1975.00 |
Wednesday | 3173.180 | 5608.013 | 1600 | 1800.25 |
Thursday | 3500.268 | 5820.627 | 1600 | 2250.00 |
Friday | 3025.869 | 4539.610 | 1500 | 2000.00 |
Saturday | 4062.451 | 5350.749 | 2100 | 2650.00 |
Sunday | 3790.376 | 4771.926 | 2100 | 2675.00 |
The table below highlights variables with the highest and most significant correlations in the data set. This output may be considered when analyzing covariance to control for potentially confounding variables.
# Display top 10 highest correlations
covarianceDF <- corr_cross(df = channelData, max_pvalue = 0.05, top = 10, plot = 0) %>%
select(key, mix, corr, pvalue) %>% rename("Variable 1" = key, "Variable 2" = mix,
"Correlation" = corr, "p-value" = pvalue)
# Display non-zero p-values
covarianceDF[4] <- format.pval(covarianceDF[4])
kable(covarianceDF)
Variable 1 | Variable 2 | Correlation | p-value |
---|---|---|---|
kw_max_min | kw_avg_min | 0.956574 | < 2.22e-16 |
self_reference_max_shares | self_reference_avg_sharess | 0.911916 | < 2.22e-16 |
n_unique_tokens | n_non_stop_unique_tokens | 0.906660 | < 2.22e-16 |
n_non_stop_words | average_token_length | 0.870183 | < 2.22e-16 |
kw_min_min | kw_max_max | -0.856370 | < 2.22e-16 |
kw_max_avg | kw_avg_avg | 0.818423 | < 2.22e-16 |
global_rate_negative_words | rate_negative_words | 0.815529 | < 2.22e-16 |
self_reference_min_shares | self_reference_avg_sharess | 0.787590 | < 2.22e-16 |
kw_max_max | kw_avg_max | 0.750110 | < 2.22e-16 |
rate_positive_words | rate_negative_words | -0.731286 | < 2.22e-16 |
Contingency Table
The following contingency table displays counts and sums for the number of article shares within given ranges by the day of week shared. Share ranges were selected to illustrate lower, medium, and higher ranges of shares. Examining these counts can show possible patterns of shares by day or week and the range grouping for shares.
##dig.lab is needed to avoid R defaulting to scientific notation
kable(addmargins(table
(dowData$dayofWeek,cut(dowData$shares,
c(0,200,1000,10000,860000),dig.lab = 7))))
(0,200] | (200,1000] | (1000,10000] | (10000,860000] | Sum | |
---|---|---|---|---|---|
Monday | 3 | 81 | 213 | 25 | 322 |
Tuesday | 1 | 86 | 229 | 18 | 334 |
Wednesday | 3 | 106 | 254 | 25 | 388 |
Thursday | 3 | 87 | 241 | 27 | 358 |
Friday | 3 | 76 | 210 | 16 | 305 |
Saturday | 0 | 9 | 156 | 17 | 182 |
Sunday | 0 | 17 | 180 | 13 | 210 |
Sum | 13 | 462 | 1483 | 141 | 2099 |
Plots
The following histogram looks at the distribution of shares
. A pseudo
log y scale with modified y break values was used so that article
shares
with low frequency will appear. We can tell from the histogram
whether shares
has a symmetric or skewed distribution. The
distribution is symmetric if the tails are the same around the center.
The distribution is right skewed if there is a long left tail and right
skewed if there is a long right tail.
###creating histogram of shares data
##scales comma was used to avoid the default scientific notation
##pseudo log with breaks was used to make low frequency values
## more visisble
g <- ggplot(channelData, aes( x = shares))
g + geom_histogram(binwidth=12000,color = "brown", fill = "green",
size = 1) + labs(x="Article Shares", y="Pseudo Log of Count",
title = "Histogram of Article Shares") +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 2000, 6000),minor_breaks = NULL) +
scale_x_continuous(labels = scales::comma)
Fernandes et al. highlight several variables in their random forest
model (Fernandes et al., 2015). The following variables from their top
11 were included in the following correlation plot with variables in ()
being renamed for this plot:
shares
,kw_min_avg
,kw_max_avg
,LDA_03
,self_reference_min_shares
(srmin_shares
),kw_avg_max
,self_reference_avg_sharess
(sravg_shares
),LDA_02
,kw_avg_min
,LDA_01
,n_non_stop_unique_tokens
(n_nstop_utokens
).
The plot shows correlation with the response variable shares
and the
other various combinations. Larger circles indicate stronger positive
(blue) or negative (red) correlation with correlation values on the
lower portion of the plot.
##Reduce variable name length for later plotting
## Otherwise var names overwrite Title no matter
## how many other size tweaks were made
corrData<-channelData %>%
mutate(sravg_shares=self_reference_avg_sharess,
srmin_shares=self_reference_min_shares,
n_nstop_utokens=n_non_stop_unique_tokens)
Correlation<-cor(select(corrData, shares, kw_min_avg,
kw_max_avg, LDA_03, srmin_shares,
kw_avg_max, sravg_shares, LDA_02,
kw_avg_min, LDA_01, n_nstop_utokens),
method = "spearman")
corrplot(Correlation,type="upper",tl.pos="lt", tl.cex = .70)
corrplot(Correlation,type="lower",method="number",
add=TRUE,diag=FALSE,tl.pos="n",tl.cex = .70,number.cex = .75,
title =
"Correlation Plot of Shares and Variables of Interest",
mar=c(0,0,.50,0),cex.main = .75)
The following two scatterplots illustrate the relationship between
response article shares shares
and predictor average keyword (max
shares) kw_max_ave
. kw_max_ave
was chosen because it was one of the
potential predictors examined in the previous correlation plot.
Both scatterplots plot these variables and add a simple linear regression line to the graph.
For either graph, an upward relationship indicates higher average keyword values tend towards more article shares. A negative relation would indicate a lower average keyword values tend towards more article shares.
In addition, both graphs use differing color for weekday and weekend articles so that we can spot any possible trends with those values too.
The first scatterplot uses the default R generated axes so that potential outliers or significant observations can be observed.
The second scatterplot reduces the scale of both axes to make it easier to spot relationships for the majority of data that occur within these bounds.
###Create new factor version of weekend variable
### to use later in graphs
scatterData<-channelData %>%
mutate(dayType=ifelse(is_weekend,"Weekend","Weekday"))
scatterData$dayType<-as.factor(scatterData$dayType)
###First scatter plot with ALL data
g<-ggplot(data = scatterData,
aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
geom_smooth(method = lm) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma) +
labs(x="Avg. keyword (max. shares)", y="Article Shares",
title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",color="")
###Second scatter plot with reduced axes
g<-ggplot(data = scatterData,
aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
geom_smooth(method = lm) +
ylim(0,10000) +
xlim(0,20000) +
labs(x="Avg. keyword (max. shares)", y="Article Shares",
title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",
color="")
The bar plot below shows cumulative article publications for each day of the week, with higher bars indicating more publications. However, days with the largest number of publications are not necessarily ones with the most article shares, as seen in the subsequent box plot.
# Subset columns to include only weekday_is_*
weekdayData <- channelData %>% select(starts_with("weekday_is"))
# Calculate sum of articles published in each week day
articlesPublished <- lapply(weekdayData, function(c) sum(c=="1"))
# Use factor to set specific order in bar plot
weekPubDF <- data.frame(weekday=c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"),
count=articlesPublished)
weekPubDF$weekday = factor(weekPubDF$weekday, levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))
# Create bar plot with total publications by day
weekdayBar <- ggplot(weekPubDF, aes(x = weekday, y = articlesPublished)) + geom_bar(stat = "identity", color = "#123456", fill = "#0072B2")
weekdayBar + labs(x = "Day", y = "Number published",
title = "Article publications by day of week")
The boxplot below examines the day of article publication
(Monday-Sunday) and the associated distribution of article shares
. The
median line indicates the center of the distribution of shares
, and
comparatively high medians indicate days that have relatively high
circulation of Mashable articles in social media networks. For days in
which the median is closer to the lower quartile (and where the upper
whisker may be taller than the lower whisker), the distribution is
skewed to the right. Conversely, a median that is closer to the upper
quartile indicates a distribution that is skewed to the left. Days with
relatively taller boxplots also have greater variability of shares
.
# Subset columns to include only weekday_is_*, shares,
# create categorical variable, "day", denoting day of week (Mon-Sun)
medianShares <- channelData %>% select(starts_with("weekday_is"), shares) %>% mutate(day = NA)
# Populate "day"
for (i in 1:nrow(medianShares)) {
if (medianShares$weekday_is_monday[i] == 1) {
medianShares$day[i] = "Monday"
}
else if (medianShares$weekday_is_tuesday[i] == 1) {
medianShares$day[i] = "Tuesday"
}
else if (medianShares$weekday_is_wednesday[i] == 1) {
medianShares$day[i] = "Wednesday"
}
else if (medianShares$weekday_is_thursday[i] == 1) {
medianShares$day[i] = "Thursday"
}
else if (medianShares$weekday_is_friday[i] == 1) {
medianShares$day[i] = "Friday"
}
else if (medianShares$weekday_is_saturday[i] == 1) {
medianShares$day[i] = "Saturday"
}
else if (medianShares$weekday_is_sunday[i] == 1) {
medianShares$day[i] = "Sunday"
}
else {
medianShares$day[i] = NA
}
}
# Transform "day" into factor with levels to control order of boxplots
medianShares$day <- factor(medianShares$day,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
# Plot distribution of shares for each day of the week
sharesBox <- ggplot(medianShares, aes(x = day, y = shares, fill = day))
sharesBox + geom_boxplot(outlier.shape = NA) +
# Exclude extreme outliers, limit range of y-axis
coord_cartesian(ylim = quantile(medianShares$shares, c(0.1, 0.95))) +
# Remove legend after coloration
theme(legend.position = "none") +
labs(x = "Day", y = "Shares",
title = "Distribution of article shares for each publication day") + scale_fill_brewer(palette = "Spectral")
For the empirical cumulative distribution function (ECDF) below, the
dplyr
ranking function ntile()
divides shares
into four groups.
Observations with the fewest shares are placed into group 1, those with
the most shares are placed into group 4, and intermediaries reside in
groups 2 and 3. The horizontal axis lists word count, and the vertical
axis lists the percentage of content with that word count. A divergence
of the colored lines suggests that the number of words differs in
content with the fewest and most shares. At any given percentage of
content (y-value), curves further to the right correspond to more words
within the associated shares
group. Groups with curves that are
further to the left indicate fewer words in that percentage of content.
# Create variable to for binning the shares
binnedShares <- channelData %>% mutate(shareQuantile = ntile(channelData$shares, 4))
binnedShares <- binnedShares %>% mutate(totalMedia = num_imgs + num_videos)
# Render and label word count ECDF, group by binned shares
avgWordHisto <- ggplot(binnedShares, aes(x = n_tokens_content, colour = shareQuantile))
avgWordHisto + stat_ecdf(geom = "step", aes(color = as.character(shareQuantile))) +
labs(title="ECDF - Number of words in the article \ngrouped by article shares (ranked)",
y = "ECDF", x="Word count") + xlim(0,2000) +
scale_colour_brewer(palette = "Spectral", name = "Article shares \n(group rank)")
Modeling
Splitting Data
Per project requirements, the data for each channel are split with 70% of the data becoming training data and 30% of the data becoming test data.
#Using set.seed per suggestion so that work will be reproducible
set.seed(20)
dataIndex <-createDataPartition(channelData$shares, p = 0.7, list = FALSE)
channelTrain <-channelData[dataIndex,]
channelTest <-channelData[-dataIndex,]
Linear Regression Models
Linear regression models describe a linear relationship between a response variable and one or more explanatory variables. Models with one explanatory variable are called simple linear regression models and models with more than one explanatory variable are called multiple linear regression models. Multiple linear regression models can include polynomial and interaction terms. Each explanatory variable has an associated estimated parameter. All linear regression models are linear in the parameters.
For linear regression, explanatory variables can be continuous or categorical. However, response variables are only continuous for linear regression models.
Linear regression models are fit with training data by minimizing the sum of squared errors. Model fitting results in a line for simple linear regression and a saddle for multiple linear regression.
The first linear regression model contains predictors that encompass content keywords, sentiment and subjectivity, the length of content (the effects of which were gleaned previously from the ECDF), and link citations.
# Parallel cluster setup
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
# Linear regression with subset of predictors (p-value < 0.1) selected after performing
# least squares fit on the entire set of predictors.
lmFit1 <- train(shares ~ kw_avg_avg + kw_max_avg + kw_min_avg +
num_hrefs + self_reference_min_shares + global_subjectivity +
num_self_hrefs + n_tokens_title + n_tokens_content + n_unique_tokens +
average_token_length + kw_min_max + num_keywords + kw_max_min + abs_title_subjectivity +
global_rate_positive_words,
data = channelTrain,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5))
stopCluster(cl)
The second linear regression model contains main effects for most of the
predictors listed earlier in the correlation plot. If variables had more
than .50 pairwise correlation in any channel, one variable of that pair
was excluded. Excluded variables were: self_reference_min_shares
,
kw_avg_max
, and LDA_01
.
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
lmFit2 <- train(shares ~ kw_min_avg +
kw_max_avg + LDA_03 +
self_reference_avg_sharess + LDA_02 +
kw_avg_min + n_non_stop_unique_tokens,
data = channelTrain,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 10))
stopCluster(cl)
Random Forest Model
Random forest models aggregate results from many sample decision trees. Those sample trees are produced using bootstrap samples created using resampling with replacement. A tree is trained on each bootstrap sample, resulting in a prediction based on that training sample data. Results from all bootstrap samples are averaged to arrive at a final prediction.
Both bagging and random forest methods use bootstrap sampling with decision trees. However, bagging includes all predictors which can lead to less reduction in variance when strong predictors exist. Unlike bagging, random forests do not use all predictors but use a random subset of predictors for each bootstrap tree fit. Random forests usually have a better fit than bagging models.
In this particular case, the response shares
is continuous and we are
working with regression trees. The mtry
tuning parameter controls how
many random predictors are used in the bootstrap samples. An mtry
of 1
to 30 was chosen as a way to evaluate up to 30 predictors. These values
were chosen to work within available computing constraints. Five fold
cross validation is used to choose the optimal mtry value corresponding
to the lowest RMSE.
##Run time presented a challenge so parallel processing was used
##Followed Parallel instructions on caret page
## https://topepo.github.io/caret/parallel-processing.html
##Various mtry values were tried with a 20 minute runtime goal
## A 20 minute per channel runtime corresponds
## to a total of about 2 hours model fit for all 6 channels
##mtry 1:30 was chosen because it was close to 20 minutes
##mtry 1:20 had 10 minute runtime and 1:30 took 30 minutes
##repeatedcv was evaluated but took over 30 minutes
## thus repeats were not used
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
rfFit <- train(shares ~ ., data = channelTrain,
method = "rf",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneGrid = data.frame(mtry = 1:30))
stopCluster(cl)
rfFit
## Random Forest
##
## 1472 samples
## 52 predictor
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1178, 1176, 1178, 1177, 1179
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 7700.354 0.011448227 3257.875
## 2 7767.243 0.010557108 3344.202
## 3 7818.160 0.008819996 3361.536
## 4 7863.695 0.008708639 3392.410
## 5 7885.215 0.009835263 3415.104
## 6 7903.022 0.010214064 3396.174
## 7 7989.386 0.009209833 3452.842
## 8 8044.493 0.007095766 3463.653
## 9 8072.146 0.006458376 3464.451
## 10 8133.106 0.006624797 3476.167
## 11 8154.974 0.007672426 3486.552
## 12 8140.628 0.010311884 3481.164
## 13 8151.268 0.009191298 3482.209
## 14 8219.685 0.007071879 3495.295
## 15 8266.071 0.007106738 3499.214
## 16 8272.007 0.007487571 3485.774
## 17 8335.853 0.006336330 3522.780
## 18 8343.314 0.007765418 3525.826
## 19 8403.869 0.007720668 3517.991
## 20 8406.939 0.006405060 3545.463
## 21 8422.275 0.008047622 3516.312
## 22 8451.770 0.006368502 3537.152
## 23 8488.364 0.007367648 3524.590
## 24 8454.638 0.006303017 3522.232
## 25 8526.539 0.007246372 3520.831
## 26 8582.794 0.006213899 3544.972
## 27 8627.369 0.006609938 3544.678
## 28 8647.981 0.006894932 3552.627
## 29 8631.074 0.007435105 3541.756
## 30 8678.797 0.006936974 3545.583
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 1.
After fitting the random forest model, the following variable importance plot is created. The top ten most important predictors are plotted using a scale of 0 to 100.
rfImp <- varImp(rfFit, scale = TRUE)
plot(rfImp,top = 10, main="Random Forest Model\nTop 10 Importance Plot")
Boosted Tree Model
Boosting is a general method whereby decision trees are grown sequentially using residuals (the differences between observed values and predicted values of a variable) as the response. Initial prediction values start at 0 for all combinations of predictors, so that the first set of residuals matches the observed values in our data. To mitigate low bias and high variance, contributions from subsequent trees are scaled with a shrinkage parameter, λ. The value of this parameter is generally small (0.01 or 0.001), which slows tree growth and tampers overfitting (James et al., 2021).
# Re-allocate cores for parallel computing
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
# Boosted tree fit with tuneLength (let function decide parameter combinations)
boostedTreeFit <- train(shares ~ ., data = channelTrain,
method = "gbm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneLength = 5)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 81592531.7691 nan 0.1000 -110219.7078
## 2 81125745.4982 nan 0.1000 -171529.4904
## 3 81064940.7366 nan 0.1000 -41573.2315
## 4 80930623.3187 nan 0.1000 138028.7523
## 5 80553322.6423 nan 0.1000 -87917.9221
## 6 80288738.6406 nan 0.1000 -126909.0245
## 7 80070259.2664 nan 0.1000 -264937.3073
## 8 79967993.4034 nan 0.1000 -82901.7624
## 9 79932628.1262 nan 0.1000 -47379.8320
## 10 79723983.2571 nan 0.1000 -52377.8386
## 20 78302534.7038 nan 0.1000 -285124.4793
## 40 77803375.1027 nan 0.1000 -223603.4316
## 50 77553134.0175 nan 0.1000 -415020.9452
# Define tuning parameters based on $bestTune from the permutations above
nTrees <- boostedTreeFit$bestTune$n.trees
interactionDepth = boostedTreeFit$bestTune$interaction.depth
minObs = boostedTreeFit$bestTune$n.minobsinnode
shrinkParam <- boostedTreeFit$bestTune$shrinkage
# Boosted tree fit with defined parameters
bestBoostedTree <- train(shares ~ ., data = channelTrain,
method = "gbm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneGrid = expand.grid(n.trees = nTrees, interaction.depth = interactionDepth,
shrinkage = shrinkParam, n.minobsinnode = minObs))
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 81970919.4694 nan 0.1000 33269.3045
## 2 81544792.4778 nan 0.1000 -44965.6893
## 3 80990983.9297 nan 0.1000 12946.2311
## 4 80710153.3339 nan 0.1000 -61735.7828
## 5 80505886.1632 nan 0.1000 -84920.5351
## 6 80223864.9008 nan 0.1000 -102302.6449
## 7 79996735.8107 nan 0.1000 -176239.8212
## 8 79791678.7133 nan 0.1000 -92517.9657
## 9 79674421.6124 nan 0.1000 -114856.9453
## 10 79455266.6071 nan 0.1000 -144634.3177
## 20 78447467.1908 nan 0.1000 -46217.4624
## 40 77584239.2316 nan 0.1000 -154797.5720
## 50 76942816.5418 nan 0.1000 210463.3425
stopCluster(cl)
Model Comparisons
After models were fit with training data, we do predictions with testing data. Finally, RMSE metrics are extracted and compared. The model with lowest RMSE is presented as the winning model.
# Predict using test data
predictLM1 <- predict(lmFit1, newdata = channelTest)
# Metrics
RMSELM1 <- postResample(predictLM1, obs = channelTest$shares)["RMSE"][[1]]
RMSELM1
## [1] 8344.936
# Store value for model comparison
modelPerformance <- tibble(RMSE = RMSELM1, Model = "Linear regression 1")
predictLM2 <- predict(lmFit2, newdata = channelTest)
RMSELM2<-postResample(predictLM2, channelTest$shares)["RMSE"][[1]]
RMSELM2
## [1] 8450.339
modelPerformance <- add_row(modelPerformance, RMSE = RMSELM2, Model = "Linear regression 2")
predictRF <- predict(rfFit, newdata = channelTest)
RMSERF<-postResample(predictRF, channelTest$shares)["RMSE"]
RMSERF
## RMSE
## 8415.603
modelPerformance <- add_row(modelPerformance, RMSE = RMSERF, Model = "Random forest")
# Predict using test data
predictGBM <- predict(bestBoostedTree, newdata = channelTest)
# Metrics
RMSEGBM <- postResample(predictGBM, obs = channelTest$shares)["RMSE"]
RMSEGBM
## RMSE
## 8503.737
modelPerformance <- add_row(modelPerformance, RMSE = RMSEGBM, Model = "Boosted tree")
# Select row with lowest value of RMSE.
selectModel <- modelPerformance %>% slice_min(RMSE)
selectModel
## # A tibble: 1 x 2
## RMSE Model
## <dbl> <chr>
## 1 8345. Linear regression 1
Based on the preceding analyses with test data, the Linear regression 1 model yields the lowest RMSE - 8344.9359123.