Analysis for the Business Channel

Maks Nikiforov and Mark Austin Due 10/31/2021

Data Import
Introduction
Summarizations
Modeling
Model Comparisons
References

##For markdown automation need a different 
##  image and cache folder 
##  for each of the 6 channels so that results
##    from different channels don't overwrite each other
##Also setting up currentChannel variable 
if (params$channel=="data_channel_is_bus") {
  knitr::opts_chunk$set(fig.path = "images/bus/",
                        cache.path = "cache/bus/")
  currentChannel<-"Business"
} else if (params$channel=="data_channel_is_entertainment") {
  knitr::opts_chunk$set(fig.path = "images/entertainment/",
                        cache.path="cache/entertainment/")
  currentChannel<-"Entertainment"
} else if (params$channel=="data_channel_is_lifestyle") {
  knitr::opts_chunk$set(fig.path = "images/lifestyle/",
                        cache.path = "cache/lifestyle/")
  currentChannel<-"Lifestyle"
} else if (params$channel=="data_channel_is_socmed") {
  knitr::opts_chunk$set(fig.path = "images/socmed/",
                        cache.path = "cache/socmed/")
  currentChannel<-"Social Media"
} else if (params$channel=="data_channel_is_tech") {
  knitr::opts_chunk$set(fig.path = "images/tech/",
                        cache.path = "cache/tech/")
  currentChannel<-"Tech"
} else if (params$channel=="data_channel_is_world") {
  knitr::opts_chunk$set(fig.path = "images/world/",
                        cache.path = "cache/world/")
  currentChannel<-"World"
} 

Data Import

Data was imported first to allow for a more automated introduction.

# Read all data into a tibble
fullData<-read_csv("./data/OnlineNewsPopularity.csv")

# Eliminate non-predictive variables
reduceVarsData<-fullData %>% select(-url,-timedelta)

#test code for pre markdown automation
#params$channel<-"data_channel_is_bus"

#filter by the current params channel
channelData<-reduceVarsData %>% filter(eval(as.name(params$channel))==1) 

# URL data for top ten articles in each category
channelDataURL <- fullData %>% filter(eval(as.name(params$channel))==1)

###Can now drop the data channel variables 
channelData<-channelData %>% select(-starts_with("data_channel"))

Introduction

This page offers an exploratory data analysis of Business articles in the online news popularity data set. The top ten articles in this category, based on the number of shares on social media, include the following titles:

Shares	Article title
690400	Dove Experiment Aims to Change the Way You See Yourself
652900	Kanye West Lectures at Harvard About Creativity
310800	It’s Hot as Hell in Australia Right Now
306100	BlackBerry Sold 1 Million BlackBerry 10 Smartphones in Q4
298400	IBM Brings Watson to the Masses and Other News You Need to Know
158900	All The Christmas Movies You Need in One Mashup
144400	Can Beautiful Design Make Your Resume Stand Out?
139500	Apple to Return $32.5 Million for Accidental App Purchases
110200	How Big Data Is Influencing Hiring
106400	MapBox Enables Amazing Custom Maps for Sites and Apps

Two variables - url and timedelta - are non-predictive and have been removed. The remaining 53 variables comprise 6258 observations, which makes up 15.8 percent of the original data set. Fernandes et al., who sourced the data, concentrated on article characteristics such as verbosity and the polarity of content, publication day, the quantity of included media, and keyword attributes (Fernandes et al., 2015). A subset of these variables and the correlations between them are explored in subsequent sections.

The broader purpose of this analysis is predicated on using supervised learning to predict a target variable - shares. To this end, the final sections outline four unique models for conducting such predictions and an assessment of their relative performance. Two models are rooted in multiple linear regression analysis, which assesses relationships between a response variable and two or more predictors. The remaining models are based on random forest and boosted tree techniques. The random forest method averages results from multiple decision trees which are fitted with a random parameter subset. The boosted tree method spurns averages in favor of results that stem from weighted iterations (James et al., 2021).

Summarizations

Numerical Summaries

The first table summarizes information for article shares grouped by whether an article was a weekend article or not. This summary gives an idea of the center and spread of shares across type of day group levels.

channelData %>% 
  mutate(dayType=ifelse(is_weekend,"Weekend","Weekday")) %>%
  group_by(dayType) %>% 
  summarise(Avg = mean(shares), Sd = sd(shares), 
    Median = median(shares), IQR =IQR(shares)) %>% kable()

dayType	Avg	Sd	Median	IQR
Weekday	2975.514	15614.231	1300	1376
Weekend	3909.990	7563.786	2400	2400

The next tables gives expands on the idea of the first table by grouping shares by each day of the week. This summary gives an idea of the center and spread of shares across day of the week group levels.

dowData<-channelData %>% select(starts_with("weekday_is"),shares) %>%
  mutate(dayofWeek=case_when(as.logical(weekday_is_monday)~"Monday",
                             as.logical(weekday_is_tuesday)~"Tuesday",
                             as.logical(weekday_is_wednesday)~"Wednesday",
                             as.logical(weekday_is_thursday)~"Thursday",
                             as.logical(weekday_is_friday)~"Friday",
                             as.logical(weekday_is_saturday)~"Saturday",
                             as.logical(weekday_is_sunday)~"Sunday")) %>%
  select(dayofWeek,shares)

dowLevels<-c("Monday","Tuesday","Wednesday",
             "Thursday","Friday","Saturday","Sunday")
dowData$dayofWeek<-factor(dowData$dayofWeek,levels = dowLevels)

dowData %>%  
  group_by(dayofWeek) %>% 
  summarise(Avg = mean(shares), Sd = sd(shares), 
    Median = median(shares), IQR =IQR(shares)) %>% kable()

dayofWeek	Avg	Sd	Median	IQR
Monday	3887.436	28313.186	1400	1651.00
Tuesday	2932.336	10826.650	1300	1387.00
Wednesday	2676.552	8160.413	1300	1312.00
Thursday	2885.192	13138.010	1300	1296.75
Friday	2363.770	5133.923	1400	1420.25
Saturday	4426.897	10139.138	2600	2150.00
Sunday	3543.784	4979.289	2200	2400.00

The table below highlights variables with the highest and most significant correlations in the data set. This output may be considered when analyzing covariance to control for potentially confounding variables.

# Display top 10 highest correlations
covarianceDF <- corr_cross(df = channelData, max_pvalue = 0.05, top = 10, plot = 0) %>% 
  select(key, mix, corr, pvalue) %>% rename("Variable 1" = key, "Variable 2" = mix, 
                                            "Correlation" = corr, "p-value" = pvalue) 

# Display non-zero p-values
covarianceDF[4] <- format.pval(covarianceDF[4])

kable(covarianceDF)

Variable 1	Variable 2	Correlation	p-value
kw_max_min	kw_avg_min	0.976916	< 2.22e-16
n_unique_tokens	n_non_stop_unique_tokens	0.905731	< 2.22e-16
rate_positive_words	rate_negative_words	-0.903109	< 2.22e-16
kw_max_avg	kw_avg_avg	0.879080	< 2.22e-16
self_reference_max_shares	self_reference_avg_sharess	0.866696	< 2.22e-16
kw_min_min	kw_max_max	-0.855276	< 2.22e-16
self_reference_min_shares	self_reference_avg_sharess	0.810516	< 2.22e-16
global_rate_negative_words	rate_negative_words	0.788070	< 2.22e-16
weekday_is_sunday	is_weekend	0.749185	< 2.22e-16
avg_negative_polarity	min_negative_polarity	0.745006	< 2.22e-16

Contingency Table

The following contingency table displays counts and sums for the number of article shares within given ranges by the day of week shared. Share ranges were selected to illustrate lower, medium, and higher ranges of shares. Examining these counts can show possible patterns of shares by day or week and the range grouping for shares.

##dig.lab is needed to avoid R defaulting to scientific notation
kable(addmargins(table
                 (dowData$dayofWeek,cut(dowData$shares,
                  c(0,200,1000,10000,860000),dig.lab = 7))))

	(0,200]	(200,1000]	(1000,10000]	(10000,860000]	Sum
Monday	2	369	739	43	1153
Tuesday	2	422	714	44	1182
Wednesday	6	470	762	33	1271
Thursday	3	439	762	30	1234
Friday	5	232	582	13	832
Saturday	1	9	220	13	243
Sunday	0	26	300	17	343
Sum	19	1967	4079	193	6258

Plots

The following histogram looks at the distribution of shares. A pseudo log y scale with modified y break values was used so that article shares with low frequency will appear. We can tell from the histogram whether shares has a symmetric or skewed distribution. The distribution is symmetric if the tails are the same around the center. The distribution is right skewed if there is a long left tail and right skewed if there is a long right tail.

###creating histogram of shares data 
##scales comma was used to avoid the default scientific notation
##pseudo log with breaks was used to make low frequency values 
## more visisble
g <- ggplot(channelData, aes( x = shares))
g + geom_histogram(binwidth=12000,color = "brown", fill = "green", 
  size = 1)  + labs(x="Article Shares", y="Pseudo Log of Count",
  title = "Histogram of Article Shares") +
  scale_y_continuous(trans = "pseudo_log",
                     breaks = c(0:3, 2000, 6000),minor_breaks = NULL) +
  scale_x_continuous(labels = scales::comma) 

Fernandes et al. highlight several variables in their random forest model (Fernandes et al., 2015). The following variables from their top 11 were included in the following correlation plot with variables in () being renamed for this plot: shares,kw_min_avg,kw_max_avg,LDA_03,self_reference_min_shares(srmin_shares),kw_avg_max,self_reference_avg_sharess(sravg_shares),LDA_02,kw_avg_min,LDA_01,n_non_stop_unique_tokens(n_nstop_utokens).
The plot shows correlation with the response variable shares and the other various combinations. Larger circles indicate stronger positive (blue) or negative (red) correlation with correlation values on the lower portion of the plot.

##Reduce variable name length for later plotting
## Otherwise var names overwrite Title no matter
##  how many other size tweaks were made
corrData<-channelData %>% 
  mutate(sravg_shares=self_reference_avg_sharess,
         srmin_shares=self_reference_min_shares,
         n_nstop_utokens=n_non_stop_unique_tokens)

Correlation<-cor(select(corrData, shares, kw_min_avg,
        kw_max_avg, LDA_03, srmin_shares,
        kw_avg_max, sravg_shares, LDA_02,
        kw_avg_min, LDA_01, n_nstop_utokens),
        method = "spearman")

corrplot(Correlation,type="upper",tl.pos="lt", tl.cex = .70)
corrplot(Correlation,type="lower",method="number",
         add=TRUE,diag=FALSE,tl.pos="n",tl.cex = .70,number.cex = .75,
         title = 
           "Correlation Plot of Shares and Variables of Interest",
         mar=c(0,0,.50,0),cex.main = .75)

The following two scatterplots illustrate the relationship between response article shares shares and predictor average keyword (max shares) kw_max_ave. kw_max_ave was chosen because it was one of the potential predictors examined in the previous correlation plot.

Both scatterplots plot these variables and add a simple linear regression line to the graph.

For either graph, an upward relationship indicates higher average keyword values tend towards more article shares. A negative relation would indicate a lower average keyword values tend towards more article shares.

In addition, both graphs use differing color for weekday and weekend articles so that we can spot any possible trends with those values too.

The first scatterplot uses the default R generated axes so that potential outliers or significant observations can be observed.

The second scatterplot reduces the scale of both axes to make it easier to spot relationships for the majority of data that occur within these bounds.

###Create new factor version of weekend variable 
### to use later in graphs
scatterData<-channelData %>% 
  mutate(dayType=ifelse(is_weekend,"Weekend","Weekday"))
scatterData$dayType<-as.factor(scatterData$dayType)

###First scatter plot with ALL data 
g<-ggplot(data = scatterData,
          aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
  geom_smooth(method = lm) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(labels = scales::comma) +
  labs(x="Avg. keyword (max. shares)", y="Article Shares",
       title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",color="") 

###Second scatter plot with reduced axes
g<-ggplot(data = scatterData,
          aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
  geom_smooth(method = lm) +
  ylim(0,10000) +
  xlim(0,20000) +
    labs(x="Avg. keyword (max. shares)", y="Article Shares",
       title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",
       color="")

The bar plot below shows cumulative article publications for each day of the week, with higher bars indicating more publications. However, days with the largest number of publications are not necessarily ones with the most article shares, as seen in the subsequent box plot.

# Subset columns to include only weekday_is_*
weekdayData <- channelData %>% select(starts_with("weekday_is"))

# Calculate sum of articles published in each week day
articlesPublished <- lapply(weekdayData, function(c) sum(c=="1"))

# Use factor to set specific order in bar plot
weekPubDF <- data.frame(weekday=c("Monday", "Tuesday", "Wednesday", 
                           "Thursday", "Friday", "Saturday", "Sunday"),
                count=articlesPublished)
weekPubDF$weekday = factor(weekPubDF$weekday, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", 
                           "Thursday", "Friday", "Saturday"))

# Create bar plot with total publications by day
weekdayBar <- ggplot(weekPubDF, aes(x = weekday, y = articlesPublished)) + geom_bar(stat = "identity", color = "#123456", fill = "#0072B2") 
weekdayBar + labs(x = "Day", y = "Number published",
       title = "Article publications by day of week")

The boxplot below examines the day of article publication (Monday-Sunday) and the associated distribution of article shares. The median line indicates the center of the distribution of shares, and comparatively high medians indicate days that have relatively high circulation of Mashable articles in social media networks. For days in which the median is closer to the lower quartile (and where the upper whisker may be taller than the lower whisker), the distribution is skewed to the right. Conversely, a median that is closer to the upper quartile indicates a distribution that is skewed to the left. Days with relatively taller boxplots also have greater variability of shares.

# Subset columns to include only weekday_is_*, shares,
# create categorical variable, "day", denoting day of week (Mon-Sun)
medianShares <- channelData %>% select(starts_with("weekday_is"), shares) %>% mutate(day = NA)

# Populate "day"
for (i in 1:nrow(medianShares)) {
  if (medianShares$weekday_is_monday[i] == 1) {
    medianShares$day[i] = "Monday"
  }
  else if (medianShares$weekday_is_tuesday[i] == 1) {
    medianShares$day[i] = "Tuesday"
  }
  else if (medianShares$weekday_is_wednesday[i] == 1) {
    medianShares$day[i] = "Wednesday"
  }
  else if (medianShares$weekday_is_thursday[i] == 1) {
    medianShares$day[i] = "Thursday"
  }
  else if (medianShares$weekday_is_friday[i] == 1) {
    medianShares$day[i] = "Friday"
  }
  else if (medianShares$weekday_is_saturday[i] == 1) {
    medianShares$day[i] = "Saturday"
  }
  else if (medianShares$weekday_is_sunday[i] == 1) {
    medianShares$day[i] = "Sunday"
  }
  else {
    medianShares$day[i] = NA
  }
}

# Transform "day" into factor with levels to control order of boxplots
medianShares$day <- factor(medianShares$day, 
                           levels = c("Monday", "Tuesday", "Wednesday", 
                                      "Thursday", "Friday", "Saturday", "Sunday"))

# Plot distribution of shares for each day of the week
sharesBox <- ggplot(medianShares, aes(x = day, y = shares, fill = day))

sharesBox + geom_boxplot(outlier.shape = NA) + 
  # Exclude extreme outliers, limit range of y-axis
  coord_cartesian(ylim = quantile(medianShares$shares, c(0.1, 0.95))) +
  # Remove legend after coloration
  theme(legend.position = "none") +
  labs(x = "Day", y = "Shares",
       title = "Distribution of article shares for each publication day") + scale_fill_brewer(palette = "Spectral")

For the empirical cumulative distribution function (ECDF) below, the dplyr ranking function ntile() divides shares into four groups. Observations with the fewest shares are placed into group 1, those with the most shares are placed into group 4, and intermediaries reside in groups 2 and 3. The horizontal axis lists word count, and the vertical axis lists the percentage of content with that word count. A divergence of the colored lines suggests that the number of words differs in content with the fewest and most shares. At any given percentage of content (y-value), curves further to the right correspond to more words within the associated shares group. Groups with curves that are further to the left indicate fewer words in that percentage of content.

# Create variable to for binning the shares
binnedShares <- channelData %>% mutate(shareQuantile = ntile(channelData$shares, 4))
binnedShares <- binnedShares %>% mutate(totalMedia = num_imgs + num_videos)

# Render and label word count ECDF, group by binned shares
avgWordHisto <- ggplot(binnedShares, aes(x = n_tokens_content, colour = shareQuantile))
avgWordHisto + stat_ecdf(geom = "step", aes(color = as.character(shareQuantile))) +
  labs(title="ECDF - Number of words in the article \ngrouped by article shares (ranked)",
     y = "ECDF", x="Word count") + xlim(0,2000) + 
  scale_colour_brewer(palette = "Spectral", name = "Article shares \n(group rank)")

Modeling

Splitting Data

Per project requirements, the data for each channel are split with 70% of the data becoming training data and 30% of the data becoming test data.

#Using set.seed per suggestion so that work will be reproducible
set.seed(20)

dataIndex <-createDataPartition(channelData$shares, p = 0.7, list = FALSE)

channelTrain <-channelData[dataIndex,]
channelTest <-channelData[-dataIndex,]

Linear Regression Models

Linear regression models describe a linear relationship between a response variable and one or more explanatory variables. Models with one explanatory variable are called simple linear regression models and models with more than one explanatory variable are called multiple linear regression models. Multiple linear regression models can include polynomial and interaction terms. Each explanatory variable has an associated estimated parameter. All linear regression models are linear in the parameters.

For linear regression, explanatory variables can be continuous or categorical. However, response variables are only continuous for linear regression models.

Linear regression models are fit with training data by minimizing the sum of squared errors. Model fitting results in a line for simple linear regression and a saddle for multiple linear regression.

The first linear regression model contains predictors that encompass content keywords, sentiment and subjectivity, the length of content (the effects of which were gleaned previously from the ECDF), and link citations.

# Parallel cluster setup
cl <- makePSOCKcluster(6)
registerDoParallel(cl)

# Linear regression with subset of predictors (p-value < 0.1) selected after performing 
# least squares fit on the entire set of predictors. 
lmFit1 <- train(shares ~ kw_avg_avg + kw_max_avg + kw_min_avg + 
    num_hrefs + self_reference_min_shares + global_subjectivity + 
    num_self_hrefs + n_tokens_title + n_tokens_content + n_unique_tokens + 
    average_token_length + kw_min_max + num_keywords + kw_max_min + abs_title_subjectivity + 
    global_rate_positive_words, 
    data = channelTrain,
               method = "lm",
               preProcess = c("center", "scale"),
               trControl = trainControl(method = "cv", 
                                        number = 5))
stopCluster(cl)

The second linear regression model contains main effects for most of the predictors listed earlier in the correlation plot. If variables had more than .50 pairwise correlation in any channel, one variable of that pair was excluded. Excluded variables were: self_reference_min_shares, kw_avg_max, and LDA_01.

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

lmFit2 <- train(shares ~ kw_min_avg +
        kw_max_avg + LDA_03 +  
        self_reference_avg_sharess + LDA_02 +
        kw_avg_min + n_non_stop_unique_tokens, 
               data = channelTrain,
               method = "lm",
               preProcess = c("center", "scale"),
               trControl = trainControl(method = "cv", 
                                        number = 10))


stopCluster(cl)

Random Forest Model

Random forest models aggregate results from many sample decision trees. Those sample trees are produced using bootstrap samples created using resampling with replacement. A tree is trained on each bootstrap sample, resulting in a prediction based on that training sample data. Results from all bootstrap samples are averaged to arrive at a final prediction.

Both bagging and random forest methods use bootstrap sampling with decision trees. However, bagging includes all predictors which can lead to less reduction in variance when strong predictors exist. Unlike bagging, random forests do not use all predictors but use a random subset of predictors for each bootstrap tree fit. Random forests usually have a better fit than bagging models.

In this particular case, the response shares is continuous and we are working with regression trees. The mtry tuning parameter controls how many random predictors are used in the bootstrap samples. An mtry of 1 to 30 was chosen as a way to evaluate up to 30 predictors. These values were chosen to work within available computing constraints. Five fold cross validation is used to choose the optimal mtry value corresponding to the lowest RMSE.

##Run time presented a challenge so parallel processing was used
##Followed Parallel instructions on caret page
##   https://topepo.github.io/caret/parallel-processing.html

##Various mtry values were tried with a 20 minute runtime goal
##  A 20 minute per channel runtime corresponds 
##  to a total of about 2 hours model fit for all 6 channels

##mtry 1:30 was chosen because it was close to 20 minutes
##mtry 1:20 had 10 minute runtime and 1:30 took 30 minutes

##repeatedcv was evaluated but took over 30 minutes 
## thus repeats were not used


cl <- makePSOCKcluster(6)
registerDoParallel(cl)

rfFit <- train(shares ~ ., data = channelTrain,
               method = "rf",
               preProcess = c("center", "scale"),
               trControl = trainControl(method = "cv",
                                number = 5),
               tuneGrid = data.frame(mtry = 1:30))

stopCluster(cl)

rfFit

## Random Forest 
## 
## 4382 samples
##   52 predictor
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3506, 3506, 3504, 3506, 3506 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE     
##    1    14890.12  0.03145458  2837.164
##    2    15040.96  0.02933558  2903.931
##    3    15171.47  0.02567270  2933.521
##    4    15335.50  0.02332574  2974.780
##    5    15532.18  0.02025297  2991.222
##    6    15653.58  0.01839011  3023.554
##    7    15677.90  0.01855645  3031.803
##    8    15996.41  0.01622697  3070.579
##    9    15975.03  0.01698446  3080.933
##   10    16042.27  0.01771635  3090.988
##   11    16173.43  0.01588203  3085.666
##   12    16148.39  0.01551689  3100.412
##   13    16119.63  0.01625215  3088.022
##   14    16363.03  0.01455047  3130.017
##   15    16181.99  0.01489809  3120.928
##   16    16251.89  0.01388384  3128.875
##   17    16373.60  0.01431683  3141.608
##   18    16494.23  0.01238045  3154.666
##   19    16413.15  0.01293169  3152.633
##   20    16597.18  0.01249404  3169.338
##   21    16680.73  0.01222214  3170.733
##   22    16402.72  0.01151833  3152.689
##   23    16578.82  0.01358022  3156.448
##   24    16466.51  0.01191402  3168.896
##   25    16567.99  0.01181646  3185.864
##   26    16524.27  0.01221460  3163.637
##   27    16738.15  0.01163398  3193.647
##   28    16837.01  0.01245574  3195.433
##   29    16803.55  0.01211917  3195.775
##   30    16938.78  0.01136888  3210.828
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 1.

After fitting the random forest model, the following variable importance plot is created. The top ten most important predictors are plotted using a scale of 0 to 100.

rfImp <- varImp(rfFit, scale = TRUE)
plot(rfImp,top = 10, main="Random Forest Model\nTop 10 Importance Plot")

Boosted Tree Model

Boosting is a general method whereby decision trees are grown sequentially using residuals (the differences between observed values and predicted values of a variable) as the response. Initial prediction values start at 0 for all combinations of predictors, so that the first set of residuals matches the observed values in our data. To mitigate low bias and high variance, contributions from subsequent trees are scaled with a shrinkage parameter, λ. The value of this parameter is generally small (0.01 or 0.001), which slows tree growth and tampers overfitting (James et al., 2021).

# Re-allocate cores for parallel computing
cl <- makePSOCKcluster(6)
registerDoParallel(cl)


# Boosted tree fit with tuneLength (let function decide parameter combinations)
boostedTreeFit <- train(shares ~ ., data = channelTrain,
               method = "gbm",
               preProcess = c("center", "scale"),
               trControl = trainControl(method = "cv", 
                                        number = 5),  
               tuneLength = 5)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 283409072.2697             nan     0.1000 -83960.4166
##      2 280978872.6293             nan     0.1000 -146705.1979
##      3 279012218.5718             nan     0.1000 -338754.5361
##      4 277061267.4391             nan     0.1000 -469478.7030
##      5 276859225.2438             nan     0.1000 -118503.0608
##      6 275488071.0598             nan     0.1000 -282600.4104
##      7 274290200.1770             nan     0.1000 -650634.8451
##      8 274077262.0476             nan     0.1000 67149.3659
##      9 273787020.2861             nan     0.1000 382104.7027
##     10 273037220.0058             nan     0.1000 -664751.2152
##     20 264986383.8918             nan     0.1000 -341841.7639
##     40 256506148.7169             nan     0.1000 -1703672.3617
##     50 254139580.5369             nan     0.1000 -1073899.6042

# Define tuning parameters based on $bestTune from the permutations above
nTrees <- boostedTreeFit$bestTune$n.trees
interactionDepth = boostedTreeFit$bestTune$interaction.depth
minObs = boostedTreeFit$bestTune$n.minobsinnode
shrinkParam <- boostedTreeFit$bestTune$shrinkage

# Boosted tree fit with defined parameters
bestBoostedTree <- train(shares ~ ., data = channelTrain,
               method = "gbm",
               preProcess = c("center", "scale"),
               trControl = trainControl(method = "cv", 
                                        number = 5),  
               tuneGrid = expand.grid(n.trees = nTrees, interaction.depth = interactionDepth,
                                      shrinkage = shrinkParam, n.minobsinnode = minObs))

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1 282087483.6694             nan     0.1000 -290705.5199
##      2 280380382.6921             nan     0.1000 -519595.0378
##      3 279343194.4944             nan     0.1000 -376257.1922
##      4 278639210.0420             nan     0.1000 -1257722.7860
##      5 278222946.4359             nan     0.1000 -1191411.8982
##      6 276735736.6750             nan     0.1000 997418.0997
##      7 275206863.3175             nan     0.1000 922257.3742
##      8 272916782.4655             nan     0.1000 -459522.7614
##      9 272999070.4616             nan     0.1000 -847294.8887
##     10 270856713.5097             nan     0.1000 -357775.4802
##     20 263415447.8759             nan     0.1000 -1069225.0043
##     40 254738882.2609             nan     0.1000 -810467.6399
##     50 251233557.7684             nan     0.1000 -836015.9677

stopCluster(cl)

Model Comparisons

After models were fit with training data, we do predictions with testing data. Finally, RMSE metrics are extracted and compared. The model with lowest RMSE is presented as the winning model.

# Predict using test data
predictLM1 <- predict(lmFit1, newdata = channelTest)

# Metrics
RMSELM1 <- postResample(predictLM1, obs = channelTest$shares)["RMSE"][[1]]
RMSELM1

## [1] 9683.193

# Store value for model comparison
modelPerformance <- tibble(RMSE = RMSELM1, Model = "Linear regression 1")

predictLM2 <- predict(lmFit2, newdata = channelTest)
RMSELM2<-postResample(predictLM2, channelTest$shares)["RMSE"][[1]]
RMSELM2

## [1] 9707.911

modelPerformance <- add_row(modelPerformance, RMSE = RMSELM2, Model = "Linear regression 2")

predictRF <- predict(rfFit, newdata = channelTest)
RMSERF<-postResample(predictRF, channelTest$shares)["RMSE"]
RMSERF

##     RMSE 
## 9442.735

modelPerformance <- add_row(modelPerformance, RMSE = RMSERF, Model = "Random forest")

# Predict using test data
predictGBM <- predict(bestBoostedTree, newdata = channelTest)

# Metrics
RMSEGBM <- postResample(predictGBM, obs = channelTest$shares)["RMSE"]
RMSEGBM

##     RMSE 
## 10295.24

modelPerformance <- add_row(modelPerformance, RMSE = RMSEGBM, Model = "Boosted tree")

# Select row with lowest value of RMSE.
selectModel <- modelPerformance %>% slice_min(RMSE)
selectModel

## # A tibble: 1 x 2
##    RMSE Model        
##   <dbl> <chr>        
## 1 9443. Random forest

Based on the preceding analyses with test data, the Random forest model yields the lowest RMSE - 9442.7346433.

References

Fernandes, K., Vinagre, P., & Cortez, P. (2015). A proactive intelligent decision support system for predicting the popularity of online news. In F. Pereira, P. Machado, E. Costa, & A. Cardoso (Eds.), *Progress in artificial intelligence* (pp. 535–546). Springer International Publishing.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An introduction to statistical learning*. Springer US. <https://doi.org/10.1007/978-1-0716-1418-1>