Analysis for the World Channel
Maks Nikiforov and Mark Austin Due 10/31/2021
##For markdown automation need a different
## image and cache folder
## for each of the 6 channels so that results
## from different channels don't overwrite each other
##Also setting up currentChannel variable
if (params$channel=="data_channel_is_bus") {
knitr::opts_chunk$set(fig.path = "images/bus/",
cache.path = "cache/bus/")
currentChannel<-"Business"
} else if (params$channel=="data_channel_is_entertainment") {
knitr::opts_chunk$set(fig.path = "images/entertainment/",
cache.path="cache/entertainment/")
currentChannel<-"Entertainment"
} else if (params$channel=="data_channel_is_lifestyle") {
knitr::opts_chunk$set(fig.path = "images/lifestyle/",
cache.path = "cache/lifestyle/")
currentChannel<-"Lifestyle"
} else if (params$channel=="data_channel_is_socmed") {
knitr::opts_chunk$set(fig.path = "images/socmed/",
cache.path = "cache/socmed/")
currentChannel<-"Social Media"
} else if (params$channel=="data_channel_is_tech") {
knitr::opts_chunk$set(fig.path = "images/tech/",
cache.path = "cache/tech/")
currentChannel<-"Tech"
} else if (params$channel=="data_channel_is_world") {
knitr::opts_chunk$set(fig.path = "images/world/",
cache.path = "cache/world/")
currentChannel<-"World"
}
Data Import
Data was imported first to allow for a more automated introduction.
# Read all data into a tibble
fullData<-read_csv("./data/OnlineNewsPopularity.csv")
# Eliminate non-predictive variables
reduceVarsData<-fullData %>% select(-url,-timedelta)
#test code for pre markdown automation
#params$channel<-"data_channel_is_bus"
#filter by the current params channel
channelData<-reduceVarsData %>% filter(eval(as.name(params$channel))==1)
# URL data for top ten articles in each category
channelDataURL <- fullData %>% filter(eval(as.name(params$channel))==1)
###Can now drop the data channel variables
channelData<-channelData %>% select(-starts_with("data_channel"))
Introduction
This page offers an exploratory data analysis of World articles in the online news popularity data set. The top ten articles in this category, based on the number of shares on social media, include the following titles:
Shares | Article title |
---|---|
284700 | U.S. Will Now Monitor All Travelers From Ebola Zone for 21 Days |
141400 | Study: 54% of Online Adults Would Spend Tax Refunds on Travel |
128500 | Apple Fixes FaceTime Bug With iOS Update |
115700 | Mystery drones fly over French nuclear sites |
111300 | Prince Harry Reaches South Pole After 200-Mile Trek for Charity |
108400 | 12 Hours in ‘Utopia’: On the Set of Fox’s Newest Reality Show |
96500 | These Glasses Let You Play in 3D Virtual Worlds |
84800 | Thousands of Children Orphaned by West Africa’s Ebola Crisis |
75500 | With Lima climate talks entering critical period, Kerry tries to rally leaders to act |
69300 | 11 People Who Should Cancel Their Gym Memberships |
Two variables - url
and timedelta
- are non-predictive and have been
removed. The remaining 53 variables comprise 8427 observations, which
makes up 21.3 percent of the original data set. Fernandes et al., who
sourced the data, concentrated on article characteristics such as
verbosity and the polarity of content, publication day, the quantity of
included media, and keyword attributes (Fernandes et al., 2015). A
subset of these variables and the correlations between them are explored
in subsequent sections.
The broader purpose of this analysis is predicated on using supervised
learning to predict a target variable - shares
. To this end, the final
sections outline four unique models for conducting such predictions and
an assessment of their relative performance. Two models are rooted in
multiple linear regression analysis, which assesses relationships
between a response variable and two or more predictors. The remaining
models are based on random forest and boosted tree techniques. The
random forest method averages results from multiple decision trees which
are fitted with a random parameter subset. The boosted tree method
spurns averages in favor of results that stem from weighted iterations
(James et al., 2021).
Summarizations
Numerical Summaries
The first table summarizes information for article shares grouped by
whether an article was a weekend article or not. This summary gives an
idea of the center and spread of shares
across type of day group
levels.
channelData %>%
mutate(dayType=ifelse(is_weekend,"Weekend","Weekday")) %>%
group_by(dayType) %>%
summarise(Avg = mean(shares), Sd = sd(shares),
Median = median(shares), IQR =IQR(shares)) %>% kable()
dayType | Avg | Sd | Median | IQR |
---|---|---|---|---|
Weekday | 2229.789 | 6271.037 | 1100 | 1000 |
Weekend | 2679.424 | 4666.479 | 1500 | 1300 |
The next tables gives expands on the idea of the first table by grouping
shares
by each day of the week. This summary gives an idea of the
center and spread of shares
across day of the week group levels.
dowData<-channelData %>% select(starts_with("weekday_is"),shares) %>%
mutate(dayofWeek=case_when(as.logical(weekday_is_monday)~"Monday",
as.logical(weekday_is_tuesday)~"Tuesday",
as.logical(weekday_is_wednesday)~"Wednesday",
as.logical(weekday_is_thursday)~"Thursday",
as.logical(weekday_is_friday)~"Friday",
as.logical(weekday_is_saturday)~"Saturday",
as.logical(weekday_is_sunday)~"Sunday")) %>%
select(dayofWeek,shares)
dowLevels<-c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday")
dowData$dayofWeek<-factor(dowData$dayofWeek,levels = dowLevels)
dowData %>%
group_by(dayofWeek) %>%
summarise(Avg = mean(shares), Sd = sd(shares),
Median = median(shares), IQR =IQR(shares)) %>% kable()
dayofWeek | Avg | Sd | Median | IQR |
---|---|---|---|---|
Monday | 2456.054 | 6864.797 | 1100 | 968.25 |
Tuesday | 2220.135 | 5677.929 | 1100 | 929.00 |
Wednesday | 1879.788 | 3135.450 | 1100 | 919.00 |
Thursday | 2394.008 | 8584.880 | 1100 | 911.00 |
Friday | 2228.411 | 5792.085 | 1100 | 1052.00 |
Saturday | 2760.202 | 4864.959 | 1500 | 1500.00 |
Sunday | 2605.483 | 4480.142 | 1400 | 1200.00 |
The table below highlights variables with the highest and most significant correlations in the data set. This output may be considered when analyzing covariance to control for potentially confounding variables.
# Display top 10 highest correlations
covarianceDF <- corr_cross(df = channelData, max_pvalue = 0.05, top = 10, plot = 0) %>%
select(key, mix, corr, pvalue) %>% rename("Variable 1" = key, "Variable 2" = mix,
"Correlation" = corr, "p-value" = pvalue)
# Display non-zero p-values
covarianceDF[4] <- format.pval(covarianceDF[4])
kable(covarianceDF)
Variable 1 | Variable 2 | Correlation | p-value |
---|---|---|---|
n_non_stop_words | average_token_length | 0.963012 | < 2.22e-16 |
kw_max_min | kw_avg_min | 0.955981 | < 2.22e-16 |
n_unique_tokens | n_non_stop_unique_tokens | 0.952642 | < 2.22e-16 |
kw_min_min | kw_max_max | -0.872529 | < 2.22e-16 |
self_reference_max_shares | self_reference_avg_sharess | 0.858039 | < 2.22e-16 |
self_reference_min_shares | self_reference_avg_sharess | 0.846895 | < 2.22e-16 |
kw_max_avg | kw_avg_avg | 0.821122 | < 2.22e-16 |
n_non_stop_words | n_non_stop_unique_tokens | 0.815246 | < 2.22e-16 |
n_non_stop_unique_tokens | average_token_length | 0.772018 | < 2.22e-16 |
global_rate_negative_words | rate_negative_words | 0.765871 | < 2.22e-16 |
Contingency Table
The following contingency table displays counts and sums for the number of article shares within given ranges by the day of week shared. Share ranges were selected to illustrate lower, medium, and higher ranges of shares. Examining these counts can show possible patterns of shares by day or week and the range grouping for shares.
##dig.lab is needed to avoid R defaulting to scientific notation
kable(addmargins(table
(dowData$dayofWeek,cut(dowData$shares,
c(0,200,1000,10000,860000),dig.lab = 7))))
(0,200] | (200,1000] | (1000,10000] | (10000,860000] | Sum | |
---|---|---|---|---|---|
Monday | 11 | 607 | 693 | 45 | 1356 |
Tuesday | 12 | 750 | 740 | 44 | 1546 |
Wednesday | 12 | 761 | 757 | 35 | 1565 |
Thursday | 15 | 740 | 762 | 52 | 1569 |
Friday | 14 | 548 | 707 | 36 | 1305 |
Saturday | 12 | 116 | 369 | 22 | 519 |
Sunday | 2 | 121 | 424 | 20 | 567 |
Sum | 78 | 3643 | 4452 | 254 | 8427 |
Plots
The following histogram looks at the distribution of shares
. A pseudo
log y scale with modified y break values was used so that article
shares
with low frequency will appear. We can tell from the histogram
whether shares
has a symmetric or skewed distribution. The
distribution is symmetric if the tails are the same around the center.
The distribution is right skewed if there is a long left tail and right
skewed if there is a long right tail.
###creating histogram of shares data
##scales comma was used to avoid the default scientific notation
##pseudo log with breaks was used to make low frequency values
## more visisble
g <- ggplot(channelData, aes( x = shares))
g + geom_histogram(binwidth=12000,color = "brown", fill = "green",
size = 1) + labs(x="Article Shares", y="Pseudo Log of Count",
title = "Histogram of Article Shares") +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 2000, 6000),minor_breaks = NULL) +
scale_x_continuous(labels = scales::comma)
Fernandes et al. highlight several variables in their random forest
model (Fernandes et al., 2015). The following variables from their top
11 were included in the following correlation plot with variables in ()
being renamed for this plot:
shares
,kw_min_avg
,kw_max_avg
,LDA_03
,self_reference_min_shares
(srmin_shares
),kw_avg_max
,self_reference_avg_sharess
(sravg_shares
),LDA_02
,kw_avg_min
,LDA_01
,n_non_stop_unique_tokens
(n_nstop_utokens
).
The plot shows correlation with the response variable shares
and the
other various combinations. Larger circles indicate stronger positive
(blue) or negative (red) correlation with correlation values on the
lower portion of the plot.
##Reduce variable name length for later plotting
## Otherwise var names overwrite Title no matter
## how many other size tweaks were made
corrData<-channelData %>%
mutate(sravg_shares=self_reference_avg_sharess,
srmin_shares=self_reference_min_shares,
n_nstop_utokens=n_non_stop_unique_tokens)
Correlation<-cor(select(corrData, shares, kw_min_avg,
kw_max_avg, LDA_03, srmin_shares,
kw_avg_max, sravg_shares, LDA_02,
kw_avg_min, LDA_01, n_nstop_utokens),
method = "spearman")
corrplot(Correlation,type="upper",tl.pos="lt", tl.cex = .70)
corrplot(Correlation,type="lower",method="number",
add=TRUE,diag=FALSE,tl.pos="n",tl.cex = .70,number.cex = .75,
title =
"Correlation Plot of Shares and Variables of Interest",
mar=c(0,0,.50,0),cex.main = .75)
The following two scatterplots illustrate the relationship between
response article shares shares
and predictor average keyword (max
shares) kw_max_ave
. kw_max_ave
was chosen because it was one of the
potential predictors examined in the previous correlation plot.
Both scatterplots plot these variables and add a simple linear regression line to the graph.
For either graph, an upward relationship indicates higher average keyword values tend towards more article shares. A negative relation would indicate a lower average keyword values tend towards more article shares.
In addition, both graphs use differing color for weekday and weekend articles so that we can spot any possible trends with those values too.
The first scatterplot uses the default R generated axes so that potential outliers or significant observations can be observed.
The second scatterplot reduces the scale of both axes to make it easier to spot relationships for the majority of data that occur within these bounds.
###Create new factor version of weekend variable
### to use later in graphs
scatterData<-channelData %>%
mutate(dayType=ifelse(is_weekend,"Weekend","Weekday"))
scatterData$dayType<-as.factor(scatterData$dayType)
###First scatter plot with ALL data
g<-ggplot(data = scatterData,
aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
geom_smooth(method = lm) +
scale_y_continuous(labels = scales::comma) +
scale_x_continuous(labels = scales::comma) +
labs(x="Avg. keyword (max. shares)", y="Article Shares",
title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",color="")
###Second scatter plot with reduced axes
g<-ggplot(data = scatterData,
aes(x= kw_max_avg,y=shares))
g + geom_point(aes(color=dayType)) +
geom_smooth(method = lm) +
ylim(0,10000) +
xlim(0,20000) +
labs(x="Avg. keyword (max. shares)", y="Article Shares",
title = "Scatter Plot of Article Shares Versus Avg. keyword (max. shares)",
color="")
The bar plot below shows cumulative article publications for each day of the week, with higher bars indicating more publications. However, days with the largest number of publications are not necessarily ones with the most article shares, as seen in the subsequent box plot.
# Subset columns to include only weekday_is_*
weekdayData <- channelData %>% select(starts_with("weekday_is"))
# Calculate sum of articles published in each week day
articlesPublished <- lapply(weekdayData, function(c) sum(c=="1"))
# Use factor to set specific order in bar plot
weekPubDF <- data.frame(weekday=c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"),
count=articlesPublished)
weekPubDF$weekday = factor(weekPubDF$weekday, levels = c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"))
# Create bar plot with total publications by day
weekdayBar <- ggplot(weekPubDF, aes(x = weekday, y = articlesPublished)) + geom_bar(stat = "identity", color = "#123456", fill = "#0072B2")
weekdayBar + labs(x = "Day", y = "Number published",
title = "Article publications by day of week")
The boxplot below examines the day of article publication
(Monday-Sunday) and the associated distribution of article shares
. The
median line indicates the center of the distribution of shares
, and
comparatively high medians indicate days that have relatively high
circulation of Mashable articles in social media networks. For days in
which the median is closer to the lower quartile (and where the upper
whisker may be taller than the lower whisker), the distribution is
skewed to the right. Conversely, a median that is closer to the upper
quartile indicates a distribution that is skewed to the left. Days with
relatively taller boxplots also have greater variability of shares
.
# Subset columns to include only weekday_is_*, shares,
# create categorical variable, "day", denoting day of week (Mon-Sun)
medianShares <- channelData %>% select(starts_with("weekday_is"), shares) %>% mutate(day = NA)
# Populate "day"
for (i in 1:nrow(medianShares)) {
if (medianShares$weekday_is_monday[i] == 1) {
medianShares$day[i] = "Monday"
}
else if (medianShares$weekday_is_tuesday[i] == 1) {
medianShares$day[i] = "Tuesday"
}
else if (medianShares$weekday_is_wednesday[i] == 1) {
medianShares$day[i] = "Wednesday"
}
else if (medianShares$weekday_is_thursday[i] == 1) {
medianShares$day[i] = "Thursday"
}
else if (medianShares$weekday_is_friday[i] == 1) {
medianShares$day[i] = "Friday"
}
else if (medianShares$weekday_is_saturday[i] == 1) {
medianShares$day[i] = "Saturday"
}
else if (medianShares$weekday_is_sunday[i] == 1) {
medianShares$day[i] = "Sunday"
}
else {
medianShares$day[i] = NA
}
}
# Transform "day" into factor with levels to control order of boxplots
medianShares$day <- factor(medianShares$day,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
# Plot distribution of shares for each day of the week
sharesBox <- ggplot(medianShares, aes(x = day, y = shares, fill = day))
sharesBox + geom_boxplot(outlier.shape = NA) +
# Exclude extreme outliers, limit range of y-axis
coord_cartesian(ylim = quantile(medianShares$shares, c(0.1, 0.95))) +
# Remove legend after coloration
theme(legend.position = "none") +
labs(x = "Day", y = "Shares",
title = "Distribution of article shares for each publication day") + scale_fill_brewer(palette = "Spectral")
For the empirical cumulative distribution function (ECDF) below, the
dplyr
ranking function ntile()
divides shares
into four groups.
Observations with the fewest shares are placed into group 1, those with
the most shares are placed into group 4, and intermediaries reside in
groups 2 and 3. The horizontal axis lists word count, and the vertical
axis lists the percentage of content with that word count. A divergence
of the colored lines suggests that the number of words differs in
content with the fewest and most shares. At any given percentage of
content (y-value), curves further to the right correspond to more words
within the associated shares
group. Groups with curves that are
further to the left indicate fewer words in that percentage of content.
# Create variable to for binning the shares
binnedShares <- channelData %>% mutate(shareQuantile = ntile(channelData$shares, 4))
binnedShares <- binnedShares %>% mutate(totalMedia = num_imgs + num_videos)
# Render and label word count ECDF, group by binned shares
avgWordHisto <- ggplot(binnedShares, aes(x = n_tokens_content, colour = shareQuantile))
avgWordHisto + stat_ecdf(geom = "step", aes(color = as.character(shareQuantile))) +
labs(title="ECDF - Number of words in the article \ngrouped by article shares (ranked)",
y = "ECDF", x="Word count") + xlim(0,2000) +
scale_colour_brewer(palette = "Spectral", name = "Article shares \n(group rank)")
Modeling
Splitting Data
Per project requirements, the data for each channel are split with 70% of the data becoming training data and 30% of the data becoming test data.
#Using set.seed per suggestion so that work will be reproducible
set.seed(20)
dataIndex <-createDataPartition(channelData$shares, p = 0.7, list = FALSE)
channelTrain <-channelData[dataIndex,]
channelTest <-channelData[-dataIndex,]
Linear Regression Models
Linear regression models describe a linear relationship between a response variable and one or more explanatory variables. Models with one explanatory variable are called simple linear regression models and models with more than one explanatory variable are called multiple linear regression models. Multiple linear regression models can include polynomial and interaction terms. Each explanatory variable has an associated estimated parameter. All linear regression models are linear in the parameters.
For linear regression, explanatory variables can be continuous or categorical. However, response variables are only continuous for linear regression models.
Linear regression models are fit with training data by minimizing the sum of squared errors. Model fitting results in a line for simple linear regression and a saddle for multiple linear regression.
The first linear regression model contains predictors that encompass content keywords, sentiment and subjectivity, the length of content (the effects of which were gleaned previously from the ECDF), and link citations.
# Parallel cluster setup
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
# Linear regression with subset of predictors (p-value < 0.1) selected after performing
# least squares fit on the entire set of predictors.
lmFit1 <- train(shares ~ kw_avg_avg + kw_max_avg + kw_min_avg +
num_hrefs + self_reference_min_shares + global_subjectivity +
num_self_hrefs + n_tokens_title + n_tokens_content + n_unique_tokens +
average_token_length + kw_min_max + num_keywords + kw_max_min + abs_title_subjectivity +
global_rate_positive_words,
data = channelTrain,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5))
stopCluster(cl)
The second linear regression model contains main effects for most of the
predictors listed earlier in the correlation plot. If variables had more
than .50 pairwise correlation in any channel, one variable of that pair
was excluded. Excluded variables were: self_reference_min_shares
,
kw_avg_max
, and LDA_01
.
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
lmFit2 <- train(shares ~ kw_min_avg +
kw_max_avg + LDA_03 +
self_reference_avg_sharess + LDA_02 +
kw_avg_min + n_non_stop_unique_tokens,
data = channelTrain,
method = "lm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 10))
stopCluster(cl)
Random Forest Model
Random forest models aggregate results from many sample decision trees. Those sample trees are produced using bootstrap samples created using resampling with replacement. A tree is trained on each bootstrap sample, resulting in a prediction based on that training sample data. Results from all bootstrap samples are averaged to arrive at a final prediction.
Both bagging and random forest methods use bootstrap sampling with decision trees. However, bagging includes all predictors which can lead to less reduction in variance when strong predictors exist. Unlike bagging, random forests do not use all predictors but use a random subset of predictors for each bootstrap tree fit. Random forests usually have a better fit than bagging models.
In this particular case, the response shares
is continuous and we are
working with regression trees. The mtry
tuning parameter controls how
many random predictors are used in the bootstrap samples. An mtry
of 1
to 30 was chosen as a way to evaluate up to 30 predictors. These values
were chosen to work within available computing constraints. Five fold
cross validation is used to choose the optimal mtry value corresponding
to the lowest RMSE.
##Run time presented a challenge so parallel processing was used
##Followed Parallel instructions on caret page
## https://topepo.github.io/caret/parallel-processing.html
##Various mtry values were tried with a 20 minute runtime goal
## A 20 minute per channel runtime corresponds
## to a total of about 2 hours model fit for all 6 channels
##mtry 1:30 was chosen because it was close to 20 minutes
##mtry 1:20 had 10 minute runtime and 1:30 took 30 minutes
##repeatedcv was evaluated but took over 30 minutes
## thus repeats were not used
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
rfFit <- train(shares ~ ., data = channelTrain,
method = "rf",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneGrid = data.frame(mtry = 1:30))
stopCluster(cl)
rfFit
## Random Forest
##
## 5900 samples
## 52 predictor
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4720, 4721, 4720, 4720, 4719
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 5182.343 0.03743642 1790.242
## 2 5183.437 0.03735817 1827.322
## 3 5195.340 0.03718695 1852.552
## 4 5185.312 0.04166386 1859.705
## 5 5198.292 0.04054853 1879.654
## 6 5221.081 0.03586387 1890.592
## 7 5219.663 0.03815146 1899.968
## 8 5228.265 0.03728502 1905.179
## 9 5245.001 0.03520613 1917.283
## 10 5259.468 0.03349519 1921.595
## 11 5264.505 0.03343803 1922.949
## 12 5255.401 0.03658707 1927.706
## 13 5266.718 0.03500375 1935.479
## 14 5280.229 0.03399299 1937.477
## 15 5278.936 0.03448087 1940.336
## 16 5289.540 0.03352459 1944.222
## 17 5296.129 0.03241678 1946.882
## 18 5302.975 0.03205211 1954.007
## 19 5318.198 0.03105698 1961.336
## 20 5320.409 0.03177288 1965.356
## 21 5320.975 0.03046745 1960.956
## 22 5336.247 0.03113255 1970.660
## 23 5329.221 0.03135101 1968.487
## 24 5346.014 0.02926197 1977.377
## 25 5339.073 0.03046821 1972.070
## 26 5345.336 0.02997528 1975.250
## 27 5367.916 0.02717491 1982.114
## 28 5360.429 0.02918193 1978.048
## 29 5353.951 0.03005881 1980.034
## 30 5356.728 0.02936480 1984.482
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 1.
After fitting the random forest model, the following variable importance plot is created. The top ten most important predictors are plotted using a scale of 0 to 100.
rfImp <- varImp(rfFit, scale = TRUE)
plot(rfImp,top = 10, main="Random Forest Model\nTop 10 Importance Plot")
Boosted Tree Model
Boosting is a general method whereby decision trees are grown sequentially using residuals (the differences between observed values and predicted values of a variable) as the response. Initial prediction values start at 0 for all combinations of predictors, so that the first set of residuals matches the observed values in our data. To mitigate low bias and high variance, contributions from subsequent trees are scaled with a shrinkage parameter, λ. The value of this parameter is generally small (0.01 or 0.001), which slows tree growth and tampers overfitting (James et al., 2021).
# Re-allocate cores for parallel computing
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
# Boosted tree fit with tuneLength (let function decide parameter combinations)
boostedTreeFit <- train(shares ~ ., data = channelTrain,
method = "gbm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneLength = 5)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 28151790.1831 nan 0.1000 69810.4856
## 2 28041695.5198 nan 0.1000 62510.4613
## 3 27986234.6249 nan 0.1000 -11419.0572
## 4 27945138.2537 nan 0.1000 2946.0247
## 5 27908554.4651 nan 0.1000 -29102.3845
## 6 27851707.9009 nan 0.1000 2341.4440
## 7 27808793.4825 nan 0.1000 18211.1615
## 8 27748444.4311 nan 0.1000 34098.4816
## 9 27699550.9633 nan 0.1000 -5195.4857
## 10 27659063.8766 nan 0.1000 4245.6564
## 20 27113289.1077 nan 0.1000 11125.7077
## 40 26589717.1170 nan 0.1000 -6461.5717
## 50 26421556.5563 nan 0.1000 4648.0519
# Define tuning parameters based on $bestTune from the permutations above
nTrees <- boostedTreeFit$bestTune$n.trees
interactionDepth = boostedTreeFit$bestTune$interaction.depth
minObs = boostedTreeFit$bestTune$n.minobsinnode
shrinkParam <- boostedTreeFit$bestTune$shrinkage
# Boosted tree fit with defined parameters
bestBoostedTree <- train(shares ~ ., data = channelTrain,
method = "gbm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 5),
tuneGrid = expand.grid(n.trees = nTrees, interaction.depth = interactionDepth,
shrinkage = shrinkParam, n.minobsinnode = minObs))
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 28121785.6990 nan 0.1000 92987.3684
## 2 28036091.5933 nan 0.1000 72477.3174
## 3 27962784.5616 nan 0.1000 53691.5602
## 4 27916702.0401 nan 0.1000 -16769.2612
## 5 27843896.5592 nan 0.1000 39695.6885
## 6 27768557.2416 nan 0.1000 49130.0674
## 7 27709356.7327 nan 0.1000 1802.9720
## 8 27647268.6180 nan 0.1000 54616.5060
## 9 27597204.3846 nan 0.1000 46167.8385
## 10 27540892.4831 nan 0.1000 39945.2386
## 20 27073824.5208 nan 0.1000 1016.4510
## 40 26567276.0734 nan 0.1000 -29366.9323
## 50 26423494.8341 nan 0.1000 -7863.5678
stopCluster(cl)
Model Comparisons
After models were fit with training data, we do predictions with testing data. Finally, RMSE metrics are extracted and compared. The model with lowest RMSE is presented as the winning model.
# Predict using test data
predictLM1 <- predict(lmFit1, newdata = channelTest)
# Metrics
RMSELM1 <- postResample(predictLM1, obs = channelTest$shares)["RMSE"][[1]]
RMSELM1
## [1] 7547.569
# Store value for model comparison
modelPerformance <- tibble(RMSE = RMSELM1, Model = "Linear regression 1")
predictLM2 <- predict(lmFit2, newdata = channelTest)
RMSELM2<-postResample(predictLM2, channelTest$shares)["RMSE"][[1]]
RMSELM2
## [1] 7577.951
modelPerformance <- add_row(modelPerformance, RMSE = RMSELM2, Model = "Linear regression 2")
predictRF <- predict(rfFit, newdata = channelTest)
RMSERF<-postResample(predictRF, channelTest$shares)["RMSE"]
RMSERF
## RMSE
## 7532.579
modelPerformance <- add_row(modelPerformance, RMSE = RMSERF, Model = "Random forest")
# Predict using test data
predictGBM <- predict(bestBoostedTree, newdata = channelTest)
# Metrics
RMSEGBM <- postResample(predictGBM, obs = channelTest$shares)["RMSE"]
RMSEGBM
## RMSE
## 7573.298
modelPerformance <- add_row(modelPerformance, RMSE = RMSEGBM, Model = "Boosted tree")
# Select row with lowest value of RMSE.
selectModel <- modelPerformance %>% slice_min(RMSE)
selectModel
## # A tibble: 1 x 2
## RMSE Model
## <dbl> <chr>
## 1 7533. Random forest
Based on the preceding analyses with test data, the Random forest model yields the lowest RMSE - 7532.5787991.