Class Reflection Post 558

What (if anything) has changed about what you think a data scientist is and what they do?

I had been wondering how much a data scientist role really is new compared to what had been called a statistician role.

In terms of day to day work, I had not had not realized how prominent APIs had become in recent years so I was glad that we got to experience working with APIs.

In an earlier post, I mentioned that I first started hearing the term “data scientist” roughly 10 years ago.

Interestingly, I realized in the last few weeks that much of the process we’ve done in class has been around for a much longer time period. In particular I recall seeing demos of the SAS Enterprise Miner product over 20 years ago. At that time, Enterprise Miner had a node setup with data processing and preparation nodes passing data to modeling nodes then to scoring nodes. In other words, the general process was very similar to what we’ve been coding in R.

I’ve started to wonder whether folks were actually working as “data scientists” well before the formal buzzword spread. The rise of formal “data scientists” seems to correspond more to the abundance of large amounts of data and the recognition by more organizations that they could obtain value from their data.

I’ve also noticed that different groups have different ideas of what skills and abilities are needed for data science. For example, at NCSU there are at least three different graduate programs in the analytics, computer science, and statistics groups with some degree of stated data science focus. However, a graduate of these different programs could become a data scientist with very different emphasis on skills developed.

In the end, it does seem like a data scientist and statistician role could be very similar or different depending on the context. My observations have reinforced the idea that data science is truly interdisciplinary with statistics being just one part.

What your current thoughts are in terms of using R for data science - do you think you’ll continue to use R going forward? Why or why not?

R seems to be well suited for some aspects of data science. Combining Tidyverse and R Markdown is an especially productive way to work with data and reporting.

I do have concerns about R and data science model fitting performance. Even with parallel processing our project two random forest fitting performance seemed slow. Maybe later when time permits I’ll see if I can replicate that part in Python or SAS and compare performance with R.

I definitely plan to use R for graduate classes. I took this class early in my program because I knew most of the other classes use R frequently.

Outside of classes, how much I use R would depend on the task and what software colleagues or projects are using.

My general approach to software has been to use what is available and best for the job. I’ve often taken a hybrid approach combining different software. For example, I’ve worked with various combinations of Perl, PowerShell, PHP, and SAS using the parts that each does better.

After this class, I feel confident enough about R to add R to the languages I use on a regular basis.

What things are you going to do differently in practice now that you’ve had this course?

Before this class, I had not really used R Markdown and was only using R scripts. Now that I’ve had lots of experience, I plan to switch from R scripts to R Markdown. In addition to great reporting, I found R Markdown to be easier to use for code commenting than traditional R script comments.

I’m also considering continuing to use Github for saving my work. So far I’ve only used Github repos for the class projects. In the future I hope to use repos more frequently since I’ve found they really help with getting work organized.

Before this class I had not used the Caret package. I’ve found Caret to be especially helpful since more common syntax can be used. I plan to continue using Caret when applicable.

Written on November 26, 2021