Module 8: Linear Regression Visualizations in ggplot2

 Module 8: Linear Regression Visualizations in ggplot2

DISCLAIMER: In this Week's submission, I will post the code and my comments here, but will submit a separate HTML .rmd file for your convenience in Canvas, so that viewing my work is easier since it will be organized into a compiled document.

In this weeks module, we explored different styles of Statistical Analysis from a Dataset in R programming. The key, is to remember when dealing with data, what is the appropriate time to separate (use factor()) or to treat data irrespective of each other by allowing unfiltered data directly from the source to be visually represented. I used related columns from the data frame to visualize similarities and trends amongst characteristics of automobiles that I know for a fact can tell us about the others even without knowledge of their data. Without data, one might assume a larger engine means a quicker quarter mile or less fuel mileage, but that is not always the case. This data does not take into account possible modifications or other nuances of the car from the factory that might make it an outlier in its respective category. Instances like this are what make for small anomalies in the data, but from what was provided to me in the dataset, here are some visualizations on what I believe to be an accurate regression analysis in ggplot2.

##Module 8: Linear Regression Analysis in ggplot2

## Let's conduct a simple regression analysis in R using mtcars

##First, let's call 'mtcars' and set it to CD

##Add Some libraries too



library(dplyr)

library(tidyverse)

library(ggplot2)



data(mtcars)

CD <- data.frame(mtcars)


#Now let's isolate some data and generate a quick histogram 

#Say you want to analyze the relationship between MPG and the Number of Cylinders 

#First, define ggplot(), the data and then the aesthetics (aes()), and then specify some colors and labels

## To add a regression line or line of best fit, use geom_smooth.



ggplot(CD, aes(CD$wt,CD$mpg)) +

  geom_point(

    fill = "blue",

    color = "black"

  ) +

  labs(

    x = "Cylinder Count",

    y = "MPG"

  ) +

  geom_smooth(method = "lm")



#You can also use stat_smooth() to achieve the same effect as geom_smooth()


ggplot(CD, aes(CD$wt,CD$mpg)) +

  geom_point(

    fill = "blue",

    color = "black"

  ) +

  labs(

    x = "Cylinder Count",

    y = "MPG"

  ) +

  stat_smooth(method = "lm", color = "red")


#I changed the color to red to specify the difference in the same line being designated by two different function calls

## Let's try something a little different. Instead of using a scatterplot and regression line

##We can do a line of best fit for each varable, see the scenario below:


#Let's visualize a regression analysis line on a scatterplot between, engine power (hp) and quarter mile times (qsec).

## We will set the color to the cylinder number to show that some larger engines that use more fuel may in fact be faster


#I make another object with my desired title to avoid using factor() in more that one place or within the title

##Call it, my way of keeping my presentation and visualizations clean.

Cylinders <- factor(CD$cyl)


ggplot(CD, aes(CD$hp,CD$qsec)) +

  geom_point(

    fill = "blue",

    color = CD$cyl

  ) +

  labs(

    x = "Engine Output (HP)",

    y = "Quarter Mile Time (qsec)"

  ) +

  stat_smooth(

    method = "lm",

    se=FALSE,

    fullrange=TRUE,

    aes(color= Cylinders))


#We can also make use of a grid, within the theme() function call


ggplot(CD, aes(CD$hp,CD$qsec)) +

  geom_point(

    fill = "blue",

    color = CD$cyl

  ) +

  theme(panel.grid = element_line(color = "#8ccde3",

                                  size = 0.75,

                                  linetype = 2)) +

  labs(

    x = "Engine Output (HP)",

    y = "Quarter Mile Time (qsec)"

  ) +

  stat_smooth(

    method = "lm",

    se=FALSE,

    fullrange=TRUE,

    aes(color= Cylinders))



#As you can see from the resulting visualization, the correlation between a quarter mile time is predecated on a few variables

##First, as the number of horsepower increases, the populated points read higher hp, are larger cylinder count engines, and by method of deduction, have a larger displacement

##The results show that as you increase the number of cylinders and the horsepower, your quarter mile time is going to go down.

##Though, one might believe that 8 cylinders should out perform 6 cylinders, but I must remind that some of the cars on the list in mtcars, are 6 cylinder

##but have either individual throttle bodies or forced induction which creates more power and pickup in a smaller engine thats lighter and grips the ground easier.

##Ultimately you would have to test that individually or add that data to mtcars, if the car has a turbo/supercharger or not and if so? how many PSI of boost does it give at maximum load capacity?

##Other than that, this is the end of this weeks review of Linear and Multiple Regression Analysis in ggplot2.

Comments

Popular posts from this blog

R Package: pfStat

Module 7 Assignment

The Tampa Feasibility Report featuring R based Visualizations