Data Activity 1 / unit 1

Task

Getting started with R; creating summary statistics based on Crime Survey for England and Wales, 2013-2014

Process and Findings

# Install and load necessary package
install.packages("haven")
library(haven)

# Read the data from the .sav file
my_data <- read_sav("filelocation.sav")

# List column headings
names(my_data)

# Display frequency table of 'antisocx' variable
table(my_data$antisocx)  # shows the data in the output window, however realized it is better to view this in the ‘my_data’ tab in the programming window.  

# Check the mean and median of 'antisocx'
mean(my_data$antisocx)  # this returned NA, so decided to try median to see if same issue
median(my_data$antisocx)  # - NA again, visualisation of data shows that ‘NA’ is a recorded response in many rows. Therefore, we need to tell R to ignore missing values, as shown below.

# Ignore missing values (NA)
mean(my_data$antisocx, na.rm = TRUE)
median(my_data$antisocx, na.rm = TRUE)

# Create summary statistics
summary(my_data$antisocx, na.rm = TRUE)

Results
Min.     1st Qu.  Median    Mean     3rd Qu.  Max.    NAs
-1.215  -0.788    -0.185    -0.007   0.528    4.015   6694

# Using 'describe' function to get more detailed summary statistics
library(psych)
describe(my_data$antisocx)

      vars    n   mean   sd  median trimmed   mad   min   max  range   skew kurtosis   se
X1    1     2149 -0.01  0.99  -0.18   -0.11  1.06  -1.22  4.01   5.23   0.8  0.23     0.02

Interpretation

The interpretation of a negative value is not clear from the data table. Does it mean that antisocial behavior is not a concern?

The fact that the median is smaller than the mean suggests that the table is skewed to the right and that there could be some high-score outliers influencing the mean score. The min is -1.22 and the max is 4.01. The positive skew value confirms that the distribution is not normal and is right-skewed.

Learnings

Repeating this task using ‘import’ command for file did not work, suggesting the headings, row/column titles are not retained. with coding above the file must be ‘read’

Potential missing/NA values must be checked for (can be seen in the ‘summary’), and R needs to be told to ignore these. Using ‘Describe’ gives more information about the variation and ‘skewness’ of the data.

⬅️ Back to Home