Puzzling behavior in simple logical test applied to vector of values

Puzzling behavior in simple logical test applied to vector of values

Ok, this has me absolutely perplexed and worried-
As part of a routine, I have been classifying individual observations of variables as TRUE or FALSE based on whether their values are above or below/equal to the median value. However, I have been getting a behavior in R that is largely unexpected from performing this simple test.

So take this set of observations:

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)

For me to classify these values, I did:

data_med=median(data)
quant_data=data
quant_data[quant_data>data_med]="High"
quant_data[quant_data<=data_med]="Low"

I know there are 1 gazillion ways of doing this more efficiently, but what has me worried is that the output from this does not make sense. Since there are no NaNs on the set and the test is all inclusive (> or <=), I should end up with a list of only TRUE/FALSE values, but instead I get:

[1] "High"  "High"  "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "High"  "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "1e-04"
[18] "Low"   "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "1e-04" "Low"   "High"  "Low"   "Low"   "High" 
[35] "High"  "Low"   "High"  "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"   "High"  "High"  "Low"   "Low"   "1e-04" "Low"  
[52] "1e-04" "Low"   "Low"   "High"  "Low"   "Low"   "Low"   "Low"   "Low"   "High"  "High"  "High"  "High"  "High"  "Low"   "Low"   "Low"  
[69] "1e-04" "High"  "High"  "High"  "High"  

See the “1e-04″s? What is even stranger, let’s pick value 69, one of the ones that return odd values:

data[69]
>1e-04

If I test this value alone, I get what I expected to get:

data[69]<=data_med
TRUE

Can someone explain this behavior? It just seems downright dangerous…

Let’s walk through what you did here.

data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)



data_med=median(data)  ## 0.5
quant_data=data        ## irrelevant
quant_data[quant_data>data_med]="High"

But by doing this you have converted quant_data to a character vector:

str(quant_data)
##  chr [1:73] "High" "High" "High" "High" "High" "High" "High" ...

Now the comparison between a character value and the data_med value is almost meaningless, because data_med will get coerced to a character value too:

"High" < "0.5"  ## FALSE
"1e-4" < "0.5"  ## FALSE -- this is your problem.
quant_data[quant_data<=data_med]="Low"

What you presumably meant to do (and a reason to assign quant_data=data) was:

quant_data[data>data_med]="High"
quant_data[data<=data_med]="Low"
table(quant_data)
## High  Low 
##   35   38 

As @Arun points out in comments above, quant_data <- ifelse(data>data_med,"High","Low") would work too. So would an appropriate use of cut().

.
.
.
.