ansaurus

Question

How can I generate conditional distributions of data by taking slices of scatterplots?

Answer 1

A:

This page explains it for you http://www.statmethods.net/advgraphs/trellis.html

You basically want to alter the equation for the graphs. They should be more like

csalary ~ bsalary|gender

should break the graphs apart based on different values of gender. There is a bunch of control language for continuous conditional variables.

TheSteve0 2010-02-22 06:05:32

@TheSteve0 - Thanks! I'm a fan of Quick-R for reference to things I can already do in SPSS, and it's been a huge help. I think my biggest challenge will be understanding and really becoming conversant in both conceptual modeling language in statistics, and R's particular modeling language for formulae.

briandk 2010-02-23 19:08:45

Answer 2

+2 A:

you can use the cut() function to slice your data into ordinal categories. Then ggplot2's qplot function can then very easily create your desired plots.

library(ggplot2)

#fake data
csalary <- rnorm(100,,100)
bsalary <- csalary +rnorm(100,,10)

#Regular Scatter Plot
qplot(bsalary,csalary)

#Stacked dot plot
qplot(cut(bsalary,10),csalary)

#box-plot
qplot(cut(bsalary,10),csalary,geom="boxplot")

Ian Fellows 2010-02-22 06:12:15

@Ian - Thanks! I just tried your code, and at a glance it looks like it's producing the exact output I'd hoped. I do have two questions though.1. The "cut" function doesn't seem to require loading ggplot2, right?2. I'm new to `lattice`, and I'd never even tried `ggplot2` before. What motivates people to choose those packages to graph data over R's basic plot commands? I imagine there's a slew of ggplot2 and lattice options that could overwhelm a novice like myself :-(

briandk 2010-02-23 18:57:59

@Ian - also, next time I'll try and provide some data so you don't have to make it up :-)

briandk 2010-02-23 19:09:27

For a discussion on ggplot2 versus lattice... http://www.schulte-mecklenbeck.com/?p=65

William Doane 2010-02-24 10:54:19

Answer 3

+2 A:

Do you really want to do that? Turning a continuous variable into an ordinal one throws away information since different values of the X variable end up in the same bin. I think your boxplot graphic conveys much less information than your scatterplot.

If you are dissatisfied with the scatterplot because of points overlapping, one way to preserve information would be to add a smooth curve that captures the trend. Look at the documentation for lowess for an example.

In your graph the three observations with salaries higher than $20,000 are pushing the remaining observations into a corner. Dropping those and replotting would give a better graph.

Another approach for skewed data like yours is to plot the logarithms of the variables instead of the variables themselves.

Jyotirmoy Bhattacharya 2010-02-22 06:42:45

@Jmoy - You're absolutely right (see above). For this particular regression assignment, we were actually _comparing_ LOESS smoothing to joined-line-segments of slice means. So, I wanted to know for myself how one might do that (conditionally group data by slice), even though for a lot of cases it gives a much less desirable result than LOESS. My question was less about best practices in data analysis, and more about understanding conceptually and programmatically one might ask R to group data.

briandk 2010-02-23 18:50:00

Answer 4

+2 A:

Rather than slice the data by the value of the conditioning variable (turning a continuous variable into a discrete variable), it is more efficient to condition using a kernel function. There is package that does this: hdrcde. Check out the examples in the help files.

Rob Hyndman 2010-02-22 21:18:11

@Rob - thanks! I'll definitely have to check this out. I'm not familiar with kernel functions yet, but I hope to be by the end of my Multiple Regression course.

briandk 2010-02-23 19:01:02

ansaurus

tags:

views:

answers:

How can I generate conditional distributions of data by taking slices of scatterplots?

related questions