If you look at CRAN, there are various (I count about 10) packages to read image data. And of course, there are various packages to do clustering. In theory, you could just plug the raw image data into the clustering algorithms, but in practice that wouldn't work very well. In terms of speed, it would be very slow, and in terms of accuracy it would probably be pretty bad too. Modern techniques to cluster image data rely on specialized features extracted from images, and operate on that. The best features are application dependent, but some of the best known are SIFT, SURF and HOG. Older techniques relied on histograms of colors of the image as features, and that is quite doable with the aforementioned R packages, but it is not very accurate - it can hardly distinguish between a picture of the sea and a picture of a blue room.
So what to do? It depends on your ultimate objective, really. One way could be use one of various open source feature extractors out there, save the data to text or other R-readable format, and then do the data processing in R as usual.
A nice open source C library to extract features that has a cli interface is vlfeat. If you use this, I recommend using dense SIFT extraction on the three color channels. Then represent each image by the concatenated SIFT vectors and apply your favorite clustering technique (that can handle vectors with dimensionalities in the thousands). That would hardly give you state of the art performance, but it's a start.
This page has various reference implementations of feature extractors, but binary only.
Beware: in my experience, R doesn't scale too well with large, high dimensional datasets (with sizes in the GB range). I love R to death, but use C++ for this stuff.