Short answer : You can't
Long answer : The problem you'll face is that you'll be able to get (x,y) coordinates for the click event on div1, but any implementation of the caret position while require you knowing the position of the caret in the content (which is the number of characters preceding the caret).
To convert the (x,y) coordinates to a character position you actually need to know how many characters were before (ie. left on the current line and above, if the text is ltr).
If you use a fixed width font, you can simplify the problem : mapping an (x,y) coordinate to a (line, column) coordinate on a character grid.
However, you still face the problem of not knowing how the text is wrapped. For example :
------------------
|Lorem ipsum |
|dolor sit amet |
|consectetur |
|adipiscing elit |
------------------
If the user clicks on the d in dolor, you know that the character is the 1st on the 2nd line, but without knowing the wrapping algorithm there is no way you'll know that it is the 13th character in "Lorem ipsum dolor sit…". And there is no guarantee that such a wrapping algorithm is identical across browsers and platform.
Now, what I'm wondering is why would you use 2 different synced div
in the first place ? Wouldn't it be easier to use only one div and set its content to editable when the user clicks (or hovers) ?