It's overlapping a bit with Igor's reply, but here's my take:
Native Control Look - UI controls today have a rather complex appearance. There are many visual cues we instinctively derive from them, and even if it's a white rectangle with some frame, with the wong shadow it looks strangely out of place. A context menu often doesn't just open today, it slides in from some direction, or fades in.
Native Control Behavior - Even more complex than UI, there's a lot of detail to behavior: different context menus depending on click position, different "hot" areas when selecting or dragging items, keyboard shortcuts, etc.
Attention to detail - There's a lot of consistent UI behavior to discover on any platform. Just the way arrow keys work in a tree control WRT selecting, opening and closing nodes.
Just look at Windows: Most non-native toolkits get the basic keyboard navigation wrong - Arrow keys, Home, End, PgUp and PgDown, behavior modified with Ctrl, extending selection with Shift gives up to 32 behaviors. Copy & Paste is traditionally with Ctrl+C/Ctrl+X/Ctrl+V and Shift+INS,Shift+DEL, and missing. Mouse double click often selects a word, mouse triple click sometimes a sentence, line or paragraph.
Response time and Muscle Memory - There are, basically, two UI operation modes:
act-look loop, where you wait for the response before deciding the next step,
playback from muscle memory, which is much faster and requires less mental processing ressources.
There are, however, two requirements for that: response must be uniform and "instant", and the next action must be registered correctly immediately (at least within 10 ms)
Often enough, with non-native toolkits, this gets hard by the response lagging behind one or two actions (the mind locks on the discrepancy), and by toolkits that take 50ms or more to show a menu, in which time a click isn't registered as intended.
A polished UI takes long to get right - A good control library can solve most of the per-control issues, but there's some final 10% taking 90% of the time, and you have control interactions. You have to try different approaches, you have to expect users with FPS-trained reflexes, you have to try all kinds of workflows.
Cross-Platform toolkits can't get it perfectly right - they are stuck between a rock and a hard place: They can opt for internal consistency independent of the platform, or being consistent with platform they currently run on. To get it right, the latter often requires platform-dependent code in the calling code, the actual thing you are trying to avoid.