If A is clearly, significantly better in individual A/B tests, while B scores better in aggregate, then the main implication is that you can't aggregate those data sets that way. A is better.
If the testing got the same results every day, you wouldn't get this clear result, even with varying sample sizes per day. So I think it additionally implies that something has changed. It could be anything, though. Maybe what you tested each day changed (perhaps in some very subtle way, like server speed). Or maybe the people you're testing it on changed (perhaps demographically, perhaps just in terms of their mood). That doesn't mean your testing is bad or invalid. It just means you're measuring something that's moving, and that makes things tricky.
And I might be miscalculating or misunderstanding the situation, but I think it is also necessarily true that you haven't been testing A and B the same number of times. That is, if on Monday you tested A 50 times and B 50 times, and on Tuesday you tested A 600 times and B 600 times, and so on, and A outscored B each day, then I don't see how you could get an aggregate result where B beats A. If this is true of your test setup, it certainly seems like something you could fix to make your data easier to reason about.