I need to get a plot of a Lorentz curve of a cumulative variable as a function of the number of observations. I want both axes to be displayed on a percentage basis (e.g. say observations are the number of buyers and the y variable is the amount they bought, buyers are already ranked in descending order, I want to get the plot that says "The top 10% buyers purchased 90% of the total bought"). My dataset is a couple million observations.
What is the best way to do this? Sub-questions:
If I need to add two variables for the quantiles of total observations and total $ bought (so as to use them to plot), what is the object that returns the row number? I tried:
user_quantile <- row(df)/nrow(df)
but I get a matrix of identical columns (user_quantile.1, user_quantile.2) of which I only need one column.
Is there instead any way to skip adding percentages as variables and only have them for axes values?
The plot has way to many points than I need to get the line. What is the best approach to minimize the computational effort and get a nice graph?
Thanks.