The general approach is to read the source buffer as efficiently as possible, and shift it as required on the way to writing the destination buffer.
You don't have to do byte operations, you can always get the source reads long
aligned for the bulk of the operation by doing up to three bytes at the beginning, and similarly handling the end since you shouldn't attempt to read past the stated source buffer length.
From the values read, you shift as required to get the bit alignment desired and assemble finished bytes for writing to the destination. You can also do the same optimization of writes to the widest aligned word size you can.
If you dig around in the source to a compression tool or library that makes extensive use of variable-width tokens (zlib, MPEG, TIFF, and JPEG all leap to mind) you will likely find sample code that treats an input or output buffer as a stream of bits that will have some implementation ideas to think about.