ٰSharing my PoC for HuggingFace Multi-worker Server
Python problems hit hard some times.
Python nature brings a lot of challenges when dealing with blocking IO. HuggingFace SDK doesn't provide an out-of-box solution to having inference on models be threaded, although the lower-level structures (PyTorch and Tenserflow) provides the necessary tooling. HF docs suggest using multi-threaded web server, but my attempts didn't to apply the same snippet didn't resolve well.
As I needed an urgent PoC of being able to provide multi-tenant (more than one user using using LLM capabilities at once) service. I decided to build a PoC that follows workers concept, where multiple number of
workers can be started alongside
backend to provide multi-tenant API for LLM inference.
The results of work were satisfying as it didn't really go off-track by a lot for me. I published the work on GitHub repo with name
The demo runs
falcon-40b-instruct model in
conversational mode, and allows users to provide knowledge source,
article, and ask
question so LLM would answer it assuming its only knowledge is
article. Instructions to run the PoC and use it are available in repo.
Header photo: A work by freestocks on Unsplash