ٰSharing my PoC for HuggingFace Multi-worker Server


Python problems hit hard some times.

Python nature brings a lot of challenges when dealing with blocking IO. HuggingFace SDK doesn't provide an out-of-box solution to having inference on models be threaded, although the lower-level structures (PyTorch and Tenserflow) provides the necessary tooling. HF docs suggest using multi-threaded web server, but my attempts didn't to apply the same snippet didn't resolve well.

As I needed an urgent PoC of being able to provide multi-tenant (more than one user using using LLM capabilities at once) service. I decided to build a PoC that follows workers concept, where multiple number of workers can be started alongside backend to provide multi-tenant API for LLM inference.

The results of work were satisfying as it didn't really go off-track by a lot for me. I published the work on GitHub repo with name huggingface_multi_worker_server.

The demo runs falcon-40b-instruct model in conversational mode, and allows users to provide knowledge source, article, and ask question so LLM would answer it assuming its only knowledge is article. Instructions to run the PoC and use it are available in repo.

Header photo: A work by freestocks on Unsplash

Creative Commons License

This page content - excluding exceptions - is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.