Run your own LLM on Hugging Face in 7 steps

Warren Bickley Avatar

Warren Bickley 27 August 2023

Screenshot of the Hugging Face hero banner on the home page
Hugging Face is the home of Open Source data science and machine learning. A veritable hub of knowledge and collaboration for AI experts and enthusiasts.

TL;DR

Running Open Source LLMs is easy with Hugging Face's inference endpoints. The future of both software and work incorporates AI in various capacities.

Step 1: Create a Hugging Face account

To get started, you will need to create an account on the Hugging Face website or log in if you already have an account.

Screenshot of the Hugging Face sign up page
Sign up to Hugging Face.

Step 2: Choose a pre-trained model

Hugging Face hosts a variety of pre-trained models that you can run with. If this is your first time working with LLMs, we’d recommend any of the Meta Llama 2 models - with the 7b being the most cost efficient to run.

You can find models in the Hugging Face model repository, once you have chosen one you should copy the model name by clicking the clipboard icon next to the name.

Screenshot of the Llama-2-7b model on the Hugging Face model repository
Find a model you want to run.

Step 3: Create a new inference endpoint

With the model name you want to run now on your clipboard, you can go to Hugging Face’s Inference Endpoints by clicking “Solutions” in the top menu and then “Inference Endpoints”. From there you can start the process of creating your own inference endpoint.

Screenshot of the menu with a link to the inference endpoints management page.
Go to the inference endpoints management page.

Step 4: Enter your model details

In the “model repository” field you can now paste in the model name you chose earlier. You should give your endpoint a unique name too.

Screenshot of the endpoint creation screen.
Input the options for your endpoint.

Step 5: Choose an instance type

You can choose the cloud platform you wish to run the model on, but we’d recommend sticking with AWS us-east-1 for now due to the wider choice of instance types to run on. This usually only matters when running massive models (~70B for example).

GPU [medium] should be enough for you to run a 7B model on. If the platform doesn’t think your instance type is capable of running the model then it will present you with a helpful warning too so you can click through and experiment with what works.

Screenshot of the select instance type menu.
Choose an instance type.

Step 6: Enable scale-to-zero (optional)

Hugging Face have a really handy “scale-to-zero” option which we highly recommend for non-production environments (and even production environments if handled correctly!). What this does is shut your instance down if there have been no requests for a specified period of time (currently only 15 minutes available). If you try to make a request whilst it is down, the gateway will return a status code of 502 and start the instance. This takes some time, but if handled well in the UI you can present the user with this information and auto-retry until it comes back online.

Screenshot of the scale-to-zero options in Hugging Face.
Enable scaling the instance count to zero.

Step 7: Make a request to your model

With your model now running you can scroll to the bottom of the model page and it will give you call examples in different languages for you to experiment with.

If you’re using a tool like Postman, you can copy the cURL request and import it (File → Import…) for easy experimentation. You can also click “Add API token” within the Hugging Face to automatically add your API to the request code.

Screenshot of the call examples.
Easily make a request to your endpoint.

Running your own LLM couldn’t be easier thanks to Hugging Face’s intuitive platform, it’s a great way to get started with AI and massively reduces the barrier to entry.

Join our waitlist to stay updated on Eject.