Gemma on Device with MediaPipe & TensorFlow Lite
Try On-Device AI! Gemma Demo with MediaPipe & TensorFlow Lite.
Everyone should be running Gemma on their desktop, but sometimes you don't want to just run it on your desktop. For those. For those who want to run it on their phones, come on hands up if you want to run it in their web apps. Imagine every time you pulled up a website or loaded an app on your phone. It was fully unique.
There was nothing preset. It was generated on the fly for you using an on-device. I'm not going to show that demo; we'll get there in a few years, but that's the future that we're heading towards, and what we have today is the first version of LLMS running on a device. a device. In web and mobile apps, and we're going to walk you through that before we get straight into Gemma, let's talk a little bit about the stack. To enable this, we're using TensorFlow.
You have native APIs where you can pass text and images and get solutions. It does that by chaining multiple mLmL models along with pre- and post-processing mics, which are mics that are not working all right.
Gemma 2b in a web app; this is live. You can go to the Media Pipe Studio website and play with it yourself. Here, we have Gemma running on an Android device and an iOS device. This is running using our sample apps that are hosted on our github; download them and play around with them. with them.
So what can you do with Gemma? You can get it running on your device, but what kind of use cases is it good for? We We are primarily talking about the 2B model; we're not quite at 7B on device, yet as much as we would love to be, you can use it for text generation. We We see lots of great examples of content generation. Smart replies and emails.
We're excited to see lots of applications build on top of this and start using it to let developers query against lots of different kinds of text, and once again, you can do this all offline with no privacy concerns and without paying any bills. It's a lot of promise. Let's see how easy it is to really build this. So what we have here is before we had the studio demo, which is our nice polish demo.
This is our example JavaScript app. We have it on our GitHub, so there's no need to write down all the code; we're about to show you, but we're going to be able to walk through all the code that you see here to create this in just a few slides.
This is all the HTML you'll need; it's not too much. The one thing to call out here is that you have to load our web SDK, which is pretty straightforward. All right, once you've loaded the web SDK, you hop over to JavaScript, load in two functions that you'll need, and then you link your model because for this demo we're going to have it running locally. I have the model in the folder.
Then at the bottom we run act, our actual llm imprints, two things put your prompt in and your uh listener function, the listener function. We showed on the last slide a couple lines that, as the tokens are generated, asynchronously we'll print them on screen. All right, and last but not least, you initialize your LM inference. That's it! That's all the code.
You need to run Gemma in your browser, but this is a demo. We should probably do it all right. Can we switch over to the laptop, wonderful? What you see here is the demo that we just built, so you wonder, What are you really going to do with this on the web? App I don't know about you guys, but I spend way too much time.
I'm writing emails and worrying about all the nuances, so I'm going to tell Gemma. Hey, my landlord just sent me a letter saying I needed to pay him some money. I don't think I really should. I don't want to sit here and write this email for half an hour. I'm going to have Gemma." Do it for me; does it do a pretty good job? Not only does it write the email it asked me to say: Can you provide written confirmation of your decision? I probably would have forgotten that that's pretty handy.
Everyone should be running Gemma on their desktop, but sometimes you don't want to just run it on your desktop. For those. For those who want to run it on their phones, come on hands up if you want to run it in their web apps. Imagine every time you pulled up a website or loaded an app on your phone. It was fully unique.
There was nothing preset. It was generated on the fly for you using an on-device. I'm not going to show that demo; we'll get there in a few years, but that's the future that we're heading towards, and what we have today is the first version of LLMS running on a device. a device. In web and mobile apps, and we're going to walk you through that before we get straight into Gemma, let's talk a little bit about the stack. To enable this, we're using TensorFlow.
Power Up Your Device
TensorFlow Light, which is Google's on-device MLML framework, runs all the models and all the math in TensorFlow Light. Let's take a Kosur TensorFlow model and run it through the converter, which has all sorts of nice quantization optimizations for you to get a flat buffer, run it on the device, and accelerate against many back ends, but sometimes you don't want to work at the low light level. Handling Handling tensors can be a little messy. Sometimes you want to work with a little higher-level media.You have native APIs where you can pass text and images and get solutions. It does that by chaining multiple mLmL models along with pre- and post-processing mics, which are mics that are not working all right.
We're going to switch over to this one MediaPipe, which already has lots of solutions out of the box for you; it has gesture recognition and segmentation classification. Just a few hours ago, we released a new solution, LLM inference, which lets you run lots of LLMs on your device.
Gemma on Device with MediaPipe & TensorFlow Lite
A gem is a gem for you, so what can you do with the gem? Are you on your device? You can run it without any of the privacy concerns of sending data back and forth to the server without any of the cloud bills, and you could do it really quickly across Android, iOS, and the web. the web.Gemma 2b in a web app; this is live. You can go to the Media Pipe Studio website and play with it yourself. Here, we have Gemma running on an Android device and an iOS device. This is running using our sample apps that are hosted on our github; download them and play around with them. with them.
So what can you do with Gemma? You can get it running on your device, but what kind of use cases is it good for? We We are primarily talking about the 2B model; we're not quite at 7B on device, yet as much as we would love to be, you can use it for text generation. We We see lots of great examples of content generation. Smart replies and emails.
Unlocking On-Device AI
We think that this is going to be really prevalent in web apps, particularly going forward. You could do text rewrites anytime. You have users generating content, and you can allow them to change the style. The length is on the fly, and we're really excited about classification. Imagine user-generated content. You can check for toxicity in sentiment. Analysis! Try to see. Is this content appropriate for my site or the app that I'm using? Last but not least, we think the document Q&A is going to be really exciting.We're excited to see lots of applications build on top of this and start using it to let developers query against lots of different kinds of text, and once again, you can do this all offline with no privacy concerns and without paying any bills. It's a lot of promise. Let's see how easy it is to really build this. So what we have here is before we had the studio demo, which is our nice polish demo.
This is our example JavaScript app. We have it on our GitHub, so there's no need to write down all the code; we're about to show you, but we're going to be able to walk through all the code that you see here to create this in just a few slides.
This is all the HTML you'll need; it's not too much. The one thing to call out here is that you have to load our web SDK, which is pretty straightforward. All right, once you've loaded the web SDK, you hop over to JavaScript, load in two functions that you'll need, and then you link your model because for this demo we're going to have it running locally. I have the model in the folder.
Gemma On-Device Demo
If you were hosting this as a web app, you would host the model and put your url here at the top. You can see. We do some settings, we go grab our wasm files, and we put our configurations here. We're just going to use all the defaults, so we just put the base model URL in there. Here's where you would do things like set your temperature, your top k, your max tokens, or anything else you want to configure.Then at the bottom we run act, our actual llm imprints, two things put your prompt in and your uh listener function, the listener function. We showed on the last slide a couple lines that, as the tokens are generated, asynchronously we'll print them on screen. All right, and last but not least, you initialize your LM inference. That's it! That's all the code.
You need to run Gemma in your browser, but this is a demo. We should probably do it all right. Can we switch over to the laptop, wonderful? What you see here is the demo that we just built, so you wonder, What are you really going to do with this on the web? App I don't know about you guys, but I spend way too much time.
I'm writing emails and worrying about all the nuances, so I'm going to tell Gemma. Hey, my landlord just sent me a letter saying I needed to pay him some money. I don't think I really should. I don't want to sit here and write this email for half an hour. I'm going to have Gemma." Do it for me; does it do a pretty good job? Not only does it write the email it asked me to say: Can you provide written confirmation of your decision? I probably would have forgotten that that's pretty handy.