The Data Coffee Break Podcast

#31 - Open vs. Closed LLMs: A Strategic Guide

Christian Silva & Marc Montanari Season 1 Episode 31

Send us a text

Ever wondered how cycling from London to Paris could inspire an in-depth discussion about AI? Join us as Marc shares his scenic adventure along the charming Avenue Vert, setting the stage for a captivating exploration into the world of open-source large language models (LLMs). Christian and Marc delve into the heart of AI, examining the strengths and limitations of open models like Mistral, Meta's Llama, and Google's Gemma compared to closed systems such as GPT, Gemini or Claude. Discover the power of open weights and how they provide users with the flexibility to harness pre-trained parameters for a myriad of applications, shedding light on the architecture and performance of these rapidly evolving models.

As we navigate the nuanced landscape of AI, we compare the practicalities and strategic advantages of open versus closed LLMs. While open-source models may require a steeper learning curve and higher initial investment in technical expertise, they offer unparalleled customisation for specialised tasks and niche programming languages. Conversely, closed models like GPT-4 and Gemini shine in delivering cost-effective solutions for more general applications. We also delve into the strategic deployment of LLMs for on-device applications and the critical importance of data sovereignty in fields like healthcare. Tune in to gain a comprehensive understanding of how these developments might impact your projects and reshape your industry.

Follow us on LinkedIn Instagram Twitter TikTok and Youtube !
Music by Skilsel.
This podcast represents our views and not the ones of our employers.

Marc:

Welcome to the Data Coffee Break podcast. I'm Marc and I'm Christian. If you are passionate about data, like us, take a seat, relax and join us to our coffee break where we discuss all things data. And remember there are no filters, no PR. It's just a real life experience. So let's begin. Hi, christian, anything crazy you've been doing the last few weeks.

Christian:

Last few weeks, I think I would say like the last few months, a few months. Oh my God, time flies, considering when was the last time we released the previous episode? Right, I was experimenting with this Notebooks LM application, free to use. It has this very interesting podcast feature. Yeah, still very focused for researchers. Students, clearly, that application, but it's supernatural.

Marc:

I might put a short segment about what is a Data Coffee Break podcast maybe, oh yes, exactly, you will be listening now.

AI 1:

Hey, everyone, welcome back, Ready to dive into something new. Today we're going to be checking out this podcast, the Data Coffee.

Marc:

Break. Oh my God, the voice is so great. The voice is so natural Pretty good calling right 30 episodes since 2022.

AI 2:

Oh, it's a conversation.

AI 1:

Making data, like all this data stuff we hear about, less intimidating for those of us who don't spend all day coding.

AI 2:

Right, exactly Like they want analysts, sales engineers, anyone who works with data, to be in the loop.

AI 1:

Totally so. It's not just for the hardcore techies. I thought we could kind of dig into what they've got going on, what we liked and all that. What really struck you about their approach?

AI 2:

You know, for me it's their knack for breaking down these really complex ideas into something. This is crazy.

Marc:

And what about you? The first thing that comes to my mind is something not related to data, not related to computer or whatever it's being outside. I've been cycling from London to Paris.

Christian:

Oh yes.

Marc:

I saw that. How was it?

Christian:

Paris. Oh, yes, I saw that. How was it?

Marc:

Enjoyable but also painful. You get like this fantastic what they call Avenue Vert, so Green Avenue, in particular on the French side, where you're literally on the cycling highway, away from the main road, surrounded by nature for many kilometers. It's really good to do. Surrounded by nature for many kilometers, it's really good to do. I would definitely advise anyone who want to do like in uk slash france this kind of thing, but I'm sure in many other countries europe yeah south america, north america, asia you can do the same.

Marc:

Cycling is fantastic. How long did it take? We left on friday after work, stopped by Gatwick Airport, south of London, outside London, and Saturday went to the coast, took the ferry, obviously, and on Tuesday midday we were basically in Paris and in the afternoon I was taking the Eurostar to be back in London. Not even stayed long, but really enjoyable.

Christian:

Nice, that was really cool, man yeah.

Marc:

All right, so let's dive into topic of today. No sports, but about building on some of the previous episodes around generative ai and large language models. Right, what we have seen in the past few months, 18 months, two years is the rise of some companies going for commercial purpose in terms of developing large language model and GenAI tools overall. But there is also a very big and thriving community of what we can call open source LLM open LLMs right.

Christian:

Exactly as you mentioned. We have, of course, openai. Everyone knows about ChatGPT, gpt that's considered a closed model, but also we have Anthropic Google big players there also doing their own version of the closed models, this massive black box, llms and we started looking at the LEMS and we started looking at open, at the LLMS. Not only the biggest ones, of course, are Mistral Lama from Meta, which just a couple of days ago we just saw version 3.2, Gemma from Google and a very big repository of open source LLMS which is Hugging Face right Exactly.

Marc:

And you mentioned Mistral as an example, who has an open LLM. But from many of our listeners they know Mistral is a startup. They have an offering like companies can pay for the service, for the language model and using the api etc. So what can be confusing for many is why is it open llm, but it's still a commercial link to that. I think that would be good to. Yes, get into that first right yes, let's start from there.

Christian:

When we call open llm, it's actually what we call the open weights and essentially, as you have the publicly available code, most of the cases you have access to the architecture. Like, how was this used? I mean, if there's a Transformers architecture or other techniques, but the weights is quite important because those weights, basically those are the parameters that the model learned, or is the initial training, or massive datasets. So you have the parameters that accompany the training of the model, which means that the users of those models can leverage the existing knowledge without having to train it from scratch. That is quite a big thing.

Christian:

Data that was used for training may or may not be open right, that's usually not open what data sets were used for training. But the weights, that is quite a thing, right? It's like I give you the parameters that I used to train the model. And why am I getting the these results without you need to do like a massive training, so you can just take them and tweak them to your own purpose and start creating your own versions on top of those, as I mentioned, the Lama Falcon. They're all open weights models. Yeah.

Marc:

Absolutely.

Christian:

Yes. Now why is this important? Or why they become quite mainstream is because we see them performing well, the same or very similar to this large let's say, closed models. When it comes to common tasks like summarization, classifications, the benchmarks from these models now are almost at the same. So they're big closed models, right? Just look at Lama or Gemma from Google. They're quite high in the benchmarks.

Marc:

So in this case case you mentioned benchmarks such as lmcs right kind of readable benchmark.

Christian:

The chatbot arena, the chatbot, are correct some people are like well, I prefer to download it into my own. That that's another angle, if you want, right. What are the angles that we see? One is the ability to see the, the parameters, the fact that they are free. Some of them may have specific licenses, but in general, they are free to use. The other thing I was going to mention there is that they are a very good option for running on device, because these models are already pre-trained, which means that the model sizes are generally smaller or you don't need a massive amount of GPUs to get results out of them. So, when it comes to having on device and device can be your laptop, right, or not necessarily like a phone that's also quite an attractive use of the models.

Marc:

Absolutely yeah, but they obviously have some kind of disadvantage. I see in some aspect and maybe after we can compare that with some of the closed models and the most commercial one, like so you mentioned. Like the performance obviously can be discussed. Like the, it seems like the closed models and the big commercial entities they will go for having so much resources to develop, usually there will be more cutting edge and better performance. That's my point of view. From what I've seen, there is the aspect of having less support documentation around this one, so you will have to be more technical overall as well. As you mentioned, like you can download it, you can use it on on your device, but that require much more technical skills. In my point of view, give you more flexibility, more technical skills, while commercial models fantastic documentation, uh, very robust apis that make you help you to scale and build very easily products on top of it. That's what I've seen. So so far I don't know if you've seen the same or not.

Christian:

Yes, I think you pointed out very well. Maintenance and support is quite a big thing If we go a bit more holistic. One of the things that is not like a downside, but it's more of a risk they have a security risk and bias risk right, Especially with models that you have access to the architecture or code. Those can be used by malicious actors to generate harmful content, or you can have a bias amplification, where you can inherit certain biases and make them much wider spread. So I think it's risky.

Marc:

I agree with you because with commercial closed models they built all the securities around it, really focusing on the risk and security part. They built all the tools and security around it, while in open models you can have some people asking all to build a bio lab. That's an extreme right, yes, and if you build that as part of the product for your company and some people use it for illegal purposes, I think from a legal standpoint your company might be liable. I'm thinking like in a commercial idea where you use those models, those open models, to build your own product on top of it and commercialize it and you can be liable for those aspects. That's definitely one major risk.

Christian:

Yeah, yes, exactly. So that's why Mera and Mistral have very clear terms of use, similar to the MIT licenses that we see now at Code, especially in Europe. These open source models that you can deploy not necessarily on device, but even on your cloud environments are now much more popular. I'm talking about, let's say, I'd say, Llama 3.1, instruct, or now Llama 3.2, because of data sovereignty and regulations. Yeah, so, yeah, it's a big topic. The eu is a.

Marc:

It's one of the, I would say at the forefront in terms of this kind of legal those regulations that is the same.

Christian:

singapore, eu are leading the way, and that means as well that in europe we don't get certain model versions, certain features you know, that's perhaps the debate whether this is correct or not, that's for another episode but definitely organizations that are highly regulated. That requires that the data don't leave the, let's say, the country of where it's processed. These models are a really good sweet spot right, because, as we were saying at the beginning of the episode, they can do very well summarization, they can do very well classification, chat, and the data will never leave, because you're not going to access a large foundation model like gpt or gemini in this case absolutely.

Marc:

I want to go back to the second point. You mentioned bias. I mean the. The grand idea of large language model is the output that the language model is going to give you is based on the training data. If your training data is specifically collecting the right data, providing selected data, that is going to balance the result as well, if I'm correct or not?

Marc:

yeah, that's correct yeah, which is something you might like in the open, open models and you might get a risk having some bias. Depending on your use case, it might not be an issue, but that could be something to consider as well when you are part of an organization. You want to potentially select an open model, but you have to be conscious of those risks.

Christian:

Yes, I do think as well that you know, like they always say, that technology comes in cycles, right, and the cycle of AI in a lot of respects reminds me, you know, on this topic in particular, right On open source coding versus closed coding. Let's think about NET back in the early 2000s where Microsoft, of course, was the one that has the source code and perhaps it's not as harsh as this one, because the regulations of the code have evolved a lot. But regulators in the end, for these biases for example, organizations like a large enterprise, example, organizations like a large enterprise liked having Microsoft or an Oracle or IBM code is because there is some accountability on that. So they have accountability, versus when you have a community packages or libraries, that might or may not be this accountability. So I want to believe that we have learned quite a lot out of those examples that I was giving from 20 something years ago.

Christian:

The community is the one that is actually auto-regulating itself in a to an extent right, and that's why I think, hugging face, it's such an important pillar here, such an important piece of the puzzle, not for them to be regulating. Of course I'm not. I'm not giving any accountability of how malicious or non-malicious some of these community LLMs are, but providing that hub, you know where you can go and look at and get access to pre-trained models that are open. I think that's really important. We got to the point. Yes, exactly, there is a risk that organizations will prefer the closed models for those reasons the accountability and the dedicated teams. But in the end, the community will always exist. Right, you will always have a community and community projects and everything.

Christian:

Absolutely, it's a cornerstone here so, to finalize this short talk, I wanted to give you some actionable examples. So let's say, if I want to have a comparison between open llms again mistral, the llamas versus a close lm, like, like GPT-4, Gemini, when would I use them?

Marc:

Different criterias. Yeah, you need to take into consideration. We brushed quite a few of them, Maybe not all, but quite a few of them. Costs-wise, I read quite a few articles saying open LLMs will be higher in cost. In the short term you have to hire more technical people engineers to be able to deploy those ones, to be able to fine-tune those ones. But in the long run, as you won't be paying a commercial use most of the time, you should save more money.

Christian:

Yes, exactly. Or just get them through your Hugging Face API key right. So when it comes to cost of use, it's generally free. Well, we say free versus the subscription fees and usage costs of the open models. Customizations again, like when you need to fine-tune something like GPT or Gemini. You are restricted to the APIs, right. You're also don't have as much flexibility of modification, adaptation of the low ranking, for example the LORAs, that you can have in open LLMs. Transparency, right.

Marc:

Obviously.

Christian:

Obviously yes. If you would like to have full transparency and access to a source code, open LLMs are your choice. Versus the closed models, I would say as well, let's think about use cases, right.

Marc:

So use cases that can be very interesting that we found around is when you have niche programming languages and you want to do some code generation, you might be better off having open NLMs. I'm sure on Hug, on hugging face, you can find some of them focusing on what is this? Programming languages in financial services? Cobalt, how Cobol Cobol, yeah, yeah, that might be one of them and you are super right there.

Christian:

The other day I was looking at when, when it comes to niche and somehow related let's say, use cases to coding, I was looking at one model in CodingFace I'm sorry, I just don't remember the name, but if someone remembers, please send us a message in the comments. It was a model trained specifically to look at charts and generate a description of those, and you can do very crazy charts I'm talking about waterfalls or things Just by looking. This was a multimodal model that was trained on images right, and it was trained just for that. Imagine that you need a model that can describe charts and even give you some codes.

Marc:

Maybe was it SQL Coder 2?. Let me check again. Sql Coder is a 15 billion parameters model that outperforms GPT.

Christian:

Matcha Chart QA, far away. Matcha Chart QA. It's a different one. Matcha, chart QA is a fine-tune of an already fine-tuned model called Chart QA. That basically gives you chart to text. Then you can see the contribution models in hugging face.

Marc:

But yeah, so basically it's you know it's a fantastic example of saying like you need to go for like specialized model that is only available as far as I know, on hugging face to, you need to use a specialized model to be able to answer this specific use case.

Christian:

That can be super powerful, of course exactly this was by a narrower model on top of chart qa, which is another fine-tuned model. Yes, it's your example of of fine-tuning for a very niche programming language. You remind me of this one which is another example of a very niche use case.

Marc:

Yeah, another use I read about some startups who are building models around, so that's going to be also quite commercial is legal aspects. Obviously, it's highly differentiated based on which country it operates. So, yeah, you might be having a better fit as going for commercial models. I don't know that's a good question. I don't know what you think on that.

Christian:

So I think documents that have a very standard formatting you might be good with, you know, with a GPT 4.0 or a Gemini flash, because the the inference for them is quite cheap and you can do that very quickly.

Christian:

Basically, if you don't need a lot of training on those documents, right, it's a document that has a standard, all the time getting the same format the same formatting you can, yeah, yeah you can get away with prompt engineering, but, to your point, if you need some fine tuning to understand nuances within the documents, or the documents change in nature or format, then yes, an OpenLM might be the way to fine tune it.

Christian:

Another use case that we see as I was telling you about the on-device, we've seen this Apple intelligence. Apple train their own LLMs. They don't disclose which model they use to train them, but they deploy it on-device. So most likely they either use their own version of large language models or they took the weights from an open LLM and then they created their own wrappers on it, right? So Apple intelligence, let's say, for the summarization, classification tasks that were shown in the demos, it's using on-device and then whenever the task acquires much more processing power or is too complex or goes beyond those trained models, you get a problem, whether you would like to go and ask chat, gpt, right? That really is how you can leverage both right.

Marc:

Yeah, that's going to be like a future where you will get this kind of flexibility and applications that are going to be built will take into consideration that the question or the task will lead to the use of one or another model that is going to be better suited for this task specifically, exactly.

Christian:

I agree. So yes, I guess the last one will be something like healthcare.

Marc:

I mean I guess, because we're speaking about private patient data, it has to be more around, as you were pointing out, like sovereignty or at least like the data not leaving usually the premise of the company. So that kind of really skews towards having an open LLM where you will be able to make sure it, like it's reside, it reside in your infrastructure. That's what I see, after I don't have a, I'm sure I mean obviously you have to go for models that are more adapted for this kind of use case as well, so that really skew towards open models I think step number one will be to go to Hugging Face and see if there is already one model adapted there.

Christian:

Then take it from there. I think the takeaway from the episode is that we see these open LLMs, as we were saying, getting pretty much the same performance, very similar performance for common tasks than up close or, let let's say, more mainstream models. So pick the one based on your requirements, on your use case. Be smart. I guess my question to you would be like then we are seeing more and more, or do you see the world where you have multiple LLMs in one single application?

Marc:

lms in one single application. As I was pointing out before, depending on the task you're going to carry out, you might have a first step at figuring out which lm to use to have the most accurate and potentially the fastest answer, depending on what's the priority.

Christian:

I don't know, if the most important trade-off is speed over accuracy, yeah, when we were talking about the search on vectors. Right, depending on which method you use, you have like a faster response, but your recall, let's say, your results might not be super accurate. Versus waiting for it, this will give for a good episode on agents. We can talk about that maybe on the next episode, but until then, Thank you very much, christian, that was fantastic. Thanks for listening to this episode. This podcast represents our views and not the ones of our employers.

Marc:

Our mission at the Data Coffee Break podcast is to inform you and help you grow in this always changing data field.

Christian:

Follow us and get into the conversation with the community on our LinkedIn page and Instagram.

Marc:

See you next Tuesday and until then, keep your data caffeinated.

People on this episode