17 Python Libraries Every AI Engineer Should Know

原始影片內容展開影片
  • The AI landscape is changing rapidly, making it hard to keep up.
  • This video covers 17 essential Python libraries for AI engineers.
  • Understanding these libraries is crucial for a successful AI engineering career.
  • Key libraries discussed include Pydantic, FastAPI, SQLAlchemy, and others.
  • Familiarity with modern frameworks and databases is important.
  • The generative AI Launchpad project is available for those interested in practical application.

the AI landscape is constantly changing and it's really hard to keep up so in this video I want to cover 17 essential python libraries that in my opinion every AI engineer should know.

These are the exact libraries that we use to build all of our client projects at Dat Alumina right now. If you understand most of these libraries, you'll set yourself up for a really great career as an AI engineer, something a lot of companies are hiring for right now.

So let's dive in, and then real quick just so we're all on the same page about what an AI engineer really is in my opinion, because this role has really shifted over the past two years in terms of what the responsibilities are.

With AI engineering, I mean engineers focusing on integrating pre-trained models into applications or products rather than training a model from scratch—something that a machine learning engineer or a data scientist would focus on.

All right, and the first library on the list is Pydantic. Now, Pydantic is the most widely used data validation library for Python. It's much more powerful than Python's standard data classes.

And why it works so well for AI projects is that the data flowing through our systems is often very messy and unreliable. Pydantic really enables you as an AI engineer to actually build AI systems rather than just simple wrappers around these APIs.

By structuring and validating your data within your application, you can then use it to pass it down to other functions or next steps and control the flow of your application.

All right, and then the second library, following up on that, is Pydantic Settings. So this is part of the Pydantic ecosystem, but it's actually a separate library that you can pip install.

Now with Pydantic Settings, instead of using the base model that you use with regular Pydantic models, you can use the base settings model to specify and structure your settings.

Now, I like to create separate files within my project. For example, here you can see an example of an LLM config where I provide everything that I need in order to instruct my LLM.

I can make sure that I have the API key in here, which I will load up on runtime. If the API key is not available, it will throw an error because of the validation mechanisms that are implemented within these base settings.

So overall, Pydantic Settings really help me to one, structure all of my settings in a central place, and two, also validate all of the important information that my application needs.

All right, and then the third library within the setup part is the Python Decouple library. So this is going to ensure that you keep sensitive information like API keys or secrets out of your version control and keep them safe within your .env files.

And like I've said, I really like to use this in combination with the Pydantic Settings to load the environment variables and also validate them at the same time.

All right, and then with the project setup out of the way, let's dive into some backend components starting with the API. This is really going to act as the middle layer for your application, typically connecting the front end or user input with your backend logic.

We like to use FastAPI for this. Now, Flask is another very popular Python library to build APIs, but we prefer FastAPI because it's really straightforward, easy to learn, it just works, it's fast, and it integrates with Pydantic.

Since Pydantic plays such a big role in our development workflow, it makes sense to use FastAPI as well. So with FastAPI, we can define endpoints which we can then send data to.

That data coming in can be specified using, like I've said, Pydantic models, so we make sure that the data flowing into our AI system is first validated by Pydantic. We know exactly what we're dealing with.

So again, data validation and reliability play a really big role within our design choices because AI applications and LLMs are very messy and chaotic. We want to try our best to minimize that chaos or control that chaos as much as possible.

All right, and then next up, we have Celery, which is a Python library that you can use to build task queues to distribute work across multiple threads or machines.

Now this is useful when you're scaling up your application. For example, multiple users, and you want to make sure that your endpoint remains available and active without tasks or processes taking a very long time to complete—which can sometimes be the case when you chain together multiple LLM calls.

Sometimes these tasks take multiple seconds, sometimes even minutes, depending on how big your system is. And with Celery, we can ensure that our endpoint remains available by taking all incoming requests and putting them off to the task queue, where they can work on a separate machine or thread allowing our application to scale and be reliable.

So typically what we do within our FastAPI endpoints is we just store the data in a database and then send it off to Celery and put it in a task queue. This ensures that the operation here at the endpoint level is very quick and non-blocking.

Then we have separate logic for essentially picking up that task and then processing it further in our application.

Hey, and since there's so much demand for AI solutions right now, a lot of developers want to take on side projects next to their full-time job or even transition entirely to an independent career as a freelancer or agency owner.

If that sounds like you, but you struggle to take the leap or struggle to land that first or that second client, you might want to check out the first link in the description. It's a video of me going over how my company can help you with this.

In the video, you'll learn more about our program and also how to see if you qualify. So if that sounds like you, make sure to check that out.

All right, then next, let's talk about data management, and let's focus on databases first because your application needs to store data.

Now, while there are plenty of databases available, two common options that you can look into are either PostgreSQL for a SQL approach or MongoDB for a NoSQL approach.

We actually like to use PostgreSQL for all our database needs, and the two Python libraries that can really help with that are on the one hand Psycopg if you use PostgreSQL and on the other hand pymongo if you want to look into MongoDB—both options that I believe you should be familiar with.

Now, of course, your database, regardless of which one you use, is going to play a key role in storing all of the key data that your application relies on. This could be the raw data in the form of events that are coming in, for example, to your endpoint.

Then, of course, you have all the processing, maybe some intermediate steps, and the final output that you send back to the user or your application. You probably want to store all of that in a database.

All right, and then next up we have SQLAlchemy, and this is really going to simplify all of the operations around working with your SQL database.

So, for example, in our case, this would be PostgreSQL, and SQLAlchemy just allows us to specify in pure Python all of the common operations that we would need when working with our database—storing the data, retrieving the data, specifying our models. SQLAlchemy is our Python library of choice for that.

All right, and then following up on that, we have Alembic, which works together with SQLAlchemy to manage your database migrations. This is a lightweight tool that you can use to define your database migrations in pure Python straight from your code base.

So whenever, for example, you want to change the structure of your tables, add a column, remove a column, you can specify this from your code base and Alembic is going to help you perform that migration in the database itself without having to come up with complex SQL commands or opening your database at all to manage all of that.

This again just helps us manage our entire project basically from within our Python code base, which is really convenient.

All right, and then the last one in data management is the Pandas library. So this is essentially your Excel tool but then within pure Python, and me coming from a data science background, this is probably my most used library.

I really like it to structure data in a more human-readable way. So while this is more of a data science library, when you're building, for example, your evaluation datasets or you're extracting information from unstructured data and you want to structure that, Pandas is a really powerful library that you can use to essentially structure it in rows and columns and perform these operations on your data to manipulate that and then also view it again in a very human-readable way.

So also another library that I would recommend for you to dive into.

All right, and then up next let's talk about AI integration. And of course, first of all, we have all the LLM model providers. This is probably an open door, but for the sake of clarity and completeness, make sure you understand the OpenAI API, Anthropic's API, look into Google's API, and finally, I can recommend you to look into Hugging Face's Hub, which provides a unified interface that you can use to run and experiment with open-source models.

Now the fact that you're here watching these videos, you're probably already familiar with these, of course. But also make sure that you go beyond the simple quick starts, right? So really read through the API documentation.

For example, for OpenAI, look into function calling, structured output, and look into the vision models and image generation. There is really a lot behind these APIs that you're probably not familiar with yet.

And now because these APIs from these model providers are really going to be at the core of your application, you really want to make sure that you fully understand all the ins and outs and the capabilities of that particular API.

All right, and then following up on these model providers, we have the Instructor Library, which is currently my favorite way to work with these models and get structured output to build more reliable AI applications.

Now while some of these model providers, like OpenAI, are also integrating their own ways to get structured output from the API, Instructor still has some upsides.

For example, you have more complex data validation mechanisms, and I also like that it's model agnostic, which means that you can easily swap out different models.

So Pydantic builds on top of the Instructor Library, which by now you know I'm a big fan of. What you can do is you can specify Pydantic models using the base model, and here, for example, you can specify a name and an age.

We want to say the name should be a string, and age should be an integer. We can then create a chat completion in this case using OpenAI, and the response model we can plug that in there.

Essentially we'll ask the LLM model to give us data back in a specific format. If it doesn't fit within that Pydantic model, Pydantic will throw an error, and we can perform a retry.

Now this really increases the reliability of your application by basically being 100% confident that the data we get from those LLMs is exactly how we specified them within our response models.

So again, reliability and validation are core concepts, and as an AI engineer, I would recommend to always use structured output in your application. To me, there's just no reason to not use structured output.

Up next, we have all of the frameworks out there that you can use to build applications with large language models. The most popular one being LangChain, probably followed by LlamaIndex.

Now these are a bit of controversial ones because you really have people that love them and people that don't really like them. I put them on the list because as an AI engineer, regardless of whether you like them, you should be familiar with them and at least have tried them.

They cover a lot of the core concepts that you as an AI engineer should be familiar with—combining different large language models, working with embeddings and vector databases, building applications, managing prompts—all of those core concepts are integrated into these frameworks for you to then use in a few lines of code.

What these frameworks do is they abstract away all of the core components that you need to build your application, which then makes it very easy to get up and running in a few lines of code.

But, of course, there's also going to be a trade-off with that, right? You're going to be building upon abstractions that other developers made and designed based on their application and how they see it.

This can introduce some problems. First, you might not fully understand what's going on behind the scenes. Second, there might be some things that are not implemented within the framework.

With that, it can get really messy. In the past, I've run into issues when using LangChain where I had to dig five layers deep into some classes to figure out why I couldn't implement something, and this got messy real quickly.

So for us right now, we don't use any of these frameworks and build everything from scratch depending on the project we're working on, so we fully understand our complete project.

But again, I put these frameworks in here because, as an AI engineer, you should be familiar with them, because maybe some clients or teams that you're working with do use these tools, and you can get a lot of great inspiration from that, maybe even build entire applications.

But just out of all of the companies and clients we've talked to, no one is using these frameworks in production systems. So just so you know, that's just my observation.

All right, and then next, let’s talk about Vector databases, which play a key role in most AI applications to store and retrieve the right context at the right time through a process called retrieval-augmented generation.

Now, while you can do more with Vector databases, this is probably right now the most common use case. There are a handful of options out there, so again, I'm going to provide you with some options, but you should at least be familiar with most of these, understand the pros and cons, and then dive into the one that makes the most sense for your use case.

So first off, we have Pinecone, which is a really popular one. Then we have Weaviate, and another one that we like to use is PGVector Scale, which is actually an extension for PostgreSQL that you can use to store vector embeddings and perform similarity searches straight into PostgreSQL.

I did an entire video series on this. We like to use this because it simplifies our workflow, as we can just use one database. Otherwise, for example, if you use one of these other options, you probably also need another database to store your regular application data.

All of these Vector databases have somewhat similar functionalities and Python SDKs that you can work with, so make sure to familiarize yourself with all of these Vector databases and pick the one that best suits your project.

All right, and then up next, let's talk about observability and monitoring—again another category where there are multiple options that you can pick from, but you should at least be familiar with one of them.

These are going to play a crucial role in maintaining and debugging your application. Essentially what these platforms do, which are all available also through Python libraries, is for example, here with Langfuse, you can really track all of your LLM calls and keep track of key metadata.

So what was the prompt? What was the data? What was the output? What was the latency? What was the cost? Everything that you want to know about your interactions with these large language models can be traced in these platforms.

Now they all work pretty similarly; of course, there are pros and cons. We like to use Langfuse, so it's an open-source platform, and by the way, you can also self-host all of these. But another common one is LangSmith.

So multiple options, but really crucial to at least have one of them integrated with your LLM application to track everything.

All right, so by now we already covered a really complete stack to build event-driven AI applications and also make sure that they are reliable and robust.

Now let's get into the final three libraries that I want to cover that can help you with some more specialized tasks within your AI systems.

The first one is DSPI. Now this is definitely a library that I want to do more with. The whole paradigm that they’re introducing here is programming not prompting.

DSPI is a library that allows you to iterate fast on building modular AI systems by optimizing prompts and weights. Instead of you as the AI engineer, as the prompter, coming up with everything, DSPI offers you frameworks to essentially start with basic prompts and then let the AI over time figure out what the best prompt is for the problem you're trying to solve.

I think this is a paradigm that, heading into the future—heading towards a more AI-integrated future—is going to be a core part because prompting right now, to me, still feels pretty random.

There are so many ways to tackle a problem. Is this the right approach? Can it be better? Can I do it with fewer tokens? All of these questions take up so much time when you're trying to do this manually.

If there is a good framework and setup that you can do to iterate over this and essentially let AI figure out the best prompt, then that is going to be really helpful. DSPI is, I believe, the first one of its kind to do something like that, so definitely one that you can look into.

It's definitely more of an advanced library that can work really well when you're already a little bit deeper into a project, for example, and you want to increase the performance by optimizing your prompts.

All right, and then the second category in these final tools are ways to extract information from documents or PDFs. There are a couple of options here, and this typically requires some experimentation based on the type of data that you're working with.

One popular library for this is PyMuPDF, which allows you to extract information from PDFs. You can also try pdf2image. These libraries will have different results, again, depending on the data you're using.

We've also found that for some use cases, it works better to actually use a service from Amazon, like Amazon Textract or Azure Document Intelligence, for example, where you need a little bit more power.

Now these are open-source, and you can integrate them directly into your project in pure Python. These are services that you use through an API, and there's also a cost associated with that. There are so many great use cases out there for companies sitting on lots of unstructured data in the form of Word documents and PDFs.

These tools are really going to help you extract all of that information and then feed it to your AI system.

All right, and the final one on the list is Jinja. So Jinja is a templating engine for Python that you can use to programmatically fill in templates with data, and Jinja is really cool for building dynamic prompts.

This is something I think you'll see more AI engineers using in the future. Here's an article from Jason Leu, creator of the Instructor Library, also explaining why he thinks using Jinja within the Instructor library is the right choice, and it mainly has to do with Jinja's formatting capabilities, validation, versioning, and logging, and more specifically, also like I said, creating dynamic prompts.

So within our projects, we actually like to store our prompts using Jinja templates, where we can introduce some of the logic that we just covered. We then have a simple prompt manager class to manage all of this and load it into our application.

All right, and those are all of the Python libraries that in my opinion every AI engineer should know right now.

Now throughout this video, you saw me walking through examples from this project, which is our Generative AI Launchpad. We made that entire project available as a repository, plus a course that's going to help AI engineers to build and deploy generative AI applications faster.

So it has everything you need from all of the infrastructure, the components, all the way with the instructions to then deploy this on a server. If you're interested in this, make sure to check out the link in the description. You'll get the entire project that you see over here, plus a course on how to get started and start building your own applications.

All right, and then that's it for this video. Now, if you found it helpful, please leave a like and also consider subscribing if you want to learn more about AI engineering. Then make sure to check out this video next. It's a video of me going over the entire process that we use inside Lumina, my company, to find, build, and deliver generative AI projects.