Build production-ready generative AI applications for enterprise search using Haystack pipelines and Amazon SageMaker JumpStart with LLMs

This blog post is co-written with Tuana Çelik from deepset. 

Enterprise search is a critical component of organizational efficiency through document digitization and knowledge management. Enterprise search covers storing documents such as digital files, indexing the documents for search, and providing relevant results based on user queries. With the advent of large language models (LLMs), we can implement conversational experiences in providing the results to users. However, we need to ensure that the LLMs limit the responses to company data, thereby mitigating model hallucinations.

In this post, we showcase how to build an end-to-end generative AI application for enterprise search with Retrieval Augmented Generation (RAG) by using Haystack pipelines and the Falcon-40b-instruct model from Amazon SageMaker JumpStart and Amazon OpenSearch Service. The source code for the sample showcased in this post is available in the GitHub repository

Solution overview

To restrict the generative AI application responses to company data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a response. LLMs have limitations around the maximum word count for the input prompts, so choosing the right passages among thousands or millions of documents in the enterprise has a direct impact on the LLM’s accuracy.

The RAG technique has become increasingly important in enterprise search. In this post, we show a workflow that takes advantage of SageMaker JumpStart to deploy a Falcon-40b-instruct model and uses Haystack to design and run a retrieval augmented question answering pipeline. The final retrieval augmentation workflow covers the following high-level steps:

  1. The user query is used for a retriever component, which does a vector search, to retrieve the most relevant context from our database.
  2. This context is embedded into a prompt that is designed to instruct an LLM to generate an answer only from the provided context.
  3. The LLM generates a response to the original query by only considering the context embedded into the prompt it received.

SageMaker JumpStart

SageMaker JumpStart serves as a model hub encapsulating a broad array of deep learning models for text, vision, audio, and embedding use cases. With over 500 models, its model hub comprises both public and proprietary models from AWS’s partners such as AI21, Stability AI, Cohere, and LightOn. It also hosts foundation models solely developed by Amazon, such as AlexaTM. Some of the models offer capabilities for you to fine-tune them with your own data. SageMaker JumpStart also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning (ML) with SageMaker.


Haystack is an open-source framework by deepset that allows developers to orchestrate LLM applications made up of different components like models, vector DBs, file converters, and countless other modules. Haystack provides pipelines and Agents, two powerful structures for designing LLM applications for various use cases including search, question answering, and conversational AI. With a big focus on state-of-the art retrieval methods and solid evaluation metrics, it provides you with everything you need to ship a reliable, trustworthy application. You can serialize pipelines to YAML files, expose them via a REST API, and scale them flexibly with your workloads, making it easy to move your application from a prototype stage to production.

Amazon OpenSearch

OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud. OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license.

In recent years, ML techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes.

With the vector database capabilities of OpenSearch Service, you can implement semantic search, RAG with LLMs, recommendation engines, and search rich media. In this post, we use RAG to enable us to complement generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

Application overview

The following diagram depicts the structure of the final application.

In this application, we use the Haystack Indexing Pipeline to manage uploaded documents and index documents and the Haystack Query Pipeline to perform knowledge retrieval from indexed documents.

The Haystack Indexing Pipeline includes the following high-level steps:

  1. Upload a document.
  2. Initialize DocumentStore and index documents.

We use OpenSearch as our DocumentStore and a Haystack indexing pipeline to preprocess and index our files to OpenSearch. Haystack FileConverters and PreProcessor allow you to clean and prepare your raw files to be in a shape and format that your natural language processing (NLP) pipeline and language model of choice can deal with. The indexing pipeline we’ve used here also uses sentence-transformers/all-MiniLM-L12-v2 to create embeddings for each document, which we use for efficient retrieval.

The Haystack Query Pipeline includes the following high-level steps:

  1. We send a query to the RAG pipeline.
  2. An EmbeddingRetriever component acts as a filter that retrieves the most relevant top_k documents from our indexed documents in OpenSearch. We use our choice of embedding model to embed both the query and the documents (at indexing) to achieve this.
  3. The retrieved documents are embedded into our prompt to the Falcon-40b-instruct model.
  4. The LLM returns with a response that is based on the retrieved documents.

For model deployment, we use SageMaker JumpStart, which simplifies deploying models through a simple push of a button. Although we’ve used and tested Falcon-40b-instruct for this example, you may use any Hugging Face model available on SageMaker.

The final solution is available on the haystack-sagemaker repository and uses the OpenSearch website and documentation (for OpenSearch 2.7) as our example data to perform retrieval augmented question answering on.


The first thing to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Then you should create an administrative user and group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

To be able to use the Haystack, you’ll have to install the farm-haystack package with the required dependencies. To accomplish this, use the requirements.txt file in the GitHub repository by running pip install requirements.txt.

Index documents to OpenSearch

Haystack offers a number of connectors to databases, which are called DocumentStores. For this RAG workflow, we use the OpenSearchDocumentStore. The example repository includes an indexing pipeline and AWS CloudFormation template to set up an OpenSearchDocumentStore with documents crawled from the OpenSearch website and documentation pages.

Often, to get an NLP application working for production use cases, we end up having to think about data preparation and cleaning. This is covered with Haystack indexing pipelines, which allows you to design your own data preparation steps, which ultimately write your documents to the database of your choice.

An indexing pipeline may also include a step to create embeddings for your documents. This is highly important for the retrieval step. In our example, we use sentence-transformers/all-MiniLM-L12-v2 as our embedding model. This model is used to create embeddings for all our indexed documents, but also the user’s query at query time.

To index documents into the OpenSearchDocumentStore, we provide two options with detailed instructions in the README of the example repository. Here, we walk through the steps for indexing to an OpenSearch service deployed on AWS.

Start an OpenSearch service

Use the provided CloudFormation template to set up an OpenSearch service on AWS. By running the following command, you’ll have an empty OpenSearch service. You can then either choose to index the example data we’ve provided or use your own data, which you can clean and preprocess using the Haystack Indexing Pipeline. Note that this creates an instance that is open to the internet, which is not recommended for production use.

aws cloudformation create-stack --stack-name HaystackOpensearch --template-body file://cloudformation/opensearch-index.yaml --parameters ParameterKey=InstanceType, ParameterKey=InstanceCount,ParameterValue=3 ParameterKey=OSPassword,ParameterValue=Password123!

Allow approximately 30 minutes for the stack launch to complete. You can check its progress on the AWS CloudFormation console by navigating to the Stacks page and looking for the stack named HaystackOpensearch.

Index documents into OpenSearch

Now that we have a running OpenSearch service, we can use the OpenSearchDocumentStore class to connect to it and write our documents to it.

To get the hostname for OpenSearch, run the following command:

aws cloudformation describe-stacks --stack-name HaystackOpensearch --query "Stacks[0].Outputs[?OutputKey=='OpenSearchEndpoint'].OutputValue" --output text

First, export the following:

export OPENSEARCH_HOST='your_opensearch_host'
export OPENSEARCH_PASSWORD=Password123!

Then, you can use the script to preprocess and index the provided demo data.

If you would like to use your own data, modify the indexing pipeline in to include the FileConverter and PreProcessor setup steps you require.

Implement the retrieval augmented question answering pipeline

Now that we have indexed data in OpenSearch, we can perform question answering on these documents. For this RAG pipeline, we use the Falcon-40b-instruct model that we’ve deployed on SageMaker JumpStart.

You also have the option of deploying the model programmatically from a Jupyter notebook. For instructions, refer to the GitHub repo.

  1. Search for the Falcon-40b-instruct model on SageMaker JumpStart.
  2. Deploy your model on SageMaker JumpStart, and take note of the endpoint name.
  3. Export the following values:
    export SAGEMAKER_MODEL_ENDPOINT=your_falcon_40b_instruc_endpoint
    export AWS_PROFILE_NAME=your_aws_profile
    export AWS_REGION_NAME=your_aws_region
  4. Run python

This will start a command line utility that waits for a user’s question. For example, let’s ask “How can I install the OpenSearch cli?”

This result is achieved because we have defined our prompt in the Haystack PromptTemplate to be the following:

question_answering = PromptTemplate(prompt="Given the context please answer the question. If the answer is not contained within the context below, say 'I don't know'.\n" 
"Context: {join(documents)};\n Question: {query};\n Answer: ", output_parser=AnswerParser(reference_pattern=r"Document\[(\d+)\]"))

Further customizations

You can make additional customizations to different elements in the solution, such as the following:

  • The data – We’ve provided the OpenSearch documentation and website data as example data. Remember to modify the script to fit your needs if you chose to use your own data.
  • The model – In this example, we’ve used the Falcon-40b-instruct model. You are free to deploy and use any other Hugging Face model on SageMaker. Note that changing a model will likely mean you should adapt your prompt to something it’s designed to handle.
  • The prompt – For this post, we created our own PromptTemplate that instructs the model to answer questions based on the provided context and answer “I don’t know” if the context doesn’t include relevant information. You may change this prompt to experiment with different prompts with Falcon-40b-instruct. You can also simply pull some of our prompts from the PromptHub.
  • The embedding model – For the retrieval step, we use a lightweight embedding model: sentence-transformers/all-MiniLM-L12-v2. However, you may also change this to your needs. Remember to modify the expected embedding dimensions in your DocumentStore accordingly.
  • The number of retrieved documents – You may also choose to play around with the number of documents you ask the EmbeddingRetriever to retrieve for each query. In our setup, this is set to top_k=5. You may experiment with changing this figure to see if providing more context improves the accuracy of your results.

Production readiness

The proposed solution in this post can accelerate the time to value of the project development process. You can build a project that is easy to scale with the security and privacy environment on the AWS Cloud.

For security and privacy, OpenSearch Service provides data protection with identity and access management and cross-service confused proxy prevention. You may employ fine-grained user access control so that the user can only access the data they are authorized to access. Additionally, SageMaker provides configurable security settings for access control, data protection, and logging and monitoring. You can protect your data at rest and in transit with AWS Key Management Service (AWS KMS) keys. You can also track the log of SageMaker model deployment or endpoint access using Amazon CloudWatch. For more information, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

For the high scalability on OpenSearch Service, you may adjust it by sizing your OpenSearch Service domains and employing operational best practices. You can also take advantage of auto scaling your SageMaker endpoint—you can automatically scale SageMaker models to adjust the endpoint both when the traffic is increased or the resources are not being used.

Clean up

To save costs, delete all the resources you deployed as part of this post. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Similarly, you can delete any SageMaker endpoints you may have created via the SageMaker console.


In this post, we showcased how to build an end-to-end generative AI application for enterprise search with RAG by using Haystack pipelines and the Falcon-40b-instruct model from SageMaker JumpStart and OpenSearch Service. The RAG approach is critical in enterprise search because it ensures that the responses generated are in-domain and therefore mitigating hallucinations. By using Haystack pipelines, we are able to orchestrate LLM applications made up of different components like models and vector databases. SageMaker JumpStart provides us with a one-click solution for deploying LLMs, and we used OpenSearch Service as the vector database for our indexed data. You can start experimenting and building RAG proofs of concept for your enterprise generative AI applications, using the steps outlined in this post and the source code available in the GitHub repository.

About the Authors

Tuana Celik is the Lead Developer Advocate at deepset, where she focuses on the open-source community for Haystack. She leads the developer relations function and regularly speaks at events about NLP and creates learning materials for the community.

Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS based in Munich, Germany. Roy helps AWS customers—from small startups to large enterprises—train and deploy large language models efficiently on AWS. Roy is passionate about computational optimization problems and improving the performance of AI workloads.

Mia Chang is an ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.

Inaam Syed is a Startup Solutions Architect at AWS, with a strong focus on assisting B2B and SaaS startups in scaling and achieving growth. He possesses a deep passion for serverless architectures and AI/ML. In his leisure time, Inaam enjoys quality moments with his family and indulges in his love for biking and badminton.

David Tippett is the Senior Developer Advocate working on open-source OpenSearch at AWS. His work involves all areas of OpenSearch from search and relevance to observability and security analytics.

文 » A