From Fantasy to Fact: The Secret Weapon that Crushes AI Hallucinations

AI & LLMsRAG & Retrieval

Materials — open to everyone, no sign-in

Topic: From Fantasy to Fact: The Secret Weapon that Crushes AI Hallucinations

Presenter: Coach Cindy, Director of Data Science, Machine Learning and Frontend Engineering

Additional Resources:

Description

Join us on Sunday, 3/16/2025 for an enlightening session titled “From Fantasy to Fact: The Secret Weapon that Crushes AI Hallucinations.” This engaging workshop will be presented by Coach Cindy, a leading expert in the field of Data Science, Machine Learning and Frontend Engineering. Coach Cindy will take you on a deep dive into the fascinating world of Artificial Intelligence, debunking common myths and misconceptions, and revealing the secret weapon that turns AI fantasies into quantifiable realities.

This workshop is designed to empower attendees with a nuanced understanding of AI’s capabilities and limitations, hence providing a pragmatic perspective to its application. It will offer valuable insights into the inner workings of AI, enabling attendees to sift fact from fiction and make informed decisions in their professional endeavors. Whether you’re an AI enthusiast, a seasoned professional, or simply curious about the future of technology, this event promises a wealth of knowledge that could revolutionize your understanding and approach to AI. Discover the secret weapon that can help you wield the power of AI effectively and responsibly. Don’t miss this chance to elevate your technical prowess and gain a competitive edge in the ever-evolving digital landscape.

Video Transcript

AI support agents can significantly improve customer communication for companies however they can also cause major issues when they lack answers to customer questions they may hallucinate generating incorrect responses and suggesting non-existent products and discounts in this presentation coach Cindy and Engineering leader with extensive experience in machine learning introduces retrieval augmented generation a technology that reduces AI hallucination by enabling AI support agents to access upto-date information don’t forget to subscribe to our channel for the latest updates and insights on AI advancements and learning opportunities tonight’s topic is about from fantasy to facts so this explains that when we’re using large language model for personal uses uh then we have you could have like some errors uh that it does not really matter but sometimes when you’re using it for Enterprise level then when you’re servicing uh your customer through large language model then if your answer would be uh it’s it’s low confidence or it’s wrong that it really matters so today we’re talking about a situation that we want to solve and the mechanism to solve it is called retrieval augmented generation or um abbreviated as R um so uh u based on the audience today will be a combination of english- speaking person and a Chinese speaking person so for certain um important aspects or terminology I might explain repeat in Chinese so that people can proceed without uh any language barriers so let’s talk about this is an uh this is a story that is written by lar language model so it’s saying that here in this mysterious land of Elia um in this forest and there is a prophecy saying the rise of a hero will be able to restore the lost crystal of Eternity and this dragon and with all this shimmering scales and he’s um circling um the the this peak of this mountain and he has this sword and then he he’s um journeying through the enchanted glands and battling shly with and on the right side it is a picture that is being generated by a regarding this story so this is a fantasy story uh and and the in the fantasy story um you know it created like a uh fictional place and fictional um story and then you know generate a fictional animal and um you know fictional Encounter of the hero of the story with a dragon so and then after um you know um me and my large language model in this situation CH GPT having this encounter asked him a question I said I work for this company eloria okay this is the land of aloria that you um mentioned just now I said I work for this company adoria and my job is to collect RIS and my job is to uh hunt for Ghost and collect them uh may I work remotely from Paris so this is a uh may I work remotely from Paris is a very realistic question that a regular employee in a regular company might ask but given the fictional um background love you know the aoria as the mysterious land and the job is to uh collect the ghost so I’m really curious what chpt is going to tell me so uh after thinking then chpd said well it depends on you know your job responsibility um depending on your company’s policy depending on does that have any legal or logistical considerations and how do you communicate with your management and how do you uh collaborate with your co-workers etc etc so clearly that Chad GPT real uh was able to understand that you know working remotely from Paris is the essential request of my question but he’s not able to recog that the land of adoria and hunting ghost is not a exactly a traditional lab uh uh situation of uh a job so he did mention that in this case presuma is in a fic fictional land lador um it’s the remote work might be challenging so also it’s recognized in context love the fictional part of it so way that it’s answering it is still like um you know really missing the point does not having the um understanding of the humorous part of the question or and able to answer it in a fictional way so in other words that chbt really making up all these job responsibilities and the company policies and legal considerations Etc as part of what he understands a general question should be answered in this context and this is what we call hallucination so when things do not really exist then um you know our large language models in this situation chat GPT and in other situations it could be other um you know um you know could be could be um uh The Meta lar could be the anthropic CLA it could be um you know deep seek or other um you know model will be able to answer questions in a very similar way AKA creating things that does not exist in the real world right so let’s look at how hallucinations happen uh hallucination has been existing s uh in in chpt probably since that it was launched back in 2022 and uh while with all the um significant efforts try to fix a problem it still exist more or less um and then um you know there there are many reasons many things that we have already done try to fix a problem and but then there are still you know it’s still happening a lot so when reason is called U domain shift um the reason is that when a large language model is being trained it is using the um data that exist in a real world uh could H it has it like all the data that human being have created in many many years of our civilization has all the books that we have published it has a um the um all the uh subtitles of the TV and movie that we have produced it has a natural conversation of um uh love the Reddit uh threats that you know human being have been talking and they have some of the um you know uh area of the uh uh the um email information or live chat information that uh we’re able to legally have our access to then with all this training information uh as a large language model it does not H really have a way to tell that which part of the data that is being turned down is more creditable than other pieces of the data uh in such a situation that um you know when when it’s when it’s being trained with conflicting data or data of different uh um importance then it does not really distinguish which data is more uh reliable than others so therefore when you’re asking questions there it does not really um you know tell you the right answer uh a easy example would be like if you ask that when will be a um uh lowest age that I’m able to get a driver’s license or will is legal age that I’m able to get married then because like different country and different state have different uh answers then the answer uh from this uh models will probably be very vague so that’s U one situation as um so what we saw just now is a situation of the domain shift uh as in like um you know when when the chpt knows about this uh fictional uh land of aoria and then we’re telling him that well I indeed interested in the can I work remotely from Paris information that it’s really not able to tell that elor is a place that I’m able to work or um it is not a place that a regular work will be conducted the second reason that hallucinations happen is called task shift as in like okay I under uh as a model I already being trained on this data and I’m ready to answer certain questions but then when you’re asking me to a uh to answer certain questions such as AKA uh perform um certain tasks and then I’m not sure that exactly what you want me to do and because I kind of assumed that you might want me to do certain things then the results might be um not necessarily what you were looking for so a couple of years ago when Google re released their PM large anguage model p a l m and then they have a situation that uh you know it went uh went kind of become pretty famous people asking that um tell me about the elephant on the moon so the large language model answered well um the elephant on the moon U his her name is Luna and she is pink she’s very cute and keep go going and tell a host story about the elephant Luna living on the moon and what her life is like that’s why I created this picture from chat b a really really cute little pink Val elephant so the reason is that when you’re asking in a scientific manner as you tell me about the elephant on the moon you are kind of expecting CH uh the the model to be able to answer research the scientific uh fact first as you like well um um you know kind of uh on the moon there is no elephant living there no no water no no air Etc but then if uh the the model might be thinking or assuming that you’re asking a question about the children’s B bedtime story so if it’s a 5-year-old asking her mother as in mom tell me about the elephant on the moon and then if the mother answered that you know there the there is the elephant named Luna and she’s pink it might be a perfect answer so this is the second scenario of how hallucinations happen as you like when you’re having a task uh that in in the person’s mind but not necessarily uh conoid in a Clarity that can be understood by the lar model it might be which might understood the task as some other things and then answer it accordingly as you know giving you answer that you do not expect U the third situation is um um about the data restrictions so just now we mentioned that um you know large language models are being trained by the data that we have access to um but then there are still like um you know tons love priv private data that people do not have access to um for example JP Morgan is a company that you know have about hundreds of years of customer data so when you’re asking a general um large language model who does not have access to the private data you’re asking that hey um tell me about what kind of uh customers would have a higher risk to default in a loan then they uh General model General Foundation model might give you a answer that does not necessarily have a very precise answering whereas if uh hypothetically if uh um JP Morgan has trained a separate private model with the data that they have access to they might give you a very good answer regarding like what kind of customer uh would have a higher rate of default on their loans so so that’s uh three reasons how hallucinations might happen now let’s talk about like well is there a way that we can harness the large language model so that it’s answering things when it is sure about what is talking about and when it does not know what you know the the the domain or the task then just say that you know you do not know so that you can get the trust in the Enterprise user usage in other words that you know there is there was a probabilistic answer and then let’s change it go all the way to deterministic answer okay so let’s talk about like Suppose there is an Enterprise you have um you know millions of customers exchanging like you know millions of emails on a daily basis um um kind of conducting like millions of transactions on a daily basis and then submit certain requests and uh need uh customer support uh every day then what do this Enterprise uh requirements that have in common the first thing is that the answer that a Enterprise is looking for has to be accurate every company have their internal rules and policies if you’re asking that hey can I return this thing it is a yes or no answer according to the return policy right so you you want to be be able to answer that that correctly um like how much refund am I going to get it is not a problem excuse me it’s not a probabilistic answer as a lot of large language models give well you’re going to get a refund for $37.95 it’s a very precise number that you want to be sure right um and then also there is security as in you know every Enterprise have their certain customer data or um private um customer activity data that they want to protect um for example um um Facebook MAA as a company right they have like 8 billion customers but then the customer interaction data that their customer posted their information online which becomes public but their customer interaction data and customer uh historical data or private data that they want to protect so every company have certain data that they want to protect and it probably also becomes a very important Equity love this U company and does not want to share with any other people right the third thing is about like um a Enterprise company needs to be re reliable so the results if I tell you that you can return this item um and then the results need to be repetitive if you ask another customer a support representative if I tell you that I’m going to refund you $23 and then if you ask another support representative you’re going to be refunded exactly the same thing the results needs to be repetitive and then it’s also like things needs to be cost reasonable um so you can say that hey why don’t you just retrain your model with um like um all your private data and private transactions uh on a on a daily basis and then you will your model will always get the best results does not necessarily work that way because that uh it’s extremely expensive um to to train the model um there are um you know hundreds of millions cost to build a cluster that can are able to train the large language model and sometimes it’s uh um you know for um it’s the model would take many months to train and or at least many weeks so you cannot get a fast enough training uh so that you can uh be reasonable with the uh the answer that you’re going to provide today based on the data that you collected today right so let’s talk about like then how do we handle this situation uh the architecture handling this will be is called retrieval augmented gener generation uh it is a paper uh I put the link here if you have further interest feel free to read the paper so um this is idea that was proposed in 20201 and the idea basically saying that um um it’s it’s called This is called the response um query and context triangle so when a person or when a customer is asking something asking the uh the model something that is a query and then we are going to say that hey is they uh the retrieved context so there will will be certain things that we build into the uh the model architecture uh we build into the the the uh the system as a context which we’re going to talk about mainly later and so we’re going to ask that well is there U for what you’re asking is is this related to the context that you have and then based on the context we’re we’re going to say that well is the is a response that you’re going to generate being supported by the context and we so this is called groundedness so only when your response is able to be grounded in the context then I’m going to reply a related answer to your query and keep going for the next next query uh iteration Etc so you can see that uh this architecture changes the large model from you ask me something I’ll respond with whatever that I have right change from that architecture with an additional important piece called context as in um you want to make sure that if you’re asking things related to the context AKA this company’s business if you’re asking me things that are not related to my business fine um you know I can I do not have that much restrictions about how I answer but if you are asking anything regarding to the business that I’m having and then I want to make sure that everything I answer are grounded and then build this groundedness into the responses and I’m going to provide it back to the customer and and then so well the next question is oh sounds great great idea then how do we build it right so um let’s look at this R and R pipeline so you have a bunch of documents um these documents can be considered as the company’s um um proprietary documents their private data these documents can be their uh internal policy such as their return policy can be their uh internal company uh policy such as you know what your your HR document or your code of conduct for the employees right um it can be um certain information that um about the um the purchase return rate that you do not necessarily want to share with other people uh it can be the document of the past transactions uh or the um you know the historical data of this such a customer Etc so all these private documents that you know you have you want to split the do instead of training the model again with your private data you want to split the document into chunks which which we explain later but for now you can understand the chunks as small pieces of the document and then you need to turn these chunks into embeddings embedding is an idea that is commonly used in machine learning basically a machine learning model does not have the understanding love uh the documents um as in you know um U the the human language or pictures or audio or video um the a machine learning model does not understand all that everything needs to be converted into float numb that can be understood by the by the um model which is called embeddings there are many ways to generate eddings but that’s beyond the conversation today so you have these documents you process it in a way so that the um the models are able to understand you put it in a database and then you index the database so this is called a ingestion pipeline uh so this is like remember here that when you’re creating this uh context um and then you need to ingest you can ingest the company information into this um ingestion Pipeline and then there is a retrieval process so here is that when you’re creating responses then you’re going to tell to generate from the contexts that are already being created that’s your retrieval process then you have a query and then you look in this um indexed database you find out your um you know your your best results called top K which we’re going to talk about the top K later and then with all this information that you’re getting you put you give it to a large language model saying that this is a context that you’re having and this is the original query that are being asked based on these two pieces of information generate response to me so here in the red pipeline we’re seeing if few things that we’re going to talk about one piece and another uh in in our later presentations number one is that you’re going to get your original documentation you’re going to split it number two is that you’re going to convert it two numbers number three is that you’re going to store it in the database to for Effective search and number four is that you’re going to combine the context with the original um query to generator response so that’s how a Rec pipeline works okay let’s talk about the documentation uh representation in a vector space so um when you have a when you have a data you split it into many pieces well the first first thing is why why don’t you just have one piece and then U you know put it in a vector database and that would be it why do you even bother the reason is for your search is that when you’re are searching for things if you get the entire document you get a lot of noise about how do you answer so think about that when you’re interacting with a large language model you asking for a particular question you are expecting a question a answer about what you ask for right if the uh model just saying here is a 700 page document go read your the answer you’re looking for is within this document it is not very helpful right so in that situation that’s the reason that we are really chunking different things um so that you’re only retrieving the most relevant piece love the information and then uh we’ll only use this uh to answer the questions now it comes to a situation well how many pieces shall I chunk it how many pieces I have a 700 page document um shall I shall I chunk it to 700 different chunks or 1,400 or you know 35 um you know what is my magic number right well the quick answer is that it depends it depends on on a few treat offs uh which we’ll talk about later as well but you know currently you can understand as in if you split it into too many chunks then the retrieval process will be slow um so if you have a document of 700 Pages probably allog together have I don’t know 1 million words right and then if you split into hey why don’t I have like one word uh as one chunk then in that situation you have a uh you every time that you search for something you need to search from 1 million items so that would slow down your search but it also have another problem as in when you’re splitting things too much then each word does not really tell you what’s in the document anymore so so when you’re um when you when you split things by sentence you kind of know what this sentence is talking about but if you’re like uh splitting by by by phrases then you do not really understand exactly what this document is talking about you do understand what this phrase or this word is talking about but um you know because all the documents are comp are U composed of words then the the result of splitting this document and slip splitting another 500 page document would not be different at all you get all those words that are individually not dependent on each other and do not provide context against each other so that’s one direction to the extreme the other direction is as what we said is that you know I do not Sue at all and then I just provide you with the entire uh document and then you can I I know that what you’re looking for is within this document and how about you figure out your yourself right that is not very good uh customer service either so we we want to pick a number that is kind of make sense so let’s look at two examples about chunking um this is uh Abraham Lincoln’s Gettysburg Address it’s a very short uh Speech so um you know we’re able to split it um in in in different ways let’s look at the left side first so you can see that um it looks like you know the the letters or tokens in the machine learning domain the tokens that in the first uh chunk is very prop is the same as the second chunk and it is the same as the third chunk so basically we pre- select a number a magic number that we like for example that I want say 20 tokens in the sent and then this is a uh a document that I’m going to split it this way you know chunk one chunk two chunk three and then if you’re asking me that um a question that I’m going to pull out a chunk that are related to what you’re asking me and then being able to answer it what problem with this way of splitting well so first you’re going to split in the middle of the sentence so that that um you know you kind of you understand that 87 years ago our forefathers founded on this continent in new nation and what is this new nation about your this piece love this chunk does not provide you with so which means that you’re getting cut off information and then you need to try to find the second um piece that would be more related to this topic about this new nation um it could be number two or it could be something else so you kind of um kind of suffer this problem of like incomplete sentences so that’s uh one problem the other problem is that let’s look at piece number one and piece number two it does not have anything in overlap which means that in a vector database which we’re going to talk about the mechanisms a little later but in a when we store it in the database the first chunk and second chunk really does not have anything to do with each other so they might be stored in different places because they don’t have anything to do with each other so retrieval of the first piece does not necessarily make it more likely to retrieve the second piece so th um this is you know the challenges without uh with about you know chunking your documentation in the first way then let’s look at the second way so as you can see that for this situation the chunk the the letters or the tokens included in each chunk is a little more than just now um but that was not exactly the point the real CH the real difference is that you can see that chunk one this blue box and chunk two this red box and chunk three this Orange Box have recent have a decent size of overlap meaning that the information in Chun one will be related to what the information in chunk two which will be related to information in chunk three which also will mean that when you store the this uh a in a in a um Vector database because one and two are related somehow therefore two will be stored closer to one than this situation and the same going forward so this is a more acceptable or more pragmatic way of chunking is that number one you try to uh collect information as much as possible you have a window collect information about the original document as much as possible number two you have certain overlap in a way so that the incomplete sentence in in section one will be uh presented or completed in section two because it’s adjacent uh chunk number three the third situation is that because the overlap then they are going to be stored closer in a in a a vector database so these are a few considerations about chunking so meaning this is a correct one of doing it the left side is a correct one of doing it the right side one way is the correct one doing it right side is probably a more popular usage and probably will also provide you with better search results and then just now we talk about embedding so let’s uh go back to this so first you have the documentation you split the documentation into chunks and then you need to encode your different chunks into embeddings which are flow numbers and then next thing we’ll talk about is will be the embeddings embedding is a little scary word if you have never worked with machine learning in the past but what is really doing is do a transformation so you start with you know a sentence or a phrase or a token and then you do certain Transformations um and then you provide code here um is indeed the results of the embedding will be at least a vector of the float numbers and then you’re going um when you need when you you you’re going to use this float numbers uh Vector for calculation and when you’re done you should be able to um you know decode the entire thing to this um to the original output so U you can understand the idea of Crea embeddings as in creating the zip file um um like when you’re when you used to compressing the file so you start from with original and then you start uh with like compressing or or starting with compressing the original file it’s called encoder and through a process that you’re going to generate certain things which is intermediate results that as a human being you really do not care and then model is going to use this intermediate U deep de uh information for calculation and then when the calculation is done we return you certain things and then you are like de depressing your uh your your zip file uh decompressing your uh uh ZIP file and with a decoder and into its original format so this is an example of encoding process um so first you have a text says certain things right what it says does not matter how long it is does not matter uh runs it through a embedding model and then it will be generated into a vector uh keep in mind this Vector is a flow number um you know it could be 1,000 dimensions of the flow number or 10,000 Dimension flow number depending on how your embedding model uh is being defined and then um you know the model the large language model is going to use this for calculation after it’s done you know a a reverse process if you just reverse these arrows going through the same embedding model but from a different direction and then would give you back the original original text from it um in a in a multi model situation we are talking about multi model um um when we’re talking about multimodel it is a um that you know you can have text you can have uh images you can have uh audio sound or you can have video uh everything can be uh encoded by this uh eding model it will be a different eding model for different kind of things or it can be one embeding model that can encode everything does not matter um but the idea is that you know you start with certain things that you want to process and you you’re going to end up with a vector that machine learning uh model can understand and then when machine learning model is done then it will be returned to its original format so uh let’s uh go back to this section again so uh we get embeddings and then embeddings are going to be stored in a database this database is called Vector database let’s look at this Vector database um Vector database so you can see that you know here is a vectors this is a database have all the vectors uh these are the um vectoral float numbers keep in mind these are vectors of float numbers and then uh these are your original data and mapping two of these float numbers and you are like well since your original data is here that why why do you even need it here right you can just search these datas here in the blue box and then when you get what you need you just convert it to the vector isn’t that enough good idea except that the searching mechanism on the vector database is different so just now we talk about indexing as in like when you’re putting data in a vector database they will be stored as you know vectors in in the vector database when you do search it does not search on the original data because searching on the original data is not only slow but also not precise instead we search on these vectors directly keep in mind these are float numbers uh machine learning and all the retrieval um you know your your documentation process would process the numbers like very very fast and then so you’re going to index the data in your uh in your database with the vectors with the vectors itself so the data is just links to the vectors when you decide to retrieve all these three vectors and then you try to get the data out of out it instead of you search on the data um which will give you like slow and not correct answers so that’s very important between like a your um you know I know that a lot of people have the understanding of a relational database which you know your data are stored in rows and columns uh you know no SQL data database that your data is being stored in uh in K value Pairs and then uh we already know that the nosql database probably is more high performance for searching because you can search for case you do not necessarily have to search for values right so consider this as Vector database as a little bit like you know these vectors are the case that you’re searching for except that your case uh that in a traditional no SQL database is very likely are still strings uh that searching is is um you know by exact matching and then the but here if we search by vectors then we are able to do the semantic search that you know is a the idea that people are talking about so this is the indexing process and in in Vector database first you do the indexing is put your your data in and then you do the search and then you process your search result with reranking so indexing let’s talk about how indexing work so so you have a you have a document to start with and then you split your document into different chunks like this right and then different chunks will be converted to the vectors and stored in this area and then and your document uh document data will be stored in this blue area so note that the left side document chunks and the right side document chunks are exactly the same thing so when you’re when you are um ingesting the data into Vector DB you really have the same data is being used for indexing and the same data are being used for retrieval keep keep that in mind of course there are many levels and uh you know because a vector DB is a database as well there are many ways to um um to handle the scalability issue that any database would need to handle and this is an example about um you know when you have a really large document then you split split them into um several parent documents which will be split into many more child documents and then child documents will be um corresponding to each chunks and then you know how do we do the uh indexing and the retrieval process you know this is to handle skillability issues but the concept and idea about how a uh Vector database Works still the same U note that I’m only giving two examples now U there are like a lot of more other uh ways to optimize a um Vector database just like you know any other commercial database um product would have many many ways to optimize uh for scalability so then let’s look at like well great I get everything that into my Vector database and I want to do a search how do I do it I want to look for things well keep in mind that the things are are being stored in a vector so um let’s go back to this so when you’re seeing that a document split into chunks and chunks are being um placed in different vectors right and then every Vector will have its own Vector space so this is a simplified version that you know you you simplify in a the vector in a 3D space and then you can see that the vectors are you know like this way and this is another example that you know here is your original point and then your data are being um you know stored in all these places consider every one of them as a vector of that you know from your original point to here so so in order to leverage the convenience created by uh the you know Vector search in float number groups and then let’s consider that you know your different vectors are in a multi-dimensional space we just use this to help with understanding viralization but does not consider that the vectors are only 3D vectors it can very likely to be 1,000 Dimensions or 4,000 Dimensions or 10,000 Dimensions right but the calculation love vectors are still the same obey all the you know your college U Cal uh Vector calculation rules so when I when I have a query I’m going to convert the query that I have into float numbers as well and my float number will be uh represented by this green button here this green bubble here so I’m uh I’m going to query something that should be answer U should be answering my my my query question represented by the uh green bubble and what it does is that it’s going to get the K nearest neighbor um so there are many many um different algorithm being being used um K nearest neighbor is one approximate nearest neighbor is one uh this this is I think a local um local uh local some cash sorry I don’t remember um I think uh yeah I don’t remember what lsh stand for so this is some hash sensitive hashing oh H yes sensitivity hashing yeah uh yeah this uh hashing algorithm um is well but the general idea is about thank you Dr ly um the general idea is about that you know you have a green button you you want to find out a few closest bubbles or closest vectors in a space to this green button so this is a document chunk this is another document chunk this is another document chunk and in a in a multi-dimensional space and uh um with certain rules of calculation the vector have distance can be calculated mathematically it’s not guessing it’s mathematic results right when I find out that these doc these are the closest chunks closest to me I’m going to retrieve these blue bubbles because I believe that they are going to get the answer that I’m looking for so this is a search well and then the next question is like why why do I want four blue bubbles instead of 40 why I want uh four instead of 400 right U if I just retrieve everything say for 400 bubbles wouldn’t I get like better answers well let’s see so this is called uh search reranking which is about uh that you’re going to your query is going to get some results and then you’re going to get um try to understand your results in a certain way so your your query uh your quer is getting you know you you you query query something here you have a green bubble here right you’re going to get a few you know bubbles that are closest to you and then you have four bubbles out and not all of them are your best answers so and then it’s like why do I want four well of course you can get 40 with no problem it’s just like to process 40 then you need more calculation so this is a second point of the tradeoff point as in like does uh retrieving more search candidate this are how many is called search candidate do retrieving more search candidate will be a better uh um uh will will help you to improve your answers well the short answer is that I don’t know you need to you need to calculate according in to exactly what your system wants and what your system needs and then get from it but first you get those four bubbles that you want and then you rearrange them according to certain rules called rerank normally these days rerank is being done by another model as well um but you know for Simplicity purposes let’s just say that let’s rerank by How deep the color is R rank in certain ways and these are the rerank results that you’re going to use to generate a answer from suppose your answer is here so this step is called candidate generation uh this is this step is called reranking then uh the re the purpose of reranking is make sure the most relevant result is on the very top of it and then you can do a cut off as you like I I only want the first one to generate my answer there you go for example if your question is that um uh give me the address for Washington University and and University of Washington and then it’s going to tell you that um you know it’s uh um one answer would be enough uh the most relevant answer should be enough G giving you uh one two three on ABC straight with a zip code 56789 uh that is the correct answer that that’s great you know one one answer can one um one chunk of the documentation can provide you with a perfect answer there will be situations that need need multiple points to get a good answer right like um um you know give me the the uh restaurant near me that are still open at this hour and serve uh Japanese food then you’re going to get a lot to restaurants and then you’re going to do the reranking as in you know which one’s nearest to me and they um they F fulfill the the condition that you want and they do want to provide a list of answers you you do do not want to just provide one restaurant you want to do want to provide multiple restaurants for the customer to choose from so so the so in other words how search generates you know candidates and how rank decide on how many candidates to get it’s really depending on exactly what you’re trying to do it may not necessarily that you retune your reranking mechanism by every different query that is not right because I said that you know normally reranking is generated by uh language models uh but you know it’s uh it’s it’s really reranking is defined by uh the business that you’re running what are what information that you’re trying to provide okay so we we talk about the vector database uh how to ingest uh data into it we talk about that um after you get data into the the vector database how do you do the search and we also talk about that uh in order to provides the most valuable results to to search then how do you um do the re in now let’s go back to look at the entire R query process so you started with the document collection which is your private data uh Enterprise private data and then you have a search query or a question that you know is what you’re trying to satisfy with both pieces you you you do the incoder and you use the same embedding model to translate all this into float numbers float number are stored in the vector database here and then you’re going to search your vector database you generate a list of the search candidates and then you do a reranking mechanism and decide on what answers that you’re going to provide and then you use this same embeding model only to decode the float numbers into a original text and then you have a ranked H A ranked list to provide um you know to your to generate answers for your customers so that’s how IR R query works now well we talk about it you know we we talk about hallucination we’re saying that well you know uh large language model is not very reliable and then we’re saying that here is this wonderful rag system that are able to provide very reliable answers because for every answer question that you ask is going to ground the answers against the context that you’re given which is the the your your your Enterprise documentation that you know you have over there and then it should provide really good answers right uh well it does have re also have challenges as well just like any um software systems it is a software system it needs to be built in a reasonable way hypothetically is working but pragmatically there are a lot of tuneups a lot of additional work that needs to get done um one challenges that it has is called mismatch as in I want to search for certain things I did not get it for example um um I want to search for uh the the restaurants that are open at this hour and then I’m able to um I get like it’s open at this hour I did not get a restaurant out why it could could be many reasons attributing to it um one reason can be that I do not have enough doc uh documents regarding the restaurant um for example that you know my my documentation tells me that there are open gym nearby there are open spa nearby there is open grocery store nearby but the restaurants are labeled as store instead of as restaurant itself because you know for whatever reason that my do my document my data considered uh restaurant it’s part of grocery store then when I when I’m searching for this I’m going to get like irrelevant information right so mismatch love your uh result is very likely is a data problem is that you know your data is not doing great or your data is uh you do not have complete data or your data is wrong or whatsoever and then second kind of problem is cut coverage problem as I know there is a restaurant um you know two blocks away from me and I I know that restaurant is open but when you return the results to me it it did not uh it did not include that restaurant so there are certain feedbacks that I’m able to give this re system saying hey you’re you’re not having the right coverage you do not have um the the the the information that I know that you already have then so just now we said that mismatch is very likely to be a data issue whereas the coverage AKA returning a incomplete list is very likely to be a search issue as in you’re not generating the right uh candidate from the search or during your ranking process the relevant um uh uh candidate is is just ranked as like low relevance so that you do not really display it so that’s where you should look at so if then you need to do a little debug as in like hey I know that I should return five restaurants are you returning all the five restaurants if you are not returning all five you only return three then we know that that you know you your search has not generated enough candidate then you need to tune up your search parameters a little bit to generate more candidates but there could be situation that I know there are five restaurants should come up with this search and then I do get five restaurants but when you display the information to me and you just say that here are two restaurants you can go to then that’s your reranking problem then up your reranking setup right um the third thing is about um you know bias and uh fairness so when we are talking about bias and fairness very likely um very often I would say that um if you’re not dealing with um the the data management a lot it’s very likely that you uh come to a situation that you always think U bias from a social perspective uh do we have uh the right rual group uh you know do we have equal opportunities for all whereas the bias and fairness in a machine learning context is really about the data bias and data fairness AKA as in there are sometimes there are more dominant data points that uh are just like occurred more more and then you need to understand what that means for example when we’re talking about inflation right right now when we’re talking about inflation everyone know oh egg prices rise why because uh egg number one yes egg prices is higher number two it’s in the news all the time so when you’re doing the document analysis that egg prices would dominant what your large language models input as in like a lot of inflation talk about prices raises but then you know the end the the gas prices have decreased so but you know it does not make as much news so people do not pay attention to it and the same thing when people talk about the gas prices the energy prices right people talk talk about the gas prices all time people do not talk about the natural gas so when you’re especially that in the context of energy gas is one word the natural gas are to worse so when you are when you are uh kind of uh to do the analysis at like the the word level then very likely you’re going to group some of the natural gas information as a gas information so that would make things wrong as well um as in like you understand natural gas as part of gas which is not true in a you know in a pragmatic way so so um and also like when you’re talking about natural gas and then you’re talking about oh oh Russia exports natural gas where which you know probably dominant news cycle recently but people do not pay attention to like Canada exports natural gas as well um except that it does not dominate the new cycle so the bias and fairness uh regarding the machine learning domain is about like do you have too much attention on the dominant data points and that’s very important in a way that um you know just like a uh cancer if you have a model diagnosing cancer U if you just um diagnose everyone as cancer free then you get 99% Precision whereas you know you get like extremely low recall so bias and fairness is can be understood and measured in a way of you know prec recall um and other you know u mean average Precision or Comm um um Community uh cumulative gain uh love the your search results Etc so those are like real world problems that need to be handled uh when you’re building a re system um and also in the real world the data can be very complicated well as well uh like you know you can you can have a text saying that line is a king of the jungle and you can have two pictures of lines that are not similar to each other and you have a voice that are explaining um you know this uh this video so there could be like multiple U dimension love the the information that needs to be um you know handled and processed in a way so that we can do a more precise uh search and give a more precise answer um so given the the time constraint today there are lot of things that we cannot expand on um and then we’re going to have more advanced topics on uh teaching about R uh there will be four classes in um in April every Friday from April 4 April 11th April 18th and April 25th so four classes U talking about red in detail we’re going to do a Hands-On projects to talk to really get you grounded in the concepts and ideas that we talk about just now we’re also going to cover like more advanced topics such as query expansion um you ask for certain things um you ask that uh tell me about the elephant on the moon and which can probably should be uh expanded into two questions are you asking U elephant on the moon in a scientific U scientific uh scenario or are you asking about the elephant on the moon in a creative scenario and then we can uh we when we separate the this with you know query expansion then we can answer it you know separately and then return the answer to the customers uh you know with with answers to both if you mean a scientific uh um uh if you mean the question in a scientific setting then I can tell you there is no elephant on the moon but if you wanted in a creative way here is a story about the Pink Elephant whose name is Luna so that would make the um you know idea a lot better also we talk about several search mechanism today and then um you know when we talk about several search mechanism you probably realize that why don’t we have it all right there will be a combination of hybrid search and there will be certain ways to do quantization in your vector database so that you come out your vector database would provide you with the best search result and then just now we talk about the multimodel search a little bit so that you can not only search for text you can also search for for audio uh video and uh u pictures Etc also uh in a in a uh situation that okay I have this search uh rag systems working how do I know it’s working well there are several metrics uh regarding that how do I know it’s working well and then also like there are certain things that you want to do the tradeoff just uh just now we talked a little bit about like you know there are cost tradeoff there are answer relevance tradeoff there are how grounded your answer is tradeoff um in certain situations I do not mind to sacrifice um hallucinate sacrifice the accuracy uh love my results a little bit um you know hallucination does not really hurt me from certain situations in other uh situations Hallucination is a big no no that I do not want any one of it right so these are all the design tradeoff consideration that we’re going to talk about in details uh okay I think we comes to the end of the presentation um so these are a few QR codes um I do conduct like oneon-one training on machine learning for uh for uh uh for uh interview preparation or career guidance um the this is a QR code for it if you want to join our um uh Rec courses in April here is a QR code for it and also please join our um um Mall AI LinkedIn Group by scanning this QR code as well