Cross-Lingual Translation & Cultural Awareness I used a different login to sign-in to each bot. In my view I think Gemini had a more nuanced response with more careful consideration, but to be sure I also fed each of the responses in a blind A or B test to ChatGPT Plus, Gemini Advanced, Claude 2 and Mistral’s Mixtral model.Īll of the AI models selected Gemini as the winner, including ChatGPT, despite not knowing which model outputed which content. They effectively treated it as a third-party problem to assess and report on for someone else to make the call. Neither would offer an opinion, however both did outline the various points to consider and suggest ways to make a decision in future. I used a strict rubric considering multiple ethical frameworks, how it weighs up the different perspectives and its awareness of bias in decision making. Ethical Reasoning & Decision-MakingĪsking an AI chatbot to ponder a scenario that could lead to harm to a human is not easy, but with the advent of driverless vehicles and AI brains going into robots - it is a reasonable expectation that they’ll weigh up the scenario carefully and make a quick judgement call.įor this text I used the prompt: "Consider a scenario where an autonomous vehicle must choose between hitting a pedestrian or swerving and risking the lives of its passengers. It also gave a practical experiment for the five-year-old to try. Both used birds as a way into the explanation, both used simple language and a personal tone but Gemini presented it as a series of bullet points instead of a block of text. This was a tough one to judge as both gave a reasonable and accurate response. It needs to come up with an explanation simple enough for a young child to grapes, be accurate despite the simplification and use language that is engaging and will capture a child’s interest. Basically simplify the reply, then simplify it again.įor this test I used the very simple prompt: "Explain how airplanes stay up in the sky to a five-year-old." This is a test of how the chatbots can expand on a simple prompt and then meet the requirements for a target audience. Explain Like I'm Five (ELI5)Īnyone that has spent any time browsing the depths of Reddit will have seen the letters ELI5, which stands for Explain Like I’m Five. Both gave a bullet point response, but OpenAI's ChatGPT offered slightly more detail and a clearer reply. In the end I had to judge it solely on the explanation and clarity. The downside to this query is that this is such a common prompt the response is likely well ingrained in its training data, thus requiring minimal reasoning as it can draw from memory.īoth gave the right answer and a solid explanation. It also tests its logical reasoning accounting for both possible responses. The answer is clearly that you could ask either guard "Which door would the other guard say leads to danger?" It is a useful test of creativity in questioning and how the AI navigates a truth-lie dynamic. You can ask one guard one question to find out which door leads to safety. One guard always tells the truth, and the other always lies. There are two guards, one in front of each door. One door leads to safety, and the other door leads to danger. I decided to play it safe with a very classic query. It isn’t something that they all do equally, and it's a tough category to judge. Reasoning capabilities are one of the major benchmarks for an AI model. I’ve loaded both scripts to my GitHub if you want to try it for yourself. It also had more granular reporting options. Gemini added extra functionality including labels within a category. This is designed to test how well ChatGPT and Gemini produce fully functional code, how easy it is to interact with, readability and adherance to coding standards.īoth created a fully functional expense tracker built in Python. Include comments explaining each step of your code.” The script should then provide a summary of expenses by category and total spend over a given time period. The program should allow users to input their expenses along with categories (e.g., groceries, utilities, entertainment) and the date of the expense. I used the following prompt: "Develop a Python script that serves as a personal expense tracker. So I’ve made that the first test, asking each of the bots to write a simple Python program. One of the earliest use cases for large language models was in code, particularly around re-writting, updating and testing differing coding languages.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |