Training DBRX (Databricks LLM): Model Design and Challenges. Virtual Talk

Plus: AI That Forecasts Future Events. Video lecture by UC Berkeley.

Apr 17, 2024

Hello, fellow human! I hope things are going well for you on this planet. Sophia here with a brief announcement about our next two talks. Would be great if you could join us.

Table of Contents:

Upcoming virtual talk on May 2nd: Training DBRX (Databricks LLM): Model Design and Challenges.

Video recording of a past talk: GPT-4 based Language Model (LM) that can retrieve and summarize articles, reason about them, and provide predictions on given questions about future events by UC Berkeley.

Training DBRX (Databricks LLM): Model Design and Challenges

Our guest, Shashank Rajput, a research scientist at Databricks Mosaic Research, will speak about DBRX, a large language model (LLM) recently introduced by Databricks.

Databricks claims that the new model surpasses GPT-3.5 and is competitive with Gemini 1.0 Pro.

Shashank will share some details about the model:

- Dive into the The Mixture of Experts architecture on which DBRX is based.

- Describe the model's components and hyperparameters.

- Discuss the challenges the team faced during large-scale training and how they addressed them.

More details and registration are here.

A reminder: tomorrow, on April 18th, we are hosting a talk about an LLM vulnerability coming from a leak in GPU local memory. It affected Apple, AMD, and Qualcomm devices. Please join us. Registration is here.

AI That Forecasts Future Events

We recently hosted a talk with Danny Halawi, a researcher from UC Berkeley, where he presented a paper: Approaching Human-Level Forecasting with Language Models.

The video lecture for the BuzzRobot community is available on our YouTube channel.

I’m also sharing the key highlights of the lecture below.

The language model was developed to make predictions in the field of judgmental forecasting and to answer questions like:

Will a nuclear weapon be detonated in 2024 including tests and accidents?
Who will be de facto power in the Gaza Strip on January 1, 2025?
Who will win the US presidential election in November 2024?

This is how the researchers built the model.

Step 1: Set up an ML benchmark

The researchers sourced binary questions from 5 competitive forecasting platforms with aggregated human predictions.
Established a retrieval date – the date up to which the language model (LM) can retrieve information from those platforms.
Fed that information to the LM and prompted it to make predictions.
Computed the performance of the LM using the Brier score, which calibrates the accuracy of the predictions. A Brier score of 0 – a perfect prediction, a score of 1 – the prediction is completely inaccurate.
Compared the model’s Brier score to the crowd’s Brier score prediction on the retrieval date.

Step 2. Dataset

For the training dataset the researchers collected questions before June 1st, 2023.

The test dataset contained questions after June 1st, 2023 (after pre-training cut off).

Examples of questions in the dataset:

Will Tesla’s market cap be more than $1T before July 1st, 2023?
Will Russia successfully land on the Moon by August 2023?

Step 3. The Language model structure

Retrieval of articles: The researchers prompted the model to generate search queries about a future event.
They used News APIs to feed the search queries to.
After retrieving a set of articles, the LM was asked to rate their relevance with respect to the question.
The most relevant articles were chosen and given to the model to summarize.

Then the researchers prompted the two language models with the given questions and news summaries.

The first GPT-4 based model also received scratchpads on how to ‘think’ about these questions.

Another fine-tuned GPT-4 based model didn’t receive any instructions—only questions and news summaries. The model had to learn by itself how to use this information for prediction. Each of these language models generated reasoning which was aggregated into final predictions.

The fine-tuned model that didn’t receive any instructions had a higher prediction accuracy. You can learn more about how the researchers fine-tuned the model in the video lecture.

Results

The predictions were scraped from 5 platforms with 914 questions.

The model’s prediction accuracy was 71.5%.

The crowd accuracy was 77%.

For comparison, other prediction systems like Autocast had 67.9% accuracy in predicting future events.

According to Danny, their language model could have achieved a higher accuracy score but OpenAI’s safety guards prevented it from making predictions or providing certain answers.

The researchers also tested open source models. Unfortunately, they didn’t perform well. Open source LLMs still have a long way to catch up with closed source models.

The full lecture is available on our YouTube channel.