Berkeley Existential Risk Initiative | X-risk Relevant Prediction Market Question Suggestions (Medium Term)

Background & motivation
The questions
    General performance benchmarks
    Will reinforcement learning be the dominant paradigm in n years?
    How well will reinforcement learning or its successor work?
    Operationalizations of AI relevant compute
    Architecture, modularity, and deep learning
        How many modules?
        Gradient descent modules vs. non-gradient descent modules?
        How many meta-modules?
    Sociological questions
    Governments and nationality
    Consensus in the ML community that the “alignment problem” is important

Background & motivation

A few months ago, BERI commissioned me (Eli Tyre) to collaborate with local individuals working on AI safety to develop a set of existential-risk-relevant questions that would be appropriate for use in prediction markets. BERI is making the questions available for general use in prediction markets, forecasting tournaments, and the like. Feel free to make use of these questions in any capacity.

BERI believes that accurate forecasts are useful for trying to prevent existential risks. Predictions about the near- and medium-term future could inform strategic decisions, such as what types of research should be prioritized. In order to encourage forecasting and prediction markets in this area, BERI is open-sourcing the following questions for general use.

If you are running a prediction market, you are welcome to use any of these questions freely, in any capacity.

Note that this is only a preliminary list of questions and should not be treated as exhaustive. Due to the breadth of the space, there are certainly good questions that my collaborators and I did not consider, or for which we did not find suitable operationalizations. I don’t want the existence of this list to discourage others from trying to develop their own prediction market questions.

Also note that this is simply a list of questions that we believe are could be answered by a prediction market; there are other questions that are strategically important to AI x-risk which cannot be easily resolved by simply-worded prediction market questions.

Although we tried to crisply operationalize each question, it would be infeasible to give full the resolution criteria for each one (at least, without making turning this blogpost into hundreds of pages of minutia). Anyone using one of these questions as the basis of a prediction market is responsible for delineating specific resolution criteria and mechanisms for settling edge cases for their market.

If you have questions about the motivation or operationalization of any of the following items, please email eli@existence.org.

The questions

Many of the following questions are open-ended: e.g., “In what year will X benchmark be reached?” These questions could be formulated as a contract that pays out an amount proportional to the year the task is solved, or a contract that pays out only if the task is solved before a specified cutoff year.

In contrast, many other questions are of the form “By a given date, will X event have occurred?” We give suggested specific dates for these questions. However, in most cases, variants of these questions using dates between now and 2030 are expected to be useful.

Below, I list the questions, grouped by broad category.

General performance benchmarks

The following are some standard benchmarks for AI performance. Prediction markets forecasting the arrival of technologies that hit these benchmarks would be useful.

(A note on attribution: many of these questions were taken directly from either AI Impact’s 2016 Expert Survey in Progress in AI, the 2017 AI Index, or the Electronic Frontier Foundation’s AI Progress Measurement open-source notebook. Attributions are noted in brackets after the question.)

By what date will an RL agent match human performance on Starcraft II with no domain-specific hardcoded knowledge, trained using no more than $10,000 of compute of public hardware?

By what date will an AI system surpass human performance on the open-ended visual question answering task, COCO? [EFF]

By what date will an AI system surpass human performance at visual question answering on the VQA 2.0 dataset? [AI Index]

By what date will an AI system match human performance in reading comprehension on the Facebook BABI 20 QA dataset, with only one thousand training examples? [EFF]

By what date will an AI chatbot definitely pass the Turing test: a human can do no better than chance at distinguishing a chatbot from another human, after an hour of textual conversation?

By what date will an AI system see only one labeled image of a new object, and then be able to recognize the object in real world scenes, to the extent that a typical human can?
- The new object should be one of the ImageNet classes; the training example can but need not be from ImageNet. The system will be considered to match human performance if it is able to achieve or exceed human accuracy in photos of a wide range of real world scenes. The system may train on labeled images of other objects. [AI Impacts]

By what date will an AI system be able to take a written passage and output a recording that cannot be distinguished from a voice actor, by an expert listener? [AI Impacts]

By what date will there be commercially available software that takes a body of text and produces a synthetic reading of the text that cannot be distinguished from a non-synthetic production?

By what date will an AI system routinely and autonomously prove mathematical theorems that are publishable in top mathematics journals today, including generating the theorems to prove? [AI Impacts]

By what date will an AI system be able to play new levels of Angry Birds better than the best human players? (For context, this is the goal of the IJCAI Angry Birds AI competition.) [AI Impacts]

By what date will an AI system be able to outperform professional game testers on all Atari games using no game-specific knowledge? (This includes games like Frostbite, which require planning to achieve sub-goals and have posed problems for deep Q-networks.)

By what date will an AI system surpass average human performance in OpenAI’s Gym retro contest?

By what date will there be at least one company that has deployed at least 1,000 fully automated cars, operating on public roads in a major urban area and serving the public?

By what date will there be a publicly available 10-minute clip of synthetic video that cannot be distinguished from a non-synthetic video? (A skilled human, shown a clip of real video and a clip of synthetic video and told to identify the real clip, can do no better than chance.)

By what date will there be a commercially available video game with fully photorealistic graphics, that cannot be distinguished from recorded footage of a similar scene?

In a given year [2025, 2030], how many professional translators and interpreters will work in the United States, according to the Bureau of Labor Statistics?

By what date will there be a single AI architecture that can be trained, using only self play, to play either Go or each of the seven Atari 2600 games used in Deep Mind’s *Playing Atari with Deep Reinforcement Learnin*g, at a superhuman level?
- The architecture must be able to learn each game. That is, this the criteria of this question are met if one copy of the system is trained on Go, and another copy is trained on Atari, even if no single system can play each game. However, the system may not be tuned or modified by a human for the differing tasks. “Superhuman”, here, means performance superior to that of the best human experts in each domain)

By what date will there be a single AI architecture that can be trained, using only self play, to play Go, Starcraft II, poker, or each of the seven Atari 2600 games used in Deep Mind’s *Playing Atari with Deep Reinforcement Learnin*g, each at a superhuman level? (
- The architecture must be able to learn each game. That is, this the criteria of this question are met if one copy of the system is trained on Go, and another copy is trained on Atari, etc., even if no single system can play each game. However, the system may not be tuned or modified by a human for the differing tasks. “Superhuman”, here, means performance superior to that of the best human experts in each domain.

Will reinforcement learning be the dominant paradigm in n years?

In a given year [2025, 2027, 2030], what percentage of papers published on arXiv in the Computer Science and Statistics categories (or whatever the most commonly used repository of the era is) will include the phrase “Reinforcement Learning” in the title?

In a given year [2025, 2027, 2030], will 3 of the 6 most cited papers (in the fields of AI and Machine Learning) of that year involve reinforcement learning (as judged by a survey of experts, such as NIPS authors)?

How well will reinforcement learning or its successor work?

By what date will there be a general purpose robot that can learn at least three of the following tasks: 1) make a bed, 2) wash dishes, 3) fold laundry, and 4) vacuum a room, requiring only 10 minutes worth of data (explanation, demonstration, etc.) from a human per task? (The robot can be pre-trained on other tasks for arbitrary amounts of time/compute/data. The robot does not need to be commercially available.)

By what date will there be an agent that can beat the a variation of the OpenAI Minecraft task that rewards no points for gold and red stone, only for picking up a diamond, learning only from pixels with no minecraft-specific hard-coded knowledge?

Operationalizations of AI relevant compute

Answers to the following questions have high information value. However, each of the following involve continuous values, that will be hard to determine exactly. Markets on the following question should either be formulated in the form of individual binary contracts (e.g. “In 2025, the maximum training compute used by a published AI system will be greater than 4,000 petaflops/s-day”) or resolve to ranges instead of point values.

In a given year [2020, 2025, 2030], what will be the maximum compute (measured in petaflop/s-days), used in training by a published AI system.
- A “published AI system” is a system that is the topic of a published research paper or blogpost. In order to be admissible, the paper/blog post must give sufficient information to estimate training compute, within some error threshold.
- [This question was inspired by this blog post, which delinates a methodology for estimating training compute.]

What will be the retail cloud computing price of conducting a total of ten million 4000 x 4000 matrix multiplications, consisting of 10,000 multiplications of 1,000 different matrices in a given year [2020, 2025, 2030]?
- The multiplications may be parallelized. The elements of the matrix should be randomly sampled from a standard normal distribution. The number of matrices is constrained only to avoid repeatedly incurring the cost of a random number generator. You may not exploit the property that the matrices are duplicated. The cost should be prorated to fractions of an hour (or other relevant billing cycle), and be for an ‘on-demand’ (non-preemptible, no reservation) price, without any bulk or other special discounts.

In a given year [2025, 2030], how much power will it take to implement 1e14 Traversed Edges Per Second for an hour on the best case machine from the graph500 list?

In a given year [2025, 2030], how much power will it take to implement 1,000,000,000,000 Traversed Edges Per Second for an hour on the best case machine from the graph500 list?

Serial computation: In a given year [2025, 2030], how many minutes will it take to train ResNet-152 on ILSVRC 2012 using a system available to purchase on the public market for less than $5000?
- (Where ResNet-152 refers to the test performed in this TensorFlow benchmark, trained until it reaches top-1 error of <=28% and top-5 error <7%.)

Parallel computation: In a given year [2025, 2030], how many instances of ResNet-152 will be able to be trained in parallel on ILSVRC 2012 in fewer than 24 hours using a system available to purchase on the public market in [2025, 2030] for less than $5000?
- (Where ResNet-152 refers to the test performed in this TensorFlow benchmark, trained until it reaches top-1 error of <=28% and top-5 error <7%.)

Architecture, modularity, and deep learning

For each of the following questions, a “module” refers to some division of an AI system such that all information between modules is human legible.

As an example, AlphaZero has two modules: a neural net, and a monte carlo tree search. The neural net, when given a boardstate, has two outputs to the tree search: a valuation of the board, and a policy over all available actions.

The “board value” is a single number between -1, and 1. A human cannot easily assess how the neural net reached that number, but the human can say crisply what the number represents: how good this board state is for the player. Similarly with the policy output. The policy is a probability vector. A human can conceptualize what sort of object it is: a series of weightings on the available moves by how likely those move are to lead to a win. The board value and the policy are both “human legible”.

Contrast this with a given floating point number inside of a neural net, which will rarely correspond to anything specific from a high-level human perspective. A floating point number in a neural net is not “human legible”.

A module is a component of a neural net that only outputs data that is legible in this way. (In some cases, such as the Monte Carlo Tree search of AlphaZero, the internal representation of a module will be human legible, and therefore that module could instead be thought of as several modules. In such cases, prefer the division that has the fewest number of modules.)

Therefore AlphaZero is made up of two modules: the neural net and the monte carlo tree search.

An end-to-end neural network that takes in all sense data and outputs motor plans should be thought of as composed of only a single module.

How many modules?

In a given year [2025, 2030] how many modules will the state-of-the-art conversational chatbot have?

In a given year [2025, 2030] how many modules will the state-of-the-art AI system for architecture-search have?

In a given year [2025, 2030] how many modules will the state-of-the-art general use robotics system (that can do multiple learned, as opposed to hardcoded, tasks) have?

Gradient descent modules vs. non-gradient descent modules?

In a given year [2025, 2030], how many modules that employ gradient descent on a loss function will the state-of-the-art conversational chatbot have?

In a given year [2025, 2030] how many modules that employ gradient descent on a loss function will the state-of-the-art general use robotics system (that can do multiple learned, as opposed to hardcoded, tasks) have?

In a given year [2025, 2030], what proportion of the power for deploying (not training), the state-of-the-art conversational chatbot will go to modules that do have a loss function? (If the system is a single module, the proportion would be either zero or one.)

In a given year [2025, 2030], what proportion of the power for deploying (not training), the state-of-the-art general use robotics system (that can do multiple learned, as opposed to hardcoded, tasks) will go to modules that do have a loss function? (If the system is a single module, the proportion would be either zero or one.)

A meta-module is a module that controls the allocation of computational resources to other modules.

How many meta-modules?

In a given year [2025, 2030] will the state-of-the-art conversational chatbot have an architecture that includes a meta-module?

In a given year [2025, 2030] will the state-of-the-art AI system for architecture-search have an architecture that includes a meta-module?

Sociological questions

In a given year [2020, 2025, 2030], what will be the number of submissions to arXiv (in machine Learning and Artificial Intelligence), in that year?

In a given year [2020, 2025, 2030], what will the average (arithmetic mean) number of attendees across NIPS, ICML, and AAAI be?

In a given year [2025, 2030], what percentage of graduates leaving top 10 CS undergraduate programs will either go into a PhD in AI, or accept a research job in AI or Machine Learning?

In a given year [2025, 2030] what will be the largest amount spent on research and development of AI technology by any one company in 2018 dollars?

What will be the smallest number of organizations that will account for 10% of the first authors of papers published in [NIPS, ICML, other top venue] in a given year [2020, 2025].

In a given year [2020, 2025], what will be the proportion of job ads on Hacker News Who’s Hiring mentioning “Artificial Intelligence”, “Machine Learning”, or “Deep Learning” (or an abbreviation of any of those)?

In a given year [2025, 2030], will DeepMind be a financial subsidiary of Google or Alphabet?

In a given year [2025, 2030], how many people will OpenAI employ?

Governments and nationality

By a given year [2030, 2025], a major AI lab in North America, EU27, UK or Australia will have been nationalized.

In a given year [2030], fewer than 50% of major fundamental discoveries (of similar significance to DQN, LSTM, AlphaGo, etc.) made in the previous ten years will be published within two years of discovery.

By a given date [2030], a major AI lab will have been nationalized.

Consensus in the ML community that the “alignment problem” is important

In a given year [2025, 2030], what percentage of authors at the top three AI conferences would agree with the statement, “Artificial General Intelligence poses an extinction risk to humanity”?

In a given year [2025, 2030] what percentage of authors at the top three AI conferences would agree with the statement, “AGI alignment is a critical problem for our generation”?

In a given year [2025, 2030], how many technical employees at OpenAI will work directly on problems of AI alignment?

In a given year [2025, 2030], how many technical employees at DeepMind will work directly on problems of AI alignment?

In a given year [2025, 2030], what percentage of the technical employees at DeepMind will work directly on problems of AI alignment?

In a given year [2025, 2030], what percentage of the technical employees at OpenAI will work directly on problems of AI alignment?