Collaborative Best Arm Identification in Multi-armed Bandits

Home
Happenings
Events
Collaborative Best Arm Identification in Multi-armed Bandits

Date28th Mar 2024

Time11:00 AM

Venue Online

PAST EVENT

Details

The literature on Multi-armed Bandits consists of problems about regret minimization and best arm identification.
This work considers the best arm identification problem in stochastic multi-armed bandits (MAB) under a fixed budget
setting where the agents learning the MAB instance are connected through a network. The model assumes a collaborative network of
agents where the agent and its one-hop neighbors instantaneously observe each agent’s action and reward. We tried to estimate the best arm for the entire network with this collaborative model. We proposed several policies for the same. We first considered devising policies for a star network since they are a common motif in many communication systems. After devising policies for star networks, we further devised policies for generic networks. We have two different sets of approaches for policies of a generic network. In one approach, we estimate the best arm with the help of an Ensemble Learner (EL). In the other approach, we tried to estimate the best arm without using EL.
For the best arm identification policies, which have the EL, at the end of the budget, the EL aggregates samples from the agents and decides the best arm. The objective is to design arm selection policies for each agent (and a suitable aggregation policy for the EL) to minimize the probability of incorrect arm identification. We propose two collaborative learning policies that first partition the network of agents into dominating sets. We then employ (i) a highly exploring UCB-E variant or (ii) a ‘Follow Your Leader’ policy within each star partition, and at the end of the budget, the EL aggregates the samples from all the partitions to select the best arm.
We also propose voting-based policies, which include the EL to estimate the best arm. The voting policies are again of two categories, one based on the upper confidence bound, while the other based on the successive rejects of the worst performing arm.
Our policies enjoy an exponentially decaying probability of error in the number of agents and the budget. Notably, our policies based on dominating set partitions do not require a minimum dominating set partition of
the graph (which is an NP-hard problem), and the mathematical probability of error bounds remain independent of the size of the dominating set partition.

Speakers

Mr. Amit Anand (EE19S002)

Electrical Engineering