Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms

Almasri, Mahmoud; Mansour, Ali; Moy, Christophe; Assoum, Ammar; Le Jeune, Denis; Osswald, Christophe

doi:10.1007/s11277-020-08064-w

Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms

Published: 26 January 2021

Volume 118, pages 1165–1188, (2021)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Mahmoud Almasri ORCID: orcid.org/0000-0002-9106-4020¹,
Ali Mansour¹,
Christophe Moy²,
Ammar Assoum³,
Denis Le Jeune¹ &
…
Christophe Osswald¹

222 Accesses
1 Citation
Explore all metrics

Abstract

This paper tackles the problem of Opportunistic Spectrum Access (OSA) in the Cognitive Radio (CR). The main challenge of a Secondary User (SU) in OSA is to learn the availability of existing channels in order to select and access the one with the highest vacancy probability. To reach this goal, we propose a novel Multi-Armed Bandit (MAB) algorithm called $\epsilon$-UCB in order to enhance the spectrum learning of a SU and decrease the regret, i.e. the loss of reward by the selection of worst channels. We corroborate with simulations that the regret of the proposed algorithm has a logarithmic behavior. The last statement means that within a finite number of time slots, the SU can estimate the vacancy probability of targeted channels in order to select the best one for transmitting. Hereinafter, we extend $\epsilon$-UCB to consider multiple priority users, where a SU can selfishly estimate and access the channels according to his prior rank. The simulation results show the superiority of the proposed algorithms for a single or multi-user cases compared to the existing MAB algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

QoS-Based Blind Spectrum Selection with Multi-armed Bandit Problem in Cognitive Radio Networks

Article 15 April 2016

Distributed algorithm under cooperative or competitive priority users in cognitive networks

Article Open access 08 July 2020

Online Frequency Allocation and Mechanism Design for Cognitive Radio Wireless Networks∗

Availability of Data and Materials

The authors declare that all the data and materials in this manuscript are available from the authors.

Notes

A SU in the context of OSA can be considered as an agent in the classic MAB problem, and the frequency channels become equivalent to different arms.
The variable $S_i(t)$ may also represent the reward of the ith channel at slot t.
According to [24], Chernoff–Hoeffding theorem is defined as follows: Let $X_1,\ldots ,X_n$ be random variables in $\{0, 1\}$, and $E[X_t]= \mu$, and let $S_n = \sum _{i=1}^{n} X_i$. Then $\forall$ $a\ge 0$, $p\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}$ and $p\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}$.
According to [24], Chernoff–Hoeffding theorem is defined as follows: Let $X_1, \ldots ,X_n$ be random variables in [0,1], and $E[X_t]= \mu$, and let $S_n = \sum _{i=1}^{n} X_i$. Then $\forall$ $a\ge 0$, we have $P\{S_n \ge n \mu +a\} \le \exp ^{\frac{-2 a^2}{n}}$ and $P\{S_n \le n \mu -a\} \le \exp ^{\frac{-2 a^2}{n}}$.

References

Marcus, M., Burtle, C. J., Franca, B., Lahjouji, A., & McNeil, N. (2002). Federal Communications Commission (FCC): Spectrum Policy Task Force. ET Docket, 02-135.
Watkins, C. (1989). Learning from delayed rewards. Cambridge: University of Cambridge.
Google Scholar
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Article MathSciNet Google Scholar
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294.
Article Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32, 48–77.
Article MathSciNet Google Scholar
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Article Google Scholar
Burtini, G., Loeppky, J., & Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit, arXiv preprint arXiv:1510.00757.
Kaufmann, E., Cappé, O., & Garivier, A. (2012). On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, La Palma, Canary Islands.
Maillard, O., Munos, R., & Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback–Leibler divergences. In Annual conference on learning theory, Budapest, Hungary.
Scott, S. (2010). A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26, 639–658.
Article MathSciNet Google Scholar
Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, Granada, Spain.
Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, Lyon, France.
Agrawal, S., & Goyal, N. (2013). Further optimal regret bounds for Thompson sampling. In Artificial intelligence and statistics, Scottsdale, USA.
Agrawal, S., & Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit. In Annual conference on learning theory, Edinburgh, Scotland.
Gai, Y., & Krishnamachari, B. (2011). Decentralized online learning algorithms for opportunistic spectrum access. In Global communications conference, Texas, USA.
Torabi, N., Rostamzadeh, K., & Leung, V. C. (2012). Rank-optimal channel selection strategy in cognitive networks. In Global communications conference, California, USA.
Rosenski, J., Shamir, O., & Szlak, L. (2016). Multi-player bandits-a musical chairs approach. In International conference on machine learning, New York, USA.
Avner, O., & Mannor, S. (2014). Concurrent bandit and cognitive radio networks. In European conference on machine learning and principles and practice of knowledge discovery in databases, Nancy, France.
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). Distributed algorithm to learn OSA channels availability and enhance the transmission rate of secondary users. In International symposium on communications and information technologies, HoChiMinh, Vietnam.
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2018). Opportunistic spectrum access in cognitive radio for tactical network. In European conference on electrical engineering and computer science, Bern, Switzerland.
Modi, N., Mary, P., & Moy, C. (2017). QoS driven channel selection algorithm for cognitive radio network: Multi-user multi-armed bandit approach. IEEE Transactions on Cognitive Communications and Networking, 3, 1–6.
Article Google Scholar
Tekin, C., & Liu, M. (2011). Online learning in opportunistic spectrum access: A restless bandit approach. In International conference on computer communications, Shanghai, China.
Cauchy, A. (1889). Sur la convergence des séries. Oeuvres completes Sér, 2(2), 267–279.
Google Scholar
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Article MathSciNet Google Scholar
Almasri, M., Mansour, A., Moy, C., Assoum, A., Osswald, C., & Lejeune, D. (2019). All-powerful learning algorithm for the priority access in cognitive network. In European signal processing conference, A Coruña, Spain.

Download references

Author information

Authors and Affiliations

LABSTICC, UMR 6285 CNRS, ENSTA Bretagne, 2 rue F. Verny, 29806, Brest, France
Mahmoud Almasri, Ali Mansour, Denis Le Jeune & Christophe Osswald
Univ Rennes, CNRS, IETR - UMR 6164, 35000, Rennes, France
Christophe Moy
Faculté des sciences, Université Libanaise, Tripoli, Lebanon
Ammar Assoum

Authors

Mahmoud Almasri
View author publications
You can also search for this author in PubMed Google Scholar
Ali Mansour
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Moy
View author publications
You can also search for this author in PubMed Google Scholar
Ammar Assoum
View author publications
You can also search for this author in PubMed Google Scholar
Denis Le Jeune
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Osswald
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All the authors have contributed to the analytic and numerical results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mahmoud Almasri.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

In this Appendix, we investigate the upper bound of the term $\mathbb {A}= \epsilon _t \times Prob$ in e-UCB where Prob can be expressed as follows:

$$\begin{aligned} Prob \le P\big \{B_i(t-1, T_i(t-1)) \ge B_1(t-1, T_1(t-1)); T_i(t-1)\ge l \big \} \end{aligned}$$

The index of the i-th channel $B_i(t, T_i(t))$ is the sum of the exploration factor, $X_i(T_i(t))$, and the exploitation factor, $A_i(t, T_i(t))$:

$$\begin{aligned} B_i(t, T_i(t))= X_i(T_i(t))+ A_i(t, T_i(t)) \end{aligned}$$

(28)

Then, we obtain:

$$\begin{aligned}&Prob \le P\bigg \{X_1({T_1(t-1)}) + A_1(t-1, T_1(t-1)) \le \nonumber \\&\quad X_i({T_i(t-1)}) + A_i(t-1, T_i(t-1)) \text { and } T_i(t-1)\ge l \bigg \} \end{aligned}$$

(29)

By taking the minimum value of $X_1({T_1(t-1)}) + A_1(t-1, T_1(t-1))$ and the maximum value of $X_i({T_i(t-1)}) + A_i(t-1, T_i(t-1))$ at each time slot, we can upper bound Prob by the following equation:

$$\begin{aligned} Prob \le P\Bigg \{ \min _{0<S_1<t} \bigg [ X_1({S_1}) + A_1(t,S_1) \bigg ] \le \max _{l\le S_i<t} \bigg [ X_i({S_i}) + A_i(t,S_i) \bigg ] \Bigg \} \end{aligned}$$

(30)

where $S_i \ge l$ to fulfill the condition $T_i(t-1) \ge l$. Then, we obtain:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\Bigg \{X_1({S_1}) + A_1(t,S_1) < X_i(S_i)+A_i(t,S_i) \Bigg \} \end{aligned}$$

(31)

The above probability can be upper bounded by:

$$\begin{aligned}&Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\Big \{ X_1({S_1})+A_1({t,S_1}) \le \mu _1 \Big \} \nonumber \\&\qquad + P\Big \{\mu _1<\mu _i+2 A_i({t,S_i})\Big \} \nonumber \\&\qquad + P\Big \{ X_i(S_i)+ A_i({t,S_i}) \ge \mu _i+2A_i({t,S_i}) \Big \} \end{aligned}$$

(32)

Using the ceiling operator $\lceil \rceil$, let $l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil$, where $\Delta _{i} = \mu _1-\mu _i$ and $S_i\ge l$, then the inequality $\mu _1<\mu _i+2 A_i({t,S_i})$ in Eq. (32) becomes false, in fact:

$$\begin{aligned} \mu _1-\mu _i-2 A_i({t,S_i})&= \mu _1-\mu _i-2\sqrt{\frac{\alpha \ln (t)}{S_i} }\\&\ge \mu _1-\mu _i-2\sqrt{\frac{\alpha \ln (n)}{l} } \\&\ge \mu _1-\mu _i - \Delta _{i} = 0 \end{aligned}$$

Based on Eq. (32), we obtain:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} P\bigg \{X_1({S_1})\le \mu _1 - A_1( t,S_1) \bigg \} + P\bigg \{X_i(S_i)\ge \mu _i + A_i(t,S_i) \bigg \} \end{aligned}$$

(33)

Using Chernoff–Hoeffding bound^{Footnote 4} [24], we can prove that:

$$\begin{aligned} P\Big \{X_1({S_1}) \le \mu _1-A_1({t,S_1})\Big \}&\le \exp ^{ \frac{-2}{S_1}\Big [S_1 \sqrt{\frac{\alpha \ln (t)}{S_1}} \Big ]^2 } \nonumber \\&= t^{-2\alpha } \end{aligned}$$

(34)

$$\begin{aligned} P\Big \{X_i(S_i)\ge \mu _i+A_i({t,S_i})\Big \}&\le \exp ^{ \frac{-2}{S_i}\Big [S_i \sqrt{\frac{\alpha \ln (t)}{S_i}} \Big ]^2 } \nonumber \\&= t^{-2\alpha } \end{aligned}$$

(35)

The two inequations above and inequation (33) lead us to:

$$\begin{aligned} Prob \le \sum _{S_1=1}^{t-1} \sum _{S_i=l}^{t-1} 2 t^{-2 \alpha } \le 2 t^{-2\alpha +2} \end{aligned}$$

(36)

Finally, we obtain:

$$\begin{aligned} \mathbb {A} \le \frac{H}{t} \times 2 t^{-2\alpha +2} = 2H \times t^{-2\alpha +1} \end{aligned}$$

(37)

Appendix 2

This appendix stands for finding an upper bound of Z that contributes to finding an upper bound of e-UCB:

$$\begin{aligned} Z= p\big \{X_1(T_1(t-1)) \le a ; T_i(t-1)\ge l\ \big \} \end{aligned}$$

(38)

where a is a constant number that can be chosen as follows: $a= \frac{\mu _1 + \mu _i}{2}= \mu _1-\frac{\Delta _i}{2}= \mu _i+\frac{\Delta _i}{2}$, and $\Delta _i=\mu _1-\mu _i$ . After the learning period where $T_i(t-1)\ge l$, we have: $T_1(t-1)>>T_i(t-1)$. Then Z can be upper bounded by:

$$\begin{aligned} Z \le&p\big \{X_1(T_1(t-1)) \le a ; T_1(t-1)\ge l\ \big \}\nonumber \\ \le&\sum _{z=l}^{n} p\big \{X_1(T_1(t-1)) \le \mu _1-\frac{\Delta _i}{2}; T_1(t-1) = z \big \} \nonumber \\ \le&\sum _{z=l}^{n} p\big \{X_1(z) \le \mu _1-\frac{\Delta _i}{2}\big \} \end{aligned}$$

(39)

Using the Chernoff–Hoeffding [24], we can upper bound the above equation as follows:

$$\begin{aligned} Z \le \sum _{z=l}^{n} \exp ^{- \frac{2 \Delta _i^2 z^2}{4z}} \le n \exp ^{\frac{- l\Delta _i^2}{2}} \end{aligned}$$

(40)

According to the proof provided in Appendix 1, we have $l=\lceil \frac{4 \alpha \ln (n)}{\Delta _{i}^2}\rceil$ where $\alpha =2$. So, we obtain:

$$\begin{aligned} Z \le n \exp ^ {-4\ln n} = \frac{1}{n^3} \end{aligned}$$

(41)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Almasri, M., Mansour, A., Moy, C. et al. Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms. Wireless Pers Commun 118, 1165–1188 (2021). https://doi.org/10.1007/s11277-020-08064-w

Download citation

Accepted: 30 December 2020
Published: 26 January 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11277-020-08064-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms

Abstract

Access this article

Similar content being viewed by others

QoS-Based Blind Spectrum Selection with Multi-armed Bandit Problem in Cognitive Radio Networks

Distributed algorithm under cooperative or competitive priority users in cognitive networks

Online Frequency Allocation and Mechanism Design for Cognitive Radio Wireless Networks∗

Availability of Data and Materials

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed Competitive Decision Making Using Multi-Armed Bandit Algorithms

Abstract

Access this article

Similar content being viewed by others

QoS-Based Blind Spectrum Selection with Multi-armed Bandit Problem in Cognitive Radio Networks

Distributed algorithm under cooperative or competitive priority users in cognitive networks

Online Frequency Allocation and Mechanism Design for Cognitive Radio Wireless Networks∗

Availability of Data and Materials

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation