I am relatively new and struggle to combine different packages. I am trying to wrap a figure within my text and I also want the caption to be aligned on the side of the text. My current approach looks as follows, while the red arrows show what I mean to improve:
In order to wrap the image within the text I am using the package wrapfig as follows:
\begin{wrapfigure}{R}{0.5\textwidth}
\vspace{-25pt}
\centering
\includegraphics[scale=0.41]{figures/transformer3.png}
\caption[Illustration of multiheaded attention]{Illustration of multiheaded attention. The two highlighted attention heads have learned to associate \textit{"it"} with different parts of the sentence.}
\label{fig:transformer3}
\end{wrapfigure}
\noindent The projections are parameter matrices $\bm{W}{i}^{Q} \in \mathbb{R}^{d \times d_q}$, $\bm{W}{i}^{K} \in \mathbb{R}^{d \times d_k}$, $\bm{W}_{i}^{V} \in \mathbb{R}^{d \times d_v}$ and $\bm{W}^{O} \in \mathbb{R}^{hd_v \times d}$. By applying multiple attention heads, the model is allowed to jointly attend to information at different positions within the input sequence. In figure \ref{fig:transformer3} for example, the orange attention head associates \textit{“it”} with \textit{“The animal”}, while the green attention head has learned an association to “tired”.
\subsubsection*{Outlook on the Empirical Studies}
While the U-Net and the stacked hourglass are already well established architectures in the CV domain, Transformers have been mainly applied on NLP problems so far. However, there is a strong belief within the deep learning community that Transformers may represent a suitable architecture for CV tasks as well. For this reason, the empirical study will investigate on recent approaches to apply self-attention based networks on images. The concepts will then be implemented in a neural network that will be trained on a CV task. Finally, the performance will be evaluated against models that instead rely on the U-Net and the stacked hourglass.
For side captions, I read that the package floatrow should be useful. However, when I try to combine both, I get compilation errors. I also found an introductory usage here. Again, I can reproduce this, but in this case I struggle to align it with my text appropriately. Can anyone help me out here? Thank's a lot!


