In the answer I'm going to start from the simplest ideal 2D airfoil to the real complete 3D wing, adding one brick at a time. In my opinion this is the best way to introduce the induced drag since induced drag is basically a mathematically well defined drag term that appears only going from 2D to 3D. I'm not going to use any equations here, just a linear-growing explanation of why and how the concept of induced drag was born.
At the end I will also discuss some of the existing "simple and appealing" explanations and I will explain where the name comes from.
Why do we care about a not-in-reality-existing 2D airfoil? Well because it is the basic building block of any wing. Any wing is built putting airfoils near each other and therefore the behaviour of the whole wing cannot be truly understood until the behaviour of the single airfoil composing it is not understood. Wings of several aircrafts are built around some more or less standard 2D airfoil: just to name a few, F-22 uses NACA 64A005.29, F-16 NACA 64A204, Agusta A109 NACA 23012, Boeing 747 BAC463, Cessna 172 Skyhawk NACA 2412 and so on. And since the behaviour of a simple 2D airfoil is in practice almost perfectly understood, any complicated CFD simulation or test apparatus is normally first proof tested on a 2D airfoil: if it fails this test then something is wrong and the model must be corrected. So, 2D airfoils are the basis, 2D is real.
Let's start then with the most simple theory that has been developed, the "thin airfoil theory”. Even it is based on the simplification of a bi-dimensional (aka wing of infinite span) inviscid and incompressible flow, it gives a surprisingly useful insight to the behaviour of an aerodynamic body since this theory allows to learn that: a) only the camber of the airfoil counts in the generation of lift and moment; b) the lift coefficient has the value $C_l = 2\pi\alpha$, where $\alpha$ is the angle of incidence; c) the coefficient of moment is constant in respect to the point lying one quarter of the chord behind the leading edge (quarter-chord point) and d) yes! in this case Bernoulli does work and it is used to calculate the just given equation for $C_l$ (but that’s another matter).

Airfoil nomenclature. Source: https://commons.wikimedia.org/wiki/User:Ariadacapo#/media/File:Chord_length_definition_(en).svg
Nice thing the “thin airfoil theory” but... what about the drag? “It seems to me that the theory, developed in all possible rigor, gives, at least in several cases, a strictly vanishing resistance, a singular paradox which I leave to future mathematicians to elucidate”, Jean le Rond d'Alembert. He was obviously right, so right that this contradiction is named after him.
Anyway, even without being committed mathematicians, the answer is simple: since the thin-airfoil theory neglects: a) the 3$^{rd}$ dimension; b) the compressibility and c) the viscosity of the fluid, then the culprit is one of these three.
So: a) cannot be since also 2D bodies generate drag; b) compressibility becomes important only for speed bigger than Mach 0.3 but drag exist also for very low-speed flows, thus it cannot be; then the winner is: c) the viscosity!

Boundary layer around a typical airfoil. Source: https://www.researchgate.net/publication/299486909_Design_of_Diver_Propulsion_Vehicle_Ganendra_RI-1_Using_SolidWorks_Flow_Simulation
Unfortunately viscosity makes the simple equations of the thin-airfoil theory no more solvable per hand and therefore either CFD simulations or, before computer age, extensive wind tunnel tests has (had) to be performed. In particular, wind tunnel tests allow(ed) to learn that viscosity generates a "boundary layer" which interacts with the flow, leading to: a) viscous drag; b) stall at high angle of attack; c) importance of the distribution of the thickness around the camber and d) more generally: non-linearity in all aerodynamic actions (lift, drag and moment - except maybe at small angles of attack). Wind tunnel tests provide(d) also a common basis of comparison among the various airfoil’s geometries: a paper like “Summary of airfoil data” by Abbott, Von Doenhoff and Stivers, published in the 1945, is still today a must-read.
And what about if we go faster? Then also compressibility becomes important with all that this entails, like the formation of the wave drag.
This one is actually "one step forward and two back". The step forward is obviously going from 2D to 3D i.e. from infinite to finite wing; the steps backward are the elimination of viscosity and compressibility, again to make the mathematical model easy enough to be solvable by hand. Indeed, the same mathematical instruments used for the thin-airfoil theory are now still valid except for one essential difference: in 2D, the wake is a line (in a mathematical sense i.e. it possesses no thickness) and doesn't contribute to the aerodynamic actions; however in 3D, as depicted in the following picture, all these wake lines shed from the trailing edge build a surface (called “vortex sheet”) which now does contribute to the generation of the aerodynamic forces:

Vortex sheet wing model. Source: https://www.researchgate.net/publication/344467888_Aerodynamics_and_Control_Aspects_of_Formation_Flight_for_Induced_Drag_Savings
Unfortunately the flow on each one of this line possesses a very "chaotic" velocity which, again, complicates too much the equations. Some other simplification was therefore needed to make the problem solvable again. The simplification found consists in considering a wing with a span much bigger than the chord, i.e. a wing with an high aspect ratio AR. Why that? When the AR is big enough, it exist an “optimal distance” behind the wing which is: a) far enough from the wing so that the flow has lost most of this "chaotic" behaviour (so called “far-field”); but b) not so far that the shape of the wake has become too much different from the starting shape it had by the trailing edge (so called “near-field”). The following picture nicely depicts these three region of near-field, far-field and intermediate (optimal distance) field:

A NACA0012 wing with the near, intermediate and far-field regions.
Source: https://www.researchgate.net/publication/322284435_A_parallel_stability_analysis_of_a_trailing_vortexA_wake
Within the intermediate field (the “optimal distance” we were looking for) a plane is chosen to do sums, which is called “Trefftz plane”:

The Trefftz plane lying within the “intermediate field”. Source: How do wing tip vortices that interact with the wing surface affect drag?
Why do we choose a plane just there? Because in the Trefftz plane the whole wake becomes just a thin line (in red in the picture) having the same shape of the trailing edge of the wing. So, if we do the math on the Trefftz plane, the whole messy wake can be substituted by a geometrically known thin line: that's a smart choice!
So, on the Trefftz plane the equations become again solvable by hand and allow to get some very useful results: a) moving along the wingspan, the wing can be seen as a sequence of 2D airfoil (as in Step 1 of this answer) each one generating its own portion of lift; this is called "spanwise lift distribution"; b) everywhere the lift is different between an airfoil and the adjacent one, a vortex is shed perpendicular to the trailing edge and contributes to the vortex sheet; c) well behind the wing (far-field), the vortex sheet rolls up into two big vortices that eventually dissipate due to the viscosity of the air; the dissipation could take some minutes and dictates an adequate separation between successive aircraft; d) the aerodynamic force now possesses both a lift and a drag; e) the drag reaches a minimum when the spanwise lift distribution is elliptical; in this case it exists a simple mathematical relation between lift and drag coefficients which is $C_{d_i} = C^2_l/\pi AR$; if the spanwise distribution of lift is not elliptical, this expression becomes $C_{d_i} = C^2_l/\pi ARe$ where $e$ < 1; f) yes, also in this case Bernoulli does work and it is used to get the previous equation.
The following picture nicely depicts the points a), b) and c):

The spanwise lift distribution causes the shedding of a vortex sheet behind the aircraft. Source: https://upload.wikimedia.org/wikipedia/commons/1/18/Aircraft_wing_lift_distribution_showing_trailing_vortices_%283%29.svg
So, do we still need today, in times of realistic CFD simulations, all this blabla on 2D airfoils and wing with big AR? Isn't it enough to launch a computer simulation to get all the information that we need? Well, yes but! if the induced drag needs to be evaluated then there's nothing else to do but define a Trefftz plane behind the wing and make calculation on it. Why that? Because this is how the induced drag is defined: you get a plane in the intermediate/optimal field and you do the math on it. The drag jumping out from this calculation is the induced drag.
Hopefully it was not too difficult to follow, comments are obviously welcome.
Bonus material, in no order whatsoever
- A misunderstanding about the relation $C_{d_i} - C_l$:
when all the simplifications listed by “Step 3: ideal 3D” have been put into practice, the aerodynamic force can be finally calculated and decomposed in its two components lift and drag; when afterwards the lift is made equal to the drag, then the much (in)famous relation $C_{d_i} = C^2_l/\pi AR$ is obtained. This relation only implies that lift and drag have been generated by the same mathematical (and physical) process, nothing more, nothing less; it definitely does not imply that it exists a lift-generated or a lift-induced drag; actually that relation could (should?) be written $C_l = \sqrt{C_{d_i}\pi AR}$: would that imply the existence of a "drag-generated" lift? Obviously not, otherwise the paradox would be created that having a lot of drag is unavoidable to have a lot of lift.
- A second misunderstanding about the relation $C_{d_i} - C_l$:
On closer inspection it is clear that the relation actually involves coefficients: that implies that more lift (lift, not coefficient of lift) might actually "induce" less drag.
- A general misunderstanding about “lift-generates-drag”:
To realise that the association “lift-generates-drag” is in general not correct, it is enough to consider a 2D airfoil (as in Step 2): it generates plenty of lift without any lift-related drag (only viscosity- and compressibility-related drag)
From “Step 3: ideal 3D” it should be now finally clear what induced drag truly is: it is a kind of aerodynamic drag which appears when the simplification of infinite wing is dropped. That’s all. It has nothing to do with the amount of lift generated, only with the fact that the wing somewhere ends and the lift becomes null there. It could actually even be said that the induced drag is not the drag due to lift but the drag due to the lack of lift!
- On the naming “induced” drag:
The naming is misleading because it suggests the controversial definition of “drag due to lift”. Actually, as we have seen in “Step 3: ideal 3D”, there is indeed an "inducing" phenomenon involved in the process: the spanwise lift distribution inducing the vortices in the vortex sheet. This is where the name comes from! So, the definition might become correct if expanded into: “drag due to the spanwise variation of lift” (which, again, is a consequence of the fact that the wing somewhere ends and the lift becomes null there)
- Again on the relation between lift and drag:
An aerodynamic body moving in a flow generates an aerodynamic force, period. The decomposition of this aerodynamic force in lift and drag is just a pure convention. A totally different convention is used for example in the helicopter world where the aerodynamic force is decomposed in a component parallel to the rotor shaft, positive upward and one perpendicular to it, positive against the sense of rotation. The sum of the former gives the thrust produced by the rotor while the sum of the latter gives the torque that the engine has to generate to keep the rotor spinning. So, no lift nor drag in helicopters. Now, that a relation between these two terms (lift/drag or thrust/torque) always exists is just because they are the two sides of the exact same coin. They must be related because they both pop up from exactly the same phenomenon: you have an aerodynamic force and you split it in two parts... these two parts are obviously related since they both arise from the same thing.
- On the "tip vortices" model
This model is, among the plethora of explanations, the less wrong. Anyway, as already seen, tip vortices form backward but they are so far from the wing (in the far-field) that they cannot influence the wing anymore (but they do influence another airplane flying into them). The right picture for this model would be therefore the following one, which better depicts the vortex sheet:

A more correct representation of the "tip" vortices. Source: https://cdn-images-1.medium.com/max/1600/0\*2gL-QOXqx51qNdAE.jpg
- On the "momentum" aka "energy" aka "pushing air down" model
Even if the physics behind this model might be correct, I personally find this model way too generic: you take some random quantity of air, you randomly push the air down, you associate this pushing down to lift, but also to drag, you mix everything and you get $D = {blabla} L^2$. And in some versions of this model that's not the end of the story: since in the (in)famous relation a $\pi$ and the aspect ratio AR make their appearance, the random quantity of air is made "less random" using a tube of air with the same diameter of the wing: choosing a tube of air as big as the wing, we get in one shot the aspect ratio AR and the $\pi$ in the equation. But then, why a tube exactly? What if we had chosen a parallelepiped instead? Isn't a parallelepiped better? Maybe with a width a bit bigger of the wingspan to get also some tip vortex? And why not a tube which invest the wing from above? You also get the $\pi$ and the AR. Or even better: why not a downward angled tube? Seen from the front it looks like an oval i.e.an ellipsoid and this could be related to the fact that an elliptical lift distribution is the one giving the lowest induced drag... bang! we've got $\pi$, AR and elliptical lift distribution: three birds with one stone. Whatever. Furthermore, the obtained equation relates $D$ with $L^2$ which is wrong since the actual relation is between ${C^2}_l$ and $C_{d_i}$.
- On the "lift is tilted backward" model
As said, the aerodynamic force is customarily split in 1) a component parallel to the airflow, called drag and 2) a component perpendicular to the airflow, called lift. As also said, this is just an arbitrary and convenient choice that, for example, is definitely not useful for rotors analysis where it is preferred to split the aerodynamic force in respect to the orientation of the rotor's shaft. So, what does it actually mean that the lift is tilted backward? It simply means that... well it is not lift what we are looking at! Lift cannot be tilted backward, per definition: lift is perpendicular to the flow, that's it! If it's not perpendicular to the flow then it is not lift. A better explanation would be to say that, due to the downwash, the wing sees an airflow which is now the composition of $V_{flight}$ (mainly horizontal) and $V_{induced}$ (mainly vertical) and therefore this composed velocity has a lower $\alpha$ in respect to the wing. Therefore the aerodynamic force (and not the lift) is more backward tilted and its decomposition gives, in turn, more drag and less lift. This representation, albeit simple and effective, is anyway not really correct because it implies that the downwash in the wake acts on the wing. But the wake is not on the wing rather behind it, so this model supposes that what happens in the downstream (wake) influences what lies in the upstream (wing), which is also not totally correct.