The coordinates of the screen corners in the cited write-up do essentially “come out of nowhere.” Camera and screen placement are fundamental variables in setting up the view. Their relative placement determines, among other things, the field of view and zoom factor (focal length).
In practice, it might be easier to work backwards by first deciding on the camera’s line of sight. This establishes a line in your scene, which you can parameterize as $P+\lambda N$. $P$ is any point on the line (perhaps the camera’s position $C$, the midpoint of the screen $M$, or some other convenient reference point). $N$ is a unit vector parallel to the line that establishes its direction—the same normal (perpendicular) vector to the screen that’s calculated in step 6. To be consistent with the described method, it should point away from the scene, i.e., from the screen to the camera. You can always flip the signs of its components if you discover that it’s pointing the wrong way. Since $N$ has unit length, the parameter $\lambda$ measures signed distance from $P$ along this line.
Choose a point $M$ along this line to be the midpoint of the screen. How you do that is up to you. An equation of the plane through $M$ that’s perpendicular to the line of sight is $N\cdot((x,y,z)-M)=0$, which expands into $a_n(x-m_x)+b_n(y-m_y)+c_n(z-m_z)=0$. Now, you’ve got another choice to make: you have to specify a direction in the scene that corresponds to “up” on the screen. However you do this, you’ll end up with a vector $U$ that’s parallel to this plane with a direction that corresponds to “up” on screen. (One way is to locate a point $U'$ on the plane that’s directly “up” from $M$ and set $U=U'-M$.) Finding the corner points is simple from there.
Obviously, $S_1$ is halfway up the screen from $M$, so we’re going to have $M+\frac H2{U\over\|U\|}$, but we also need to go left from $M$. For that, we need a vector that represents right or left on the screen. We’ll go with right, since that’s where we need to go for $S_2$, anyway. Such a vector is given by the cross product $R=U\times N$. We’ve now got all of the necessary pieces to place the three corners of the screen: $$\begin{align}S_1 &= M + \frac H2{U\over\|U\|} - \frac W2{R\over\|R\|} \\ S_2 &= M+\frac H2{U\over\|U\|}+\frac W2{R\over\|R\|} \\ S_3 &= M-\frac H2{U\over\|U\|}-\frac W2{R\over\|R\|}.\end{align}$$ Note, by the way, that ${\overrightarrow{S_1S_2}\over\|\overrightarrow{S_1S_2}\|}={R\over\|R\|}$ and ${\overrightarrow{S_1S_3}\over\|\overrightarrow{S_1S_3}\|}=-{U\over\|U\|}$, so you’ve already got some of the values that you’ll need for computing the projection. That is to say, you don’t really need to compute $S_2$ and $S_3$ here.
Having gone through all of that, I must point out that these calculations and the ones for the projection itself are much simpler if you work in the camera’s coordinate system. Effectively, you “front load” the uglier parts of the computation into the world-to-camera transformation. In this sort of setup, the camera is at the origin, “up” is the positive $y$-axis and the camera points down the negative $z$-axis. The image plane (screen) is then simply $z=-h$ and the screen corners are at $S_1(-W/2,H/2,-h)$, $S_2(W/2,H/2,-h)$ and $S_3(-W/2,-H/2,-h)$. The projection of a point $A(x,y,z)$ onto this screen is simply $A'(-hx/z,-hy/z,-h)$, from which $(m,n)$ is easily found to be $\left(\frac W2-\frac hzx,\frac H2+\frac hzy\right)$. Clipping is also simple in this coordinate system. You can either check that $0\le m\le W$ and $0\le n\le H$ (which you can do if working in the world coordinate system, too) or check that $-W/2\le -hx/z\le W/2$ and $-H/2\le -hy/z\le H/2$, which is a lot simpler than computing all of those dot products and lengths in the world coordinate system.