For a long time it bothered me, how a large number of people, some of them professional software developers, misunderstand software development practices on a fundamental level. While I recognize that more than a decade of Microsoft dominance and a about a decade of dotcoms (partially overlapping) could turn any atmosphere poisonous, it should have been long enough since the last talentless dork decided to "study computers" to earn billions by becoming "second Bill Gates" or creating a web site that clones functionality of another web site. Writing things that humans are not supposed to read is got sufficiently unpopular again, so only people genuinely interested in software development would do it, and therefore they would pay some attention to things that are obvious and easy to recognize, right? Apparently not. Like poison that saturated air, soil and water, worst practices continue to perpetuate themselves even among otherwise sane and educated people, and best ones are either ignored or paid lip service to. At very least, I believe, it will be useful to mention what is currently missing.1. Software is a collection of machines that process data, and those machines are implemented in the form of code that is also data, interpreted by hardware and other software. Software development is a form of engineering that deals with software.
While those are not the only possible definitions, they describe the proper place of software among other kinds of technology. If you are software developer, you are subject to the same standards you would expect from other engineers. If you are an electrical, mechanical or other kind of engineer, working on software exposes you to a different, though similar, form of engineering. If you are a scientist, you will find that whatever clarity of thought is necessary for your primary area of expertise, is just as necessary for dealing with software, however large and entirely artificial infrastructure built by what is supposed to be engineers, makes things much more confusing and difficult to follow.
If you are an artist or any other kind of non-engineer, jumping into software development in any form exposes you to a fundamentally different area of human knowledge, much more complex than anything you have ever dealt with before. And among other new unusual things, it is brutally unforgiving to any mistakes and misunderstanding. It is not trivial. It did not get "easier to use" except on the most superficial level that you are unlikely to see for long as you will use more complex functionality.
It is not subject to interpretation unless you are trying to describe it to someone else -- and your description has a good chance to be worthless or worse if your interpretation is wrong. Your opinion means nothing to it -- you can't convince it to do what you want any more than you can convince your car not to slam into the railing when you foolishly enter a highway exit curve at 90mph.
It's true that one can write software without going through many study courses that are considered mandatory for any other engineer. It's also true that courses that are usually thought to be specific for Computer Science and Computer Engineering, and therefore necessary for software development, are not in fact mandatory for every developer. However this is merely a result of enormous breadth of software, and of the fact that very likely you are working on something where others already made some groundwork. Poor or incorrect knowledge has its price though -- in the most obviously disastrous scenario, you will not notice when this groundwork no longer applies and has to be changed or completely replaced. There are plenty of less obvious ones, too.
However even if your knowledge can be incomplete without immediately disqualifying you from everything software-related, missing engineering approach or understanding of engineering culture are guaranteed to turn your work into unusable mess -- what large amount of software sadly is. And this is something that makes engineering education, and understanding that software development is engineering, important.2. Software is built by a compiler (or chain of tools including a compiler) from source code written by humans. Code written in a programming language is both the original form of software itself and the final expression of its design. Expression of design in any other form is either an illustration or explanation (if it describes existing software) or specification (if it describes software to be implemented). All decisions made while implementing software are design decisions because they all modify primary expression of software design.
There was a time when "computer" or "calculator" meant a person performing calculations for others, so those others' work was not constantly interrupted by tedious, mind-numbing procedures. Now the idea of using a living person for such purpose became so preposterous, the words themselves changed their meaning to signify machines that replaced those people, and no one considers going back.
In exactly the same manner there was a time when there were "programmers" who translated formulas and algorithms into the form usable on ancient (50's-70's) computers. Someone else, usually called an analyst if it was a separate person from end user, handled the task of creating the algorithms and designing the structure of the program, something that was beyond the scope visible to the programmers whose job was overwhelmed by implementation details. This time is gone, and for exactly the same reason as why human calculators are gone. Development of modern programming languages (for the purpose of this distinction, C is a modern programming language, FORTRAN is borderline) made algorithms and data structures transparent for the person writing a program, thus eliminating the possibility of keeping a distinction between "programmer" and "analyst".
Decisions made by the person writing code now happen exactly on the same level as decisions that would be made by a person describing what code should do, so keeping a separate person "writing code" from a person "designing software" is either a way to pay less to another person who ends up both designing and implementing it, or a procedure where all work is done twice.
One form of such convergence is what would obviously follow from such explanation -- whoever writes code, has to understand design and ends up interpreting and extending it, thus leaving his "master" the function of documentation writer, at best, redundant paper-pusher at worst.
There opposite form of this convergence is far worse -- trivial or supposedly trivial operations formerly reserved to "programmers" are elevated to the level of system design while "design decisions" made by everyone else become entirely fictional brochure-ware that has nothing to do with how software works. Such design has a tendency to degenerate into something impossible to implement due to some fundamental un-applicability or internal contradiction that was left unnoticed because no one relevant reads those worthless papers. When such a mistake happens, the situation is cemented, as no one is in a position to fix it even if he tried. A company can spend months, years or decades in denial, but in the end the result is determined by actual decisions made by people who write code. They could be geniuses who spontaneously developed their own design and maintain a great product under the noses of managers and marketing department who believe that company produces something completely different, or they could be idiots, constantly scrambling to make thousands of little tweaks every time something or someone shatters an illusion that product does what marketing claims it does. In both cases developers who write code are true masters of the product, and it's quality and completeness of their work that determines all aspects of the result.3. Software is composed of components. Components are connected to each other, and to various external entities by interfaces. Interfaces pass data and trigger its processing. Typical interfaces are protocols, collections of functions (and possibly classes in object-oriented software), data structure definitions, system calls and (with some stretch of a typical usage of the term) interrupts and device registers. In order of interfaces from most universal to specific to individual instance, protocol is always before functions/API, as protocol is the only interface that can be used between devices sharing absolutely nothing else in common.
I thought, people learned this after they started writing network-accessible applications, and should've designed them accordingly. Foolish me -- apparently most of them belong to two groups:
1. People who followed a set of instructions that make magic pictures appear in a browser after some code in some magic language was uploaded to the web server and some URL was magically associated to that code. Magic being the key word for understanding what exactly is running on the server, what exactly is running on the client, and what and how data is being transferred between those two.
2. People who learned how to use some form (or multiple forms) of RPC, and treat it as another form of API.
The only software developers who never invented their own form RPC (if not the whole concept independently) are ones who were told early enough what a terrible idea it is. Let me elaborate on this.
A person who learned how complex software can be produced by combining function calls and data structures, when he encounters a fundamentally new capability -- such as transfer of data over the network and triggering operations on reception of data -- will understandably try to apply known processing model over it. Unless that person is mentally retarded, or slept through most of his classes, he will inevitably find the most obvious solution:
Create a set of identifiers (integers, strings, path names within a tree, combination of any of the above with network addressing primitives) that map to particular functions. Devise a serialization/de-serialization scheme for function arguments and return values (also class and object identifiers if the interface in question is object-oriented).
Write, or make procedure to generate, a wrapper to the function on the callee side, that is called when request is received over the network, and a stub on the caller side that sends request over the network, waits for the response and passes the data involved. On top of this, create a dispatcher that passes the requests and responses, or reuse one in existing protocol.
Congratulations, you can call functions remotely! You and every second-year CS student who made exactly the same design.
There are problems though. First of all, most likely functions have side effects. They may be within something that is completely confined to some data structures referred by arguments, and that data will be inaccessible from other functions and objects, so you will just have to transfer all that data back and forth along with your arguments. Or data may reside on the callee side and caller will have no direct access to it except possibly through accessors that are also remotely-called functions. But then you have to deal with another problem -- simultaneous requests from the same or multiple callers that affect, directly or indirectly, the same data. You need a locking scheme, so on top of waiting for synchronous protocol, caller has to wait for locks that he is not aware of but callee is. Maybe he should be aware, but then you have to propagate them to clients through some kind of distributed lock manager.
Or maybe there should be views presented to multiple clients and some mechanism to resolve them and roll back the conflicts. Then clients should be able to process the results of this, thus implementing some form of transactions. You can decide to simplify the task by creating impenetrable wall between multiple clients but then you have to know when such an object should be created and deleted. Creating "session" is easy but how do you know that client no longer exists? Then there are all of the same problems but worse if you choose to support potentially unreliable protocols such as UDP (what, of course, what the original RPC developers did).
With all (or even some) of those problem solved, you will create a large, complex system that does nothing in particular, and is poorly suited for anything you can imagine implementing with it. To be fair, all parts of it make perfect sense. Serialization/deserialization of data is useful. Synchronous requests to objects identified with some addressing scheme are useful. Locking, views, transactions, sessions -- all valid and useful concepts. A combination of all those things, jumbled into a huge ball of some sticky, greasy substance where pieces can not be separated from each other and used separately, is nearly useless. The problem is, a system of this kind ignores the nature of a network protocol, and is mostly a result of a simple fact that people learn about functions before they learn about protocols.
Protocol can be described by a finite-state machine. For convenience one can separate high-level "protocol" state and low-level "parser" state that handles some format or language used to represent requests and data. For anyone who understands this, it should be clear that some data exchanges are by their nature asynchronous, and this whole massive structure is completely unnecessary for them. That some don't even have an "end" -- for example, logging data or results of transactions is never supposed to end while the system is running, and it's just important to receive the data, process or store it, and possibly confirm later so the sender can be sure that at least this particular set of records was processed. That in some cases locking mechanism has absolutely nothing to do with "functions". In other words, protocols are better driven by flow of data than the control flow in your program, and should be designed accordingly. If this design requires any of the complexity I have described above, it can have it, but the unnecessary pieces and all the complexity they bring in, won't be there.
Protocol design is important for another reason -- well-designed protocol can be implemented in a system that is completely unrelated to one you have, and yet interoperate with it without any problems. You can reuse the components that use or implement a protocol, or replace them with new ones -- being a clearly defined interface with minimal set of data and states that it handles, it reduces the amount of effort involved in either reuse or refactoring.
But if so, how in the world did it happen that most protocols end up being undocumented, however their authors are more than happy to distribute some braindamaged "API library" that implements those protocols in some convoluted manner? That RPC model with XML-over-HTTP is pretty much as far as most "networked" software works? The only explanation I can think of, is ignorance. People write extremely complex software but it does not occur to them to write a simple protocol implementation because they don't know how to implement a protocol in a simple manner. They never learned how a parser or finite-state machine in general works. They don't know how to handle sockets. The knowledge they are missing can be condensed to tens of pages in a book that can be read in a week, but lack of this knowledge is what shapes massive projects that are developed for years and probably will be around for decades, only to be replaced with similar monstrosities.
...To be continued...