Log in

No account? Create an account
On software development, or how to be less of an idiot while doing it, Part 1. - Alex Belits
On software development, or how to be less of an idiot while doing it, Part 1.
For a long time it bothered me, how a large number of people, some of them professional software developers, misunderstand software development practices on a fundamental level. While I recognize that more than a decade of Microsoft dominance and a about a decade of dotcoms (partially overlapping) could turn any atmosphere poisonous, it should have been long enough since the last talentless dork decided to "study computers" to earn billions by becoming "second Bill Gates" or creating a web site that clones functionality of another web site. Writing things that humans are not supposed to read is got sufficiently unpopular again, so only people genuinely interested in software development would do it, and therefore they would pay some attention to things that are obvious and easy to recognize, right? Apparently not. Like poison that saturated air, soil and water, worst practices continue to perpetuate themselves even among otherwise sane and educated people, and best ones are either ignored or paid lip service to. At very least, I believe, it will be useful to mention what is currently missing.

1. Software is a collection of machines that process data, and those machines are implemented in the form of code that is also data, interpreted by hardware and other software. Software development is a form of engineering that deals with software.

While those are not the only possible definitions, they describe the proper place of software among other kinds of technology. If you are software developer, you are subject to the same standards you would expect from other engineers. If you are an electrical, mechanical or other kind of engineer, working on software exposes you to a different, though similar, form of engineering. If you are a scientist, you will find that whatever clarity of thought is necessary for your primary area of expertise, is just as necessary for dealing with software, however large and entirely artificial infrastructure built by what is supposed to be engineers, makes things much more confusing and difficult to follow.

If you are an artist or any other kind of non-engineer, jumping into software development in any form exposes you to a fundamentally different area of human knowledge, much more complex than anything you have ever dealt with before. And among other new unusual things, it is brutally unforgiving to any mistakes and misunderstanding. It is not trivial. It did not get "easier to use" except on the most superficial level that you are unlikely to see for long as you will use more complex functionality.

It is not subject to interpretation unless you are trying to describe it to someone else -- and your description has a good chance to be worthless or worse if your interpretation is wrong. Your opinion means nothing to it -- you can't convince it to do what you want any more than you can convince your car not to slam into the railing when you foolishly enter a highway exit curve at 90mph.

It's true that one can write software without going through many study courses that are considered mandatory for any other engineer. It's also true that courses that are usually thought to be specific for Computer Science and Computer Engineering, and therefore necessary for software development, are not in fact mandatory for every developer. However this is merely a result of enormous breadth of software, and of the fact that very likely you are working on something where others already made some groundwork. Poor or incorrect knowledge has its price though -- in the most obviously disastrous scenario, you will not notice when this groundwork no longer applies and has to be changed or completely replaced. There are plenty of less obvious ones, too.

However even if your knowledge can be incomplete without immediately disqualifying you from everything software-related, missing engineering approach or understanding of engineering culture are guaranteed to turn your work into unusable mess -- what large amount of software sadly is. And this is something that makes engineering education, and understanding that software development is engineering, important.

2. Software is built by a compiler (or chain of tools including a compiler) from source code written by humans. Code written in a programming language is both the original form of software itself and the final expression of its design. Expression of design in any other form is either an illustration or explanation (if it describes existing software) or specification (if it describes software to be implemented). All decisions made while implementing software are design decisions because they all modify primary expression of software design.

There was a time when "computer" or "calculator" meant a person performing calculations for others, so those others' work was not constantly interrupted by tedious, mind-numbing procedures. Now the idea of using a living person for such purpose became so preposterous, the words themselves changed their meaning to signify machines that replaced those people, and no one considers going back.

In exactly the same manner there was a time when there were "programmers" who translated formulas and algorithms into the form usable on ancient (50's-70's) computers. Someone else, usually called an analyst if it was a separate person from end user, handled the task of creating the algorithms and designing the structure of the program, something that was beyond the scope visible to the programmers whose job was overwhelmed by implementation details. This time is gone, and for exactly the same reason as why human calculators are gone. Development of modern programming languages (for the purpose of this distinction, C is a modern programming language, FORTRAN is borderline) made algorithms and data structures transparent for the person writing a program, thus eliminating the possibility of keeping a distinction between "programmer" and "analyst".

Decisions made by the person writing code now happen exactly on the same level as decisions that would be made by a person describing what code should do, so keeping a separate person "writing code" from a person "designing software" is either a way to pay less to another person who ends up both designing and implementing it, or a procedure where all work is done twice.

One form of such convergence is what would obviously follow from such explanation -- whoever writes code, has to understand design and ends up interpreting and extending it, thus leaving his "master" the function of documentation writer, at best, redundant paper-pusher at worst.

There opposite form of this convergence is far worse -- trivial or supposedly trivial operations formerly reserved to "programmers" are elevated to the level of system design while "design decisions" made by everyone else become entirely fictional brochure-ware that has nothing to do with how software works. Such design has a tendency to degenerate into something impossible to implement due to some fundamental un-applicability or internal contradiction that was left unnoticed because no one relevant reads those worthless papers. When such a mistake happens, the situation is cemented, as no one is in a position to fix it even if he tried. A company can spend months, years or decades in denial, but in the end the result is determined by actual decisions made by people who write code. They could be geniuses who spontaneously developed their own design and maintain a great product under the noses of managers and marketing department who believe that company produces something completely different, or they could be idiots, constantly scrambling to make thousands of little tweaks every time something or someone shatters an illusion that product does what marketing claims it does. In both cases developers who write code are true masters of the product, and it's quality and completeness of their work that determines all aspects of the result.

3. Software is composed of components. Components are connected to each other, and to various external entities by interfaces. Interfaces pass data and trigger its processing. Typical interfaces are protocols, collections of functions (and possibly classes in object-oriented software), data structure definitions, system calls and (with some stretch of a typical usage of the term) interrupts and device registers. In order of interfaces from most universal to specific to individual instance, protocol is always before functions/API, as protocol is the only interface that can be used between devices sharing absolutely nothing else in common.

I thought, people learned this after they started writing network-accessible applications, and should've designed them accordingly. Foolish me -- apparently most of them belong to two groups:

1. People who followed a set of instructions that make magic pictures appear in a browser after some code in some magic language was uploaded to the web server and some URL was magically associated to that code. Magic being the key word for understanding what exactly is running on the server, what exactly is running on the client, and what and how data is being transferred between those two.

I realize that having to deal with at least four data formats (HTML, CSS, XML, HTTP headers), at least one protocol (HTTP), at least two full-blown programming languages (Java, PHP, Perl or Python on the server, Javascript on the client), and optionally a gaggle of mini-interfaces of various kinds such as CGI, FastCGI, servlets, Apache modules, etc. does not make people scream "I want to develop crystal-clear understanding of all details of this system". But this is why AJAX-based web application development is a terrible way of learning how any of those -- really very simple on their own -- components work.

2. People who learned how to use some form (or multiple forms) of RPC, and treat it as another form of API.

The only software developers who never invented their own form RPC (if not the whole concept independently) are ones who were told early enough what a terrible idea it is. Let me elaborate on this.

A person who learned how complex software can be produced by combining function calls and data structures, when he encounters a fundamentally new capability -- such as transfer of data over the network and triggering operations on reception of data -- will understandably try to apply known processing model over it. Unless that person is mentally retarded, or slept through most of his classes, he will inevitably find the most obvious solution:

Create a set of identifiers (integers, strings, path names within a tree, combination of any of the above with network addressing primitives) that map to particular functions. Devise a serialization/de-serialization scheme for function arguments and return values (also class and object identifiers if the interface in question is object-oriented).
Write, or make procedure to generate, a wrapper to the function on the callee side, that is called when request is received over the network, and a stub on the caller side that sends request over the network, waits for the response and passes the data involved. On top of this, create a dispatcher that passes the requests and responses, or reuse one in existing protocol.

Congratulations, you can call functions remotely! You and every second-year CS student who made exactly the same design.

There are problems though. First of all, most likely functions have side effects. They may be within something that is completely confined to some data structures referred by arguments, and that data will be inaccessible from other functions and objects, so you will just have to transfer all that data back and forth along with your arguments. Or data may reside on the callee side and caller will have no direct access to it except possibly through accessors that are also remotely-called functions. But then you have to deal with another problem -- simultaneous requests from the same or multiple callers that affect, directly or indirectly, the same data. You need a locking scheme, so on top of waiting for synchronous protocol, caller has to wait for locks that he is not aware of but callee is. Maybe he should be aware, but then you have to propagate them to clients through some kind of distributed lock manager.

Or maybe there should be views presented to multiple clients and some mechanism to resolve them and roll back the conflicts. Then clients should be able to process the results of this, thus implementing some form of transactions. You can decide to simplify the task by creating impenetrable wall between multiple clients but then you have to know when such an object should be created and deleted. Creating "session" is easy but how do you know that client no longer exists? Then there are all of the same problems but worse if you choose to support potentially unreliable protocols such as UDP (what, of course, what the original RPC developers did).

With all (or even some) of those problem solved, you will create a large, complex system that does nothing in particular, and is poorly suited for anything you can imagine implementing with it. To be fair, all parts of it make perfect sense. Serialization/deserialization of data is useful. Synchronous requests to objects identified with some addressing scheme are useful. Locking, views, transactions, sessions -- all valid and useful concepts. A combination of all those things, jumbled into a huge ball of some sticky, greasy substance where pieces can not be separated from each other and used separately, is nearly useless. The problem is, a system of this kind ignores the nature of a network protocol, and is mostly a result of a simple fact that people learn about functions before they learn about protocols.

Protocol can be described by a finite-state machine. For convenience one can separate high-level "protocol" state and low-level "parser" state that handles some format or language used to represent requests and data. For anyone who understands this, it should be clear that some data exchanges are by their nature asynchronous, and this whole massive structure is completely unnecessary for them. That some don't even have an "end" -- for example, logging data or results of transactions is never supposed to end while the system is running, and it's just important to receive the data, process or store it, and possibly confirm later so the sender can be sure that at least this particular set of records was processed. That in some cases locking mechanism has absolutely nothing to do with "functions". In other words, protocols are better driven by flow of data than the control flow in your program, and should be designed accordingly. If this design requires any of the complexity I have described above, it can have it, but the unnecessary pieces and all the complexity they bring in, won't be there.

Protocol design is important for another reason -- well-designed protocol can be implemented in a system that is completely unrelated to one you have, and yet interoperate with it without any problems. You can reuse the components that use or implement a protocol, or replace them with new ones -- being a clearly defined interface with minimal set of data and states that it handles, it reduces the amount of effort involved in either reuse or refactoring.

But if so, how in the world did it happen that most protocols end up being undocumented, however their authors are more than happy to distribute some braindamaged "API library" that implements those protocols in some convoluted manner? That RPC model with XML-over-HTTP is pretty much as far as most "networked" software works? The only explanation I can think of, is ignorance. People write extremely complex software but it does not occur to them to write a simple protocol implementation because they don't know how to implement a protocol in a simple manner. They never learned how a parser or finite-state machine in general works. They don't know how to handle sockets. The knowledge they are missing can be condensed to tens of pages in a book that can be read in a week, but lack of this knowledge is what shapes massive projects that are developed for years and probably will be around for decades, only to be replaced with similar monstrosities.

...To be continued...
3 comments or Leave a comment
From: (Anonymous) Date: July 31st, 2011 09:25 pm (UTC) (Link)
You seem to be missing a core principle of engineering here: pragmatism.

Let me give you an example - I can write a simple web application (say a blog or a small wiki) with a JSON REST API that works, doesn't break, performs well under load, runs on ten-year-old hardware and is easy for any programmer to maintain or extend. And I can do it in a matter of hours.

How many hours would it take you to design and implement a custom protocol that does the same job? Document it? Debug it? And if you want your software to have a lifetime beyond your initial contribution, how long will it take for each new developer to learn your protocol?

And finally, what would you have gained by doing so? A small decrease in resource usage. If you're lucky.

Standards exist for a reason.
abelits From: abelits Date: August 1st, 2011 11:24 am (UTC) (Link)
You seem to be missing a core principle of engineering here: pragmatism.

No, I do not.

Let me give you an example - I can write a simple web application (say a blog or a small wiki) with a JSON REST API that works, doesn't break, performs well under load, runs on ten-year-old hardware and is easy for any programmer to maintain or extend. And I can do it in a matter of hours.

This is an example of a statement that is "not even wrong" -- you have not merely failed to understand my point, you do not understand what kind of problem I am talking about.

JSON is not a protocol. It's a data format notation, something that can be used to define a message or document syntax, that in its turn is a part of protocol. It's still a developer's job to define the syntax itself, JSON merely provides an easy way to include aggregated data handling into a syntax, and makes it possible to use a simple parser -- one that already exists or one that developer can write by himself (a trivial job either way). However syntax is merely a part of a protocol.

REST is not even a part of a protocol. It's a vaguely defined design principle, or ideology applicable to protocol semantics, another part of the protocol design that is not syntax.

It basically says that from the server's point of view protocol state is completely defined by data stored on the server, and all messages contain unambiguous serialized representation of relevant data. Client trusts server and accommodates data received from it into its state. The rest, pardon the pun, is fluff.

If a protocol is implemented over HTTP, those requirements (as I have described above) are a direct result of the nature of underlying protocol -- an application implemented in some other way would be unsafe because server will be swamped with client state from incomplete sessions. It will not be able to change or invalidate such state when server-side data referred to it is modified, because server can not arbitrarily initiate a transaction, or verify that its response was successfully processed by the client.

It's something that was a fundamental assumption made when HTTP was designed, an assumption about all applications that will be implemented over HTTP. An assumption so obvious for developers that they never thought it to be necessary to spell out -- even though they realized that there are various alternative forms of design (such as typical GUI program) that would not fit into HTTP. Then, of course, hordes of idiots arrived, and managed to miss that point.

It's not a protocol. It's not a standard. It's not a product. It's a recommendation on the same level as "Don't stick forks into a power outlet". It should be taught to little programmers so they won't do something monumentally stupid before they will learn why protocols should be organized that way.

There are some special cases when this principle is almost broken -- for example, AJAX applications can require a server to store client state and delay the response until an event is triggered by something based on a combination of server state and client state contained in a request. Ex: a mail or bulletin board application that returns updated list of messages when any message from given list of threads arrived ("given list of threads" is client state, "message arrived" is server state). It is safe because lifetime of the request, usually negligible in HTTP, can be used to limit the lifetime of associated client state, however this relies on a peculiarity of HTTP clients, their ability to process requests asynchronously, and reliable detection of disconnection. Formally it's still a long-living client data on the server -- and proper design has to take into account the possibility of resource exhaustion due to large number of such requests in progress. Those issues, obviously, are glossed over in web developers' kindergarten where principles such as "REST" are taught.
abelits From: abelits Date: August 1st, 2011 11:25 am (UTC) (Link)

How many hours would it take you to design and implement a custom protocol that does the same job? Document it? Debug it? And if you want your software to have a lifetime beyond your initial contribution, how long will it take for each new developer to learn your protocol?

When you write "JSON REST API" you are still developing a custom protocol. You can re-use a parser that uses JSON, server that implements HTTP as underlying transport, and you can adopt design principle known as REST, however the protocol you have developed is your own, and you are responsible for it working properly. You still have to document it. You still can have terrible bugs in it if you do anything wrong. And sure as Hell, you are supposed to support it.

And finally, what would you have gained by doing so? A small decrease in resource usage. If you're lucky.

Where did I ever mention resource usage? Where did I support lack of layering? Did you really read anything I wrote as if I recommend everyone to write his own server, parser, and state machine implementation for everything? Or did you skim it over and decide "Oh, this is some old-school programmer complaining about people not using assembly language anymore -- let's respond to that!"?

I am talking about software design, and you seem to have no understanding of what software design involves. When you do it, you still fail to distinguish between design decisions that you make, and use of existing components, designed by others already.

The only explanation for this is ignorance -- you do not recognize when you make a design decision because you do not recognize any possible decisions other than one you instinctively made. It's "obvious" because nothing else comes into your mind as an alternative. You are one of those people who decade ago claimed "It's in XML, so it's standardized" -- apparently the only progress they made is switching from XML to JSON.

The most dangerous result of such ignorance is that "obvious" decision can be wrong. Seriously wrong. And you will not understand how wrong it is because it's "obvious" and nothing else can be used in its place, so it must be right, right?

Standards exist for a reason.

Standards exist to allow modular and inter-operable design. They exist so interfaces can be re-implemented without use, or knowledge, of existing implementations, and still be inter-operable.

They do not exist so a person with no understanding of protocol design can pick up a few buzzwords, call a standard parser, and produce something that he believes, is not broken because "it uses standards".

In other words, standards are written for smart people.
3 comments or Leave a comment