You may remember my post, few weeks ago, about “NodeJS vs Java by Example“. I was comparing the same algorithm, the insertion sort, chosen because not extremely performing and therefore easy to generate a bit of delay, and easy to be written with almost the same code in all the different languages, without requiring different constructs or techniques that would bring benefits to some languages. The comparison was made in such a way that the test was run both on the full powered local machine, a Mac, and into a docker container, in order to limit the resource consumption.

For that test I was creating a server with the two languages, Java with Spring and Node with express, that was accepting a POST of a json with the number to order, an array of 9999 elements. The server was doing the following:

  • Accepting the data
  • Starting a clockwatch for the single execution of the insertion sort
  • Executing the insertion sort
  • Taking the time of the execution
  • Starting 200 threads doing the same thing
  • returning the result to the requester
  • each thread was printing, at the end of the execution, the elapsed time since the beginning

What I found, as expected, is that NodeJS is maybe comparable for single threaded execution, although Java JVM was not optimised at all while I guess that nodejs is trying to consume more resources without any specific configuration. Where NodeJS could not compete at all is in the multithreading, althoug I was using the workers, latest experimental features that should bring a bit of improvement where before only the fork was available on NodeJS.

The result was showing that Java was giving back the response (so, it was computing the first insertion sort and creating the 200 threads) in 302 milliseconds, while the 200 threads were finishing the computation after 9.8 seconds. On the other side, NodeJS was taking more than 40 times longer to give an answer to a request, 13.5 seconds, and because computing the first iteration of the insertion sort was taking almost the same time than Java, the difference was all time spent in creating the 200 threads.
NodeJS was also taking 3.5 times longer to complete the full computation of the 200 threads, 34.5 seconds

In this post I am going to do the same with another language, Rust. As an initial observation, Rust is an extremely complicated language, considering that I didn’t read anything about that before starting this project. I had no idea how to code anything in Rust, which tools to use, how to do anything. I started from the documentation to setup a new project, and I familiarised with Cargo, the package manager. I wanted to be quick, so I searched an example of a server, and I understood that I just needed the package for serialise/deserialise data.

[package]
name = "easy_rust"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
serde_json = "1.0"
serde = { version = "1.0", features = ["derive"] }

The web server that I found was easy to understand but… Rust is a bit low level, and if you don’t know exactly the length of the input data, as in my case, you have to deserialise the bytes by writing a loop to receive all of them. This means you have to start fighting with concepts like mutable objects, and especially with pointers like in the old C++ times. And then there is this concept of unwrap that really, I didn’t get it and I didn’t want to analyse. On top of that, I had to go back to the basic of HTTP transfer, meaning that I had to manually handle the source sending messages like Expect: 100-continue and send the appropriate http answer.

Long story short, I created a single file (just to avoid imports) and writing a server and make everything work as expected was like balancing a pile of irregular dishes, where any change could have make everything break. The result is the following

use std::io::prelude::*;
use std::str;
use std::net::{TcpListener, TcpStream, Shutdown};
use serde::{Serialize, Deserialize};
use std::thread;
use std::time::SystemTime;

#[derive(Debug, Serialize,Deserialize)]
struct Data {
    test: Vec<i64>
}

fn main() -> std::io::Result<()>{
    println!("Starting at 127.0.0.1:7878");
    let listener = TcpListener::bind("0.0.0.0:7878").unwrap();
    for stream in listener.incoming() {
        println!("Connection established");
        let stream = stream.unwrap();
        handle_connection(stream).unwrap();
    }
    Ok(())
}

fn read_stream(stream: &mut TcpStream) -> (String, usize) {
    let buffer_size = 512;
    let mut request_buffer = vec![];
    // let us loop & try to read the whole request data
    let mut request_len = 0usize;
    loop {
        let mut buffer = vec![0; buffer_size];
        match stream.read(&mut buffer) {
            Ok(n) => {

                if n == 0 {
                    break;
                } else {
                    request_len += n;
                    request_buffer.append(&mut buffer);

                    // we need not read more data in case we have read less data than buffer size
                    if n < buffer_size {
                        break;
                    }
                }
            },

            Err(e) => panic!("Invalid UTF-8 sequence: {}", e),
        }
    }

    let s = match str::from_utf8(&request_buffer) {
        Ok(v) => v,
        Err(e) => panic!("Invalid UTF-8 sequence: {}", e),
    };
    (s.to_string(), request_len)
}

fn handle_connection(mut stream: TcpStream) -> std::io::Result<()> {
    let (mut request, mut len) = read_stream(&mut stream);
    
    if request.contains("Expect: 100-continue") {
        let response = format!(
            "HTTP/1.1 100 OK\r\n",
        );

        stream.write(response.as_bytes()).unwrap();
        stream.flush().unwrap();

        (request, len) = read_stream(&mut stream);
        request.truncate(len);
    }

    let mut v:Data = serde_json::from_str(&request).unwrap();

    println!("Deserialised {} numbers", v.test.len());

    let mut now = SystemTime::now();

    v.test = insertion_sort(&mut v.test).to_vec();

    println!("Insertion sort: {} millis", now.elapsed().expect("wow").as_millis());
    now = SystemTime::now();

    for i in 0..200 {
        let mut newvec = v.test.to_vec();
        let index = i;
        thread::spawn(move || {
            insertion_sort(&mut newvec).to_vec();
            println!("Insertion sort in thread {}: {} millis", index, now.elapsed().expect("wow").as_millis());
        });
    }

    let content = serde_json::to_string(&v).expect("wow");

    let response = format!(
        "HTTP/1.1 200 OK\r\nConnection:close\r\nContent-Length: {}\r\n{:?}\r\n",
        content.len(),
        content
    );

    stream.write(response.as_bytes()).unwrap();
    stream.flush().unwrap();
    stream.shutdown(Shutdown::Both).unwrap();
    Ok(())
}

fn insertion_sort(vec: &mut[i64]) -> &[i64] {
    let array: &mut [i64] = vec;
    for i in 0..array.len() {
        // Start comparing current element with every element before it
        for j in (0..i).rev() {
          
            // Swap elements as required
            if array[j + 1] < array[j] {
                let swap = array[j + 1];
                array[j + 1] = array [j];
                array[j] = swap;
            }
        }
  }
  return array;
}

As you may remember, another step was to create a docker image. Let’s clarify immediately one thing that was a bit shocking. Cargo as default builds the project in debug mode. When I first run my server, the result was not astonishing at all, but this was because my server, in debug mode, is 40 times slower than in production mode.
The docker file takes care of creating a release mode. Here it is the Dockerfile

FROM rust:latest

# 1. Create a new empty shell project
RUN USER=root cargo new --bin easy_rust
WORKDIR /easy_rust

# 2. Copy our manifests
COPY ./Cargo.lock ./Cargo.lock
COPY ./Cargo.toml ./Cargo.toml

# 3. Build only the dependencies to cache them
RUN cargo build --release
RUN rm src/*.rs

# 4. Now that the dependency is built, copy your source code
COPY ./src ./src

# 5. Build for release.
RUN rm ./target/release/deps/easy_rust*
RUN cargo install --path .

EXPOSE 7878

CMD ["easy_rust"]

I then started being curious. If Rust is extremely fast, especially when run in release mode, how is Python performing? So I went through the same process. I needed some external libraries in my requirements.txt

fastapi
pydantic
uvicorn

And then I needed the code. Please note that I was a bit confused because Threads and thread pool were not performing very well. Basically they were working as a number of single execution on the same thread. I then wrote a process_pool version that was performing a bit better. In the code you have all of them

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# pip install "fastapi[all]"

from fastapi import FastAPI
from pydantic import BaseModel

class Request(BaseModel):
    test: list[int]

server = FastAPI()

@server.post("/test")
async def insertion_sort_endpoint(request: Request):
    import time
    print("Computing insertion sort on \n "+str(request.test))
    start_single = time.perf_counter()
    ordered = insertion_sort(request.test.copy())
    end_single = time.perf_counter()
    run_process_pool(200, request.test)
    end_threading = time.perf_counter()
    print("Total computation: "+str(end_threading-start_single))
    print("Single computation: "+str(end_single-start_single))
    print("Thread computation: "+str(end_threading-end_single))
    return {"result": ordered}

def run_threads(n_threads, values):
    import threading
    threads = []
    for thnum in range(n_threads):
        x = threading.Thread(target=thread_core, args=(values,thnum))
        x.start()
        threads.append(x)
    for thread in threads:
        thread.join()

def run_thread_pool(n_threads, values):
    import concurrent.futures
    with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
        for thnum in range(n_threads):
            executor.submit(thread_core, values, thnum)

def run_process_pool(n_threads, values):
    import concurrent.futures
    with concurrent.futures.ProcessPoolExecutor(max_workers=n_threads) as executor:
        for thnum in range(n_threads):
            executor.submit(thread_core, values, thnum)

def thread_core(values, thread_num):
    print("Starting thread "+str(thread_num))
    insertion_sort(values.copy())
    print("Ending thread thread "+str(thread_num))

def insertion_sort(values):
    for x in range(1, len(values)):
        for y in range(x):
            j = x-1-y
            if values[j + 1] < values[j]:
                swap = values[j + 1]
                values[j+1] = values[j]
                values[j] = swap
    return values


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Server to sort using insertion sort')

    parser.add_argument('-p', '--port', dest="port", action='store', type=int, default=49163,
                        help='the port accepting requests from customers')
    parser.add_argument('-s', '--server', dest="host", action='store', default="127.0.0.1",
                        help='the host accepting requests from customers')

    args = parser.parse_args()
    import uvicorn
    if args.host:
        uvicorn.run(server, host=args.host, port=int(args.port))
    else:
        uvicorn.run(server, port=int(args.port))

Last but not least, the Dockerfile. In this case I didn’t use alpine because it had some known issues with some libraries.

FROM python:3.9-rc-buster

# Create app directory
WORKDIR /usr/src/app

# Install app
COPY requirements.txt ./

RUN pip install --upgrade pip setuptools && \
    pip install --no-cache-dir --upgrade -r requirements.txt -t .

COPY app.py ./

EXPOSE 49163

ENTRYPOINT ["python", "app.py", "-s", "0.0.0.0"]

Nothing else is left. Except, obviously, statistics. Obviously the statistics are machine dependant but you should be able to run the code on your machine and have result that are proportionally similar. Ah, and obviously I am sharing my code in my repository, feel free to clone and run it.

The table shows the statistics using the docker version and the version running on my local machine. Again, results are machine dependant, the importance is on their related values.

Version Docker Single Thread Docker Multi thread Local Single Thread Local Multi thread
NodeJS 76 ms 9882 ms 84 ms 7872 ms
Rust 50 ms 1142 ms 62 ms 1129 ms
Java 107 ms 4267 ms 121 ms 14401 ms
Python 8 s 725 s 4 s 474 s

Hope you will find this interesting. Stay tuned!!!

Share