How I've built my first hybrid edge setup in 3 hours

Two years ago I've bought Raspberry pi, then I installed there SSD drive because I was too afraid of accidental data loss. And then I did nothing with this device so I moved it to the office and the story continued. Over-engineered Raspberry pi setup with no real use 😂 But a few weeks ago I got drunk in the office and I decided to use it for some weekend project. So I took it back home and here are a few notes I've written down over the weekend.

Remote management

If I had have IPv6 I would skip this part. Thanks for pointing this out, Radek. It's 100% percent true but we're not there yet. At least my internet provider (I live in the village so I have very limited options here). Hence let's look at some solution that helped me to overcome this limitation: Teleport.   

Teleport is a tool that simplifies access via SSH. I call it SSH on steroids, it supports SSO, it's enforcing MFA and it works even behind NAT via the reverse tunnel. And that's exactly my situation, so that's why I've built a Teleport cluster on top of EC2 instance and then I've connected Raspberry pi. 

teleport:
  auth_token: "redacted"
  ca_pin: "redacted"
  auth_servers:
    - "redacted:3080"
ssh_service:
    enabled: true
    labels:
        env: dev

proxy_service:
    enabled: false

As result, I have raspberry in my Teleport interface marked with the arrow. It stands for the reverse tunnel. 

Now I can finally connect to my Raspberry pi from any location.  

➜ tsh ssh root@raspberrypi
root@raspberrypi:~# 

Basic monitoring solution

Now I have two nodes running some software stack and it would be really nice to monitor them somehow. I can run Prometheus and Alertmanager there or I can utilize some SaaS offering. Own monitoring solution would make the whole stack independent but it'll ultimately make it extremely complicated. That's why I've decided to use my recent experience with Grafana Cloud and leverage the free tier Grafana offers. 

The big pro of the Grafana stack is that it leverages other OpenSource tools so I can switch to the full Prometheus solution whenever I want.

The basic monitoring can be easily covered by node_exporter, which's a notoriously known component of almost all Kubernetes stacks. We just need to enable systemd collector and whitelist the important systemd units with --collector.systemd.unit-whitelist flag. We're explicitly whitelisting units because we're effectively charged for the number of metrics in Grafana Cloud. So it makes sense to export only interesting units and ignore the rest.

--collector.systemd.unit-whitelist="teleport.service"

Then we're ready to scrape these metrics. For this purpose, I'm gonna use a pretty new Grafana Agent that will behave here like some sort of lightweight Prometheus. Check the scrape_configs section in the snippet, that's exactly the same configuration as you'd write in Prometheus.

server:
  http_listen_port: 12345
prometheus:
  wal_directory: /tmp/grafana-agent-wal
  global:
    scrape_interval: 15s
  configs:
    - name: integrations
      scrape_configs:
        - job_name: node_exporter
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: 'node_systemd_unit_state|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_cpu_seconds_total'
              action: keep
          static_configs:
            - targets:
                - localhost:9100
          relabel_configs:
            - source_labels: [__address__]
              target_label: instance
              replacement: "redacted"
      remote_write:
        - url: https://prometheus-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: "redacted"
            password: "redacted"

Note the metric_relabel_configs part in the config file. Let's stop here and add some context. I'm collecting just a small number of metrics there, I have two reasons for such behavior. 

  • As mentioned earlier, I'm charged for the volume so I really don't want to push myriads of metrics there.
  • And I'm not a fan of collecting everything just because I can. I collect only metrics I use in dashboards and alerting policies. I collect more only if it turns out that I don't cover all the important parts. This approach comes in handy when you run your own Prometheus. When it collects everything, you're gonna find out that Prometheus is an extremely resource-intensive beast. 

Communication between EC2 and Raspberry pi

This is the same problem we had with management access. There is a couple of solutions for this particular problem. In past, I was using VPN, Wireguard more specifically. But this time I've decided to use a specialized solution developed by Alex Ellis: Inlets PRO.  So I've purchased a personal license and created the tunnel. It's just as simple as it sounds like. It is s specialized tool developed just for this purpose so the configuration of the tunnel is dead simple 😍

This is the command I run on the EC2:

/usr/local/bin/inlets-pro server --token=redacted --common-name=redacted --listen-data 127.0.0.1:

And this is the Raspberry pi part:

/usr/local/bin/inlets-pro client --url=wss://redacted:8123/connect --ports=8080 --token=redacted --upstream localhost --license-file=/etc/inlets-pro/license

This combination of commands is gonna forward traffic from EC2's port 8080 to the application running on Raspberry pi. And that's why we're here. Now let's proceed to the last piece of the puzzle.

Automatic TLS and exposing the service to the internet

Let's just summarize what we want here. 

  • Accept HTTP and HTTPS traffic
  • Use some valid TLS certificate
  • Redirect HTTP to HTTPS
  • Forward traffic with the certain Host header to localhost:8080

We can do all mentioned with Traefik, it's an edge proxy built for the Cloud Native era. It's distributed as a single binary, it's ready for all common use cases, and the only thing we need to do is write YAML or TOML config file and run it directly or as systemd service.

Traefik has really extensive documentation so let me just share a short sample. The following snippet contains all the general configuration for Traefik itself.

log:
  level: ERROR

accessLog: {}

entryPoints:
  web:
   address: ":80"
   http:
     redirections:
       entryPoint:
         to: websecure
         scheme: https
  websecure:
    address: ":443"

certificatesResolvers:
  myresolver:
    acme:
      email: stepan@vrany.dev
      storage: acme.json
      httpChallenge:
        entryPoint: web

providers:
  file:
    directory: "/etc/traefik/conf"

The interesting part is certificatesResolvers stanza. We don't need to specify any domain names, it will just work. Check the second snippet with the actual HTTP routing.

http:
  routers:
    router0:
      tls:
        certResolver: myresolver
      entryPoints:
      - web
      - websecure
      service: raspberrypi01-8080
      rule: Host(`redacted`) && Path(`/`)

  services:
    raspberrypi01-8080:
      loadBalancer:
        servers:
        - url: http://localhost:8080/
        passHostHeader: false

Here you can actually see the complete story. Web entrypoint receives traffic and if the Host header matches the pattern specified in the `rule` property it's forwarded to localhost:8080. The rest is handled by Inlets PRO and my Raspberry pi.

Now I can send some dummy requests to the server's 80 port and I get this result.

➜ curl http://redacted -L -I
HTTP/1.1 308 Permanent Redirect
Location: https://redacted/
Date: Sun, 31 Jan 2021 14:17:02 GMT
Content-Length: 18
Content-Type: text/plain; charset=utf-8

HTTP/2 200 
content-type: text/plain; charset=utf-8
date: Sun, 31 Jan 2021 14:17:02 GMT
content-length: 8

➜ curl https://redacted
Hello, !root

Please note that this is just a fraction of the functionality we can use here. If needed, we can also use some Traefik's built-in functionality as authentication middlewares and so on.  Check more details in the Traefik documentation, they have one of the best documentation available.

Stuff that you have to solve along the way

Of course, I did not cover everything in this article. Here's a list of things I was doing besides the stuff mentioned in this article:

  • Configuration of Security Groups in AWS, you have to be really strict here
  • Configuration of ACLs / Firewalls in the operating systems
  • Togging, in my case I'm using Loki because I've chosen Grafana Cloud
  • Tracing, same story as logging. I have collectors configured in Grafana Agent
  • Application metrics, this part is WIP, you'll find out why soon
  • Alerting
  • And a bunch of other stuff

Wrap

Now I'd like to tell you one thing. I've built this whole thing but I actually don't have any concrete use case for such infrastructure 😂I'm apparently a better infrastructure engineer than an inventor or thinker. But you know what? I'll come with some ideas soon 😁 In the meantime, I will try to enhance the stuff I've already done. 

But the headline is 100% true. It all took like 3 hours. In 2021 we have a solution for everything. You just need to stack them properly. 

This article was updated on February 1, 2021

Stepan Vrany

System Engineer, Cloud Native enthusiast, father and husband. Also, I'm taking some pictures of ugly things.