Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51880

Increaser immunity to network outages

XMLWordPrintable

      The plugins it is robust, most of the network outages and difficulties do not affect
      to the uploads, or only increases the times. Downloads are a little more weak
      and network outages affects them breaking the download.

      Network outages test

      Scenario: we have a Jenkins instance with the S3 Artifact Manager Plugin installed,
      we have a ToxiProxy service connected to a
      Squid http proxy. We configured a role on ToxiProxy
      that redirects port 8888 to 3128 squid service port, We configured the Jenkins instance
      to use the the port 8888 as proxy (with java properties -Dhttp.proxyHost=127.0.0.1 -Dhttp.proxyPort=8888 -Dhttps.proxyHost=127.0.0.1 -Dhttps.proxyPort=8888).

      Test Scripts

      Big-file test

      def file = "test.bin"
      
      timestamps {
          node() {
            stage('Generating ${file}') {
              sh "[ -f ${file} ] || dd if=/dev/urandom of=${file} bs=10240 count=102400"
            }
            stage('Archive') {
              archiveArtifacts file
            }
            stage('Unarchive') {
              unarchive mapping: ["${file}": 'test.bin']
            }
          }
      }
      

      Small-files Test

      timestamps {
          node() {
            stage('Setup') {
              for(def i = 1; i < 1; i++) {
                writeFile file: "test/test-${i}.txt", text: "test ${i}"
              }
            }
            stage('Archive') {
              archiveArtifacts "test/*"
            }
            stage('Unarchive') {
              dir('unarch') {
                deleteDir()
                unarchive mapping: ["test/": '.']
              }
            }
          }
      }
      

      Stash Test

      timestamps {
          node() {
            stage('Setup') {
              for(def i = 1; i < 100; i++) {
                writeFile file: "test/test-${it}.txt", text: "test ${it}"
              }
              
            }
            stage('Archive') {
              stash name: 'stuff', includes: 'test/'
            }
            stage('Unarchive') {
              dir('unarch') {
                deleteDir()
                unstash name: 'stuff'
              }
            }
          }
      }
      

      Prepare the environment

      export NET=172.18.5.0/24
      export squid=172.18.5.10
      export toxiproxy=172.18.5.11
      
      export HOSTS="--add-host squid.example.com:${squid} \
        --add-host toxiproxy.example.com:${toxiproxy}"
      
      docker network create --subnet=${NET} toxiNetwork
      

      start the toxiproxy docker container

      docker pull shopify/toxiproxy
      docker run -it -p 8474:8474 -p 8888:8888 --rm --ip ${toxiproxy} --net toxiNetwork ${HOSTS} --name toxiproxy shopify/toxiproxy
      

      start squid

      docker run -d -p 3128:3128 --rm --ip ${squid} --net toxiNetwork ${HOSTS} --name squid minimum2scp/squid
      

      create a configuration file with two redirection rules

      cat <<EOF > toxiproxy.json
      [{
        "name": "squid",
        "listen": "${toxiproxy}:8888",
        "upstream": "${squid}:3128"
      }]
      EOF
      

      load the configuration in toxiproxy

      curl -X POST http://127.0.0.1:8474/populate -d"@toxiproxy.json" && echo
      

      this command enable/disable the Proxy, we will use it to simulate network outages.

      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": true}' && echo
      

      Simulate a TIME second continuous network outage

      launch the jobs, then execute this script,
      you have to change the variable TIME to test 1,5,10,30 seconds of outages
      per 1 second of connection.

      export TIME=1
      while [ true ]; 
      do
        curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": true}' && echo
        sleep 1
        curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": false}' && echo
        sleep ${TIME}
      done 
      

      Simulate a TIME second isolated network outage

      launch the jobs, then execute this script once on each job stage,
      you have to change the variable TIME to test 1,5,10,30 seconds of outages.

      export TIME=1
      
      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": true}' && echo
      sleep 1
      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": false}' && echo
      sleep ${TIME}
      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": true}' && echo
      sleep 1
      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": false}' && echo
      sleep ${TIME}
      curl -X POST http://127.0.0.1:8474/proxies/squid -d '{"enabled": true}' && echo
      

      Simulate latency

      Create the toxic and excute the test jobs

      curl -X POST http://127.0.0.1:8474/proxies/squid/toxics -d '{"name": "latency_squid", "type": "latency", "stream": "upstream", "toxicity": 1.0, "attributes": {"latency": 1000, "jitter": 1000} }' && echo
      

      when you finished delete the toxic

      curl -X DELETE http://127.0.0.1:8474/proxies/squid/toxics/latency_squid
      

      Simulate bandwidth limitations

      Create the toxic and excute the test jobs

      curl -X POST http://127.0.0.1:8474/proxies/squid/toxics -d '{"name": "bandwidth_squid", "type": "bandwidth", "stream": "upstream", "toxicity": 1.0, "attributes": {"rate": 1024} }' && echo
      

      when you finished delete the toxic

      curl -X DELETE http://127.0.0.1:8474/proxies/squid/toxics/bandwidth_squid
      

      Simulate slow_close

      Create the toxic and excute the test jobs

      curl -X POST http://127.0.0.1:8474/proxies/squid/toxics -d '{"name": "slowclose_squid", "type": "slow_close", "stream": "upstream", "toxicity": 1.0, "attributes": {"delay": 1000} }' && echo
      

      when you finished delete the toxic

      curl -X DELETE http://127.0.0.1:8474/proxies/squid/toxics/slowclose_squid
      

      Test Results

      Continuous networks outages

      We simulate a network outage of N seconds, then we restore the network for one second,
      and start a new network outage, we repeat this process until the job finished.

      Big files we run a test job that archives a file of 1GB, and we try different network outage times
      • 1 second - it fails consistenly
      • 5 second - it fails consistenly
      • 10 second - it fails consistenly
      • 30 second - it fails consistenly
      Small files

      we run a test job that archive and unarchive a few files, and we try different network outage times

      • 1 second - archive is not affected, unarchive fails 90% of times
      • 5 second - archive is not affected, unarchive fails 90% of times
      • 10 second - archive is not affected, unarchive fails 90% of times
      • 30 second - archive is not affected, unarchive fails 90% of times
      Stash

      we run a test job that stash and unstash a few files, and we try different network outage times

      • 1 second - stash is not affected, unstash fails 70% of times
      • 5 second - stash is not affected, unstash fails 70% of times
      • 10 second - stash is not affected, unstash fails 80% of times
      • 30 second - stash is not affected, unstash fails 90% of times

      Isolated network outages

      We simulate a network outage of N seconds, then we restore the network for one second,
      and start a new network outage, finally, we restore the network again.

      Big files we run a test job that archives a file of 1GB, and we try different network outage times
      • 1 second - it is not affected
      • 5 second - it is not affected
      • 10 second - it is not affected, we can see reties messages on the logs
      • 30 second - it is not affected, we can see reties messages on the logs
      Small files

      we run a test job that archive and unarchive a few files, and we try different network outage times

      • 1 second - it is not affected
      • 5 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      • 10 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      • 30 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      Stash

      we run a test job that stash and unstash a few files, and we try different network outage times

      • 1 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      • 5 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      • 10 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.
      • 30 second - archive is not affected, we can see reties messages on the logs. Unarchive fails consistenly.

      Latency and jitter

      we create a toxics with latency and jitter.

      • 1000 ms - not affected
      • 10000 ms - increases times
      • 30000 ms - fails consistenly

      Limited bandwidth

      we create a toxics to limit the bandwidth.

      • 1 KB/s - increases times
      • 100 KB/s - increases times
      • 1024 KB/s - increases times
      • 10240 KB/s - increases times

      Slow close

      we create a toxics that delay the TCP socket from closing until delay has elapsed.

      • 1000 ms - not affected
      • 10000 ms - increases times
      • 30000 ms - increases times

            ifernandezcalvo Ivan Fernandez Calvo
            ifernandezcalvo Ivan Fernandez Calvo
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: