public inbox for isar-users@googlegroups.com
 help / color / mirror / Atom feed
* [PATCH] testsuite: Improve SSH ping
@ 2023-03-17  4:11 Uladzimir Bely
  2023-03-17  8:17 ` Henning Schild
  2023-03-17  8:53 ` Henning Schild
  0 siblings, 2 replies; 6+ messages in thread
From: Uladzimir Bely @ 2023-03-17  4:11 UTC (permalink / raw)
  To: isar-users

When qemu machine boots, it may happen that consecutive SSH connection
fails right after the previous good one. So, we get a situation when
the command/script fails after we consider SSH is ready.

This patch improves detection of SSH server ready status by making at
least three good consecutive SSH pings.

Example of debug output that shows the case:

```
| Waiting for SSH server ready...
| SSH ping result: 255, left: 300s # <== machine is booting
| SSH ping result: 255, left: 294s
| SSH ping result: 255, left: 288s
| SSH ping result: 255, left: 282s
| SSH ping result: 255, left: 276s
| SSH ping result: 255, left: 270s
| SSH ping result: 255, left: 264s
| SSH ping result: 255, left: 258s
| SSH ping result: 0, left: 253s   # <== SSH server is up...
| SSH ping result: 0, left: 251s
| SSH ping result: 255, left: 250s # <== but one ping failed again
| SSH ping result: 0, left: 248s
| SSH ping result: 0, left: 245s
| SSH ping result: 0, left: 243s
| SSH server is ready
| `lsmod | grep example_module` returned 0
```

Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
---
 testsuite/cibuilder.py | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
index 9e84c3a3..4e568b8e 100755
--- a/testsuite/cibuilder.py
+++ b/testsuite/cibuilder.py
@@ -257,17 +257,25 @@ class CIBuilder(Test):
         self.log.debug('Waiting for SSH server ready...')
 
         rc = None
+        goodcnt = 0
         while time.time() < timeout:
             if proc.poll() is not None:
                 self.log.error('Machine is not running')
                 return rc
 
             rc = self.exec_cmd('/bin/true', cmd_prefix)
+            time_left = timeout - time.time()
+            self.log.debug('SSH ping result: %d, left: %.fs' % (rc, time_left))
             time.sleep(1)
 
             if rc == 0:
-                self.log.debug('SSH server is ready')
-                break
+                goodcnt += 1
+                # Let 3 good SSH pings to make sure SSH connection is stable
+                if goodcnt >= 3:
+                    self.log.debug('SSH server is ready')
+                    break
+            else:
+                goodcnt = 0
 
         if rc != 0:
             self.log.error('SSH server is not ready')
-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] testsuite: Improve SSH ping
  2023-03-17  4:11 [PATCH] testsuite: Improve SSH ping Uladzimir Bely
@ 2023-03-17  8:17 ` Henning Schild
  2023-03-17  8:53 ` Henning Schild
  1 sibling, 0 replies; 6+ messages in thread
From: Henning Schild @ 2023-03-17  8:17 UTC (permalink / raw)
  To: Uladzimir Bely; +Cc: isar-users

Am Fri, 17 Mar 2023 05:11:30 +0100
schrieb Uladzimir Bely <ubely@ilbers.de>:

> When qemu machine boots, it may happen that consecutive SSH connection
> fails right after the previous good one. So, we get a situation when
> the command/script fails after we consider SSH is ready.
> 
> This patch improves detection of SSH server ready status by making at
> least three good consecutive SSH pings.
> 
> Example of debug output that shows the case:
> 
> ```
> | Waiting for SSH server ready...
> | SSH ping result: 255, left: 300s # <== machine is booting
> | SSH ping result: 255, left: 294s
> | SSH ping result: 255, left: 288s
> | SSH ping result: 255, left: 282s
> | SSH ping result: 255, left: 276s
> | SSH ping result: 255, left: 270s
> | SSH ping result: 255, left: 264s
> | SSH ping result: 255, left: 258s
> | SSH ping result: 0, left: 253s   # <== SSH server is up...
> | SSH ping result: 0, left: 251s
> | SSH ping result: 255, left: 250s # <== but one ping failed again
> | SSH ping result: 0, left: 248s
> | SSH ping result: 0, left: 245s
> | SSH ping result: 0, left: 243s
> | SSH server is ready
> | `lsmod | grep example_module` returned 0
> ```
> 
> Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
> ---
>  testsuite/cibuilder.py | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
> index 9e84c3a3..4e568b8e 100755
> --- a/testsuite/cibuilder.py
> +++ b/testsuite/cibuilder.py
> @@ -257,17 +257,25 @@ class CIBuilder(Test):
>          self.log.debug('Waiting for SSH server ready...')
>  
>          rc = None
> +        goodcnt = 0
>          while time.time() < timeout:
>              if proc.poll() is not None:
>                  self.log.error('Machine is not running')
>                  return rc
>  
>              rc = self.exec_cmd('/bin/true', cmd_prefix)
> +            time_left = timeout - time.time()
> +            self.log.debug('SSH ping result: %d, left: %.fs' % (rc,
> time_left)) time.sleep(1)
>  
>              if rc == 0:
> -                self.log.debug('SSH server is ready')
> -                break
> +                goodcnt += 1

I find all that code kind of hard to read. With the multiple
indirections to find the command that actually gets executed in the
end. A quick look makes one think it is "/bin/true".

and for a counting loop python has
"for i in range(3)"

The idea to wait a bit longer is good, and if that works it can be
merged. But the code could likely by written in a way that it would be
easier to read and maintain.

Henning

> +                # Let 3 good SSH pings to make sure SSH connection
> is stable
> +                if goodcnt >= 3:
> +                    self.log.debug('SSH server is ready')
> +                    break
> +            else:
> +                goodcnt = 0
>  
>          if rc != 0:
>              self.log.error('SSH server is not ready')


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] testsuite: Improve SSH ping
  2023-03-17  4:11 [PATCH] testsuite: Improve SSH ping Uladzimir Bely
  2023-03-17  8:17 ` Henning Schild
@ 2023-03-17  8:53 ` Henning Schild
  2023-03-20  8:51   ` Uladzimir Bely
  1 sibling, 1 reply; 6+ messages in thread
From: Henning Schild @ 2023-03-17  8:53 UTC (permalink / raw)
  To: Uladzimir Bely; +Cc: isar-users

Am Fri, 17 Mar 2023 05:11:30 +0100
schrieb Uladzimir Bely <ubely@ilbers.de>:

> When qemu machine boots, it may happen that consecutive SSH connection
> fails right after the previous good one. So, we get a situation when
> the command/script fails after we consider SSH is ready.
> 
> This patch improves detection of SSH server ready status by making at
> least three good consecutive SSH pings.
> 
> Example of debug output that shows the case:
> 
> ```
> | Waiting for SSH server ready...
> | SSH ping result: 255, left: 300s # <== machine is booting
> | SSH ping result: 255, left: 294s
> | SSH ping result: 255, left: 288s
> | SSH ping result: 255, left: 282s
> | SSH ping result: 255, left: 276s
> | SSH ping result: 255, left: 270s
> | SSH ping result: 255, left: 264s
> | SSH ping result: 255, left: 258s
> | SSH ping result: 0, left: 253s   # <== SSH server is up...
> | SSH ping result: 0, left: 251s
> | SSH ping result: 255, left: 250s # <== but one ping failed again
> | SSH ping result: 0, left: 248s
> | SSH ping result: 0, left: 245s
> | SSH ping result: 0, left: 243s
> | SSH server is ready
> | `lsmod | grep example_module` returned 0
> ```
> 
> Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
> ---
>  testsuite/cibuilder.py | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
> index 9e84c3a3..4e568b8e 100755
> --- a/testsuite/cibuilder.py
> +++ b/testsuite/cibuilder.py
> @@ -257,17 +257,25 @@ class CIBuilder(Test):
>          self.log.debug('Waiting for SSH server ready...')
>  
>          rc = None
> +        goodcnt = 0
>          while time.time() < timeout:
>              if proc.poll() is not None:
>                  self.log.error('Machine is not running')
>                  return rc
>  
>              rc = self.exec_cmd('/bin/true', cmd_prefix)
> +            time_left = timeout - time.time()
> +            self.log.debug('SSH ping result: %d, left: %.fs' % (rc,
> time_left)) time.sleep(1)
>  
>              if rc == 0:
> -                self.log.debug('SSH server is ready')
> -                break
> +                goodcnt += 1
> +                # Let 3 good SSH pings to make sure SSH connection
> is stable
> +                if goodcnt >= 3:
> +                    self.log.debug('SSH server is ready')
> +                    break
> +            else:
> +                goodcnt = 0

This looks like an endless loop should ssh never come up. Not sure what
would break that loop. In the worst case a test-timeout and everything
being stuck because we might not execute these things in parallel.

Henning

>  
>          if rc != 0:
>              self.log.error('SSH server is not ready')


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] testsuite: Improve SSH ping
  2023-03-17  8:53 ` Henning Schild
@ 2023-03-20  8:51   ` Uladzimir Bely
  2023-03-20  9:12     ` Henning Schild
  0 siblings, 1 reply; 6+ messages in thread
From: Uladzimir Bely @ 2023-03-20  8:51 UTC (permalink / raw)
  To: Henning Schild; +Cc: isar-users

In mail from Friday, 17 March 2023 11:53:58 +03 user Henning Schild wrote:
> Am Fri, 17 Mar 2023 05:11:30 +0100
> 
> schrieb Uladzimir Bely <ubely@ilbers.de>:
> > When qemu machine boots, it may happen that consecutive SSH connection
> > fails right after the previous good one. So, we get a situation when
> > the command/script fails after we consider SSH is ready.
> > 
> > This patch improves detection of SSH server ready status by making at
> > least three good consecutive SSH pings.
> > 
> > Example of debug output that shows the case:
> > 
> > ```
> > 
> > | Waiting for SSH server ready...
> > | SSH ping result: 255, left: 300s # <== machine is booting
> > | SSH ping result: 255, left: 294s
> > | SSH ping result: 255, left: 288s
> > | SSH ping result: 255, left: 282s
> > | SSH ping result: 255, left: 276s
> > | SSH ping result: 255, left: 270s
> > | SSH ping result: 255, left: 264s
> > | SSH ping result: 255, left: 258s
> > | SSH ping result: 0, left: 253s   # <== SSH server is up...
> > | SSH ping result: 0, left: 251s
> > | SSH ping result: 255, left: 250s # <== but one ping failed again
> > | SSH ping result: 0, left: 248s
> > | SSH ping result: 0, left: 245s
> > | SSH ping result: 0, left: 243s
> > | SSH server is ready
> > | `lsmod | grep example_module` returned 0
> > 
> > ```
> > 
> > Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
> > ---
> > 
> >  testsuite/cibuilder.py | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
> > index 9e84c3a3..4e568b8e 100755
> > --- a/testsuite/cibuilder.py
> > +++ b/testsuite/cibuilder.py
> > 
> > @@ -257,17 +257,25 @@ class CIBuilder(Test):
> >          self.log.debug('Waiting for SSH server ready...')
> >          
> >          rc = None
> > 
> > +        goodcnt = 0
> > 
> >          while time.time() < timeout:
> >              if proc.poll() is not None:
> >                  self.log.error('Machine is not running')
> >                  return rc
> >              
> >              rc = self.exec_cmd('/bin/true', cmd_prefix)
> > 
> > +            time_left = timeout - time.time()
> > +            self.log.debug('SSH ping result: %d, left: %.fs' % (rc,
> > time_left)) time.sleep(1)
> > 
> >              if rc == 0:
> > -                self.log.debug('SSH server is ready')
> > -                break
> > +                goodcnt += 1
> > +                # Let 3 good SSH pings to make sure SSH connection
> > is stable
> > +                if goodcnt >= 3:
> > +                    self.log.debug('SSH server is ready')
> > +                    break
> > +            else:
> > +                goodcnt = 0
> 
> This looks like an endless loop should ssh never come up. Not sure what
> would break that loop. In the worst case a test-timeout and everything
> being stuck because we might not execute these things in parallel.
> 

In case ssh never come up, we will exit after the timeout (default value 
600sec), goodcnt will always be 0 in this case.

Using something more readable like "for i in range(3)" you've suggested in the 
previous comment won't work because we need to reset the counter and start 
again in case one of pings goes wrong. E.g, if second ping fails, we need 
`goodcnt` changes like 0-0-0-1-0-1-2-3. With "range" function it will act like 
0-0-0-1-0-3.

But you are right, code like "while time.time() < timeout and goodcnt < 3" 
might be a bit more readable. I'll check it.

> Henning
> 
> >          if rc != 0:
> >              self.log.error('SSH server is not ready')





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] testsuite: Improve SSH ping
  2023-03-20  8:51   ` Uladzimir Bely
@ 2023-03-20  9:12     ` Henning Schild
  2023-03-20  9:17       ` Uladzimir Bely
  0 siblings, 1 reply; 6+ messages in thread
From: Henning Schild @ 2023-03-20  9:12 UTC (permalink / raw)
  To: Uladzimir Bely; +Cc: isar-users

Am Mon, 20 Mar 2023 11:51:57 +0300
schrieb Uladzimir Bely <ubely@ilbers.de>:

> In mail from Friday, 17 March 2023 11:53:58 +03 user Henning Schild
> wrote:
> > Am Fri, 17 Mar 2023 05:11:30 +0100
> > 
> > schrieb Uladzimir Bely <ubely@ilbers.de>:  
> > > When qemu machine boots, it may happen that consecutive SSH
> > > connection fails right after the previous good one. So, we get a
> > > situation when the command/script fails after we consider SSH is
> > > ready.
> > > 
> > > This patch improves detection of SSH server ready status by
> > > making at least three good consecutive SSH pings.
> > > 
> > > Example of debug output that shows the case:
> > > 
> > > ```
> > > 
> > > | Waiting for SSH server ready...
> > > | SSH ping result: 255, left: 300s # <== machine is booting
> > > | SSH ping result: 255, left: 294s
> > > | SSH ping result: 255, left: 288s
> > > | SSH ping result: 255, left: 282s
> > > | SSH ping result: 255, left: 276s
> > > | SSH ping result: 255, left: 270s
> > > | SSH ping result: 255, left: 264s
> > > | SSH ping result: 255, left: 258s
> > > | SSH ping result: 0, left: 253s   # <== SSH server is up...
> > > | SSH ping result: 0, left: 251s
> > > | SSH ping result: 255, left: 250s # <== but one ping failed again
> > > | SSH ping result: 0, left: 248s
> > > | SSH ping result: 0, left: 245s
> > > | SSH ping result: 0, left: 243s
> > > | SSH server is ready
> > > | `lsmod | grep example_module` returned 0
> > > 
> > > ```
> > > 
> > > Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
> > > ---
> > > 
> > >  testsuite/cibuilder.py | 12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
> > > index 9e84c3a3..4e568b8e 100755
> > > --- a/testsuite/cibuilder.py
> > > +++ b/testsuite/cibuilder.py
> > > 
> > > @@ -257,17 +257,25 @@ class CIBuilder(Test):
> > >          self.log.debug('Waiting for SSH server ready...')
> > >          
> > >          rc = None
> > > 
> > > +        goodcnt = 0
> > > 
> > >          while time.time() < timeout:
> > >              if proc.poll() is not None:
> > >                  self.log.error('Machine is not running')
> > >                  return rc
> > >              
> > >              rc = self.exec_cmd('/bin/true', cmd_prefix)
> > > 
> > > +            time_left = timeout - time.time()
> > > +            self.log.debug('SSH ping result: %d, left: %.fs' %
> > > (rc, time_left)) time.sleep(1)
> > > 
> > >              if rc == 0:
> > > -                self.log.debug('SSH server is ready')
> > > -                break
> > > +                goodcnt += 1
> > > +                # Let 3 good SSH pings to make sure SSH
> > > connection is stable
> > > +                if goodcnt >= 3:
> > > +                    self.log.debug('SSH server is ready')
> > > +                    break
> > > +            else:
> > > +                goodcnt = 0  
> > 
> > This looks like an endless loop should ssh never come up. Not sure
> > what would break that loop. In the worst case a test-timeout and
> > everything being stuck because we might not execute these things in
> > parallel. 
> 
> In case ssh never come up, we will exit after the timeout (default
> value 600sec), goodcnt will always be 0 in this case.
> 
> Using something more readable like "for i in range(3)" you've
> suggested in the previous comment won't work because we need to reset
> the counter and start again in case one of pings goes wrong. E.g, if
> second ping fails, we need `goodcnt` changes like 0-0-0-1-0-1-2-3.
> With "range" function it will act like 0-0-0-1-0-3.

It is a nested loop, obviously the outer one needs to be changed a bit
as well. But the inner can maybe be done with range.

Henning

> But you are right, code like "while time.time() < timeout and goodcnt
> < 3" might be a bit more readable. I'll check it.
> 
> > Henning
> >   
> > >          if rc != 0:
> > >              self.log.error('SSH server is not ready')  
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] testsuite: Improve SSH ping
  2023-03-20  9:12     ` Henning Schild
@ 2023-03-20  9:17       ` Uladzimir Bely
  0 siblings, 0 replies; 6+ messages in thread
From: Uladzimir Bely @ 2023-03-20  9:17 UTC (permalink / raw)
  To: Henning Schild; +Cc: isar-users

In mail from Monday, 20 March 2023 12:12:43 +03 user Henning Schild wrote:
> Am Mon, 20 Mar 2023 11:51:57 +0300
> 
> schrieb Uladzimir Bely <ubely@ilbers.de>:
> > In mail from Friday, 17 March 2023 11:53:58 +03 user Henning Schild
> > 
> > wrote:
> > > Am Fri, 17 Mar 2023 05:11:30 +0100
> > > 
> > > schrieb Uladzimir Bely <ubely@ilbers.de>:
> > > > When qemu machine boots, it may happen that consecutive SSH
> > > > connection fails right after the previous good one. So, we get a
> > > > situation when the command/script fails after we consider SSH is
> > > > ready.
> > > > 
> > > > This patch improves detection of SSH server ready status by
> > > > making at least three good consecutive SSH pings.
> > > > 
> > > > Example of debug output that shows the case:
> > > > 
> > > > ```
> > > > 
> > > > | Waiting for SSH server ready...
> > > > | SSH ping result: 255, left: 300s # <== machine is booting
> > > > | SSH ping result: 255, left: 294s
> > > > | SSH ping result: 255, left: 288s
> > > > | SSH ping result: 255, left: 282s
> > > > | SSH ping result: 255, left: 276s
> > > > | SSH ping result: 255, left: 270s
> > > > | SSH ping result: 255, left: 264s
> > > > | SSH ping result: 255, left: 258s
> > > > | SSH ping result: 0, left: 253s   # <== SSH server is up...
> > > > | SSH ping result: 0, left: 251s
> > > > | SSH ping result: 255, left: 250s # <== but one ping failed again
> > > > | SSH ping result: 0, left: 248s
> > > > | SSH ping result: 0, left: 245s
> > > > | SSH ping result: 0, left: 243s
> > > > | SSH server is ready
> > > > | `lsmod | grep example_module` returned 0
> > > > 
> > > > ```
> > > > 
> > > > Signed-off-by: Uladzimir Bely <ubely@ilbers.de>
> > > > ---
> > > > 
> > > >  testsuite/cibuilder.py | 12 ++++++++++--
> > > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/testsuite/cibuilder.py b/testsuite/cibuilder.py
> > > > index 9e84c3a3..4e568b8e 100755
> > > > --- a/testsuite/cibuilder.py
> > > > +++ b/testsuite/cibuilder.py
> > > > 
> > > > @@ -257,17 +257,25 @@ class CIBuilder(Test):
> > > >          self.log.debug('Waiting for SSH server ready...')
> > > >          
> > > >          rc = None
> > > > 
> > > > +        goodcnt = 0
> > > > 
> > > >          while time.time() < timeout:
> > > >              if proc.poll() is not None:
> > > >                  self.log.error('Machine is not running')
> > > >                  return rc
> > > >              
> > > >              rc = self.exec_cmd('/bin/true', cmd_prefix)
> > > > 
> > > > +            time_left = timeout - time.time()
> > > > +            self.log.debug('SSH ping result: %d, left: %.fs' %
> > > > (rc, time_left)) time.sleep(1)
> > > > 
> > > >              if rc == 0:
> > > > -                self.log.debug('SSH server is ready')
> > > > -                break
> > > > +                goodcnt += 1
> > > > +                # Let 3 good SSH pings to make sure SSH
> > > > connection is stable
> > > > +                if goodcnt >= 3:
> > > > +                    self.log.debug('SSH server is ready')
> > > > +                    break
> > > > +            else:
> > > > +                goodcnt = 0
> > > 
> > > This looks like an endless loop should ssh never come up. Not sure
> > > what would break that loop. In the worst case a test-timeout and
> > > everything being stuck because we might not execute these things in
> > > parallel.
> > 
> > In case ssh never come up, we will exit after the timeout (default
> > value 600sec), goodcnt will always be 0 in this case.
> > 
> > Using something more readable like "for i in range(3)" you've
> > suggested in the previous comment won't work because we need to reset
> > the counter and start again in case one of pings goes wrong. E.g, if
> > second ping fails, we need `goodcnt` changes like 0-0-0-1-0-1-2-3.
> > With "range" function it will act like 0-0-0-1-0-3.
> 
> It is a nested loop, obviously the outer one needs to be changed a bit
> as well. But the inner can maybe be done with range.
> 

But there are no currently any nested loops, there is only "while".

Or you mean it should be?

> Henning
> 
> > But you are right, code like "while time.time() < timeout and goodcnt
> > < 3" might be a bit more readable. I'll check it.
> > 
> > > Henning
> > > 
> > > >          if rc != 0:
> > > >              self.log.error('SSH server is not ready')





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-20  9:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-17  4:11 [PATCH] testsuite: Improve SSH ping Uladzimir Bely
2023-03-17  8:17 ` Henning Schild
2023-03-17  8:53 ` Henning Schild
2023-03-20  8:51   ` Uladzimir Bely
2023-03-20  9:12     ` Henning Schild
2023-03-20  9:17       ` Uladzimir Bely

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox