Andrew Cooke | Contents | Latest | RSS | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

Surprise Paradox; [Books] Good Author List; [Computing] Efficient queries with grouping in Postgres; [Computing] Automatic Wake (Linux); [Computing] AWS CDK Aspects in Go; [Bike] Adidas Gravel Shoes; [Computing, Horror] Biological Chips; [Books] Weird Lit Recs; [Covid] Extended SIR Models; [Art] York-based Printmaker; [Physics] Quantum Transitions are not Instantaneous; [Computing] AI and Drum Machines; [Computing] Probabilities, Stopping Times, Martingales; bpftrace Intro Article; [Computing] Starlab Systems - Linux Laptops; [Computing] Extended Berkeley Packet Filter; [Green] Mainspring Linear Generator; Better Approach; Rummikub Solver; Chilean Poetry; Felicitations - Empowerment Grant; [Bike] Fixing Spyre Brakes (That Need Constant Adjustment); [Computing, Music] Raspberry Pi Media (Audio) Streamer; [Computing] Amazing Hack To Embed DSL In Python; [Bike] Ruta Del Condor (El Alfalfal); [Bike] Estimating Power On Climbs; [Computing] Applying Azure B2C Authentication To Function Apps; [Bike] Gearing On The Back Of An Envelope; [Computing] Okular and Postscript in OpenSuse; There's a fix!; [Computing] Fail2Ban on OpenSuse Leap 15.3 (NFTables); [Cycling, Computing] Power Calculation and Brakes; [Hardware, Computing] Amazing Pockit Computer; Bullying; How I Am - 3 Years Post Accident, 8+ Years With MS; [USA Politics] In America's Uncivil War Republicans Are The Aggressors; [Programming] Selenium and Python; Better Walking Data; [Bike] How Fast Before Walking More Efficient Than Cycling?; [COVID] Coronavirus And Cycling; [Programming] Docker on OpenSuse; Cadence v Speed; [Bike] Gearing For Real Cyclists; [Programming] React plotting - visx; [Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life

© 2006-2017 Andrew Cooke (site) / post authors (content).

Optimizing Julia Code

From: andrew cooke <andrew@...>

Date: Wed, 1 Jan 2014 12:51:06 -0300

I've just spent a little less than an hour optimizing some Julia code - I got
a factor of 6 or 7 speedup.

The program is a depth-first search to find the internal state of an RC5
cipher (you can the current code at
https://github.com/andrewcooke/BlockCipherSelfStudy.jl/blob/master/src/RC5.jl#L483
or nearby), so I initially focussed on the inner loop of the encryption.

In an interactive julia session I ran code like:

  include("RC5.jl")
  s = RC5.State(Uint32, 0x6, Uint8[0x0], rotate=false)
  @time (for i in 1:1000; RC5.encrypt(s, 0x12345678, 0x87654321); end)

which printed something like

  elapsed time: 0.000249587 seconds (64000 bytes allocated)

That seemed a little odd - I couldn't see why any memory needed to be
allocated.  It turns out that the tuple returned by the routine needs some
memory.

So I replaced that return with a mutable structure (a "type" in Julia
parlance) passed as an argument, which removed any memory use and gave a
speedup (at the cost of some ugliness).

At this point it wasn't clear what else to do, so I decided to profile.  I
modified the source to look like

  @time key_from_encrypt(1, make_solve_dfs_noro(Uint32, 0x3, 32),
		   fake_keygen(Uint32, 0x3, 0x10, rotate=false),
		   k -> (a, b) -> encrypt(k, a, b),
		   eq=same_ctext(1024, encrypt))
  @time key_from_encrypt(1, make_solve_dfs_noro(Uint32, 0x3, 32),
		   fake_keygen(Uint32, 0x3, 0x10, rotate=false),
		   k -> (a, b) -> encrypt(k, a, b),
		   eq=same_ctext(1024, encrypt))

and ran it from the command line.  The first @time is larger (JIT warming up),
but the second gave a baseline value for the elapsed time.

Then I changed the code to

  @time key_from_encrypt(1, make_solve_dfs_noro(Uint32, 0x3, 32),
		   fake_keygen(Uint32, 0x3, 0x10, rotate=false),
		   k -> (a, b) -> encrypt(k, a, b),
		   eq=same_ctext(1024, encrypt))
  @profile key_from_encrypt(1, make_solve_dfs_noro(Uint32, 0x3, 32),
		   fake_keygen(Uint32, 0x3, 0x10, rotate=false),
		   k -> (a, b) -> encrypt(k, a, b),
		   eq=same_ctext(1024, encrypt))

(note @time -> @profile) and, in an iteractive Julia session, loaded the
module a couple of times.  Once I had done that, running

  Profile.print()

generated a profile like

  julia> Profile.print()
  2   ...erSelfStudy/src/RC5.jl; set_state; line: 472
  1   ...erSelfStudy/src/RC5.jl; test_bits; line: 458
  1   ...erSelfStudy/src/RC5.jl; test_bits; line: 463
       1   no file; anonymous; line: 7
       2   no file; anonymous; line: 8
	  1 task.jl; produce; line: 52
	  1 task.jl; produce; line: 53
       1   no file; anonymous; line: 9
       5   no file; anonymous; line: 25
	  2 task.jl; produce; line: 53
       1   no file; anonymous; line: 26
	 1   no file; anonymous; line: 38
			       461 .../src/RC5.jl; solutions; line: 14
				     456 ...Solve.jl; key_from_encrypt; line: 11
				       8   .../RC5.jl; solve; line: 486
				       1   .../RC5.jl; solve; line: 490
				       1   .../RC5.jl; solve; line: 491
				       190 .../RC5.jl; solve; line: 494
					10 .../RC5.jl; set_state; line: 472
					9  .../RC5.jl; set_state; line: 473
					32 .../RC5.jl; set_state; line: 474
					69 .../RC5.jl; set_state; line: 475
					62 .../RC5.jl; set_state; line: 477
					6  .../RC5.jl; set_state; line: 479
				       14  .../RC5.jl; solve; line: 495
				       225 .../RC5.jl; solve; line: 496
					 1   ...RC5.jl; test_bits; line: 458
					 1   ...RC5.jl; test_bits; line: 463
					  1  ...RC5.jl; encrypt; line: 108
					  1  ...RC5.jl; encrypt; line: 122
					  5  ...RC5.jl; test_bits; line: 456
					  18 ...RC5.jl; test_bits; line: 458
					  6  ...RC5.jl; test_bits; line: 459
					  13 ...RC5.jl; test_bits; line: 460
					  73 ...RC5.jl; test_bits; line: 461
				       +1 6  ...RC5.jl; encrypt; line: 108
				       +1 6  ...RC5.jl; encrypt; line: 109
				       +1 14 ...RC5.jl; encrypt; line: 110
				       +1 2  ...RC5.jl; encrypt; line: 112
				       +1 23 ...RC5.jl; encrypt; line: 115
				       +1 5  ...RC5.jl; encrypt; line: 116
				       +1 1  ...RC5.jl; encrypt; line: 117
				       +1 11 ...RC5.jl; encrypt; line: 120
				       +1 3  ...RC5.jl; encrypt; line: 122
					  6  ...RC5.jl; test_bits; line: 462
					  7  ...RC5.jl; test_bits; line: 463
				       8   .../RC5.jl; solve; line: 502
				       2   .../RC5.jl; solve; line: 508
					 2 string.jl; println; line: 8
				       +1 2 string.jl; println; line: 5
				       +4 2 string.jl; print; line: 4
				     5   ...Solve.jl; key_from_encrypt; line: 12
				       2 ...Solve.jl; eq; line: 39
					1 array.jl; collect; line: 264
					 1 task.jl; done; line: 82
					  1 task.jl; consume; line: 68
					1 array.jl; collect; line: 265
				       2 ...Solve.jl; eq; line: 40
					 2 array.jl; collect; line: 264
					  2 task.jl; done; line: 82
				       +1 2 task.jl; consume; line: 68
				       1 ...Solve.jl; eq; line: 41
					 1 array.jl; collect; line: 265

Now that's the profile after fixing a few things.  The initial data showed
that:

(1) the fix I had done to encrypt() (adding the mutable output structure) had
no real effect in the larger scheme of things (so I reverted the code, since
it was ugly).

(2) the first slow spot was a call to zip(...).  I replaced that with an
explicit loop over indices.

(3) the second slow spot was an increment of the state being searched over.
This was an integer, so a significant chunk of my program's time was basically
spent adding 1 to a value.  That seemed a bit odd (although it's true it was
the innermost operation).  But it turned out that I was selecting the smallest
integer I needed (typically Uint16) and replacing that with Uint64 for all
cases speeded things up significantly.

So the lessons are:

 - profile before optimising the "obvious" (as ever!)

 - avoid exotic integer types when they are not needed (unfortunately I do
   need them in the core crypto, which relies on overflow).

 - avoid high level functional idioms (like zip) in the inner loop.

The last of these is a disappointment.  I wonder if the optimizer will improve
one day.

Andrew

Recursion OK

From: andrew cooke <andrew@...>

Date: Wed, 1 Jan 2014 16:08:46 -0300

Looking at the code I was discussing above, I realised that I had prematurely
optimized the DFS to explicitly store the stack, rather than recursing.

It turns out that a recursive implementation is fine (possibly very slightly
faster).

Andrew

Comment on this post