Jaa


Generate random content for SharePoint

This post is about how to create dummy files that you can use in SharePoint for various purposes.

Chapters:

(If you don’t want to read through all the drama, jump directly to the Falling Action section.)

Exposition

When you are building your on-prem SharePoint environment (yes, there are people who still do that) , you might wish to know how your farm behaves under pressure. There are a number of blog posts out there on how to perform stress tests, I am not planning to create another one. There is one thing that you need to keep in mind though; how file uploads and content crawls are handled in SharePoint 2013 and later versions.

First and foremost, I'm sure you've heard about the Shredded Storage feature that Microsoft introduced with full power in SharePoint 2013. (Should you not, here's the White Paper and here's Bill Baer's blog about how it works in detail.)

Long story short, the WFE servers chop up every files uploaded to SharePoint (yes each and every one of them) into 64k chunks, and send these chunks to the SQL server. What it also does, it caches these chunks and if there's a new chunk that should be sent to the SQL, it does not do the transfer, only instructs SQL to reuse one that it saved earlier.

Also, while stress testing file uploads makes sense, it's just a little part of your solution as the content crawl and query processing might be just as important.

Let's see them pieces then.

 

Rising Action

Act 1

So this Shredded storage is chopping files up to 64k pieces? So what? As I stated earlier, this causes the shreds to be cached on the WFE server, so we cannot re-use the same file with a different name, because the number of roundtrips to the SQL server is halved. Cool, so we need to use unique content for the files. At this point you might think, what's the big fuzz about this topic that I decided to create a whole post about it, since there are plenty of tools are available to generate files with random garbage in it.

True... However, remember that you want to use these files for other purposes. For testing the whole Search lifecycle (Crawling, Content Processing, Index Replication and Query Processing) of course. Which means that your content should somewhat make sense. Now, those of you who are familiar with Word would say: "Yeah, but there is the famous =rand(x,y) function, which generates random text for you.".

True... However, there are two things that invalidates this approach:

  1. Remember that you need unique content. While this function is useful to generate excessive number of paragraphs, the content in them will be repeating, and we already know, this is not useful for us because of the Shredded Storage.
  2. Since Microsoft introduced the Office Open XML file format, the actual text is being compressed into a simple ZIP file, so generating big files (above a few 100KBs) will be quite hard. Yes, we could fill the content with picture, but this would be once again not be the brightest idea if we want to process the file content with the Search.

It's getting confusing, isn't it. So many things to keep in mind, so many variables to pay attention to. No wonder stress testing is considered an art.

So to summarize... We need files that:

  • Have unique content.
  • Have sensible content that we can search for.
  • Vary by size.

What is it that would comply to these criteria?

Act 2

GUIDs seem to be a good candidate. Let's see how a GUID looks like then:

 a3284fbb-6fd6-49bc-a0de-f12886a3e648

As you can see, this is a set of alphanumeric characters, divided by hyphens. Wait a second... Is a hyphen a word breaker character? Yes it is. What is not? Well, an underscore. Cool, so we can generate GUIDs, then swap the hyphens with underscores...

  $GUID = ([guid]::NewGuid()).ToString()<br>$StringToDump = $GUID.Replace('-','_') 

... and put them into files. Now here we want to use continuous file writing, otherwise we would need to create a 1GB file in the memory, then save it, which might kill the server.

  $FileStream = New-Object System.IO.FileStream($FullFilePath,[System.IO.FileMode]::CreateNew)<br>$StreamWriter = New-Object System.IO.StreamWriter($FileStream,[System.Text.Encoding]::ASCII,128)<br># Logic to generate n number of GUIDs.<br>...<br>$StreamWriter.Write($StringToDump)<br>...<br># End of logic to generate n number of GUIDs.<br>$StreamWriter.Close()<br>$FileStream.Close() 


Awesome, isn't it? It is. It would mean, that we have 36 characters long "words" that are completely unique. (Yes, technically it is possible to create the same GUID twice, but the chances are astronomical as the possible variation is 5.3x1036 (5.3 undecillion) per Wikipedia.)

Things are shaping form out of the purple mist, I assume, so let's start putting it together. I could write a console application, but since I'm not a coder, I am staying with good old PowerShell. So we need a script, that accepts a few parameters:

  • A target folder where we're going to save the files.
  • Maybe a file prefix parameter.
  • A set of parameters for different pre-defined size of files.
    • Some pre-defined sizes:  5KB, 10KB, 20KB, 50KB, 100KB, 200KB, 500KB, 1MB, 2MB, 5MB, 10MB, 20MB, 50MB, 100MB, 200MB, 500MB, 1GB.
  • A parameter for custom sizes.
  • And maybe a switch parameter to create a folder for each file size.

Cheesy easy. You wish...

 

Climax

Act 3

Let's see a little math first. A GUID is 36 characters long. Plus we need a space for separator, so one GUID will consume 37 characters in total. How does that look like in the view of the proposed file sizes?

File size Number of characters Number of GUIDs Remaining characters
5KB 5,120 138 14
10KB 10,240 276 28
20KB 20,480 553 19
50KB 51,200 1,383 29
100KB 102,400 2,767 21
200KB 204,800 5,535 5
500KB 512,000 13,837 31
1MB 1,048,576 28,339 33
2MB 2,097,152 56,679 29
5MB 5,242,880 141,699 17
10MB 10,485,760 283,398 34
20MB 20,971,520 566,797 31
50MB 52,428,800 1,416,994 22
100MB 104,857,600 2,833,989 7
200MB 209,715,200 5,667,978 14
500MB 524,288,000 14,169,945 35
1GB 1,073,741,824 29,020,049 11

Practically this means that to create a 1GB file we need to create twenty nine million, twenty thousand and forty nine GUIDs. Now that will take some time. Luckily this is not something you would do every day, so you can leave a machine running over a few days to create all the files for you.

The next thing is that you cannot use a parameter called -5kfiles for example, because you cannot use an argument name that starts with a number. (If you want to know why, read this blog post.) Here you can decide to use some other characters as a starter (ex: -_5kfiles), or just change the size and the scale identifier (ex: -k5files). I used the first option.

Last, but not least we have to make sure that the script is not filling up the destination drive, because it would not be too good. If the target is pointing to the System drive, we should also make sure that there is enough space left for the memory dump in case of BSOD. (Of course it won't be our script that causes this, but we do not want to be part of the discussion anyway. This can be achieved with a simple function.

In this function we need to get the Disk information

  # For network drives:<br>$Disk = Get-WmiObject Win32_LogicalDisk | ?{($_.ProviderName) -and ($DestinationFolder -like "*$($_.ProviderName)*")}<br># For local drives:<br>$Disk = Get-WmiObject Win32_LogicalDisk -Filter $("DeviceID='" + $DestinationDrive + "'") 

And the memory information:

  $PhysicalMemory= (((Get-CimInstance -ClassName "cim_physicalmemory").Capacity) | Measure-Object -Sum).Sum

 

Falling Action

Act 4

Putting together the above is not complicated, it's merely a few hours of typing. I'm going to be nice and caring and link it for you. (link).

Just remember... Depending on the number and size of files you want to generate, you might need some time, so don't leave this to the last minute.

Dénouement

Now that we have a bunch of files available, you can start your performance testing. I might also create a blog entry on that. But maybe another before that an entry on how to create random Word files for testing the search functionality in a better way.

Special thanks go to my friend Daniel Vetro, who's a real PowerShell ninja, and without whom this entry would not be here.