Share via


Converting HTML E-mail To Plain Text

The Battle Of Evermore...

OK, I admit it. I've caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.

After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text - Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn't need, I came up with the following code which forms the basis of this plug-in.

 Private Function ConvertHTMLToText(ByVal Source As String) As String
  
     Dim result As String = Source
  
     ' Remove formatting that will prevent regex from running reliably
     ' \r - Matches a carriage return \u000D.
     ' \n - Matches a line feed \u000A.
     ' \f - Matches a form feed \u000C.
     ' For more details see https://msdn.microsoft.com/en-us/library/4edbef7e.aspx
     result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
  
     ' replace the most commonly used special characters:
     result = Replace(result, "&lt;", "<", RegexOptions.IgnoreCase)
     result = Replace(result, "&gt;", ">", RegexOptions.IgnoreCase)
     result = Replace(result, "&nbsp;", " ", RegexOptions.IgnoreCase)
     result = Replace(result, "&quot;", """", RegexOptions.IgnoreCase)
     result = Replace(result, "&amp;", "&", RegexOptions.IgnoreCase)
  
     ' Remove ASCII character code sequences such as &#nn; and &#nnn;
     result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
  
     ' Remove all other special characters. More can be added - see the following for more details:
     ' https://www.degraeve.com/reference/specialcharacters.php
     ' https://www.web-source.net/symbols.htm
     result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
  
     ' Remove all attributes and whitespace from the <head> tag
     result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
     ' Remove all whitespace from the </head> tag
     result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
     ' Delete everything between the <head> and </head> tags
     result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
  
     ' Remove all attributes and whitespace from all <script> tags
     result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
     ' Remove all whitespace from all </script> tags
     result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
     ' Delete everything between all <script> and </script> tags
     result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
  
     ' Remove all attributes and whitespace from all <style> tags
     result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
     ' Remove all whitespace from all </style> tags
     result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
     ' Delete everything between all <style> and </style> tags
     result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
  
     ' Insert tabs in place of <td> tags
     result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
  
     ' Insert single line breaks in place of <br> and <li> tags
     result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
     result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
  
     ' Insert double line breaks in place of <p>, <div> and <tr> tags
     result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
     result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
     result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
  
     ' Remove all reminaing html tags
     result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
  
     ' Replace repeating spaces with a single space
     result = Replace(result, " +", " ")
  
     ' Remove any trailing spaces and tabs from the end of each line
     result = Replace(result, "[ \t]+\r\n", vbCrLf)
  
     ' Remove any leading whitespace characters
     result = Replace(result, "^[\s]+", String.Empty)
  
     ' Remove any trailing whitespace characters
     result = Replace(result, "[\s]+$", String.Empty)
  
     ' Remove extra line breaks if there are more than two in a row
     result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
  
     ' Thats it.
     Return result
  
 End Function

All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the "DeliverPromote" event, whereas any incoming e-mail handled by the e-mail router triggers the "DeliverIncoming" event. Interestingly enough, the "Create" event was also called as a child pipeline for these events, but modifying the message here didn't have any effect, even in the pre-processing stage.

Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:

  1. is running on the 'DeliverPromote' or 'DeliverIncoming' messages
  2. is running synchronously
  3. is running against the 'Email' entity
  4. is running in the 'pre-processing' stage of the pipeline
  5. is running in a 'Parent' pipeline
 Public Class ConvertHtmlToText
     Implements IPlugin
  
     Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
  
         ' Exit if any of the following conditions are true:
         '  1. plug-in is not running synchronously
         '  2. plug-in is not running against the 'Email' entity
         '  3. plug-in is not running in the 'pre-processing' stage of the pipeline
         '  4. plug-in is not running in a 'Parent' pipeline
         If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
             Exit Sub
         End If
  
         If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
  
             For Each item In context.InputParameters.Properties
  
                 If (item.Name = "Body") Then
                     context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
                 End If
  
             Next
  
         End If
  
     End Sub
  
 End Class

As always, I have include the source code to my project here. Please do bear in mind that I haven't included any error handling or logging, so it's not production-ready. However, it should provide you with a good head-start.

This posting is provided "AS IS" with no warranties, and confers no rights.

Laughing Boy

SRH.CRM.Plugin.Email.zip

Comments

  • Anonymous
    August 08, 2008
    PingBack from http://emanuel.freevideonewsnetwork.info/htmlmailto.html

  • Anonymous
    March 15, 2011
    Hi, I've adapted your plugin in order to work with crm 2011. Here is the code snipet I'd to change: Added references tho M.crm.sdk.proxy and M.xrm.sdk Public Class ConvertHtmlToText    Implements IPlugin    Public Sub Execute(ByVal serviceProvider As System.IServiceProvider) Implements Microsoft.Xrm.Sdk.IPlugin.Execute        Dim context As Microsoft.Xrm.Sdk.IPluginExecutionContext = DirectCast(serviceProvider.GetService(GetType(Microsoft.Xrm.Sdk.IPluginExecutionContext)), IPluginExecutionContext)        ' Exit if any of the following conditions are true:        '  1. plug-in is not running synchronously        '  2. plug-in is not running against the 'Email' entity        '  3. plug-in is not running in the 'pre-processing' stage of the pipeline        '  4. plug-in is not running in a 'Parent' pipeline (now, this is configurable in the registration TOOL, I guess, because I couldn't find an equivalent)        If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Then ' Or Not (context.InvocationSource = 0)            Exit Sub        End If        If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then            Try                For Each elemento In context.InputParameters                    If (elemento.Key = "Body") Then                        Dim contenido As String = CStr(elemento.Value)                        context.InputParameters.Item("Body") = ConvertHTMLToText(contenido)                        'Throw New System.Exception("Se ha modificado el valor de key: Valor=" + context.InputParameters.Item("Body")) 'CStr(elemento.Value)) ' + elemento.ToString())                        Exit For                    End If                Next            Catch ex As Exception                Throw New System.Exception("Se ha modificado el valor de key: " + ex.Message)            End Try        End If    End Sub


Also, I've added these replace sentences, because I receive mails in spanish:        result = Replace(result, "á", "á", RegexOptions.IgnoreCase)        result = Replace(result, "é", "é", RegexOptions.IgnoreCase)        result = Replace(result, "í", "í", RegexOptions.IgnoreCase)        result = Replace(result, "ó", "ó", RegexOptions.IgnoreCase)        result = Replace(result, "ú", "ú", RegexOptions.IgnoreCase)        result = Replace(result, "Á", "Á", RegexOptions.IgnoreCase)        result = Replace(result, "É", "É", RegexOptions.IgnoreCase)        result = Replace(result, "Í", "Í", RegexOptions.IgnoreCase)        result = Replace(result, "Ó", "Ó", RegexOptions.IgnoreCase)        result = Replace(result, "Ú", "Ú", RegexOptions.IgnoreCase)        result = Replace(result, "Ñ", "Ñ", RegexOptions.IgnoreCase)        result = Replace(result, "ñ", "ñ", RegexOptions.IgnoreCase)        result = Replace(result, " ", vbCrLf, RegexOptions.IgnoreCase)


If you see something wrong let me know, but it's working like a charm. Regards

  • Anonymous
    March 15, 2011
    Nice one Jorge. If I get a chance, I will republish in a new post. I wonder if there is a better way of identifying all language-specific character sets, rather than adding an exception for each character?

  • Anonymous
    March 23, 2011
    Hi It would be great if you just could give me a keyword for what i have to google to find the solution of how to implement such code into dynamics crm... thank you a lot!

  • Anonymous
    March 23, 2011
    Hi again Finaly i could implement the code into dynamics using the plugin registration tool. Know I thought there will be a custom step in the workflow area... wrong again :) What do I have to do to remove the html tags out of my mails? thank you regards

  • Anonymous
    March 24, 2011
    Hi Nicolas, after you have registered the plug-in, you need to register it against two specific events (steps). You can register these steps in the plug-in registration tool as well

  1. Event: DeliverPromote; Entity email

  2. Event: DeliverIncoming; Entity: email Make sure these are registered to run synchronously in the pre-processing stage of the event pipeline. Best regards, Simon

    • Anonymous
      January 18, 2017
      Hi Simon,I have been trying to implement the above code in CRM 365 on premise, build and then use Registration Tool, but it is not working as I may have wrongly implemented your guide.-Event: DeliverPromote; Entity email-Event: DeliverIncoming; Entity: email-Make sure these are registered to run synchronously in the pre-processing stage of the event pipelinePlease advise me how to implement the above three steps.
  • Anonymous
    March 25, 2011
    Hi Simon Thank you very much! Everything works fine! I expected a custom "action" for workflows... Now I know that your plugin converts all mail messages. I'll keep searching :) Have a nice day Best regards, Nicolas

  • Anonymous
    April 03, 2012
    Once this Plugin is compiled into a DLL is it them somehow installed in Outlook? Once installed how is it triggered a button? I would like to modify this to trigger when I click "Convert Email to CRM Case" and parse out the HTML body to auto populate the Case for fields. Please forgive my foolish questions. Thanks

  • Anonymous
    April 03, 2012
    A plug-in only runs when triggered by a CRM event (such as DeleverPromote), and does not shown up in the Outlook or Web UI. To be able to use this as part of the "Convert E-mail To Case" function in the Outlook client, you will need to work out what events are triggered, and modify the plug-in to work with those events. Unfortunately I don't have the ability to check this out for the next couple of weeks, as I am on vacation right now. Best regards, Simon