I once had a project where I had to work with a large directory of unsorted word documents. The main directory had a deep sub-directory where there were multiple versions of the same document with the same file name. It was a mess.
Basically, my responsibilities were to pull out unique document filenames with the most recently modified version and parse each document for a unique key-id using regular expressions. I won’t get into all of the details, but basically I found that the easiest way to run regex searches was to break each document into raw text and then work from there.
For this post, I built both an early (needs references) and late binding version of the function I used to extract text from each Word Document for my project.
NOTE: If don’t need all of the inner document content, I added a method to pull a fragment by start, stop index. To use, you must comment out the line “docContent = oWdoc.Content” and un-comment out the line “docContent = oWdoc.Range(0, 500)”. Adjust the 0 (start) and 500 (stop) to your needs.
Extract Text From MS Word Document with VBA – Early Binding Example
'----------------------------------------------------
' Get Text From MS Word Document (Early Binding)
'----------------------------------------------------
' NOTE: To use this code, you must reference
' The Microsoft Word 14.0 (or current version)
' Object Library by clicking menu Tools > References
' Check the box for:
' Microsoft Word 14.0 Object Library in Word 2010
' Microsoft Word 15.0 Object Library in Word 2013
' Click OK
'----------------------------------------------------
Function getWordDocText(iFile) As String
Dim oWord As Word.Application
Dim oWdoc As Word.Document
Dim docHeader As String
Dim docFooter As String
Dim docContent As String
' Initialize Word Objects
'---------------------------------
Set oWord = New Word.Application
Set oWdoc = oWord.Documents.Open(iFile)
' Get Content From Document
'---------------------------------
' Get primary header
docHeader = oWdoc.Sections(1).Headers(1).Range.Text
' Get primary footer
docFooter = oWdoc.Sections(1).Footers(1).Range.Text
' Get document content
docContent = oWdoc.Content
'---------------------------------
' Limit to first 500 characters of
' main document content. Uncomment
' to use and adjust accordingly:
'---------------------------------
'docContent = oWdoc.Range(0, 500)
'---------------------------------
' Return Document Content
'---------------------------------
getWordDocText = docHeader & vbNewLine & docContent & vbNewLine & docFooter
' Clear Memory
'---------------------------------
oWdoc.Close
oWord.Quit
Set oWdoc = Nothing
Set oWord = Nothing
End Function
Extract Text From MS Word Document with VBA – Late Binding Example
'----------------------------------------------------
' Get Text From MS Word Document (Late Binding)
'----------------------------------------------------
' NOTE: This is the late binding version of the
' Get Text From MS Word Document code. No reference
' to Microsoft Word XX.0 Object Library is needed
'----------------------------------------------------
Function getWordDocText(iFile) As String
Dim oWord As Object
Dim oWdoc As Object
Dim docHeader As String
Dim docFooter As String
Dim docContent As String
' Initialize Word Objects
'---------------------------------
Set oWord = CreateObject("Word.Application")
Set oWdoc = oWord.Documents.Open(iFile)
' Get Content From Document
'---------------------------------
' Get primary header
docHeader = oWdoc.Sections(1).Headers(1).Range.Text
' Get primary footer
docFooter = oWdoc.Sections(1).Footers(1).Range.Text
' Get All Main Document Content
docContent = oWdoc.Content
'---------------------------------
' Limit to first 500 characters of
' main document content. Uncomment
' to use and adjust accordingly:
'---------------------------------
'docContent = oWdoc.Range(0, 500)
'---------------------------------
' Return Document Content
'---------------------------------
getWordDocText = docHeader & vbNewLine & docContent & vbNewLine & docFooter
' Clear Memory
'---------------------------------
oWdoc.Close
oWord.Quit
Set oWdoc = Nothing
Set oWord = Nothing
End Function
As always, please comment with questions, issues, etc…
Leave a Reply