Thursday, March 26, 2009

PPTX extractor -- pptx to text

Note : All Programs on this site Need .Net Framework 2

Many times we need to extract all text from power point files and add some basic decoration to it . unfortunately the power point just support exporting the outlines only as .rtf NOT all text In other situations we need to extract media files from the presentation , to add some modifications or reuse them I made this simple pptx file extractor (pptx 2 text) ,which extracts text and ( media if exists ! ) from power point 2007 files to *.rtf( rich text format ) text file and allow you to add some basic decoration to the extracted text such as changing the fonts and colors . I provide the explained source code , So you can easily edit it to fit your needs For visual basic users , You can use this converter http://www.developerfusion.com/tools/convert/csharp-to-vb/ or you can write your wishes and I'll try to edit it for you This is version 1 , may be there some other versions and may be not . So Keep track with the blog to know the latest news and programs Note : in case of tables the extracted text will be as this photo The code is availabe under GNU/GPL license so you can modify it and improve it or even use it in your programs freely ,but your code will be also under GNU/GPL and you should provide it free too . Download Here Note : when you going to download an alert will appear and say this file extension may harm your computer , but that is normal as it is .exe file Wait version 2 soon with alot of options ... please comment me for any bugs or new ideas

PPTX extractor code

Summary : The main idea behind our simple program is that the ( pptx ) extension is simply a zipped file ( you can try to unzip it ) In this zipped file there are a lot of XML files , So we need to find the files that contain the text Note : You can use SharpZipLibrary to unzip files in your project From http://icsharpcode.net/OpenSource/SharpZipLib I found the text is in %file%ppt/slides The XML tag that contain the text data is " a:t " so we need an XML reader to read this data and write it to our rich text box Then I added some dialoges to be a real program The Main code :

   1: try
   2:  {
   3:     //instance for fastzip library 
   4:     FastZip unzip = new FastZip();
   5:     //unzip to the temp folder in windows
   6:     string tmploc = Path.GetTempPath();
   7:     //we just need to unzip this folder NOT all files for slow computers
   8:     unzip.ExtractZip(openFileDialog1.FileName, tmploc, "ppt/slides");
   9:     //for loop to extract data from XML files 
  10:     //the ( Directory.GetFiles(tmploc + "ppt\\slides", "*.xml") ) is used to stop the loop
  11:     //after reaching the last XML File 
  12:     for (int i = 1; i <= Directory.GetFiles(tmploc + "ppt\\slides", "*.xml").Length; i++)
  13:     {
  14:         //creating a reader to read XML data from this location which change after every loop
  15:         //to get the next file name
  16:         XmlReader rdr = XmlReader.Create(tmploc + "ppt\\slides\\slide" + i + ".xml");
  17:         while (rdr.Read())
  18:         {
  19:             //specify that we need to read a node of type "element"
  20:             if (rdr.NodeType == XmlNodeType.Element)
  21:             {
  22:                 //if the reader reaches an element with the tag ( a:t )
  23:                 if (rdr.Name == "a:t")
  24:                 {
  25:                     //will read the element contents as string and add it to rich text box
  26:                     textdata.Text += rdr.ReadElementContentAsString() + "\n";
  27:                 }
  28:             }
  29:         }
  30:         //close the reader as the file location will change the next loop
  31:         rdr.Close();
  32:     }
  33: }
  34:     //catch any error and show a message to the user instead of terminating the program
  35: catch (Exception err) { MessageBox.Show(err.Message); }
You can download the rest of the code from HERE