Issue
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.util.List;
import com.opencsv.CSVReaderBuilder;
public class NormalJava implements Serializable{
private static final long serialVersionUID = 7526472295622776147L;
static String Filename = "/Users/tarunv711/Desktop/ads.csv";
static String outFile = "/Users/tarunv711/Desktop/TV.txt";
public static void main(String[] args)
{readAllData(Filename, outFile);
}
public static void readAllData(String Filename, String outFile) {
try {
// Create an object of filereader class
// with CSV file as a parameter.
FileReader filereader = new FileReader(Filename);
// create csvReader object
// and skip first Line
com.opencsv.CSVReader csvReader = new CSVReaderBuilder(filereader)
.build();
List<String[]> allData = csvReader.readAll();
ObjectOutputStream os = new ObjectOutputStream(new FileOutputStream(outFile));
// print Data
for (String[] row : allData) {
os.writeObject(row);
}
os.close();
System.out.println("Done");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
The class runs perfectly fine and produces a file, but the file in this example after being serialized was 9KB, when the starting file was 6KB. When I tried it with other files as well, the same result persisted. I'm using the java serializer.
Solution
Why is my serializable java class producing a serialized file thats larger than the normal file?
As the commenters have pointed out, this should not be surprising. The Object Serialization Stream Protocol used for encoding a serialized file is not designed to optimize for space.
The serialized form includes type descriptor information for distinct Java types for objects in the serialization (with few exceptions). This includes names and types of all of the types' fields.
A
String
value consist of the UTF-8 encoding of the characters, plus a TC_STRING or TC_LONGSTRING type byte, a 2 or 8 byte length field (giving the length of the UTF-8 encoding), and an object handle.An
String[]
value consists of an array of object handles for the strings, plus a TC_ARRAY type byte, a 4 byte length, a classdesc handle, and an object handle.
(The handles are 4 bytes, I think. The spec is not 100% clear.)
If you compare this with the way that the CVS file represents each values.
- The per string overhead is 1 byte (for the separator) plus 2 more if the string is quoted, plus some more if there is escaping within the quotes.
- The per row overhead is 1 or 2 bytes for the line terminator
- The per file "metadata" over head is either nothing, or the length of the first line if it contains column headings.
If you do an item by item comparison, you should be able to see that the overheads are larger in a serialized file compared with a CSV file.
Why?
Because minimizing the serialization size was NOT the primary design goal for the object serialization protocol. It was thought to be more important that serialization / deserialization:
- should be type-safe in the face of schema changes1, and
- should produce an object graph that is isomorphic2 to the original one.
1 - This means that the serialized form must contain type descriptors and/or serialization IDs that can be compared with the classes at deserialization time.
2 - This requires that each object must have a handle in the serialized form to denote its identity.
Answered By - Stephen C