File:Filetype identification using long, summarized n-grams (IA filetypeidentifi109455770).pdf

Page contents not supported in other languages.
This is a file from the Wikimedia Commons
Source: Wikipedia, the free encyclopedia.
Go to page
next page →
next page →
next page →

Original file(1,275 × 1,650 pixels, file size: 618 KB, MIME type: application/pdf, 106 pages)

Summary

Filetype identification using long, summarized n-grams   (Wikidata search (Cirrus search) Wikidata query (SPARQL)  Create new Wikidata item based on this file)
Author
Mayer, Ryan C.
image of artwork listed in title parameter on this page
Title
Filetype identification using long, summarized n-grams
Publisher
Monterey, California. Naval Postgraduate School
Description

Past research into file type identification has employed many different techniques in an attempt to accurately classify files and file fragments including N-gram analysis. However, naive application of n-grams breaks down when handling n-grams that are greater than two bytes, due to the sparseness of the feature. As a result, other researchers have generally ignored long n-grams for filetype identification. This thesis explores the use of long n-grams for whole file and file fragment classification by building feature distributions of commonly occurring n-grams for single filetypes and using those distributions to classify unknown files and file fragments. This thesis also utilizes summarized n-grams in order to \"collapse\" similar n-grams within a file type into common n-grams. The algorithms developed to both generate and compare unknown files are presented as well as results from an experiment that was conducted using another researcher's data set.


Subjects: Computer science; Programs; Algorithms
Language English
Publication date March 2011
Current location
IA Collections: navalpostgraduateschoollibrary; fedlink
Accession number
filetypeidentifi109455770
Source
Internet Archive identifier: filetypeidentifi109455770
https://archive.org/download/filetypeidentifi109455770/filetypeidentifi109455770.pdf
Permission
(Reusing this file)
This publication is a work of the U.S. Government as defined in Title 17, United States Code, Section 101. As such, it is in the public domain, and under the provisions of Title 17, United States Code, Section 105, may not be copyrighted.

Licensing

Public domain
This work is in the public domain in the United States because it is a work prepared by an officer or employee of the United States Government as part of that person’s official duties under the terms of Title 17, Chapter 1, Section 105 of the US Code. Note: This only applies to original works of the Federal Government and not to the work of any individual U.S. state, territory, commonwealth, county, municipality, or any other subdivision. This template also does not apply to postage stamp designs published by the United States Postal Service since 1978. (See § 313.6(C)(1) of Compendium of U.S. Copyright Office Practices). It also does not apply to certain US coins; see The US Mint Terms of Use.

File history

Click on a date/time to view the file as it appeared at that time.

Date/TimeThumbnailDimensionsUserComment
current12:25, 20 July 2020Thumbnail for version as of 12:25, 20 July 20201,275 × 1,650, 106 pages (618 KB)FEDLINK - United States Federal Collection filetypeidentifi109455770 (User talk:Fæ/IA books#Fork8) (batch 1993-2020 #16432)
No pages on the English Wikipedia use this file (pages on other projects are not listed).

Metadata